VideoNet: Domain-Specific Action Recognition with VLMs

CVPR 2026 Highlight

📄 Paper | 🤗 Data | 🌐 Website | ▶️ Demo

Overview

VideoNet contains 1,000 actions across 37 domains.

It supports two evaluation setups:

Setup	Test Set Size	Val Set Size	Targeted Model Capability
Multiple Choice	4,000	1,000	domain-specific action recognition
Binary	4,000	N/A	video in-context learning

We believe most researchers will want to evaluate on the multiple choice evaluation setting.

An explanation of the binary setup.

The binary setup is as follows:

A model is shown $k \in {0, 1, 2, 3}$ example videos of some action $a_1$
The model is then shown a test clip (which may or may not contain $a_1$)
The model must determine if the test clip contains $a_1$

Examples

Multiple-Choice Example

Question:

Which of the following Pen Spinning actions is shown in the video?

A. warped sonic

B. twisted sonic reverse

C. charge reverse

D. devil's sonic

Please respond with only the letter of the correct answer.

be4435b1-615e-4d82-a440-dd5a0ed9f54d.mp4

Answer:

B

Binary 0-shot Example

Question:

Recall that a Double Flip is an action in Figure Skating. Does the following video show a Double Flip? Please reason through your answer. It is critical that you output 'yes' or 'no' on the final line of your answer.

8ad9dd61-769c-4d7b-8627-6f8770ea48c6.mp4.15-18-50-735.mp4

Answer:

Yes

Binary 3-shot Example

Question:

The following three videos show a Double Flip, which is an action in Figure Skating. Examine the videos closely.

169870c1-59ba-414b-abe1-f11741cc722a.mp4.15-17-51-544.mp4

fbadd7aa-354a-4fd2-8b20-3384920b5bfc.mp4

468e04b0-e88f-4691-a14a-2922547d11ef.mp4

Now consider the following video. Does it also show a Double Flip? Please reason through your answer. It is critical that you output 'yes' or 'no' on the final line of your answer.

8ad9dd61-769c-4d7b-8627-6f8770ea48c6.mp4.15-18-50-735.mp4

Answer:

Yes

Evaluation

There are three ways to evaluate models on VideoNet

Use our repo
Use lmms-eval
Integrate our JSONL files into your codebase

Option 1 (Recommended)

We provide code to run GPT, Gemini, Qwen, and Intern models on VideoNet.

1.1 Clone this repo.

git clone https://github.com/RAIVNLab/VideoNet.git
cd VideoNet

1.2 Download the videos.

Tip

Already have a copy of VideoNet? You can skip the steps below by simply setting the env variable VN_VIDEOS_DIR to the folder that contains your copy of VideoNet's MP4 files.

Prereq: Install the HuggingFace library

pip install -U huggingface_hub
hf auth login

Since our dataset is gated on 🤗, you may need to generate a token and store it in the HF_TOKEN env variable.

Option A: Download via 🤗 CLI

hf download raivn/VideoNet --include "videos/*" --repo-type dataset --local-dir .

Ensure that all videos are placed in a videos folder at the root of this repository.

Option B: Download via 🤗 Python library

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="raivn/VideoNet",
    allow_patterns="videos/*",
    local_dir=".",
    repo_type="dataset",
)

Ensure that all videos are placed in a videos folder at the root of this repository.

1.3 Pick a model.

GPT

Options:

gpt-5.4
gpt-5

Requirements:

pip install openai
pip install opencv-python
export OPENAI_API_KEY="your_openai_key"

Gemini

Options:

gemini-3.1-pro
gemini-3-flash

Requirements:

pip install -U google-genai
export GEMINI_API_KEY="your_gemini_key"

Qwen VL

Options:

Qwen3-VL-8B-Instruct

Requirements:

pip install torch torchvision
pip install accelerate
pip install -U transformers
pip install qwen-vl-utils[decord]

Optionally, you can decide where Qwen's weights will be downloaded by setting the env variable VN_MODELS_DIR. If not specified, we will create a models directory at the root of this repository to store all models.

Intern VL

Options:

InternVL3_5-8B
InternVL3-8B

Requirements:

pip install torch torchvision
pip install transformers==4.55.0
pip install timm
pip install einops

Optionally, you can decide where Intern's weights will be downloaded by setting the env variable VN_MODELS_DIR. If not specified, we will create a models directory at the root of this repository to store all models.

1.4 Run the evaluation.

Once you have completed the steps above, you are ready to evaluate on VideoNet!

Here are some quick-start examples:

python src/eval.py -m gemini-3.1-pro --mcq-test

python src/eval.py -m gpt-5.4 --mcq-val --max-frames 256 --reasoning-level xhigh

python src/eval.py -m Qwen3-VL-8B-Instruct -k 0 --fps 1 --enable-flash-attn

To start a model evaluation, run src/eval.py from the root of the repository. You can specify your desired model with the -m flag. For MCQ evaluations, provide the --mcq-test flag or the --mcq-val flag. For binary evaluations, provide -k {0, 1, 2, 3} to specify the number of in-context examples shown to the model.

For certain models, you can further specify the video sampling rate (via --fps) or the max number of frames sampled (via --max-frames). For GPT and Gemini models, you can specify the reasoning level via --reasoning-level. For local models, you can also enable the use of Flash Attention with --enable-flash-attn.

Your outputs will be saved in a JSON file in the auto-generated results directory. A summary of the results will be printed once your evaluation run concludes.

Results Structure

The results JSON file will hold a dictionary.

Conceptually, each key in this dictionary corresponds to a question in your chosen configuration of the VideoNet benchmark.

Literally, keys correspond to their respective question's key field in the JSONL file for your chosen benchmark configuration.

A key maps to a dictionary containing the following:

accurate: 1, 0, or -1 (as ints) depending on if model response was correct, inaccurate, or unparsable, respectively.
prediction: prediction we extracted from model response (see parser functions in src/utils.py)
response: raw model response

If loaded into Python, the entire results dictionary would have the type signature dict[str, dict[str, int | str]].

Option 2: `lmms-eval`

VideoNet is integrated into the lmms-eval codebase. Here are the specific task names:

videonet_mcq_test
videonet_mcq_val
videonet_binary_0shot
videonet_binary_1shot
videonet_binary_2shot
videonet_binary_3shot

Option 3: Integrating with Your Codebase

We provide six JSONL files in the benchmarks directory of this repo, one per evaluation configuration.

Each line in the JSONL file is a benchmark question represented as a JSON dictionary.

Here is an example from the MCQ validation set.

{
    "key": "1-mcq-val-1", 
    "question": [
        {
            "type": "text", 
            "text": "Which of the following Figure Skating actions is shown in the video?\nA. split jump\nB. camel spin\nC. cantilever\nD. biellmann spin\nPlease respond with only the letter of the correct answer."
        }, 
        {
            "type": "video", 
            "video": "d0f9e765-a102-4e63-90bf-5605440e4adf.mp4"
        }
    ],
    "answer": "D"
}

Technical Specification of VideoNet JSONL files

Each line is a JSON dictionary with the following keys:

key: a string; the unique identifier for the question.
- e.g., 29-pos-1 indicates this is the 1st positive (i.e., the ground truth is "yes") binary test clip for action #29.
- e.g., 29-neg-2 indicates this is the 2nd negative (i.e., the ground truth is "no") binary test clip for action #29.
- e.g., 32-mcq-val-1 indicates this is the 1st multiple-choice question (from the MCQ val set) for action #32.
- e.g., 32-mcq-test-4 indicates this is the 4th multiple-choice question (from the MCQ test set) for action #32.
question: a list of dictionaries
- all dictionaries have a type key which takes the value text or video.
- dictionaries also have either a text key or video key. In the latter case, the corresponding value is a string containing the video's filename.
answer: a string; the ground-truth for this question.

Training Data

Please refer to the HuggingFace.

Acknowledgements

This project is partially funded by a grant from Apple.

The structure of this repository is inspired by TOMATO.

Citation

@misc{yadav2026videonet,
      title={VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition}, 
      author={Tanush Yadav and Mohammadreza Salehi and Jae Sung Park and Vivek Ramanujan and Hannaneh Hajishirzi and Yejin Choi and Ali Farhadi and Rohun Tripathi and Ranjay Krishna},
      year={2026},
      eprint={2605.02834},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.02834}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
misc		misc
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoNet: Domain-Specific Action Recognition with VLMs

Overview

Examples

Evaluation

Option 1 (Recommended)

1.1 Clone this repo.

1.2 Download the videos.

1.3 Pick a model.

Options:

Requirements:

Options:

Requirements:

Options:

Requirements:

Options:

Requirements:

1.4 Run the evaluation.

Option 2: `lmms-eval`

Option 3: Integrating with Your Codebase

Training Data

Acknowledgements

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoNet: Domain-Specific Action Recognition with VLMs

Overview

Examples

Evaluation

Option 1 (Recommended)

1.1 Clone this repo.

1.2 Download the videos.

1.3 Pick a model.

Options:

Requirements:

Options:

Requirements:

Options:

Requirements:

Options:

Requirements:

1.4 Run the evaluation.

Option 2: lmms-eval

Option 3: Integrating with Your Codebase

Training Data

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

Option 2: `lmms-eval`