CVPR 2026 Highlight
📄 Paper | 🤗 Data | 🌐 Website |
VideoNet contains 1,000 actions across 37 domains.
It supports two evaluation setups:
| Setup | Test Set Size | Val Set Size | Targeted Model Capability |
|---|---|---|---|
| Multiple Choice | 4,000 | 1,000 | domain-specific action recognition |
| Binary | 4,000 | N/A | video in-context learning |
We believe most researchers will want to evaluate on the multiple choice evaluation setting.
An explanation of the binary setup.
The binary setup is as follows:
- A model is shown
$k \in {0, 1, 2, 3}$ example videos of some action$a_1$ - The model is then shown a test clip (which may or may not contain
$a_1$ ) - The model must determine if the test clip contains
$a_1$
Multiple-Choice Example
Question:
Which of the following Pen Spinning actions is shown in the video?
A. warped sonic
B. twisted sonic reverse
C. charge reverse
D. devil's sonic
Please respond with only the letter of the correct answer.
be4435b1-615e-4d82-a440-dd5a0ed9f54d.mp4
Answer:
B
Binary 0-shot Example
Question:
Recall that a Double Flip is an action in Figure Skating. Does the following video show a Double Flip? Please reason through your answer. It is critical that you output 'yes' or 'no' on the final line of your answer.
8ad9dd61-769c-4d7b-8627-6f8770ea48c6.mp4.15-18-50-735.mp4
Answer:
Yes
Binary 3-shot Example
Question:
The following three videos show a Double Flip, which is an action in Figure Skating. Examine the videos closely.
169870c1-59ba-414b-abe1-f11741cc722a.mp4.15-17-51-544.mp4
fbadd7aa-354a-4fd2-8b20-3384920b5bfc.mp4
468e04b0-e88f-4691-a14a-2922547d11ef.mp4
Now consider the following video. Does it also show a Double Flip? Please reason through your answer. It is critical that you output 'yes' or 'no' on the final line of your answer.
8ad9dd61-769c-4d7b-8627-6f8770ea48c6.mp4.15-18-50-735.mp4
Answer:
Yes
There are three ways to evaluate models on VideoNet
- Use our repo
- Use
lmms-eval - Integrate our JSONL files into your codebase
We provide code to run GPT, Gemini, Qwen, and Intern models on VideoNet.
git clone https://github.com/RAIVNLab/VideoNet.git
cd VideoNetTip
Already have a copy of VideoNet? You can skip the steps below by simply setting the env variable VN_VIDEOS_DIR to the folder that contains your copy of VideoNet's MP4 files.
Prereq: Install the HuggingFace library
pip install -U huggingface_hub
hf auth loginSince our dataset is gated on 🤗, you may need to generate a token and store it in the HF_TOKEN env variable.
Option A: Download via 🤗 CLI
hf download raivn/VideoNet --include "videos/*" --repo-type dataset --local-dir .Ensure that all videos are placed in a videos folder at the root of this repository.
Option B: Download via 🤗 Python library
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="raivn/VideoNet",
allow_patterns="videos/*",
local_dir=".",
repo_type="dataset",
)Ensure that all videos are placed in a videos folder at the root of this repository.
GPT
gpt-5.4gpt-5
pip install openai
pip install opencv-python
export OPENAI_API_KEY="your_openai_key"Gemini
gemini-3.1-progemini-3-flash
pip install -U google-genai
export GEMINI_API_KEY="your_gemini_key"Qwen VL
Qwen3-VL-8B-Instruct
pip install torch torchvision
pip install accelerate
pip install -U transformers
pip install qwen-vl-utils[decord]Optionally, you can decide where Qwen's weights will be downloaded by setting the env variable VN_MODELS_DIR. If not specified, we will create a models directory at the root of this repository to store all models.
Intern VL
InternVL3_5-8BInternVL3-8B
pip install torch torchvision
pip install transformers==4.55.0
pip install timm
pip install einopsOptionally, you can decide where Intern's weights will be downloaded by setting the env variable VN_MODELS_DIR. If not specified, we will create a models directory at the root of this repository to store all models.
Once you have completed the steps above, you are ready to evaluate on VideoNet!
Here are some quick-start examples:
python src/eval.py -m gemini-3.1-pro --mcq-testpython src/eval.py -m gpt-5.4 --mcq-val --max-frames 256 --reasoning-level xhighpython src/eval.py -m Qwen3-VL-8B-Instruct -k 0 --fps 1 --enable-flash-attnTo start a model evaluation, run src/eval.py from the root of the repository. You can specify your desired model with the -m flag. For MCQ evaluations, provide the --mcq-test flag or the --mcq-val flag. For binary evaluations, provide -k {0, 1, 2, 3} to specify the number of in-context examples shown to the model.
For certain models, you can further specify the video sampling rate (via --fps) or the max number of frames sampled (via --max-frames). For GPT and Gemini models, you can specify the reasoning level via --reasoning-level. For local models, you can also enable the use of Flash Attention with --enable-flash-attn.
Your outputs will be saved in a JSON file in the auto-generated results directory. A summary of the results will be printed once your evaluation run concludes.
Results Structure
The results JSON file will hold a dictionary.Conceptually, each key in this dictionary corresponds to a question in your chosen configuration of the VideoNet benchmark.
Literally, keys correspond to their respective question's key field in the JSONL file for your chosen benchmark configuration.
A key maps to a dictionary containing the following:
accurate:1,0, or-1(as ints) depending on if model response was correct, inaccurate, or unparsable, respectively.prediction: prediction we extracted from model response (see parser functions insrc/utils.py)response: raw model response
If loaded into Python, the entire results dictionary would have the type signature dict[str, dict[str, int | str]].
VideoNet is integrated into the lmms-eval codebase. Here are the specific task names:
videonet_mcq_testvideonet_mcq_valvideonet_binary_0shotvideonet_binary_1shotvideonet_binary_2shotvideonet_binary_3shot
We provide six JSONL files in the benchmarks directory of this repo, one per evaluation configuration.
Each line in the JSONL file is a benchmark question represented as a JSON dictionary.
Here is an example from the MCQ validation set.
{
"key": "1-mcq-val-1",
"question": [
{
"type": "text",
"text": "Which of the following Figure Skating actions is shown in the video?\nA. split jump\nB. camel spin\nC. cantilever\nD. biellmann spin\nPlease respond with only the letter of the correct answer."
},
{
"type": "video",
"video": "d0f9e765-a102-4e63-90bf-5605440e4adf.mp4"
}
],
"answer": "D"
}Technical Specification of VideoNet JSONL files
Each line is a JSON dictionary with the following keys:
key: a string; the unique identifier for the question.- e.g.,
29-pos-1indicates this is the 1st positive (i.e., the ground truth is "yes") binary test clip for action #29. - e.g.,
29-neg-2indicates this is the 2nd negative (i.e., the ground truth is "no") binary test clip for action #29. - e.g.,
32-mcq-val-1indicates this is the 1st multiple-choice question (from the MCQ val set) for action #32. - e.g.,
32-mcq-test-4indicates this is the 4th multiple-choice question (from the MCQ test set) for action #32.
- e.g.,
question: a list of dictionaries- all dictionaries have a
typekey which takes the valuetextorvideo. - dictionaries also have either a
textkey orvideokey. In the latter case, the corresponding value is a string containing the video's filename.
- all dictionaries have a
answer: a string; the ground-truth for this question.
Please refer to the HuggingFace.
This project is partially funded by a grant from Apple.
The structure of this repository is inspired by TOMATO.
@misc{yadav2026videonet,
title={VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition},
author={Tanush Yadav and Mohammadreza Salehi and Jae Sung Park and Vivek Ramanujan and Hannaneh Hajishirzi and Yejin Choi and Ali Farhadi and Rohun Tripathi and Ranjay Krishna},
year={2026},
eprint={2605.02834},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.02834},
}
