Abstract
Demo
This is a data demo in our SVBench. The video is playing at 2x speed.
Overview
A temporal dialogue path represents a conversation within a video progressing over time. Our SVBench evaluates the capabilities of LVLMs in long-context streaming video understanding by constructing temporal dialogue paths to assess 9 critical skills.
Figure 1: Illustration of temporal multi-turn dialogues.
Annotation Pipeline
Overview of the proposed SVBench framework:
(1) Filtering raw videos from diverse streaming sources; (2) Detecting scenes and splitting videos accordingly; (3) Constructing QA chains for dialogues within videos; (4) Performing manual annotation and quality assessment; (5) Identifying temporal linkages between QA chains; (6) Connecting QA chains to facilitate temporal reasoning; (7) Building temporal dialogue paths for evaluating LVLMs.
Figure 2: Overview of the proposed SVBench framework.
Statistical Analysis
Our dataset contains videos organized into 12 primary categories and 36 subcategories. To facilitate a more comprehensive evaluation of the capabilities of LVLMs, we classify the questions into 9 distinct categories.
Figure 3: Distributions of videos and QA categories.
Leaderboard
To evaluate the performance of current LVLMs in streaming video understanding, we design two distinct experimental setups within the SVBench evaluation set to rigorously assess the capabilities of these LVLMs.
Model | Dialogue Evaluation | Streaming Evaluation | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SA | CC | LC | TU | IC | OS | SA | CC | LC | TU | IC | OS | |
Open-source LVLMs | ||||||||||||
MovieChat | 20.46 | 20.05 | 27.76 | 21.81 | 22.21 | 21.89 | 17.99 | 16.42 | 20.37 | 15.77 | 19.08 | 17.43 |
Video-ChatGPT | 31.86 | 32.58 | 40.28 | 35.32 | 36.26 | 33.80 | 27.98 | 29.54 | 33.81 | 27.95 | 31.00 | 28.88 |
Video-LLaVA | 35.62 | 36.52 | 42.93 | 38.63 | 38.84 | 37.34 | 32.22 | 32.83 | 36.35 | 32.46 | 34.54 | 32.79 |
ShareGPT4Video | 39.01 | 40.42 | 47.89 | 41.42 | 43.18 | 40.70 | 34.65 | 36.70 | 41.07 | 35.76 | 37.22 | 35.79 |
VideoLLaMA2 | 39.13 | 40.33 | 47.60 | 42.36 | 41.80 | 40.60 | 35.68 | 36.40 | 42.23 | 34.65 | 36.70 | 35.84 |
TimeChat | 36.19 | 37.06 | 44.72 | 40.42 | 37.12 | 37.22 | 35.72 | 37.88 | 42.65 | 36.23 | 36.34 | 36.32 |
InternVL2 | 45.91 | 46.30 | 52.67 | 49.81 | 46.25 | 46.13 | 43.55 | 44.10 | 48.91 | 40.95 | 44.17 | 42.71 |
VILA | 46.83 | 48.41 | 54.92 | 48.30 | 50.12 | 48.51 | 46.19 | 47.95 | 51.60 | 44.84 | 48.56 | 46.26 |
InternLM-XC2.5 | 51.57 | 53.93 | 59.69 | 51.57 | 56.28 | 52.31 | 52.22 | 53.39 | 58.14 | 48.05 | 54.79 | 51.46 |
MiniCPM-V 2.6 | 53.50 | 55.42 | 60.88 | 55.03 | 55.78 | 54.30 | 53.33 | 54.30 | 58.97 | 49.64 | 54.71 | 52.19 |
StreamingChat | 59.48 | 61.31 | 66.05 | 58.61 | 61.09 | 59.41 | 55.10 | 56.66 | 60.72 | 51.78 | 55.87 | 53.90 |
Closed-source LVLMs | ||||||||||||
Gemini 1.5 Pro | 54.89 | 56.05 | 61.45 | 53.08 | 56.06 | 54.29 | 49.06 | 50.05 | 54.62 | 45.73 | 49.84 | 48.02 |
GPT-4V | 65.56 | 68.02 | 71.78 | 63.80 | 68.01 | 65.19 | 58.82 | 59.55 | 64.29 | 54.08 | 60.61 | 57.35 |
GPT-4o | 65.73 | 68.10 | 71.95 | 66.54 | 68.40 | 66.29 | 59.52 | 60.42 | 65.45 | 55.10 | 61.36 | 58.17 |
Table 1: Evaluation results of various models on SVBench in dialogue and streaming evaluation.
Comparisons with Existing Benchmarks
Avg. Q/V: the average number of QA pairs per video. Open-Domain: whether the video sources are diverse. Long: whether the average video length is greater than 2 minutes. Dialogue: whether there are contextual connections between QA pairs. Streaming: whether the QA pairs can be tested in sync with the video over time.
Table 2: The comparison of different datasets.
StreamingChat
Built upon InternVL2, we develop a streaming LVLM baseline named StreamingChat. It comprises a vision encoder (InternViT), an MLP projector, and an LLM (InternLM2).
Figure 4: Architecture of the proposed StreamingChat model.
Citation
@article{turing1936computable,
title={On computable numbers, with an application to the Entscheidungsproblem},
author={Turing, Alan Mathison},
journal={Journal of Mathematics},
volume={58},
number={345-363},
pages={5},
year={1936}
}