SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

Zhenyu Yang1,2,3, Yuhang Hu4, Zemin Du5, Dizhan Xue1,2, Shengsheng Qian1,2, Jiahong Wu3, Fan Yang3, Weiming Dong1,2, Changsheng Xu1,2,6
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Kuaishou Technology, 4Zhengzhou University, 5ShanghaiTech University, 6Peng Cheng Laboratory
ICLR'2025 (Spotlight🔥)

Abstract

Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://github.com/yzy-bupt/SVBench.

Demo

This is a data demo in our SVBench. The video is playing at 2x speed.

Overview

A temporal dialogue path represents a conversation within a video progressing over time. Our SVBench evaluates the capabilities of LVLMs in long-context streaming video understanding by constructing temporal dialogue paths to assess 9 critical skills.

overview

Figure 1: Illustration of temporal multi-turn dialogues.

Annotation Pipeline

Overview of the proposed SVBench framework:
(1) Filtering raw videos from diverse streaming sources; (2) Detecting scenes and splitting videos accordingly; (3) Constructing QA chains for dialogues within videos; (4) Performing manual annotation and quality assessment; (5) Identifying temporal linkages between QA chains; (6) Connecting QA chains to facilitate temporal reasoning; (7) Building temporal dialogue paths for evaluating LVLMs.

framework

Figure 2: Overview of the proposed SVBench framework.

Statistical Analysis

Our dataset contains videos organized into 12 primary categories and 36 subcategories. To facilitate a more comprehensive evaluation of the capabilities of LVLMs, we classify the questions into 9 distinct categories.

ring

Figure 3: Distributions of videos and QA categories.

Leaderboard

To evaluate the performance of current LVLMs in streaming video understanding, we design two distinct experimental setups within the SVBench evaluation set to rigorously assess the capabilities of these LVLMs.

Model Dialogue Evaluation Streaming Evaluation
SA CC LC TU IC OS SA CC LC TU IC OS
Open-source LVLMs
MovieChat 20.46 20.05 27.76 21.81 22.21 21.89 17.99 16.42 20.37 15.77 19.08 17.43
Video-ChatGPT 31.86 32.58 40.28 35.32 36.26 33.80 27.98 29.54 33.81 27.95 31.00 28.88
Video-LLaVA 35.62 36.52 42.93 38.63 38.84 37.34 32.22 32.83 36.35 32.46 34.54 32.79
ShareGPT4Video 39.01 40.42 47.89 41.42 43.18 40.70 34.65 36.70 41.07 35.76 37.22 35.79
VideoLLaMA2 39.13 40.33 47.60 42.36 41.80 40.60 35.68 36.40 42.23 34.65 36.70 35.84
TimeChat 36.19 37.06 44.72 40.42 37.12 37.22 35.72 37.88 42.65 36.23 36.34 36.32
InternVL2 45.91 46.30 52.67 49.81 46.25 46.13 43.55 44.10 48.91 40.95 44.17 42.71
VILA 46.83 48.41 54.92 48.30 50.12 48.51 46.19 47.95 51.60 44.84 48.56 46.26
InternLM-XC2.5 51.57 53.93 59.69 51.57 56.28 52.31 52.22 53.39 58.14 48.05 54.79 51.46
MiniCPM-V 2.6 53.50 55.42 60.88 55.03 55.78 54.30 53.33 54.30 58.97 49.64 54.71 52.19
StreamingChat 59.48 61.31 66.05 58.61 61.09 59.41 55.10 56.66 60.72 51.78 55.87 53.90
Closed-source LVLMs
Gemini 1.5 Pro 54.89 56.05 61.45 53.08 56.06 54.29 49.06 50.05 54.62 45.73 49.84 48.02
GPT-4V 65.56 68.02 71.78 63.80 68.01 65.19 58.82 59.55 64.29 54.08 60.61 57.35
GPT-4o 65.73 68.10 71.95 66.54 68.40 66.29 59.52 60.42 65.45 55.10 61.36 58.17

Table 1: Evaluation results of various models on SVBench in dialogue and streaming evaluation.

Comparisons with Existing Benchmarks

Avg. Q/V: the average number of QA pairs per video. Open-Domain: whether the video sources are diverse. Long: whether the average video length is greater than 2 minutes. Dialogue: whether there are contextual connections between QA pairs. Streaming: whether the QA pairs can be tested in sync with the video over time.

comparison

Table 2: The comparison of different datasets.

StreamingChat

Built upon InternVL2, we develop a streaming LVLM baseline named StreamingChat. It comprises a vision encoder (InternViT), an MLP projector, and an LLM (InternLM2).

model_framework

Figure 4: Architecture of the proposed StreamingChat model.

Citation

@article{turing1936computable,
  title={On computable numbers, with an application to the Entscheidungsproblem},
  author={Turing, Alan Mathison},
  journal={Journal of Mathematics},
  volume={58},
  number={345-363},
  pages={5},
  year={1936}
}