👁️ StreamGaze: Gaze-Guided Temporal Reasoning and
Proactive Understanding in Streaming Videos

1UNC Chapel Hill    2Adobe Research

📌 TL;DR

We introduce StreamGaze, the first benchmark for evaluating how well MLLMs can use human gaze signals for temporal reasoning in streaming egocentric videos.

Abstract

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction.

📊 StreamGaze Task Taxonomy

StreamGaze provides a unified suite of gaze-conditioned tasks spanning past, present, and proactive reasoning.

StreamGaze Task Taxonomy

Figure 1. StreamGaze's task taxonomy. We introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos.

⏮️

Past Tasks

  • Scene Recall (SR)
  • Object Transition Prediction (OTP)
  • Gaze Sequence Matching (GSM)
  • Non-Fixated Object Identification (NFI)
▶️

Present Tasks

  • Object Identification (OI-Easy/Hard)
  • Object Attribute Recognition (OAR)
  • Gaze Target Anticipation (GTA)
⏭️

Proactive Tasks

  • Future Action Prediction (FAP)
  • Object Appearance Alert (OAA)

🎯 Key Contributions

1. Gaze-Guided Data Construction Pipeline
We propose the first gaze-guided data construction pipeline that integrates gaze trajectories with egocentric video to produce spatio-temporally aligned, gaze-guided QA pairs. Our pipeline models full scanpath dynamics, tracking how attention evolves over time.

2. StreamGaze Benchmark
We introduce StreamGaze, the first benchmark specifically designed for streaming gaze-guided video understanding, comprising 8,521 QA pairs across 10 tasks spanning past, present, and proactive timestamps.

3. Comprehensive Evaluation & Analysis
We evaluate state-of-the-art MLLMs on StreamGaze, uncovering substantial and consistent gaps relative to human performance. Our in-depth analyses reveal that current MLLMs struggle to interpret raw gaze signals.

🔧 Data Construction Pipeline

Data Construction Pipeline

Our gaze-video QA generation pipeline aligns egocentric videos with raw gaze trajectories through four key stages:

(1) Input: Given egocentric video sources and raw gaze projections.

(2) Fixation Extraction: We first extract fixation moments across the entire video.

(3) Region-Specific Visual Prompting: Next, we divide each frame into FOV and out-of-FOV regions and extract objects within the gaze area.

(4) Scanpath Construction & QA Generation: Finally, we construct scanpaths and generate streaming QA pairs.

📈 Benchmark Comparison

Comparison of streaming video understanding benchmarks with StreamGaze.

StreamGaze is the first and only benchmark that combines gaze signals with proactive reasoning across all temporal dimensions in egocentric streaming videos.

📊 Main Results

Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs (e.g., GPT-4o, InternVL-3.5) and human performance. Current MLLMs struggle to:
• Leverage gaze signals for temporal reasoning
• Model user intention from gaze patterns
• Make proactive predictions based on observed gaze dynamics

BibTeX

@misc{lee2025streamgazegazeguidedtemporalreasoning,
        title={StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos}, 
        author={Daeun Lee and Subhojyoti Mukherjee and Branislav Kveton and Ryan A. Rossi and Viet Dac Lai and Seunghyun Yoon and Trung Bui and Franck Dernoncourt and Mohit Bansal},
        year={2025},
        eprint={2512.01707},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.01707}, 
  }