EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Seoul National University, National University of Singapore
*Corresponding Author

Abstract

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly.

EgoExo-Con

Comparison of VTG Approaches.

We introduce EgoExo-Con, a benchmark comprising 491 synchronized ego-exo video pairs and 3,178 temporal-bounded event queries, to evaluate whether models can provide consistent predictions across viewpoints - a key indicator of view-invariant video-language understanding. The benchmark focuses on two temporal understanding tasks: temporal verification and temporal grounding. Temporal verification is a binary QA task that asks whether a given event occurs within a specific video moment, while temporal grounding requires identifying the relevant video moment (start and end timestamps) corresponding to an event query. In both tasks, we ask the same event but with synchronized videos of different viewpoints, and check if the tested models can output correct and consistent responses.

Result

Comparison of VTG Approaches.

We evaluate both the advanced closed-source models and open-source Video-LLMs comprising general-purpose and time-aware variants. Our benchmark results reveal that all models, especially the open-source ones, struggle with cross-view consistency. They generally exhibit a modest performance gap between individual ego and exo videos, but achieve consistency scores barely over half their single-view performance in both tasks. This indicates that the relatively stable performances across viewpoints may be sourced from view-specific biases rather than robust cross-view temporal understanding.

View-GRPO

Comparison of VTG Approaches. Comparison of VTG Approaches.

Although synchronized videos depict the same content, the reasoning process often differs across viewpoints because of distinct focuses and perspectives. To address this, we propose a reinforcement learning (RL) framework that guides models toward developing viewpoint-specific reasoning while encouraging shared consistency. Rather than simply enforcing identical outputs, our approach explicitly promotes robust reasoning across viewpoints. We build on Group Relative Policy Optimization (GRPO), which is particularly well-suited as it leverages relative rewards instead of absolute scores. The model provides step-by-step temporal reasoning with accurate grounding prediction, achieving high reasoning reward scores from LLM-judges. We name the overall approach View-GRPO.