EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural language query as input and localizes the temporal moment that best matches the query. While remarkable progress has been made, existing methods still rely on large-scale, task-specific datasets that require costly and labor-intensive manual annotation. In this paper, we introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal information from raw videos without any human-labeled data. The proposer generates query–moment pairs from raw videos, while the solver learns to ground them and provides feedback that improves the proposer in return. Through this self-reinforcing loop, driven entirely by reinforcement learning, the two agents mutually improve each other across iterations. Despite never seeing any training data, EvoGround matches or outperforms most fully supervised models across multiple VTG benchmarks, while also demonstrating strong fine-grained video captioning capability.

Overview

The Self-evolution Loop of EvoGround. The proposer generates query (q)–moment (m) pairs from a given video, and the solver learns from these pairs and provides feedback signals through its moment predictions (m̂). As the process is repeated, the two agents mutually improve each other.

Method

EvoGround consists of two agents: a proposer and a solver. The proposer identifies candidate temporal events from raw videos and generates corresponding query–moment pairs, while the solver learns to ground temporal moments using the generated data. Both agents are initialized from the same backbone and evolve solely through a self-reinforcing loop without any labeled data.

The proposer is guided by three reward criteria: validity (format reward), consistency (consistency reward), and solvability (feedback reward). The consistency reward measures intra-consistency — how coherently a query aligns with frames within its moment — and inter-consistency — how discriminatively the query matches its own moment relative to others. The feedback reward uses the solver's accuracy (measured by timestamp-aware IoU) as a signal of whether generated pairs are learnable.

The solver is trained on proposer-generated query–moment pairs using a format reward and an accuracy reward (tIoU). EvoGround adopts GDPO as its RL optimizer and a curriculum design that progressively increases the solvability threshold across iterations, shifting focus from coarse matches toward more precise temporal alignments.

Results

Despite never seeing any manual annotations during training, EvoGround demonstrates strong performance across all benchmarks. It surpasses all SFT-based models and achieves first- or second-best performance among RL-based models — which rely on manually labeled data.

Charades-STA & ActivityNet-Captions

Method	Charades-STA				ActivityNet-Captions
Method	R1@0.3	R1@0.5	R1@0.7	mIoU	R1@0.3	R1@0.5	R1@0.7	mIoU
SFT-based Models
TimeChat	-	32.2	13.4	32.2	36.2	20.2	9.5	21.8
VTimeLLM	51.0	27.5	11.4	31.2	44.0	27.8	14.3	30.4
Qwen2.5-VL	68.5	48.8	22.5	45.0	38.7	25.6	14.9	28.6
TimeSuite	69.9	48.7	24.0	-	-	-	-	-
VideoChat-Flash	74.5	53.1	27.6	-	-	-	-	-
RL-based Models
Time-R1	78.1	60.8	35.3	58.1	58.6	39.0	21.4	40.5
EvoGround (Ours)	76.4	58.6	33.6	52.5	61.6	42.5	25.0	42.8

TVGBench, ReXTime & E.T.Bench

Method	TVGBench				ReXTime			E.T.Bench
Method	R1@0.3	R1@0.5	R1@0.7	mIoU	R1@0.3	R1@0.5	mIoU	TVG F1
SFT-based Models
TimeChat	22.4	11.9	5.3	-	14.4	7.6	11.6	26.2
Qwen2.5-VL	28.1	19.5	10.5	20.4	16.3	11.2	13.3	46.6
TRACE	37.0	25.5	14.6	-	-	-	-	-
RL-based Models
VideoChat-R1.5	31.5	20.8	11.3	22.0	28.3	18.0	20.5	50.3
Time-R1	39.3	28.0	16.0	27.8	32.2	22.1	24.1	69.4
EvoGround (Ours)	42.1	28.5	15.3	29.4	33.5	22.4	25.5	69.8

On TemporalBench (fine-grained video captioning), EvoGround outperforms all prior models across every metric — including CIDEr and BLEU — despite never being explicitly trained for captioning. This suggests that the self-evolution loop fosters a broader and more grounded understanding of video content.

Analysis

Reward dynamics across iterations. As the proposer evolves over iterations, the solver correspondingly demonstrates progressively higher accuracy.

Improvements across iterations. GDPO consistently outperforms GRPO. Increasing the solvability threshold δ improves performance on longer videos and moments.

Performance improvements across iterations.

Data distributions across reward configurations. Each reward shapes the generated data differently — the consistency reward produces tighter moments, while the feedback reward enriches query descriptiveness.

Data distributions under different reward configurations.

Qualitative Results

EvoGround generates detailed, temporally-grounded descriptions of video content. The proposer produces event-level captions that accurately capture the sequence of actions and participants, closely matching the ground-truth despite receiving no caption supervision.

BibTeX

@article{jung2025evoground,
  title     = {EvoGround: Self-Evolving Video Agents for Video Temporal Grounding},
  author    = {Jung, Minjoon and Zhang, Byoung-Tak and Torresani, Lorenzo},
  journal   = {arXiv preprint},
  year      = {2025}
}