Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

Du, Dazhao

Learning Spatiotemporal Sensitivity in Video LLMs
via Counterfactual Reinforcement Learning

Dazhao Du^1,2, Jian Liu¹, Jialong Qin¹, Tao Han¹, Bohai Gu¹, Fangqi Zhu¹, Yujia Zhang², Eric Liu², Xi Chen², Song Guo^1,†

¹Hong Kong University of Science and Technology ²Tencent
^†Corresponding Author

arXiv Code 🤗 DyBench Dataset 🤗 Model (8B) 🤗 Model (4B)

Abstract

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose Counterfactual Relational Policy Optimization (CRPO), a dual-branch RL framework for improving spatiotemporal sensitivity. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a Counterfactual Relation Reward (CRR) between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. To evaluate this property, we introduce DyBench, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model.

Key Results

+7.7

DyBench P-Acc
(Qwen3-VL-8B)

+8.2

TimeBlind I-Acc
(Qwen3-VL-8B)

+9.4

DyBench P-Acc
(Qwen3-VL-4B)

3,014

DyBench Videos
(Paired)

94%

Task Router
Accuracy

Motivation

Figure: (a) A static-shortcut model answers correctly without tracking motion. (b) Accuracy on dynamic tasks strongly correlates with spatiotemporal sensitivity (Pearson r = −0.87).

Current Video LLMs often answer correctly by exploiting static shortcuts—single-frame cues and language priors—rather than tracking how events unfold over time. This problem becomes especially consequential in RL post-training: GRPO-style RL typically relies on correctness-only rewards, so if a single frame or a language-based guess is enough to answer a training question, the policy can receive high reward without tracking video dynamics.

        Key Observation: A spatiotemporally sensitive model should respond predictably to controlled counterfactual transformations. If an object moving right is horizontally flipped or temporally reversed, the model should change its answer; if the question asks about a static attribute (e.g., object presence or color), the answer should remain unchanged.
      

We call this property spatiotemporal sensitivity: equivariance for dynamic questions and invariance for static questions. Such a behavioral signature is difficult for static shortcut policies to satisfy consistently.

Method: CRPO

CRPO is a dual-branch extension of GRPO that rewards the relation between answers to factual and counterfactual videos.

Figure: Overview of the CRPO framework. Left: dual-branch training with original and counterfactual videos. Right: Task Router using a text-only reasoning model to classify question type and select the appropriate counterfactual transformation.

🔀 Task Router

Classifies each question into Spatial, Temporal, Spatiotemporal, or Static using a text-only reasoning model (DeepSeek-R1). Selects the appropriate counterfactual transformation: horizontal flip or temporal reversal. Run offline before training at no extra cost.

🎬 Dual-Branch Training

For each training prompt, CRPO generates rollouts from both the original video and its counterfactual counterpart. Both branches contribute policy gradients, making the counterfactual branch a direct training signal rather than a passive diagnostic.

🏆 Counterfactual Relation Reward (CRR)

CRR rewards the policy when answers across the two branches match the expected behavior: answer changes for dynamic questions (equivariance) and answer agreement for static questions (invariance). This cross-branch constraint is difficult for single-frame or language shortcuts to satisfy consistently.

❓ Null Option

A "None of the above" option is appended to all multiple-choice questions. This allows the counterfactual branch to express answer changes even when the transformed correct answer is not explicitly listed, without requiring counterfactual answer labels.

DyBench: A Paired Counterfactual Benchmark

Figure: DyBench examples for each task category (left) and dataset statistics including video duration distribution and source breakdown (right).

We introduce DyBench, a paired counterfactual benchmark of 1,507 video pairs (3,014 videos) targeting motion direction, event order, and the arrow of time — the three aspects of a video most readily masked by static shortcuts.

⏪ Reversible Dynamics

Asks whether a change happens forward or backward in time (e.g., opening vs. closing a door, a flower blooming vs. closing). Built by playing clips forward and backward.

3-way MCQ

➡️ Moving Direction

Asks the direction in which an object or actor moves (e.g., left vs. right). Built by horizontal flip and/or temporal reversal of tracking videos.

4-way MCQ

🔢 Event Sequence

Asks the order in which two events occur (e.g., pour milk then pour cereals vs. pour cereals then pour milk). Built by concatenating two action segments in both orders.

Binary MCQ

        Pair Accuracy (P-Acc): A pair is counted as correct only when the model answers both videos correctly. By construction, P-Acc cannot be inflated by static-shortcut policies that always return the same answer — such policies score zero on every pair.
      

DyBench draws videos from diverse sources including Something-Something-v2, GOT-10k, Breakfast, 50Salads, and more, covering humans, objects, animals, and plants. All pairs undergo manual verification for static-content consistency and temporal minimality.

Main Results

CRPO outperforms all RL baselines on every spatiotemporal-sensitive benchmark while maintaining competitive general video performance.

Model	DyBench		TimeBlind		TempCompass	VideoMME	MVBench
Model	Acc	P-Acc	Acc	I-Acc	Acc	Acc	Acc
Proprietary Models
GPT-5.1	63.7	44.9	67.3	27.0	76.4	72.9	62.9
Gemini-3.1-Pro	82.2	71.7	77.2	45.5	74.5	74.3	73.6
Open-Source Models
LLaVA-OV-7B	48.4	19.6	56.5	7.8	60.5	57.6	54.7
InternVL3-8B	69.5	50.2	63.0	18.3	70.5	65.6	74.3
Qwen2.5-VL-7B	64.5	44.1	65.0	22.5	70.9	60.3	67.1
RL Post-Training on Qwen3-VL-4B
Qwen3-VL-4B (base)	65.1	45.4	66.2	26.5	71.6	62.2	67.5
+ GRPO	66.6	48.2	67.8	28.0	72.2	62.1	68.7
+ T-GRPO	68.6	49.8	67.9	29.0	73.5	63.4	70.3
+ ArrowRL	66.7	47.5	67.5	26.8	71.6	62.2	69.9
+ CRPO (Ours)	70.3 ↑5.2	54.8 ↑9.4	69.8 ↑3.6	31.7 ↑5.2	74.2 ↑2.6	63.0 ↑0.8	68.6 ↑1.1
RL Post-Training on Qwen3-VL-8B
Qwen3-VL-8B (base)	68.2	50.4	67.8	27.8	75.1	64.9	68.7
+ GRPO	69.4	52.4	69.8	30.5	75.5	64.5	69.1
+ T-GRPO	70.9	54.5	70.3	32.5	75.4	65.0	69.5
+ ArrowRL	69.6	52.2	69.3	29.3	74.9	65.1	69.7
+ CRPO (Ours)	72.5 ↑4.3	58.1 ↑7.7	71.7 ↑3.9	36.0 ↑8.2	77.4 ↑2.3	65.6 ↑0.7	69.7 ↑1.0

↑ indicates improvement over the corresponding base model. Highlighted rows are our method.

Qualitative Results

Comparison of Qwen3-VL (base) and CRPO on counterfactual video pairs. CRPO correctly changes its answer when the video is transformed, while the base model often gives the same answer regardless of the transformation.

Qualitative example 1: Moving direction task

Figure: Moving direction example. Left pair: original video. Right pair: horizontally flipped video. The base model (red) gives the same wrong answer on both; CRPO (green) correctly changes its answer.

Qualitative example 2: Reversible dynamics task

Figure: Reversible dynamics example. Left pair: original video (forward). Right pair: temporally reversed video. CRPO correctly changes its answer when the video is reversed, while the base model fails to distinguish the two.

Training Analysis

Training curves show that all CRPO reward components grow steadily throughout training, indicating the policy actively learns spatiotemporal sensitivity rather than converging to shortcuts.

Figure: Comparison of training curves across RL methods. CRPO achieves a significantly higher auxiliary (CRR) reward while maintaining competitive correctness reward.

Resources

📦 Dataset

🤗

DyBench

3,014 paired counterfactual videos for spatiotemporal evaluation

🤖 Models

🤗

Qwen3-VL-8B-CRPO

CRPO fine-tuned Qwen3-VL-8B-Instruct

🤗

Qwen3-VL-4B-CRPO

CRPO fine-tuned Qwen3-VL-4B-Instruct

Contributions

CRPO Framework: A simple dual-branch RL framework for improving spatiotemporal sensitivity in Video LLMs. CRPO trains on both original and counterfactual branches and uses a Counterfactual Relation Reward (CRR) to reward equivariant or invariant answer relations, discouraging shortcut reliance without requiring counterfactual labels or costly spatiotemporal evidence annotations.

DyBench Benchmark: A 3,014-video paired counterfactual benchmark with strict pair accuracy, covering reversible dynamics, moving direction, and event sequence. DyBench shows that CRPO improves spatiotemporal-sensitive evaluations such as DyBench and TimeBlind while maintaining competitive general video performance.

Analysis: Comprehensive ablation studies showing that the dual-branch optimization (not just auxiliary rewards) is essential, and that CRPO's gains cannot be explained by simply doubling data or rollouts. Training curve analysis reveals that CRR actively grows throughout training, indicating the policy learns to satisfy the counterfactual relation.

BibTeX

@article{du2026crpo,
  title={Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning},
  author={Du, Dazhao and Liu, Jian and Qin, Jialong and Han, Tao and Gu, Bohai and Zhu, Fangqi and Zhang, Yujia and Liu, Eric and Chen, Xi and Guo, Song},
  journal={arXiv preprint arXiv:2605.21988},
  year={2026},
  url={https://arxiv.org/abs/2605.21988}
}

Learning Spatiotemporal Sensitivity in Video LLMsvia Counterfactual Reinforcement Learning