Learning Spatiotemporal Sensitivity in Video LLMs
via Counterfactual Reinforcement Learning
Abstract
Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose Counterfactual Relational Policy Optimization (CRPO), a dual-branch RL framework for improving spatiotemporal sensitivity. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a Counterfactual Relation Reward (CRR) between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. To evaluate this property, we introduce DyBench, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model.
Key Results
(Qwen3-VL-8B)
(Qwen3-VL-8B)
(Qwen3-VL-4B)
(Paired)
Accuracy
Motivation
Figure: (a) A static-shortcut model answers correctly without tracking motion. (b) Accuracy on dynamic tasks strongly correlates with spatiotemporal sensitivity (Pearson r = β0.87).
Current Video LLMs often answer correctly by exploiting static shortcutsβsingle-frame cues and language priorsβrather than tracking how events unfold over time. This problem becomes especially consequential in RL post-training: GRPO-style RL typically relies on correctness-only rewards, so if a single frame or a language-based guess is enough to answer a training question, the policy can receive high reward without tracking video dynamics.
We call this property spatiotemporal sensitivity: equivariance for dynamic questions and invariance for static questions. Such a behavioral signature is difficult for static shortcut policies to satisfy consistently.
Method: CRPO
CRPO is a dual-branch extension of GRPO that rewards the relation between answers to factual and counterfactual videos.
Figure: Overview of the CRPO framework. Left: dual-branch training with original and counterfactual videos. Right: Task Router using a text-only reasoning model to classify question type and select the appropriate counterfactual transformation.
π Task Router
Classifies each question into Spatial, Temporal, Spatiotemporal, or Static using a text-only reasoning model (DeepSeek-R1). Selects the appropriate counterfactual transformation: horizontal flip or temporal reversal. Run offline before training at no extra cost.
π¬ Dual-Branch Training
For each training prompt, CRPO generates rollouts from both the original video and its counterfactual counterpart. Both branches contribute policy gradients, making the counterfactual branch a direct training signal rather than a passive diagnostic.
π Counterfactual Relation Reward (CRR)
CRR rewards the policy when answers across the two branches match the expected behavior: answer changes for dynamic questions (equivariance) and answer agreement for static questions (invariance). This cross-branch constraint is difficult for single-frame or language shortcuts to satisfy consistently.
β Null Option
A "None of the above" option is appended to all multiple-choice questions. This allows the counterfactual branch to express answer changes even when the transformed correct answer is not explicitly listed, without requiring counterfactual answer labels.
DyBench: A Paired Counterfactual Benchmark
Figure: DyBench examples for each task category (left) and dataset statistics including video duration distribution and source breakdown (right).
We introduce DyBench, a paired counterfactual benchmark of 1,507 video pairs (3,014 videos) targeting motion direction, event order, and the arrow of time β the three aspects of a video most readily masked by static shortcuts.
βͺ Reversible Dynamics
Asks whether a change happens forward or backward in time (e.g., opening vs. closing a door, a flower blooming vs. closing). Built by playing clips forward and backward.
3-way MCQβ‘οΈ Moving Direction
Asks the direction in which an object or actor moves (e.g., left vs. right). Built by horizontal flip and/or temporal reversal of tracking videos.
4-way MCQπ’ Event Sequence
Asks the order in which two events occur (e.g., pour milk then pour cereals vs. pour cereals then pour milk). Built by concatenating two action segments in both orders.
Binary MCQDyBench draws videos from diverse sources including Something-Something-v2, GOT-10k, Breakfast, 50Salads, and more, covering humans, objects, animals, and plants. All pairs undergo manual verification for static-content consistency and temporal minimality.
Main Results
CRPO outperforms all RL baselines on every spatiotemporal-sensitive benchmark while maintaining competitive general video performance.
| Model | DyBench | TimeBlind | TempCompass | VideoMME | MVBench | ||
|---|---|---|---|---|---|---|---|
| Acc | P-Acc | Acc | I-Acc | Acc | Acc | Acc | |
| Proprietary Models | |||||||
| GPT-5.1 | 63.7 | 44.9 | 67.3 | 27.0 | 76.4 | 72.9 | 62.9 |
| Gemini-3.1-Pro | 82.2 | 71.7 | 77.2 | 45.5 | 74.5 | 74.3 | 73.6 |
| Open-Source Models | |||||||
| LLaVA-OV-7B | 48.4 | 19.6 | 56.5 | 7.8 | 60.5 | 57.6 | 54.7 |
| InternVL3-8B | 69.5 | 50.2 | 63.0 | 18.3 | 70.5 | 65.6 | 74.3 |
| Qwen2.5-VL-7B | 64.5 | 44.1 | 65.0 | 22.5 | 70.9 | 60.3 | 67.1 |
| RL Post-Training on Qwen3-VL-4B | |||||||
| Qwen3-VL-4B (base) | 65.1 | 45.4 | 66.2 | 26.5 | 71.6 | 62.2 | 67.5 |
| + GRPO | 66.6 | 48.2 | 67.8 | 28.0 | 72.2 | 62.1 | 68.7 |
| + T-GRPO | 68.6 | 49.8 | 67.9 | 29.0 | 73.5 | 63.4 | 70.3 |
| + ArrowRL | 66.7 | 47.5 | 67.5 | 26.8 | 71.6 | 62.2 | 69.9 |
| + CRPO (Ours) | 70.3 β5.2 | 54.8 β9.4 | 69.8 β3.6 | 31.7 β5.2 | 74.2 β2.6 | 63.0 β0.8 | 68.6 β1.1 |
| RL Post-Training on Qwen3-VL-8B | |||||||
| Qwen3-VL-8B (base) | 68.2 | 50.4 | 67.8 | 27.8 | 75.1 | 64.9 | 68.7 |
| + GRPO | 69.4 | 52.4 | 69.8 | 30.5 | 75.5 | 64.5 | 69.1 |
| + T-GRPO | 70.9 | 54.5 | 70.3 | 32.5 | 75.4 | 65.0 | 69.5 |
| + ArrowRL | 69.6 | 52.2 | 69.3 | 29.3 | 74.9 | 65.1 | 69.7 |
| + CRPO (Ours) | 72.5 β4.3 | 58.1 β7.7 | 71.7 β3.9 | 36.0 β8.2 | 77.4 β2.3 | 65.6 β0.7 | 69.7 β1.0 |
β indicates improvement over the corresponding base model. Highlighted rows are our method.
Qualitative Results
Comparison of Qwen3-VL (base) and CRPO on counterfactual video pairs. CRPO correctly changes its answer when the video is transformed, while the base model often gives the same answer regardless of the transformation.
Figure: Moving direction example. Left pair: original video. Right pair: horizontally flipped video. The base model (red) gives the same wrong answer on both; CRPO (green) correctly changes its answer.
Figure: Reversible dynamics example. Left pair: original video (forward). Right pair: temporally reversed video. CRPO correctly changes its answer when the video is reversed, while the base model fails to distinguish the two.
Training Analysis
Training curves show that all CRPO reward components grow steadily throughout training, indicating the policy actively learns spatiotemporal sensitivity rather than converging to shortcuts.
Figure: Comparison of training curves across RL methods. CRPO achieves a significantly higher auxiliary (CRR) reward while maintaining competitive correctness reward.
Contributions
BibTeX
@article{du2026crpo,
title={Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning},
author={Du, Dazhao and Liu, Jian and Qin, Jialong and Han, Tao and Gu, Bohai and Zhu, Fangqi and Zhang, Yujia and Liu, Eric and Chen, Xi and Guo, Song},
journal={arXiv preprint arXiv:2605.21988},
year={2026},
url={https://arxiv.org/abs/2605.21988}
}