R2-Tuning: Efficient Image-to-Video Transfer Learning for Video
Temporal Grounding
R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Ye Liu¹,³* Jixuan He²,† Wanhua Li³ Junsik Kim³ Donglai Wei³ Hanspeter Pfister³ Chang Wen Chen¹,‡

Abstract
Video temporal grounding (VTG) is a fine-grained video understanding problemthat aims to ground relevant clips in untrimmed videos given natural languagequeries. Most existing VTG models are built upon frame-wise final-layer CLIPfeatures, aided by additional temporal backbones (e.g., SlowFast) withsophisticated temporal reasoning mechanisms. In this work, we claim that CLIPitself already shows great potential for fine-grained spatial-temporalmodeling, as each layer offers distinct yet useful information under differentgranularity levels. Motivated by this, we propose Reversed Recurrent Tuning(R2-Tuning), a parameter- and memory-efficient transfer learning frameworkfor video temporal grounding. Our method learns a lightweight R2 Blockcontaining only 1.5% of the total parameters to perform progressivespatial-temporal modeling. Starting from the last layer of CLIP, R2 Blockrecurrently aggregates spatial features from earlier layers, then refinestemporal correlation conditioning on the given query, resulting in acoarse-to-fine scheme. R2-Tuning achieves state-of-the-art performanceacross three VTG tasks (i.e., moment retrieval, highlight detection, and videosummarization) on six public benchmarks (i.e., QVHighlights, Charades-STA,Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additionalbackbone, demonstrating the significance and effectiveness of the proposedscheme. Our code is available at https://github.com/yeliudev/R2-Tuning.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| highlight-detection-on-qvhighlights | R^2-Tuning | Hit@1: 64.20 mAP: 40.75 |
| moment-retrieval-on-qvhighlights | R^2-Tuning |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.