Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
AJ Piergiovanni Weicheng Kuo Anelia Angelova

Abstract
We present a simple approach which can turn a ViT encoder into an efficientvideo model, which can seamlessly work with both image and video inputs. Bysparsely sampling the inputs, the model is able to do training and inferencefrom both inputs. The model is easily scalable and can be adapted tolarge-scale pre-trained ViTs without requiring full finetuning. The modelachieves SOTA results and the code will be open-sourced.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-charades | TubeViT-L | MAP: 66.2 |
| action-classification-on-kinetics-400 | TubeVit-L (ImageNet-1k) | Acc@1: 90.2 Acc@5: 98.6 FLOPs (G) x views: 95300x4x3 Parameters (M): 307 |
| action-classification-on-kinetics-400 | TubeViT-H (ImageNet-1k) | Acc@1: 90.9 Acc@5: 98.9 FLOPs (G) x views: 176400x4x3 Parameters (M): 632 |
| action-classification-on-kinetics-400 | TubeVit-B (ImageNet-1k) | Acc@1: 88.6 Acc@5: 97.6 FLOPs (G) x views: 8700x3x4 Parameters (M): 86 |
| action-classification-on-kinetics-600 | TubeVit-L | Top-1 Accuracy: 91.5 Top-5 Accuracy: 98.7 |
| action-classification-on-kinetics-600 | TubeVit-B | Top-1 Accuracy: 90.9 Top-5 Accuracy: 97.3 |
| action-classification-on-kinetics-600 | TubeVit-H | Top-1 Accuracy: 91.8 Top-5 Accuracy: 98.9 |
| action-classification-on-kinetics-700 | TubeViT-L | Top-1 Accuracy: 83.8 Top-5 Accuracy: 96.6 |
| action-recognition-in-videos-on-something | TubeViT-L | Top-1 Accuracy: 76.1 Top-5 Accuracy: 95.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.