Wu Chao-Yuan ; Feichtenhofer Christoph ; Fan Haoqi ; He Kaiming ; Krähenbühl Philipp ; Girshick Ross

Abstract
To understand the world, we humans constantly need to relate the present tothe past, and put events in context. In this paper, we enable existing videomodels to do the same. We propose a long-term feature bank---supportiveinformation extracted over the entire span of a video---to augmentstate-of-the-art video models that otherwise would only view short clips of 2-5seconds. Our experiments demonstrate that augmenting 3D convolutional networkswith a long-term feature bank yields state-of-the-art results on threechallenging video datasets: AVA, EPIC-Kitchens, and Charades.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-charades | LFB | MAP: 42.5 |
| action-recognition-in-videos-on-ava-v21 | LFB (Kinetics-400 pretraining) | mAP (Val): 27.7 |
| egocentric-activity-recognition-on-epic-1 | LFB Max | Actions Top-1 (S1): 32.70 Actions Top-1 (S2): 21.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.