Command Palette
Search for a command to run...
Haoqi Fan Bo Xiong Karttikeya Mangalam Yanghao Li Zhicheng Yan Jitendra Malik Christoph Feichtenhofer

摘要
我们提出了多尺度视觉Transformer(Multiscale Vision Transformers, MViT),用于视频与图像识别任务,其核心思想是将多尺度特征层次结构这一经典理念与Transformer模型相结合。MViT采用多个通道-分辨率尺度阶段,从输入分辨率和较小的通道维度出发,逐级提升通道容量的同时逐步降低空间分辨率,从而构建出一个多层次的特征金字塔。在浅层中,模型以高空间分辨率处理简单、低层次的视觉信息;而在深层,则以较低的空间分辨率处理更复杂、高维的特征表示。我们针对多种视频识别任务对这一基础架构先验进行了评估,结果表明,该模型在性能上超越了依赖大规模外部预训练的同期视觉Transformer方法,且在计算量和参数量方面仅为其1/5至1/10,效率显著更高。此外,我们进一步移除了时间维度,将该模型应用于图像分类任务,其表现亦优于此前的视觉Transformer方法。代码已开源,地址为:https://github.com/facebookresearch/SlowFast
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | MViT-B, 32x3 (Kinetics-400 pretraining) | MAP: 44.3 |
| action-classification-on-charades | MViT-B-24, 32x3 (Kinetics-600 pretraining) | MAP: 47.7 |
| action-classification-on-charades | MViT-B, 32x3 (Kinetics-600 pretraining) | MAP: 47.1 |
| action-classification-on-charades | MViT-B, 16x4 (Kinetics-600 pretraining) | MAP: 43.9 |
| action-classification-on-charades | MViT-B-24, 32x3 (Kinetics-400 pretraining) | MAP: 46.3 |
| action-classification-on-charades | MViT-B, 16x4 (Kinetics-400 pretraining) | MAP: 40 |
| action-classification-on-kinetics-400 | MViT-B, 32x3 | Acc@1: 80.2 Acc@5: 94.4 |
| action-classification-on-kinetics-400 | MViT-B, 16x4 | Acc@1: 78.4 Acc@5: 93.5 |
| action-classification-on-kinetics-400 | MViT-B, 64x3 | Acc@1: 81.2 Acc@5: 95.1 |
| action-classification-on-kinetics-400 | MViT-S | Acc@1: 76 Acc@5: 92.1 |
| action-classification-on-kinetics-600 | MViT-B, 16x4 | Top-1 Accuracy: 82.1 Top-5 Accuracy: 95.7 |
| action-classification-on-kinetics-600 | MViT-B-24, 32x3 | Top-1 Accuracy: 83.8 Top-5 Accuracy: 96.3 |
| action-classification-on-kinetics-600 | MViT-B, 32x3 | Top-1 Accuracy: 83.4 Top-5 Accuracy: 96.3 |
| action-recognition-in-videos-on-something | MViT-B, 32x3(Kinetics600 pretrain) | GFLOPs: 170x3 Parameters: 36.6 Top-1 Accuracy: 67.8 Top-5 Accuracy: 91.3 |
| action-recognition-in-videos-on-something | MViT-B, 16x4 | Top-1 Accuracy: 66.2 Top-5 Accuracy: 90.2 |
| action-recognition-in-videos-on-something | MViT-B-24, 32x3 | GFLOPs: 236x3 Parameters: 53.2M Top-1 Accuracy: 68.7 Top-5 Accuracy: 91.5 |
| action-recognition-on-ava-v2-2 | MViT-B, 64x3 (Kinetics-400 pretraining) | mAP: 27.3 |
| action-recognition-on-ava-v2-2 | MViT-B, 16x4 (Kinetics-600 pretraining) | mAP: 26.1 |
| action-recognition-on-ava-v2-2 | MViT-B-24, 32x3 (Kinetics-600 pretraining) | mAP: 28.7 |
| action-recognition-on-ava-v2-2 | MViT-B, 32x3 (Kinetics-400 pretraining) | mAP: 26.8 |
| action-recognition-on-ava-v2-2 | MViT-B, 16x4 (Kinetics-400 pretraining) | mAP: 24.5 |
| action-recognition-on-ava-v2-2 | MViT-B, 32x3 (Kinetics-500 pretraining) | mAP: 27.5 |
| image-classification-on-imagenet | MViT-B-24 | GFLOPs: 32.7 Number of params: 72.9M Top 1 Accuracy: 84.8% |
| image-classification-on-imagenet | MViT-B-16 | GFLOPs: 7.8 Number of params: 37M Top 1 Accuracy: 83.0% |