Command Palette

Search for a command to run...

4 个月前

多尺度视觉Transformer

Haoqi Fan Bo Xiong Karttikeya Mangalam Yanghao Li Zhicheng Yan Jitendra Malik Christoph Feichtenhofer

多尺度视觉Transformer

摘要

我们提出了多尺度视觉Transformer(Multiscale Vision Transformers, MViT),用于视频与图像识别任务,其核心思想是将多尺度特征层次结构这一经典理念与Transformer模型相结合。MViT采用多个通道-分辨率尺度阶段,从输入分辨率和较小的通道维度出发,逐级提升通道容量的同时逐步降低空间分辨率,从而构建出一个多层次的特征金字塔。在浅层中,模型以高空间分辨率处理简单、低层次的视觉信息;而在深层,则以较低的空间分辨率处理更复杂、高维的特征表示。我们针对多种视频识别任务对这一基础架构先验进行了评估,结果表明,该模型在性能上超越了依赖大规模外部预训练的同期视觉Transformer方法,且在计算量和参数量方面仅为其1/5至1/10,效率显著更高。此外,我们进一步移除了时间维度,将该模型应用于图像分类任务,其表现亦优于此前的视觉Transformer方法。代码已开源,地址为:https://github.com/facebookresearch/SlowFast

代码仓库

junweiliang/multitrain
pytorch
GitHub 中提及
facebookresearch/SlowFast
官方
pytorch
GitHub 中提及
facebookresearch/pytorchvideo
pytorch
GitHub 中提及
wangjk666/stts
pytorch
GitHub 中提及
rohanshad/cmr_transformer
pytorch
GitHub 中提及
facebookresearch/hiera
pytorch
GitHub 中提及

基准测试

基准方法指标
action-classification-on-charadesMViT-B, 32x3 (Kinetics-400 pretraining)
MAP: 44.3
action-classification-on-charadesMViT-B-24, 32x3 (Kinetics-600 pretraining)
MAP: 47.7
action-classification-on-charadesMViT-B, 32x3 (Kinetics-600 pretraining)
MAP: 47.1
action-classification-on-charadesMViT-B, 16x4 (Kinetics-600 pretraining)
MAP: 43.9
action-classification-on-charadesMViT-B-24, 32x3 (Kinetics-400 pretraining)
MAP: 46.3
action-classification-on-charadesMViT-B, 16x4 (Kinetics-400 pretraining)
MAP: 40
action-classification-on-kinetics-400MViT-B, 32x3
Acc@1: 80.2
Acc@5: 94.4
action-classification-on-kinetics-400MViT-B, 16x4
Acc@1: 78.4
Acc@5: 93.5
action-classification-on-kinetics-400MViT-B, 64x3
Acc@1: 81.2
Acc@5: 95.1
action-classification-on-kinetics-400MViT-S
Acc@1: 76
Acc@5: 92.1
action-classification-on-kinetics-600MViT-B, 16x4
Top-1 Accuracy: 82.1
Top-5 Accuracy: 95.7
action-classification-on-kinetics-600MViT-B-24, 32x3
Top-1 Accuracy: 83.8
Top-5 Accuracy: 96.3
action-classification-on-kinetics-600MViT-B, 32x3
Top-1 Accuracy: 83.4
Top-5 Accuracy: 96.3
action-recognition-in-videos-on-somethingMViT-B, 32x3(Kinetics600 pretrain)
GFLOPs: 170x3
Parameters: 36.6
Top-1 Accuracy: 67.8
Top-5 Accuracy: 91.3
action-recognition-in-videos-on-somethingMViT-B, 16x4
Top-1 Accuracy: 66.2
Top-5 Accuracy: 90.2
action-recognition-in-videos-on-somethingMViT-B-24, 32x3
GFLOPs: 236x3
Parameters: 53.2M
Top-1 Accuracy: 68.7
Top-5 Accuracy: 91.5
action-recognition-on-ava-v2-2MViT-B, 64x3 (Kinetics-400 pretraining)
mAP: 27.3
action-recognition-on-ava-v2-2MViT-B, 16x4 (Kinetics-600 pretraining)
mAP: 26.1
action-recognition-on-ava-v2-2MViT-B-24, 32x3 (Kinetics-600 pretraining)
mAP: 28.7
action-recognition-on-ava-v2-2MViT-B, 32x3 (Kinetics-400 pretraining)
mAP: 26.8
action-recognition-on-ava-v2-2MViT-B, 16x4 (Kinetics-400 pretraining)
mAP: 24.5
action-recognition-on-ava-v2-2MViT-B, 32x3 (Kinetics-500 pretraining)
mAP: 27.5
image-classification-on-imagenetMViT-B-24
GFLOPs: 32.7
Number of params: 72.9M
Top 1 Accuracy: 84.8%
image-classification-on-imagenetMViT-B-16
GFLOPs: 7.8
Number of params: 37M
Top 1 Accuracy: 83.0%

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
多尺度视觉Transformer | 论文 | HyperAI超神经