Command Palette
Search for a command to run...
João Carreira† [email protected] Andrew Zisserman†,* [email protected]

摘要
当前动作分类数据集(如UCF-101和HMDB-51)中视频数量的不足,使得识别优秀的视频架构变得困难,因为大多数方法在现有的小规模基准测试中表现出类似的性能。本文基于新的Kinetics人类动作视频数据集重新评估了最先进的架构。Kinetics的数据量比现有数据集高出两个数量级,包含400个人类动作类别,每个类别超过400个片段,并且这些数据是从现实且具有挑战性的YouTube视频中收集的。我们对当前架构在这项数据集上的动作分类任务表现进行了分析,并探讨了在Kinetics上预训练后,这些模型在较小的基准测试数据集上的性能提升情况。此外,我们还引入了一种新的双流膨胀3D卷积网络(Two-Stream Inflated 3D ConvNet, I3D),该网络基于2D卷积网络的膨胀:非常深的图像分类卷积网络中的滤波器和池化核被扩展到3D,从而使得从视频中学习无缝的空间-时间特征提取器成为可能,同时利用成功的ImageNet架构设计及其参数。我们展示了在Kinetics上预训练后,I3D模型在动作分类任务上的性能显著超过了现有最先进水平,在HMDB-51上达到了80.9%,在UCF-101上达到了98.0%。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | I3D | MAP: 32.9 |
| action-classification-on-kinetics-400 | I3D | Acc@1: 71.1 Acc@5: 89.3 |
| action-classification-on-moments-in-time | I3D | Top 1 Accuracy: 29.51% Top 5 Accuracy: 56.06% |
| action-classification-on-toyota-smarthome | I3D | CS: 53.4 CV1: 34.9 CV2: 45.1 |
| action-recognition-in-videos-on-hmdb-51 | Flow-I3D (Kinetics pre-training) | Average accuracy of 3 splits: 77.3 |
| action-recognition-in-videos-on-hmdb-51 | Two-stream I3D | Average accuracy of 3 splits: 80.9 |
| action-recognition-in-videos-on-hmdb-51 | Two-Stream I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 80.7 |
| action-recognition-in-videos-on-hmdb-51 | RGB-I3D (Kinetics pre-training) | Average accuracy of 3 splits: 74.3 |
| action-recognition-in-videos-on-hmdb-51 | Flow-I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 77.1 |
| action-recognition-in-videos-on-hmdb-51 | RGB-I3D (Imagenet+Kinetics pre-training) | Average accuracy of 3 splits: 74.8 |
| action-recognition-in-videos-on-ucf101 | Two-Stream I3D (Kinetics pre-training) | 3-fold Accuracy: 97.8 |
| action-recognition-in-videos-on-ucf101 | Flow-I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 96.7 |
| action-recognition-in-videos-on-ucf101 | RGB-I3D (Kinetics pre-training) | 3-fold Accuracy: 95.1 |
| action-recognition-in-videos-on-ucf101 | Two-stream I3D | 3-fold Accuracy: 93.4 |
| action-recognition-in-videos-on-ucf101 | Two-Stream I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 98.0 |
| action-recognition-in-videos-on-ucf101 | RGB-I3D (Imagenet+Kinetics pre-training) | 3-fold Accuracy: 95.6 |
| action-recognition-in-videos-on-ucf101 | Flow-I3D (Kinetics pre-training) | 3-fold Accuracy: 96.5 |
| hand-gesture-recognition-on-egogesture-1 | I3D | Accuracy: 92.78 |
| hand-gesture-recognition-on-viva-hand-1 | I3D | Accuracy: 83.1 |
| skeleton-based-action-recognition-on-j-hmdb | I3D | Accuracy (RGB+pose): 84.1 |
| video-object-tracking-on-cater | I3D-50 + LSTM | L1: 1.2 Top 1 Accuracy: 60.2 Top 5 Accuracy: 81.8 |