Training data-efficient image transformers & distillation through
attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron Matthieu Cord Matthijs Douze Francisco Massa Alexandre Sablayrolles Hervé Jégou

Abstract
Recently, neural networks purely based on attention were shown to addressimage understanding tasks such as image classification. However, these visualtransformers are pre-trained with hundreds of millions of images using anexpensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer bytraining on Imagenet only. We train them on a single computer in less than 3days. Our reference vision transformer (86M parameters) achieves top-1 accuracyof 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific totransformers. It relies on a distillation token ensuring that the studentlearns from the teacher through attention. We show the interest of thistoken-based distillation, especially when using a convnet as a teacher. Thisleads us to report results competitive with convnets for both Imagenet (wherewe obtain up to 85.2% accuracy) and when transferring to other tasks. We shareour code and models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | DeiT-B | Accuracy: 90.32% Parameters: 87M |
| document-layout-analysis-on-publaynet-val | DeiT-B | Figure: 0.957 List: 0.921 Overall: 0.932 Table: 0.972 Text: 0.934 Title: 0.874 |
| efficient-vits-on-imagenet-1k-with-deit-s | Base (DeiT-S) | GFLOPs: 4.6 Top 1 Accuracy: 79.8 |
| efficient-vits-on-imagenet-1k-with-deit-t | Base (DeiT-T) | GFLOPs: 1.2 Top 1 Accuracy: 72.2 |
| fine-grained-image-classification-on-oxford | DeiT-B | Accuracy: 98.8% PARAMS: 86M |
| fine-grained-image-classification-on-stanford | DeiT-B | Accuracy: 93.3% PARAMS: 86M |
| image-classification-on-cifar-10 | DeiT-B | Percentage correct: 99.1 |
| image-classification-on-cifar-100 | DeiT-B | PARAMS: 86M Percentage correct: 90.8 |
| image-classification-on-flowers-102 | DeiT-B | Accuracy: 98.8% PARAMS: 86M |
| image-classification-on-imagenet | DeiT-B | Number of params: 86M Top 1 Accuracy: 84.2% |
| image-classification-on-imagenet | DeiT-B 384 | Hardware Burden: Number of params: 87M Operations per network pass: Top 1 Accuracy: 85.2% |
| image-classification-on-imagenet | DeiT-B | Number of params: 5M Top 1 Accuracy: 76.6% |
| image-classification-on-imagenet | DeiT-B | Number of params: 22M Top 1 Accuracy: 82.6% |
| image-classification-on-imagenet-real | DeiT-Ti | Accuracy: 82.1% Params: 5M |
| image-classification-on-imagenet-real | DeiT-B | Accuracy: 88.7% Params: 86M |
| image-classification-on-imagenet-real | DeiT-S | Accuracy: 86.8% Params: 22M |
| image-classification-on-imagenet-real | DeiT-B-384 | Accuracy: 89.3% Params: 86M |
| image-classification-on-inaturalist-2018 | DeiT-B | Top-1 Accuracy: 79.5% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.