Command Palette
Search for a command to run...

摘要
最先进的计算机视觉系统被训练用于预测一组固定的预定义对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为需要额外的标注数据来指定任何其他视觉概念。直接从图像的原始文本中学习是一种有前景的替代方案,它利用了更为广泛的数据来源进行监督。我们证明,通过预测哪些标题与哪些图像匹配这一简单的预训练任务,可以在一个包含4亿个(图像,文本)对的数据集上从零开始高效且可扩展地学习到最先进(SOTA)的图像表示。该数据集是从互联网收集的。预训练完成后,自然语言被用来引用已学的视觉概念(或描述新的概念),从而实现模型在下游任务中的零样本迁移。我们通过在30多个现有的计算机视觉数据集上进行基准测试来研究这种方法的性能,这些数据集涵盖了诸如光学字符识别(OCR)、视频中的动作识别、地理定位以及多种细粒度的对象分类任务。该模型在大多数任务中都能非平凡地迁移,并且通常在不需要任何特定数据集训练的情况下与完全监督基线模型具有竞争力。例如,在ImageNet上进行零样本迁移时,我们的模型达到了与原始ResNet-50相同的准确率,而无需使用其训练过程中所依赖的128万个训练样本中的任何一个。我们将在https://github.com/OpenAI/CLIP发布代码和预训练模型权重。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-recognition-on-rareact | CLIP | mWAP: 40.7 |
| few-shot-image-classification-on-imagenet-0 | CLIP (ViT B/32) | Accuracy: 63.2% |
| few-shot-image-classification-on-imagenet-0 | CLIP (ResNet50) | Accuracy: 59.6% |
| hateful-meme-classification-on-harm-p | CLIP | Accuracy: 80.6 F1: 80.3 |
| hateful-meme-classification-on-pridemm | CLIP (fine-tuned) | Accuracy: 72.4 F1: 72.3 |
| image-classification-on-objectnet | CLIP | Top-1 Accuracy: 72.3 |
| image-classification-on-omnibenchmark | CLIP-RN50 | Average Top-1 Accuracy: 42.1 |
| image-to-text-retrieval-on-coco | CLIP (zero-shot) | Recall@1: 58.4 Recall@10: 88.1 Recall@5: 81.5 |
| long-tail-learning-on-coco-mlt | CLIP(ViT-B/16) | Average mAP: 60.17 |
| long-tail-learning-on-coco-mlt | CLIP(ResNet-50) | Average mAP: 56.19 |
| long-tail-learning-on-voc-mlt | CLIP(ViT-B/16) | Average mAP: 85.77 |
| long-tail-learning-on-voc-mlt | CLIP(ResNet-50) | Average mAP: 84.30 |
| meme-classification-on-hateful-memes | CLIP (zero-shot) | ROC-AUC: 0.661 |
| meme-classification-on-multioff | CLIP | Accuracy: 62.4 F1: 48.1 |
| object-categorization-on-grit | CLIP | Categorization (ablation): 48.1 |
| object-recognition-on-shape-bias | CLIP (ViT-B) | shape bias: 79.9 |
| open-vocabulary-attribute-detection-on-ovad-1 | CLIP VIT-B16 | mean average precision: 16.6 |
| prompt-engineering-on-caltech-101 | CLIP | Harmonic mean: 95.40 |
| prompt-engineering-on-dtd | CLIP | Harmonic mean: 56.37 |
| prompt-engineering-on-eurosat | CLIP | Harmonic mean: 60.03 |
| prompt-engineering-on-fgvc-aircraft | CLIP | Harmonic mean: 31.09 |
| prompt-engineering-on-imagenet | CLIP | Harmonic mean: 70.22 |
| prompt-engineering-on-imagenet-a | CLIP | Top-1 accuracy %: 47.77 |
| prompt-engineering-on-imagenet-r | CLIP | Top-1 accuracy %: 73.96 |
| prompt-engineering-on-imagenet-s | CLIP | Top-1 accuracy %: 46.15 |
| prompt-engineering-on-imagenet-v2 | CLIP | Top-1 accuracy %: 60.83 |
| prompt-engineering-on-oxford-102-flower | CLIP | Harmonic mean: 74.83 |
| prompt-engineering-on-oxford-iiit-pet-dataset | CLIP | Harmonic mean: 94.12 |
| prompt-engineering-on-stanford-cars-1 | CLIP | Harmonic mean: 68.65 |
| prompt-engineering-on-sun397 | CLIP | Harmonic mean: 72.23 |
| prompt-engineering-on-ucf101 | CLIP | Harmonic mean: 73.85 |
| semi-supervised-image-classification-on-16 | CLIP (ResNet-50) | ImageNet Top-1 Accuracy: 40% |
| text-based-person-retrieval-with-noisy | CLIP-C | Rank 10: 90.89 Rank-1: 66.41 Rank-5: 85.15 mAP: 59.36 mINP: 43.02 |
| text-based-person-retrieval-with-noisy-1 | CLIP-C | Rank 1: 55.25 Rank-10: 81.32 Rank-5: 74.76 mAP: 31.09 mINP: 4.94 |
| text-based-person-retrieval-with-noisy-2 | CLIP-C | Rank 1: 54.45 Rank 10: 86.70 Rank 5: 77.80 mAP: 42.58 mINP: 21.38 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | CLIP | Image-to-text R@1: 58.4 Image-to-text R@10: 88.1 Image-to-text R@5: 81.5 Text-to-image R@1: 37.8 Text-to-image R@10: 72.2 Text-to-image R@5: 62.4 |
| zero-shot-cross-modal-retrieval-on-flickr30k | CLIP | Image-to-text R@1: 88.0 Image-to-text R@10: 99.4 Image-to-text R@5: 98.7 Text-to-image R@1: 68.7 Text-to-image R@10: 95.2 Text-to-image R@5: 90.6 |
| zero-shot-learning-on-coco-mlt | ViT-B/16 | Average mAP: 60.17 |
| zero-shot-learning-on-coco-mlt | ResNet-50 | Average mAP: 56.19 |
| zero-shot-learning-on-voc-mlt | CLIP(ViT-B/16) | Average mAP: 85.77 |
| zero-shot-learning-on-voc-mlt | CLIP(ResNet-50) | Average mAP: 84.30 |
| zero-shot-transfer-image-classification-on | CLIP | Accuracy: 98.4 |
| zero-shot-transfer-image-classification-on-1 | CLIP(ViT-L/14-336px) | Accuracy (Private): 76.2 |
| zero-shot-transfer-image-classification-on-1 | CLIP | Accuracy (Public): 31.3 |
| zero-shot-transfer-image-classification-on-1 | CLIP (ResNet50) | Accuracy (Private): 59.6 |
| zero-shot-transfer-image-classification-on-2 | CLIP | Accuracy: 58.5 |
| zero-shot-transfer-image-classification-on-3 | CLIP | Accuracy (Private): 70.1 Accuracy (Public): - |
| zero-shot-transfer-image-classification-on-4 | CLIP | Accuracy: 88.9 |
| zero-shot-transfer-image-classification-on-5 | CLIP | Accuracy (Private): 77.2 Accuracy (Public): - |
| zero-shot-transfer-image-classification-on-6 | CLIP | Accuracy (Private): 72.3 Accuracy (Public): - |