WIT image-text Dataset
Date
Size
Publish URL
Paper URL
License
Other

WIT, short for Wikipedia-based Image Text, is a large multimodal and multilingual dataset. The dataset consists of a curated collection of 37.6 million entity-rich image-text examples, including 11.5 million unique images in 108 Wikipedia languages. The scale of the dataset allows it to be used as a pre-training dataset for multimodal machine learning models.
WIT has four unique advantages:
- WIT is the largest multimodal dataset in terms of the number of image-text examples.
- Over 100 languages are covered (with at least 12,000 examples per language), and cross-lingual text is provided for many images.
- Relative to previous datasets, WIT represents a more diverse set of concepts and real-world entities.
- WIT provides a very challenging real-world test set.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.