Date

4 years ago

Size

25.2 GB

Organization

Publish URL

github.com

Paper URL

arxiv.org

License

Other

Tags

Multimodal

Natural Language Processing

WIT, short for Wikipedia-based Image Text, is a large multimodal and multilingual dataset. The dataset consists of a curated collection of 37.6 million entity-rich image-text examples, including 11.5 million unique images in 108 Wikipedia languages. The scale of the dataset allows it to be used as a pre-training dataset for multimodal machine learning models. WIT has four unique advantages:

WIT is the largest multimodal dataset in terms of the number of image-text examples.
Over 100 languages are covered (with at least 12,000 examples per language), and cross-lingual text is provided for many images.
Relative to previous datasets, WIT represents a more diverse set of concepts and real-world entities.
WIT provides a very challenging real-world test set.

WIT.torrent

Seeding 1Downloading 0Completed 611Total Downloads 809

WIT/
- README.md
  1.41 KB
- README.txt
  2.82 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset

Discuss on Discord

Date

4 years ago

Size

25.2 GB

Organization

Publish URL

github.com

Paper URL

arxiv.org

License

Other

Related Datasets

DRACO Cross-Disciplinary Deep Research Benchmark Dataset

13 days ago

Student Mental Health and Burnout Dataset

23 days ago

Groundsource Global Flood Events Dataset

2 months ago

TransPhy3D Transparent Reflection Synthesis Video Dataset

3 months ago

GroundingME Complex Scene Understanding Evaluation Dataset

4 months ago

MCIF Multimodal Cross-Language Instruction Following Dataset

4 months ago

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

4 months ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

WIT image-text Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

WIT image-text Dataset

Related Datasets

DRACO Cross-Disciplinary Deep Research Benchmark Dataset

Student Mental Health and Burnout Dataset

Groundsource Global Flood Events Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

WIT image-text Dataset

Related Datasets

DRACO Cross-Disciplinary Deep Research Benchmark Dataset

Student Mental Health and Burnout Dataset

Groundsource Global Flood Events Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

DRACO Cross-Disciplinary Deep Research Benchmark Dataset

Student Mental Health and Burnout Dataset

Groundsource Global Flood Events Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset

Related Datasets

DRACO Cross-Disciplinary Deep Research Benchmark Dataset

Student Mental Health and Burnout Dataset

Groundsource Global Flood Events Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

GroundingME Complex Scene Understanding Evaluation Dataset

MCIF Multimodal Cross-Language Instruction Following Dataset

MCD-rPPG Multi-Camera Remote Photoplethysmography Dataset