MLDR Multilingual Document Retrieval Dataset
Date
Size
MLDR (Multilingual Long-Document Retrieval) is a multilingual long document retrieval dataset built based on Wikipedia, Wudao and mC4 multilingual corpus, which aims to support the research and development of cross-language long text retrieval tasks. It covers 13 typologically different languages, including Arabic (ar), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Portuguese (pt), Russian (ru), Thai (th), and Chinese (zh).
Features and advantages:
- Wide multi-language coverage: It includes 13 languages, covering multiple language families (such as Indo-European, Sino-Tibetan, Arabic, etc.).
- Long document feature: The average length of a document is 4,737 words, which is suitable for long text processing needs in real scenarios.
- Standardized construction: Generate high-quality queries through GPT-3.5 to ensure strong relevance of queries to document content.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.