Use this Dataset Discuss on Discord

Date

3 months ago

Paper URL

License

Apache 2.0

Tags

Image Processing

Document Understanding

MDPBench is a benchmark dataset for parsing multilingual digital and photographic documents; related research papers include... MDPBench: A Benchmark for Multilingual Document Parsing in Real-World ScenariosThe aim is to evaluate and improve the model's ability to parse multilingual documents in real-world, complex scenarios. The dataset contains 3,400 document images covering 17 languages, including Simplified Chinese, Traditional Chinese, English, Arabic, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Thai, and Vietnamese. The images underwent a rigorous process of expert model annotation, manual correction, and manual verification to achieve high-quality annotations.

Dataset Example

Citation

@misc{li2026mdpbenchbenchmarkmultilingualdocument, title={MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios}, author={Zhang Li and Zhibo Lin and Qiang Liu and Ziyang Zhang and Shuo Zhang and Zidun Guo and Jiajun Song and Jiarui Zhang and Xiang Bai and Yuliang Liu}, year={2026}, eprint={2603.28130}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.28130}, }

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp