Command Palette
Search for a command to run...
Noisy Medical Document Image Dataset
Date
Publish URL
License
CC BY-SA 4.0
Noisy Medical Document is a dataset of noise-enhanced medical document images designed for OCR and medical document understanding tasks. It aims to simulate the complex noise interference problems encountered when scanning documents in real medical scenarios, improve the robustness and generalization ability of OCR models and document understanding models in real-world environments, and is widely used in research and engineering tasks such as optical character recognition (OCR), intelligent document analysis, medical information extraction, fine-tuning of document models such as LayoutLM, multimodal model evaluation, and medical natural language processing. This dataset contains 1,000 high-fidelity synthetic medical document images, including 500 hospital bills and 500 discharge summaries, along with complete JSON structured annotation data. All images are synthetic data and fully comply with HIPAA privacy and security standards.
Dataset composition
- Hospital bills: 500 bills, including itemized charges, CPT coding, insurance adjustments, and financial summaries.
- Discharge Summaries: 500 pages, including medical history (HPI), hospitalization process, laboratory results, medication records, follow-up instructions, and electronic physician signature.
Citation
https://doi.org/10.34740/kaggle/dsv/16402426
@dataset{noisy_medical_docs_2026,
title={Noisy Medical Document Images – Hospital Bills & Discharge Summaries},
author={Devkumar Patel},
year={2026},
publisher={Kaggle},
url={https://www.kaggle.com/datasets/devp1866/noisy-medical-document-images-ocr}
}
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.