HyperAIHyperAI

Command Palette

Search for a command to run...

Noisy Medical Document Image Dataset

Date

2 hours ago

Publish URL

www.kaggle.com

License

CC BY-SA 4.0

Noisy Medical Document is a dataset of noise-enhanced medical document images designed for OCR and medical document understanding tasks. It aims to simulate the complex noise interference problems encountered when scanning documents in real medical scenarios, improve the robustness and generalization ability of OCR models and document understanding models in real-world environments, and is widely used in research and engineering tasks such as optical character recognition (OCR), intelligent document analysis, medical information extraction, fine-tuning of document models such as LayoutLM, multimodal model evaluation, and medical natural language processing. This dataset contains 1,000 high-fidelity synthetic medical document images, including 500 hospital bills and 500 discharge summaries, along with complete JSON structured annotation data. All images are synthetic data and fully comply with HIPAA privacy and security standards.

Dataset composition

  • Hospital bills: 500 bills, including itemized charges, CPT coding, insurance adjustments, and financial summaries.
  • Discharge Summaries: 500 pages, including medical history (HPI), hospitalization process, laboratory results, medication records, follow-up instructions, and electronic physician signature.

Citation

https://doi.org/10.34740/kaggle/dsv/16402426

@dataset{noisy_medical_docs_2026,
title={Noisy Medical Document Images – Hospital Bills & Discharge Summaries},
author={Devkumar Patel},
year={2026},
publisher={Kaggle},
url={https://www.kaggle.com/datasets/devp1866/noisy-medical-document-images-ocr}
}

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp