Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

As AI rapidly penetrates the medical field, high-quality datasets are gradually becoming the core foundation for driving model performance improvement and application implementation.From medical image recognition to clinical decision support, and further to the analysis of biological mechanisms,The type, scale, and annotation accuracy of the data directly determine the upper limit of the model's capabilities and the boundaries of its application.

From an overall development perspective, medical datasets are exhibiting characteristics of parallel evolution towards multimodality and refinement.on the one hand,Medical imaging data, such as X-rays, CT scans, and MRI scans, remain the mainstream. These data have standardized structures and clear annotations, making them suitable for training and evaluating computer vision models.on the other hand,More complex data types, including clinical indicators, disease risk prediction, drug response, and even single-cell sequencing, are rapidly growing, driving AI from "image recognition" to deeper levels of assisted diagnosis and life science research.

The 10 medical-related datasets selected in this article represent a facet of this trend.It covers different disease scenarios and research directions.It includes both imaging or clinical data related to specific diseases, as well as more cutting-edge bioinformatics and drug-related research.

A systematic review of these datasets reveals that standardized and structured data remain fundamental to model training and evaluation, while the ability to fuse cross-modal and multi-source data is becoming a key factor influencing model performance and generalization ability. In-depth analysis of these data resources also helps to further understand the current development priorities and evolutionary direction of medical AI.

For a long time,HyperAI continuously gathers and organizes datasets from multiple fields.It not only provides high-quality open-source datasets covering multiple areas such as medical imaging, clinical data, and bioinformatics in the medical field, but also provides a unified data discovery and usage portal for researchers and developers worldwide for many tasks/domains such as embodied intelligence, autonomous driving, OCR, multimodal understanding, and intelligent question answering.

More high-quality datasets:

https://hyper.ai/datasets

Historical Pandemic & Epidemic Global historical epidemic dataset

* Use online:

https://go.hyper.ai/WW6gh

The Historical Pandemic and Epidemic Dataset is a dataset covering major pandemic events in global history, designed to provide an analytically ready resource. The dataset contains 50 major pandemic events from the Antonine Plague of 165 AD to COVID-19 and monkeypox in 2023, covering all eras, regions, and pathogen types.

Lung Cancer Clinical Lung Cancer Clinical Data Set

* Use online:

https://go.hyper.ai/0YW09

Lung Cancer Clinical is a clinical dataset containing 1,500 patient records spanning from 2015 to 2025, covering 60 countries across all six regions of the World Health Organization (WHO).

This dataset provides detailed clinical, demographic, lifestyle, genetic, and diagnostic information on lung cancer. The data is sourced from the WHO Fact Sheet and the Global Cancer Research Statistics (GLOBOCAN 2020) and is suitable for exploratory data analysis (EDA), machine learning classification, survival analysis, geographic trend analysis, and public health research.

Adverse Drug Reaction Simulated adverse drug reaction dataset

* Use online:

https://go.hyper.ai/hJg6S

This dataset is designed to mimic pharmacovigilance reports of adverse drug reactions (ADRs) and aims to support research, machine learning experiments, and algorithm development in drug safety monitoring. Case Safety Reports (ICSRs) are artificially generated, inspired by real-world pharmacovigilance systems such as FDA FAERS and EMA EudraVigilance.

This dataset particularly highlights the rarity and imbalance of severe ADRs: most reports are mild reactions, while severe and fatal results are relatively rare (severe/fatal totals approximately 4–51 TP3T), reflecting the underreporting and severity distribution bias common in post-market surveillance.

Pan-Cancer scRNA-Seq Cancer Single-Cell Transcription Atlas Dataset

* Use online:

https://go.hyper.ai/X0FCx

This dataset contains transcriptome expression data from 7,930 single cells, covering three different biological states: healthy immune baseline, liquid tumor (myeloid leukemia), and solid tumor microenvironment (melanoma). It aims to build a cross-cohort integrated single-cell analysis benchmark to provide a benchmark for algorithm performance evaluation and methodological comparison, multi-cohort batch effect correction, immune exhaustion state analysis, and cross-tumor type biomarker mining.

THINGS-fMRI Functional magnetic resonance imaging dataset

* Use online:

https://go.hyper.ai/KYaOn

THINGS-fMRI is a high-density functional magnetic resonance imaging (fMRI) dataset for object cognition research, released by the National Institute of Mental Health of the National Institutes of Health (NIH), the Max Planck Institute for Human Cognition and Brain Sciences in Germany, and the University of Giessen Medical School, among other institutions. It aims to systematically characterize the human brain's visual and semantic representation of objects in the real world.

This dataset belongs to THINGS-data and contains 1,854 object concepts and 26,107 manually selected and labeled images of objects in natural scenes. In the fMRI experiment, subjects viewed object images from the THINGS image database during the scan, while whole-brain BOLD signals were recorded to analyze the spatial representation distribution of objects in the brain.

Three participants completed 12 scanning sessions, viewing a total of 8,740 unique images covering 720 object categories. The images were presented rapidly and sequentially, with participants maintaining central fixation. An anomaly detection task ensured attentional engagement, and some images were presented repeatedly in different sessions to support representation stability and reproducibility analysis.

In addition to task-oriented functional data, the dataset also provides rich structural and auxiliary scanning information, including high-resolution T1/T2 structural images, vascular imaging (TOF, T2*), field maps, functional localization experiments, retinal topological localization data, and resting-state functional connectivity data, providing support for multi-level brain functional modeling.

THINGS-MEG Magnetoencephalography (MEG) dataset

* Use online:

https://go.hyper.ai/VdJ6F

THINGS-MEG is a magnetoencephalography (MEG) dataset for object cognition research, released by the National Institute of Mental Health of the National Institutes of Health (NIH), the Max Planck Institute for Human Cognition and Brain Sciences in Germany, and the University of Giessen Medical School, among other institutions. It records millisecond-level electromagnetic brain activity when subjects view images of objects, and is used to analyze the temporal dynamics of object processing.

This dataset belongs to THINGS-data. In the MEG experiment, participants viewed a representative subset of THINGS images. The experiment consisted of 12 independent sessions (N=4 participants), containing 22,448 unique images covering all 1,854 object categories. The images were presented rapidly and sequentially (with an average interval of approximately 1.5 ± 0.2 seconds), requiring participants to maintain central fixation throughout.

THINGS-EEG EEG dataset

* Use online:

https://go.hyper.ai/IVwu6

THINGS-EEG is an electroencephalogram (EEG) dataset for object cognition research, released by the National Institute of Mental Health of the National Institutes of Health (NIH), the Max Planck Institute for Human Cognition and Brain Sciences in Germany, and the University of Giessen Medical School, among other institutions. It records the EEG activity of 50 subjects while viewing images of objects, and is used to analyze the temporal dynamics and cognitive representations of object processing.

This dataset belongs to THINGS-data. In the experiment, participants viewed a representative subset of stimuli from the THINGS image database, containing 22,248 images covering 1,854 object concepts. The images were presented in a fast serial visual presentation (RSVP) manner, requiring participants to maintain central fixation. Some images were presented repeatedly to analyze the stability of neural representations.

Health & Lifestyle Healthy Lifestyle Dataset

* Use online:

https://go.hyper.ai/PyiDm

Health & Lifestyle is a health lifestyle dataset released in 2025. It aims to explore the relationship between lifestyle factors and individual health status and provide an experimental basis for health prediction modeling, cluster analysis and data mining.

This dataset contains 100,000 individual records, provided in a CSV format. It covers a wide range of information, from demographics to health status and lifestyle habits. The data does not contain any real personal information; all values are artificially synthesized, while maintaining statistical consistency with real-world distributions.

MedQA Medical Text Question Answering Dataset

* Use online:

https://go.hyper.ai/CyIG3

MedQA, an open-source dataset for the medical field developed by a research team from MIT and Huazhong University of Science and Technology, simulates the style of the United States Medical Licensing Examination (USMLE).

This dataset, collected from professional medical examinations, covers English, Simplified Chinese, and Traditional Chinese, containing 12,723, 34,251, and 14,123 questions respectively. It aims to evaluate the model's ability to understand and apply medical knowledge. In addition to the question data, a large-scale corpus of medical textbooks has been collected and released, from which the reading comprehension model can obtain the necessary knowledge to answer the questions. The dataset is divided into training, development, and testing sets, used for model training, validation, and testing, respectively.

JMED Chinese real medical data dataset

* Use online:

https://hyper.ai/datasets/20490

The JMED dataset is a new dataset based on real-world medical data distributions, built by the Citrus Team in 2025.

This dataset originates from anonymous doctor-patient dialogues at JD Health Internet Hospital, filtered to retain consultations that follow standardized diagnostic workflows. The initial version contains 1,000 high-quality clinical records, covering all age groups (0-90 years) and multiple specialties. Each question includes 21 answer options, one of which is "None of the above." This design significantly increases the complexity and difficulty of distinguishing the correct answer, thus providing a more rigorous evaluation framework.

Compared with existing medical QA datasets, JMED has three main advantages: First, it more accurately reflects the ambiguity of patient symptom descriptions and the dynamic nature of clinical diagnosis in real scenarios. Second, the expanded answer options require enhanced reasoning capabilities to identify the correct answer among numerous interference factors. In addition, by leveraging a large amount of consultation data from JD.com's major hospitals, data that conforms to the distribution characteristics of real patients can be continuously generated.

HyperAI

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

6 hours ago

More high-quality datasets:

https://hyper.ai/datasets

Historical Pandemic & Epidemic Global historical epidemic dataset

* Use online:

https://go.hyper.ai/WW6gh

Lung Cancer Clinical Lung Cancer Clinical Data Set

* Use online:

https://go.hyper.ai/0YW09

Lung Cancer Clinical is a clinical dataset containing 1,500 patient records spanning from 2015 to 2025, covering 60 countries across all six regions of the World Health Organization (WHO).

Adverse Drug Reaction Simulated adverse drug reaction dataset

* Use online:

https://go.hyper.ai/hJg6S

Pan-Cancer scRNA-Seq Cancer Single-Cell Transcription Atlas Dataset

* Use online:

https://go.hyper.ai/X0FCx

THINGS-fMRI Functional magnetic resonance imaging dataset

* Use online:

https://go.hyper.ai/KYaOn

THINGS-MEG Magnetoencephalography (MEG) dataset

* Use online:

https://go.hyper.ai/VdJ6F

THINGS-EEG EEG dataset

* Use online:

https://go.hyper.ai/IVwu6

Health & Lifestyle Healthy Lifestyle Dataset

* Use online:

https://go.hyper.ai/PyiDm

MedQA Medical Text Question Answering Dataset

* Use online:

https://go.hyper.ai/CyIG3

JMED Chinese real medical data dataset

* Use online:

https://hyper.ai/datasets/20490

The JMED dataset is a new dataset based on real-world medical data distributions, built by the Citrus Team in 2025.

Command Palette

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Command Palette

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Related News

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Dataset Compilation | Open-source Inference Datasets From NVIDIA, OpenAI, and Multiple Research Institutions, Covering Mathematics, Panoramic Space, Wiki Question Answering, Research Tasks, Visual Commonsense, etc.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

MOSS-TTS: A Decoupled, production-grade Speech Generation Model Based on CAT Architecture; Breaking the Barrier of single-cell Analysis: Constructing a cross-cancer Immune Atlas Benchmark Using the Pan-Cancer scRNA-Seq dataset.

Low Latency, Multilingual Support, and Lightweight Design: Voxtral Realtime Breaks the Constraints of ASR Across All Scenarios; a Boon for Wearable Device Design! Antenna Performance Builds an Antenna Performance and Fault dataset.

LightOnOCR-2-1B: High-precision end-to-end OCR Based on RLVR Training; Google Streetview National Street View Images: An open-source Panoramic Image Library Based on world-class Geomapping technology.

FLUX.2-klein-4B: Achieves 4-step sub-second Image Generation via Distillation, Enabling real-time Interaction on consumer-grade GPUs; Vehicles OpenImages Dataset: Focuses on Vehicle Detection and localization.

Command Palette

Dataset Compilation | From Medical imaging/clinical Data to Cell atlas/medical Q&A, 10 Major Datasets Covering Multiple Disease Scenarios

Related News

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Dataset Compilation | Open-source Inference Datasets From NVIDIA, OpenAI, and Multiple Research Institutions, Covering Mathematics, Panoramic Space, Wiki Question Answering, Research Tasks, Visual Commonsense, etc.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

MOSS-TTS: A Decoupled, production-grade Speech Generation Model Based on CAT Architecture; Breaking the Barrier of single-cell Analysis: Constructing a cross-cancer Immune Atlas Benchmark Using the Pan-Cancer scRNA-Seq dataset.

Low Latency, Multilingual Support, and Lightweight Design: Voxtral Realtime Breaks the Constraints of ASR Across All Scenarios; a Boon for Wearable Device Design! Antenna Performance Builds an Antenna Performance and Fault dataset.

LightOnOCR-2-1B: High-precision end-to-end OCR Based on RLVR Training; Google Streetview National Street View Images: An open-source Panoramic Image Library Based on world-class Geomapping technology.

FLUX.2-klein-4B: Achieves 4-step sub-second Image Generation via Distillation, Enabling real-time Interaction on consumer-grade GPUs; Vehicles OpenImages Dataset: Focuses on Vehicle Detection and localization.

Related News

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Dataset Compilation | Open-source Inference Datasets From NVIDIA, OpenAI, and Multiple Research Institutions, Covering Mathematics, Panoramic Space, Wiki Question Answering, Research Tasks, Visual Commonsense, etc.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

MOSS-TTS: A Decoupled, production-grade Speech Generation Model Based on CAT Architecture; Breaking the Barrier of single-cell Analysis: Constructing a cross-cancer Immune Atlas Benchmark Using the Pan-Cancer scRNA-Seq dataset.

Low Latency, Multilingual Support, and Lightweight Design: Voxtral Realtime Breaks the Constraints of ASR Across All Scenarios; a Boon for Wearable Device Design! Antenna Performance Builds an Antenna Performance and Fault dataset.

LightOnOCR-2-1B: High-precision end-to-end OCR Based on RLVR Training; Google Streetview National Street View Images: An open-source Panoramic Image Library Based on world-class Geomapping technology.

FLUX.2-klein-4B: Achieves 4-step sub-second Image Generation via Distillation, Enabling real-time Interaction on consumer-grade GPUs; Vehicles OpenImages Dataset: Focuses on Vehicle Detection and localization.

Related News

When Multimodal Computing Begins to Take Off: MiniCPM-o-4.5, With Only 9 Bytes, Covers real-time Image Understanding and Text Generation; vLLM Omni Simultaneously Supports high-throughput Deployment and service-oriented Architecture for Both Text and Multimodal models.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Paper Compilation | Over 100 Key AI for Science Achievements: A Quick Overview of Technological Innovations by 2025

Dataset Compilation | Open-source Inference Datasets From NVIDIA, OpenAI, and Multiple Research Institutions, Covering Mathematics, Panoramic Space, Wiki Question Answering, Research Tasks, Visual Commonsense, etc.

Achieve "voice-over Freedom" With Just 3 Seconds of Audio: Mistral open-source Speech Model Voxtral-4B-TTS-2603; Set a New Benchmark for Data Quality: Sutra 10B Pretraining.

MOSS-TTS: A Decoupled, production-grade Speech Generation Model Based on CAT Architecture; Breaking the Barrier of single-cell Analysis: Constructing a cross-cancer Immune Atlas Benchmark Using the Pan-Cancer scRNA-Seq dataset.

Low Latency, Multilingual Support, and Lightweight Design: Voxtral Realtime Breaks the Constraints of ASR Across All Scenarios; a Boon for Wearable Device Design! Antenna Performance Builds an Antenna Performance and Fault dataset.

LightOnOCR-2-1B: High-precision end-to-end OCR Based on RLVR Training; Google Streetview National Street View Images: An open-source Panoramic Image Library Based on world-class Geomapping technology.

FLUX.2-klein-4B: Achieves 4-step sub-second Image Generation via Distillation, Enabling real-time Interaction on consumer-grade GPUs; Vehicles OpenImages Dataset: Focuses on Vehicle Detection and localization.