A curated list of awesome healthcare datasets for machine learning, research, and exploration.
- Clinical Data
- Imaging Data
- Omics Data
- Biomedical Knowledge Graphs and Ontologies
- Public Health Data
- Biomedical Literature
- Wearable and Sensor Data
- Social Determinants of Health (SDOH)
- Synthetic Data
- Miscellaneous
- Contribute
- License
- MIMIC-III Clinical Database - Deidentified health data from ~40,000 critical care patients. Requires data use agreement and training.
- MIMIC-IV - Updated MIMIC-III, 2008-2019. Requires data use agreement and training.
- eICU Collaborative Research Database - Multi-center ICU data from across the US. Requires data use agreement and training.
- AmsterdamUMCdb - Deidentified health data from Amsterdam University Medical Center.
- HiRID - High time-resolution ICU data from Bern University Hospital, Switzerland (Inselspital). Requires Credentialed Access.
- Medical Information Mart for Intensive Care (MIMIC) - IV Emergency Department (MIMIC-IV-ED)
- BH-CHPR, Beth Israel - Emergency department visit data with classification labels, focus on reducing ED utilization.
- MIMIC-IV-ED - Emergency department data from MIMIC-IV. Requires data use agreement and training.
- AMR-UTI - Antimicrobial Resistance in Urinary Tract Infections.
- Abdominal and Direct Fetal ECG Database - Multichannel fetal ECG recordings.
- The Pediatric Epilepsy Research Consortium (PERC) Data - Data from multicenter observational studies on children with epilepsy. (Contact PERC for access)
- Nationwide Emergency Department Sample (NEDS) - Large, publicly-available all-payer ED database (US). Available for purchase.
- National Emergency Department Samples - The Healthcare Cost and Utilization Project.
- MIMIC-IV-Note - Deidentified clinical notes from MIMIC-IV. Requires data use agreement and training.
- i2b2/n2c2 NLP Research Data Sets - Several datasets of deidentified clinical notes with annotations for various NLP tasks (e.g., de-identification, relation extraction). Requires data use agreement.
- mtsamples - A large collection of transcribed medical sample reports.
- THYME corpus - clinical notes with temporal annotations. Contains colon cancer, brain tumor and epilepsy corpus.
- MIMIC-III Waveform Database - Waveform data matched to MIMIC-III. Requires data use agreement and training.
- MIMIC-IV Waveform Database Matched Subset
- MIMIC-IV-ECG - Diagnostic ECG data from MIMIC-IV.
- PTB-XL: A large publicly available electrocardiography dataset
- PhysioNet - Contains numerous other waveform databases (ECG, EEG, etc.) beyond MIMIC.
- OpenPrescribing - Prescribing data from GPs in England.
- FDA Adverse Event Reporting System. Captures adverse drug reactions.
- TCIA (The Cancer Imaging Archive) - Excellent resource for cancer imaging.
- Chest X-Ray Dataset - Pneumonia detection.
- RSNA Intracranial Hemorrhage Detection - Head CT scans with hemorrhage labels.
- MICCAI 2015 Challenge on Multimodal Brain Tumor Segmentation - Brain tumor segmentation.
- Non-Small Cell Lung Cancer CT Scan Dataset
- PROSTATEx - Prostate MRI.
- MosMedData: Chest CT Scans with COVID-19 Related Findings
- LUng Nodule Analysis (LUNA16) - Lung nodules.
- NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories
- DeepLesion - CT images with lesions.
- Medical Segmentation Decathlon Datasets
- dHCP 2nd data release -- sourcedata - Developmental Human Connectome Project.
- dHCP 2nd data release -- fMRI pipeline
- PADCHEST_SJ - Chest X-rays (Spanish labels).
- MURA (musculoskeletal radiographs) - Stanford bone X-rays.
- National COVID-19 Chest Image Database (NCCID) - UK COVID-19 imaging.
- International Neuroimaging Data-Sharing Initiative (INDI)
- Open Access Series of Imaging Studies (OASIS) - Brain MRI.
- BossDB Open Neuroimagery Datasets
- NYU Langone & FAIR FastMRI Dataset - Knee MRIs.
- The Human Connectome Project
- RadGraph - Radiology report entities/relations.
- RadNLI - Radiology report inference.
- RadQA - Radiology report QA.
- UK Biobank Brain Imaging - Requires application and approval.
- ADNI (Alzheimer's Disease Neuroimaging Initiative) - Requires application and approval.
- The Cancer Genome Atlas (TCGA) Clinical Data Resource imaging data . TCIA contains imaging data associated with subjects from TCGA.
- Labeled Optical Coherence Tomography - Retinal OCT.
- e_ophtha - A database of retinal images used for the detection of diabetic retinopathy.
- DIARETDB0 and DIARETDB1 - Diabetic retinopathy databases and evaluation protocols.
- MESSIDOR and Base de Datos Oftalmologica de la Region de Murcia (BDOR-Murcia) - Datasets for computer-aided diagnosis of diabetic retinopathy.
- STARE (Structured Analysis of the Retina) - Retinal images with vessel segmentations and disease labels.
- DRIVE: Digital Retinal Images for Vessel Extraction - Retinal images for vessel segmentation.
- RFMiD - Retinal Fundus Multi-disease Image Dataset.
- A-Fundus- pediatric fundus images.
- ISIC Archive (International Skin Imaging Collaboration) - A large collection of dermoscopic images of skin lesions.
- HAM10000 dataset - "Human Against Machine with 10000 training images" - dermoscopic images.
- DermNet NZ - While primarily a resource for information, DermNet NZ has a vast image library, though it is primarily for educational use and may have copyright restrictions.
- PAD-UFES-20 - Skin lesion images with clinical data.
- CAMELYON17 breast cancer - Lymph node metastasis.
- PatchCamelyon (PCam) - A benchmark dataset for machine learning, derived from Camelyon16.
- Computational Precision Medicine - Giga-pixel pathology images from the University of Pittsburg.
- The Cancer Genome Atlas (TCGA) - Includes histopathology images alongside genomic data. Requires data use agreement and, for some data, IRB approval.
- PANDA challenge - Prostate cancer grade assessment.
- Cell Painting Gallery - Drug discovery.
- Allen Cell Imaging Collections - 3D cell imaging.
- BBBC (Broad Bioimage Benchmark Collection) - A collection of freely available, high-quality, biological image datasets.
- MIMIC-IV-ECHO - Echocardiogram data from MIMIC-IV.
- EchoNet-Dynamic - Echocardiogram videos with ejection fraction measurements
- TCGA (The Cancer Genome Atlas) - Requires data use agreement.
- 1000 Genomes Project
- Genome Aggregation Database
- Genome in a Bottle on AWS - Reference genomes.
- GDC (Genomic Data Commons) - Requires data use agreement.
- UK Biobank Requires extensive application process.
- GTEx (Genotype-Tissue Expression)
- Gene Expression Omnibus (GEO)
- ArrayExpress
- Expression Atlas - A curated resource that provides information on gene expression across species and biological conditions.
- Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)
- Protein Data Bank (PDB) - Protein structures.
- Human Protein Atlas
- UniProt - Protein sequences and annotations.
- MassIVE (https://massive.ucsd.edu/) and ProteomeXchange (http://www.proteomexchange.org/) - Repositories for mass spectrometry proteomics data.
- Human Metabolome Database
- Metabolomics Workbench - A public repository for metabolomics data and metadata.
- cBioPortal - Cancer genomics.
- LinkedOmics- A web portal that integrated TCGA data with CPTAC proteomic data.
- PRECOG - PREdiction of Clinical Outcomes from Genomic profiles
- Cancer Cell Line Encyclopedia (CCLE)
- CoMMpass from the Multiple Myeloma Research Foundation
- NIH NCBI Sequence Read Archive (SRA) on AWS
- Basic Local Alignment Sequences Tool (BLAST) Databases
- Encyclopedia of DNA Elements (ENCODE)
- Tox21
- CTRP (Cancer Therapeutics Response Portal)
- PharmGKB
- GDSC (Genomics of Drug Sensitivity in Cancer) - A large-scale database of drug sensitivity in cancer cell lines.
- UMLS (Unified Medical Language System)
- SNOMED CT - Requires a license, free in some countries.
- LOINC (Logical Observation Identifiers Names and Codes)
- MeSH (Medical Subject Headings)
- ICD-10 (International Classification of Diseases, 10th Revision)
- ICD-9 (International Classification of Diseases, 9th Revision)
- CPT (Current Procedural Terminology) - Requires a license.
- Medical Dictionary for Regulatory Activities Terminology Requires a license.
- International Classification of Diseases for Oncology
- RxNorm
- DrugBank
- RxMix
- RxTerms
- Dailymed
- PubChem
- SIDER - Drug side effects.
- STITCH - Chemical-protein interactions.
- ZINC - Compounds for virtual screening.
- Orphanet Rare Disease Ontology
- GWAS Catalog
- Gene Ontology
- Disease Ontology
- Genetic and Rare Diseases
- Online Mendelian Inheritance in Man
- DisGeNET
- ClinVar - A public archive of reports of the relationships among human variations and phenotypes.
- HGNC (HUGO Gene Nomenclature Committee) - Provides approved human gene nomenclature.
- Reactome
- UBERON anatomy.
- Kyoto Encyclopedia of Genes and Genomes
- Open-targets
- STRING - A database of known and predicted protein-protein interactions.
- BioGRID - A database of protein and genetic interactions.
- IntAct - A molecular interaction database.
- Global Health Observatory (GHO)
- World Bank Health Data
- Global Burden of Disease (GBD)
- OECD Health Statistics
- Humanitarian Data Exchange
- Institute for Health Metrics and Evaluation - provides access to many more datasets related to global health.
- Medicare.gov Data
- HealthData.gov
- Medicare Provider Utilization and Payment Data
- National Health and Nutrition Examination Survey (NHANES)
- SEER (Surveillance, Epidemiology, and End Results Program) - Cancer statistics. Requires data use agreement.
- Behavioral Risk Factor Surveillance System (BRFSS) - Health-related telephone surveys.
- Youth Risk Behavior Surveillance System (YRBSS) - Monitors health-risk behaviors among youth.
- All of Us Research Program
- Canadian Open Neuroscience Platform (CONP)
- Pharmaceuticals and Medical Devices Agency Japan.
- European Medicines Agency.
- Kaiser Permanente Research Bank - Data from Kaiser Permanente, a large integrated healthcare delivery system (requires application and proposal).
- Truven Health MarketScan Databases - Commercial claims and EMR data. (Requires purchase.)
- Optum Clinformatics Data Mart - Commercial claims and EMR data. (Requires purchase, academic subscriptions available).
- National Inpatient Sample (NIS) - Largest all-payer inpatient care database in the US. (Available for purchase.)
- National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS)- Provides data on ambulatory care visits.
- PubMed Central Open Access Subset
- CORD-19 - COVID-19 papers.
- LitCovid - COVID-19 literature.
- PubMed
- Europe PMC
- Microsoft Academic Graph
- Semantic Scholar Open Research Corpus
- BioASQ - Biomedical semantic indexing and question answering challenges.
- ChemProt - Chemical-protein interactions extracted from literature.
- DDI (Drug-Drug Interaction) Extraction - Drug-drug interaction extraction from biomedical texts.
- PhysioNet - (See also Waveform Data) Contains numerous datasets from wearable sensors.
- UCI Machine Learning Repository - Contains several smaller datasets related to wearable sensor data, activity recognition, etc.
- The BIDMC PPG and respiration dataset - Photoplethysmogram (PPG), respiration, and other physiological signals.
- VitalDB - Vital signs data collected during surgeries
- mHealth Dataset - Body motion and vital signs data.
- Opportunity Dataset - Daily living activity recognition data.
- American Community Survey (ACS) - Provides detailed demographic, socioeconomic, and housing data at various geographic levels.
- Area Health Resources Files (AHRF) - County-level data on healthcare resources, demographics, and social determinants.
- County Health Rankings & Roadmaps - Provides rankings and data on various health factors and outcomes at the county level.
- USDA Food Environment Atlas - Data on food access, food prices, and local food systems.
- Robert Wood Johnson Foundation (RWJF) Data Hub - Curated datasets related to health equity and social determinants.
- Synpuf - Medicare synthetic data.
- Synthea - A synthetic patient generator that models the medical history of US patients.
- MedGAN - Generating Multi-label Discrete Patient Records using Generative Adversarial Networks.
- CorGAN - Correlation-Capturing Convolutional Generative Adversarial Networks for Generating Synthetic Healthcare Time-Series Data.
- Human Mortality Database
- OpenNeuro - Neuroimaging data.
- IBL Neuropixels Reproducible Ephys Data on AWS](https://registry.opendata.aws/ibl-reproducible-ephys/).
- Human Cell Atlas
- Refgenie reference genome assets.
- Open Bioinformatics Reference Data for Galaxy.
- OpenCell on AWS.
This list is released into the public domain. See the license file for more details.