Skip to content

geniusrise/awesome-healthcare-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Awesome Healthcare Datasets

Awesome

A curated list of awesome healthcare datasets for machine learning, research, and exploration.

Contents

Clinical Data

General EHR/ICU Data

  1. MIMIC-III Clinical Database - Deidentified health data from ~40,000 critical care patients. Requires data use agreement and training.
  2. MIMIC-IV - Updated MIMIC-III, 2008-2019. Requires data use agreement and training.
  3. eICU Collaborative Research Database - Multi-center ICU data from across the US. Requires data use agreement and training.
  4. AmsterdamUMCdb - Deidentified health data from Amsterdam University Medical Center.
  5. HiRID - High time-resolution ICU data from Bern University Hospital, Switzerland (Inselspital). Requires Credentialed Access.
  6. Medical Information Mart for Intensive Care (MIMIC) - IV Emergency Department (MIMIC-IV-ED)
  7. BH-CHPR, Beth Israel - Emergency department visit data with classification labels, focus on reducing ED utilization.

Specific Conditions/Cohorts

  1. MIMIC-IV-ED - Emergency department data from MIMIC-IV. Requires data use agreement and training.
  2. AMR-UTI - Antimicrobial Resistance in Urinary Tract Infections.
  3. Abdominal and Direct Fetal ECG Database - Multichannel fetal ECG recordings.
  4. The Pediatric Epilepsy Research Consortium (PERC) Data - Data from multicenter observational studies on children with epilepsy. (Contact PERC for access)
  5. Nationwide Emergency Department Sample (NEDS) - Large, publicly-available all-payer ED database (US). Available for purchase.
  6. National Emergency Department Samples - The Healthcare Cost and Utilization Project.

Clinical Notes/Text

  1. MIMIC-IV-Note - Deidentified clinical notes from MIMIC-IV. Requires data use agreement and training.
  2. i2b2/n2c2 NLP Research Data Sets - Several datasets of deidentified clinical notes with annotations for various NLP tasks (e.g., de-identification, relation extraction). Requires data use agreement.
  3. mtsamples - A large collection of transcribed medical sample reports.
  4. THYME corpus - clinical notes with temporal annotations. Contains colon cancer, brain tumor and epilepsy corpus.

Waveform Data

  1. MIMIC-III Waveform Database - Waveform data matched to MIMIC-III. Requires data use agreement and training.
  2. MIMIC-IV Waveform Database Matched Subset
  3. MIMIC-IV-ECG - Diagnostic ECG data from MIMIC-IV.
  4. PTB-XL: A large publicly available electrocardiography dataset
  5. PhysioNet - Contains numerous other waveform databases (ECG, EEG, etc.) beyond MIMIC.

Prescription Data

  1. OpenPrescribing - Prescribing data from GPs in England.
  2. FDA Adverse Event Reporting System. Captures adverse drug reactions.

Imaging Data

Radiology (X-ray, CT, MRI)

  1. TCIA (The Cancer Imaging Archive) - Excellent resource for cancer imaging.
  2. Chest X-Ray Dataset - Pneumonia detection.
  3. RSNA Intracranial Hemorrhage Detection - Head CT scans with hemorrhage labels.
  4. MICCAI 2015 Challenge on Multimodal Brain Tumor Segmentation - Brain tumor segmentation.
  5. Non-Small Cell Lung Cancer CT Scan Dataset
  6. PROSTATEx - Prostate MRI.
  7. MosMedData: Chest CT Scans with COVID-19 Related Findings
  8. LUng Nodule Analysis (LUNA16) - Lung nodules.
  9. NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories
  10. DeepLesion - CT images with lesions.
  11. Medical Segmentation Decathlon Datasets
  12. dHCP 2nd data release -- sourcedata - Developmental Human Connectome Project.
  13. dHCP 2nd data release -- fMRI pipeline
  14. PADCHEST_SJ - Chest X-rays (Spanish labels).
  15. MURA (musculoskeletal radiographs) - Stanford bone X-rays.
  16. National COVID-19 Chest Image Database (NCCID) - UK COVID-19 imaging.
  17. International Neuroimaging Data-Sharing Initiative (INDI)
  18. Open Access Series of Imaging Studies (OASIS) - Brain MRI.
  19. BossDB Open Neuroimagery Datasets
  20. NYU Langone & FAIR FastMRI Dataset - Knee MRIs.
  21. The Human Connectome Project
  22. RadGraph - Radiology report entities/relations.
  23. RadNLI - Radiology report inference.
  24. RadQA - Radiology report QA.
  25. UK Biobank Brain Imaging - Requires application and approval.
  26. ADNI (Alzheimer's Disease Neuroimaging Initiative) - Requires application and approval.
  27. The Cancer Genome Atlas (TCGA) Clinical Data Resource imaging data . TCIA contains imaging data associated with subjects from TCGA.

Ophthalmology

  1. Labeled Optical Coherence Tomography - Retinal OCT.
  2. e_ophtha - A database of retinal images used for the detection of diabetic retinopathy.
  3. DIARETDB0 and DIARETDB1 - Diabetic retinopathy databases and evaluation protocols.
  4. MESSIDOR and Base de Datos Oftalmologica de la Region de Murcia (BDOR-Murcia) - Datasets for computer-aided diagnosis of diabetic retinopathy.
  5. STARE (Structured Analysis of the Retina) - Retinal images with vessel segmentations and disease labels.
  6. DRIVE: Digital Retinal Images for Vessel Extraction - Retinal images for vessel segmentation.
  7. RFMiD - Retinal Fundus Multi-disease Image Dataset.
  8. A-Fundus- pediatric fundus images.

Dermatology

  1. ISIC Archive (International Skin Imaging Collaboration) - A large collection of dermoscopic images of skin lesions.
  2. HAM10000 dataset - "Human Against Machine with 10000 training images" - dermoscopic images.
  3. DermNet NZ - While primarily a resource for information, DermNet NZ has a vast image library, though it is primarily for educational use and may have copyright restrictions.
  4. PAD-UFES-20 - Skin lesion images with clinical data.

Pathology

  1. CAMELYON17 breast cancer - Lymph node metastasis.
  2. PatchCamelyon (PCam) - A benchmark dataset for machine learning, derived from Camelyon16.
  3. Computational Precision Medicine - Giga-pixel pathology images from the University of Pittsburg.
  4. The Cancer Genome Atlas (TCGA) - Includes histopathology images alongside genomic data. Requires data use agreement and, for some data, IRB approval.
  5. PANDA challenge - Prostate cancer grade assessment.

Microscopy

  1. Cell Painting Gallery - Drug discovery.
  2. Allen Cell Imaging Collections - 3D cell imaging.
  3. BBBC (Broad Bioimage Benchmark Collection) - A collection of freely available, high-quality, biological image datasets.

Dental

  1. A multimodal dental dataset facilitating machine learning research and clinic services

Other Imaging Modalities

  1. MIMIC-IV-ECHO - Echocardiogram data from MIMIC-IV.
  2. EchoNet-Dynamic - Echocardiogram videos with ejection fraction measurements

Omics Data

Genomics

  1. TCGA (The Cancer Genome Atlas) - Requires data use agreement.
  2. 1000 Genomes Project
  3. Genome Aggregation Database
  4. Genome in a Bottle on AWS - Reference genomes.
  5. GDC (Genomic Data Commons) - Requires data use agreement.
  6. UK Biobank Requires extensive application process.

Transcriptomics

  1. GTEx (Genotype-Tissue Expression)
  2. Gene Expression Omnibus (GEO)
  3. ArrayExpress
  4. Expression Atlas - A curated resource that provides information on gene expression across species and biological conditions.

Proteomics

  1. Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)
  2. Protein Data Bank (PDB) - Protein structures.
  3. Human Protein Atlas
  4. UniProt - Protein sequences and annotations.
  5. MassIVE (https://massive.ucsd.edu/) and ProteomeXchange (http://www.proteomexchange.org/) - Repositories for mass spectrometry proteomics data.

Metabolomics

  1. Human Metabolome Database
  2. Metabolomics Workbench - A public repository for metabolomics data and metadata.

Multi-omics

  1. cBioPortal - Cancer genomics.
  2. LinkedOmics- A web portal that integrated TCGA data with CPTAC proteomic data.
  3. PRECOG - PREdiction of Clinical Outcomes from Genomic profiles

Pharmacogenomics

  1. Cancer Cell Line Encyclopedia (CCLE)
  2. CoMMpass from the Multiple Myeloma Research Foundation
  3. NIH NCBI Sequence Read Archive (SRA) on AWS
  4. Basic Local Alignment Sequences Tool (BLAST) Databases
  5. Encyclopedia of DNA Elements (ENCODE)
  6. Tox21
  7. CTRP (Cancer Therapeutics Response Portal)
  8. PharmGKB
  9. GDSC (Genomics of Drug Sensitivity in Cancer) - A large-scale database of drug sensitivity in cancer cell lines.

Biomedical Knowledge Graphs and Ontologies

General Medical Terminologies

  1. UMLS (Unified Medical Language System)
  2. SNOMED CT - Requires a license, free in some countries.
  3. LOINC (Logical Observation Identifiers Names and Codes)
  4. MeSH (Medical Subject Headings)
  5. ICD-10 (International Classification of Diseases, 10th Revision)
  6. ICD-9 (International Classification of Diseases, 9th Revision)
  7. CPT (Current Procedural Terminology) - Requires a license.
  8. Medical Dictionary for Regulatory Activities Terminology Requires a license.
  9. International Classification of Diseases for Oncology

Drug and Chemical Information

  1. RxNorm
  2. DrugBank
  3. RxMix
  4. RxTerms
  5. Dailymed
  6. PubChem
  7. ChEMBL
  8. SIDER - Drug side effects.
  9. STITCH - Chemical-protein interactions.
  10. ZINC - Compounds for virtual screening.

Disease and Gene Information

  1. Orphanet Rare Disease Ontology
  2. GWAS Catalog
  3. Gene Ontology
  4. Disease Ontology
  5. Genetic and Rare Diseases
  6. Online Mendelian Inheritance in Man
  7. DisGeNET
  8. ClinVar - A public archive of reports of the relationships among human variations and phenotypes.
  9. HGNC (HUGO Gene Nomenclature Committee) - Provides approved human gene nomenclature.

Pathway and Interaction Databases

  1. Reactome
  2. EXPERIMENTAL FACTOR ONTOLOGY.
  3. UBERON anatomy.
  4. Kyoto Encyclopedia of Genes and Genomes
  5. Open-targets
  6. STRING - A database of known and predicted protein-protein interactions.
  7. BioGRID - A database of protein and genetic interactions.
  8. IntAct - A molecular interaction database.

Public Health Data

Global Health

  1. Global Health Observatory (GHO)
  2. World Bank Health Data
  3. Global Burden of Disease (GBD)
  4. UNICEF Data
  5. OECD Health Statistics
  6. Humanitarian Data Exchange
  7. Institute for Health Metrics and Evaluation - provides access to many more datasets related to global health.

US-Specific Public Health

  1. CDC WONDER
  2. Medicare.gov Data
  3. HealthData.gov
  4. Medicare Provider Utilization and Payment Data
  5. National Health and Nutrition Examination Survey (NHANES)
  6. SEER (Surveillance, Epidemiology, and End Results Program) - Cancer statistics. Requires data use agreement.
  7. Behavioral Risk Factor Surveillance System (BRFSS) - Health-related telephone surveys.
  8. Youth Risk Behavior Surveillance System (YRBSS) - Monitors health-risk behaviors among youth.

Health Systems and Policy

  1. All of Us Research Program
  2. Canadian Open Neuroscience Platform (CONP)
  3. Pharmaceuticals and Medical Devices Agency Japan.
  4. European Medicines Agency.
  5. Kaiser Permanente Research Bank - Data from Kaiser Permanente, a large integrated healthcare delivery system (requires application and proposal).
  6. Truven Health MarketScan Databases - Commercial claims and EMR data. (Requires purchase.)
  7. Optum Clinformatics Data Mart - Commercial claims and EMR data. (Requires purchase, academic subscriptions available).
  8. National Inpatient Sample (NIS) - Largest all-payer inpatient care database in the US. (Available for purchase.)
  9. National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS)- Provides data on ambulatory care visits.

Biomedical Literature

Article Databases and Collections

  1. PubMed Central Open Access Subset
  2. CORD-19 - COVID-19 papers.
  3. LitCovid - COVID-19 literature.
  4. PubMed
  5. Europe PMC
  6. Microsoft Academic Graph
  7. Semantic Scholar Open Research Corpus

Literature-Based Datasets

  1. BioASQ - Biomedical semantic indexing and question answering challenges.
  2. ChemProt - Chemical-protein interactions extracted from literature.
  3. DDI (Drug-Drug Interaction) Extraction - Drug-drug interaction extraction from biomedical texts.

Wearable and Sensor Data

  1. PhysioNet - (See also Waveform Data) Contains numerous datasets from wearable sensors.
  2. UCI Machine Learning Repository - Contains several smaller datasets related to wearable sensor data, activity recognition, etc.
  3. The BIDMC PPG and respiration dataset - Photoplethysmogram (PPG), respiration, and other physiological signals.
  4. VitalDB - Vital signs data collected during surgeries
  5. mHealth Dataset - Body motion and vital signs data.
  6. Opportunity Dataset - Daily living activity recognition data.

Social Determinants of Health (SDOH)

  1. American Community Survey (ACS) - Provides detailed demographic, socioeconomic, and housing data at various geographic levels.
  2. Area Health Resources Files (AHRF) - County-level data on healthcare resources, demographics, and social determinants.
  3. County Health Rankings & Roadmaps - Provides rankings and data on various health factors and outcomes at the county level.
  4. USDA Food Environment Atlas - Data on food access, food prices, and local food systems.
  5. Robert Wood Johnson Foundation (RWJF) Data Hub - Curated datasets related to health equity and social determinants.

Synthetic Data

  1. Synpuf - Medicare synthetic data.
  2. Synthea - A synthetic patient generator that models the medical history of US patients.
  3. MedGAN - Generating Multi-label Discrete Patient Records using Generative Adversarial Networks.
  4. CorGAN - Correlation-Capturing Convolutional Generative Adversarial Networks for Generating Synthetic Healthcare Time-Series Data.

Miscellaneous

  1. Human Mortality Database
  2. OpenNeuro - Neuroimaging data.
  3. IBL Neuropixels Reproducible Ephys Data on AWS](https://registry.opendata.aws/ibl-reproducible-ephys/).
  4. Human Cell Atlas
  5. Refgenie reference genome assets.
  6. Open Bioinformatics Reference Data for Galaxy.
  7. OpenCell on AWS.

License

CC0

This list is released into the public domain. See the license file for more details.