Datasets for Medicine #31

cyrilzakka · 2025-01-25T17:02:18Z

We're aiming to curate / create a large-scale dataset of high-quality (patient presentation, diagnosis/next care step) pairs for medicine. The goal is to have pairs that are verifiable.

Given these pairs, one can:

Distill synthetic reasoning traces from DeepSeek-R1 (good for SFT on smol models)
Feed into the GRPO pipeline to bootstrap base models (like R1-Zero) or combine with RLHF (R1).
Let's use this issue to gather public datasets that are suitable to start with!

Rajatavaa · 2025-01-25T18:11:37Z

A reasoning SFT o1 dataset - https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT
Hair problems and diagnosis - https://huggingface.co/datasets/Amod/hair_medical_sit
Dataset with instruction and diagnosis - https://huggingface.co/datasets/mamachang/medical
Explicit reasoning dataset - https://huggingface.co/datasets/PJMixers-Dev/FreedomIntelligence_medical-o1-reasoning-SFT-CustomShareGPT

ATaylorAerospace · 2025-01-27T01:52:14Z

Following is a comprehensive listing of medical datasets that can be used as a foundation for medicine.

Practical Guide for Medical Data

mkieffer1107 · 2025-01-27T03:12:29Z

I'm super excited to start training clinical reasoning LLMs since the clinical domain seems to bridge the gap between the objective and subjective. So I was pretty happy to see the TRL GRPO implementation come out!

I know that the focus right now is on simple, verifiable answers with binary rewards, so I'm not really sure what'll end up being used. It might be cool to figure out how to incorporate some sort of continuous Rouge / BERTScore down the line since there's so many clinical NLP datasets devoted to this (note generation, summarization, etc.), but I admit that doesn't sound as elegant. It might even be simpler to grade free-text responses of this type with another LLM as a judge, awarding pass or fail grades.

Some of the datasets I've included, namely MIMIC and n2c2, are a bit more traditional and harder to access, but are probably richer data sources. In addition, some aren't easy to work with out of the box, as they're annotated for traditional NER, relation extraction, etc tasks. I don't like the idea of creating dataset-specific reward functions for parsing output, but idk what else to do (maybe have another LLM transform the first's output using structured output, but this sounds complicated). Simple QA data is obviously the easiest to work with, but I'm not entirely convinced that QA alone will lead to the sort of complex clinical reasoning we're looking for (I hope it does!).

These are some nice sources I've used to find the datasets below:
https://huggingface.co/bigbio
https://huggingface.co/medalpaca
https://huggingface.co/clinicalnlplab
https://github.com/openmedlab/Awesome-Medical-Dataset/tree/main?tab=readme-ov-file#text-dataset
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/

I kind of ordered these in terms of what I believe to be increasing complexity of verification.

Dataset	Task	License
PubMedQA	Multiple-choice QA and long-form explanatory answers	MIT
MedQA	Multiple-choice QA	?
MedMCQA	Multiple-choice QA	MIT
HEAD-QA	Multiple-choice QA	MIT
MTSample	Classification	?
BioNLI	NLI / Classification	CC BY-NC
MedNLI	NLI / Classification	PhysioNet Credentialed Health Data License 1.5.0
MIMIC-IV-ICD	Multi-label Classification	PhysioNet Credentialed Health Data License 1.5.0
HoC	Multi-label Classification	GPL-3.0
EmrQA	Passage extraction	i2c2 / n2c2 license
2018 n2c2	Multi-label Classification, Relation extraction	i2c2 / n2c2 license
KD-DTI	Relation Extraction	?
DDI	Relation Extraction	CC BY-NC
2009 n2c2	NER and relation extraction	i2c2 / n2c2 license
BC5CDR	NER and relation extraction	Public Domain Mark 1.0
2010 n2c2	NER, assertion classification, relation extraction	i2c2 / n2c2 license
2011 n2c2	Coreference resolution	i2c2 / n2c2 license
MIMIC-CDM	Clinical decision-making	PhysioNet Credentialed Health Data License 1.5.0

PubMedQA

summary: A biomedical QA dataset focused on reasoning over PubMed abstracts to answer research questions with "yes/no/maybe" answers
task: Multiple-choice QA and long-form explanatory answers
format:
- input: Text (question and PubMed abstract without conclusion)
- output: Classification label ("yes", "no", or "maybe") and text (long-form conclusion)
paper: https://arxiv.org/abs/1909.06146
data:
license: MIT

MedQA

summary: A multiple-choice QA dataset derived from medical licensing examinations in the US, China, and Taiwan
task: Multiple-choice QA
format:
- input: Text (medical exam question and multiple choice options)
- output: Classification label (correct answer option)
paper: https://www.mdpi.com/2076-3417/11/14/6421
data:
license: ?

MedMCQA

summary: A large-scale multiple-choice QA dataset with over 194k medical entrance exam questions covering 21 subjects
task: Multiple-choice QA
format:
- input: Text (question and multiple choice options)
- output: Classification label(s) (correct answer option(s))
paper: https://proceedings.mlr.press/v174/pal22a.html
data:
license: MIT

HEAD-QA

summary: A multiple-choice QA dataset in Spanish from real medical exams, covering six topics: medicine, pharmacology, nursing, psychology, biology, and chemistry
-task: Multiple-choice QA
-format:
-input: Text (question and multiple choice options)
-output: Classification label (correct answer option)
-paper: https://arxiv.org/abs/1906.04701
-data:
- https://huggingface.co/datasets/dvilares/head_qa
- https://aghie.github.io/head-qa/
license: MIT

MTSample

summary: A text classification dataset to classify which medical speciality a clinical transcription belongs to
task: Classification
format:
- input: Text (clinical transcription)
- output: Classification label (medical specialty)
paper: https://mtsamples.com/
data:
- https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions
- https://github.com/JunaidFarooqZargar/clinical_text_classification
license: ?

BioNLI

summary: A natural language inference (NLI) dataset for biomedical texts, testing models' ability to infer entailment or contradiction between statements
task: NLI / Classification
format:
- input: Text pair (premise and hypothesis sentences from biomedical literature)
- output: Classification label ("entailment", "contradiction", or "neutral")
paper: https://arxiv.org/abs/2210.14814
data:
- https://github.com/StonyBrookNLP/BioNLI
license: CC BY-NC

MedNLI

summary: An NLI dataset specifically designed for clinical domains using patient records annotated with entailment labels by clinicians
task: NLI / Classification
format:
- input: Text pair (premise and hypothesis sentences from clinical notes)
- output: Classification label ("entailment", "contradiction", or "neutral")
paper: https://arxiv.org/abs/1808.06752
data:
- https://jgc128.github.io/mednli/
license: PhysioNet Credentialed Health Data License 1.5.0

MIMIC-IV-ICD

summary: A benchmark dataset from MIMIC-IV for multi-label classification of medical codes (ICD-9 and ICD-10) from clinical discharge summaries
task: Multi-label classification
format:
- input: Free-text discharge summaries from electronic health records (EHRs)
- output: Multi-label annotations with ICD-9 or ICD-10 codes indicating diagnoses and procedures
paper: https://arxiv.org/pdf/2304.13998
data:
- https://physionet.org/content/mimiciv/2.2/
- https://github.com/thomasnguyen92/MIMIC-IV-ICD-data-processing
license: PhysioNet Credentialed Health Data License 1.5.0

HoC

summary: A text classification dataset for identifying cancer hallmarks in biomedical literature
task: Multi-label Classification
format:
- input: Text (biomedical literature excerpt)
- output: Multi-label classification (cancer hallmark labels)
paper: https://academic.oup.com/bioinformatics/article/32/3/432/1743783
data:
- https://github.com/sb895/Hallmarks-of-Cancer
license: GPL-3.0 license

EmrQA

summary: A clinical QA dataset generated from Electronic Medical Records (EMRs) using semi-automated methods
task: Passage extraction
format:
- input: Text (clinical notes and question)
- output: Text (extractive answer span from input)
paper: https://arxiv.org/abs/1809.00732
data:
- https://github.com/panushri25/emrQA
license: i2c2 / n2c2 license

2018 n2c2

summary:
- Track 1: A dataset for patient cohort selection based on clinical narratives
- Track 2: A dataset for extracting medication-related entities and their relations from clinical narratives
task:
- Track 1: Multi-label classification
- Track 2: Relation extraction
format:
- input: Text (longitudinal clinical narratives)
- output:
  - Track 1: Binary classification labels (met/not met) for 13 selection criteria
  - Track 2: Named entity labels (medication concepts, ADEs) with their spans, and relation labels between entities
data:
license: i2c2 / n2c2 license

KD-DTI

summary: A dataset for discovering drug-target interaction (DTI) knowledge from biomedical literature
task: Relation Extraction
format:
- input: Text (full-text biomedical articles)
- output: Text triplet (drug-target-interaction)
paper: https://arxiv.org/abs/2109.13187
data:
- https://github.com/bert-nmt/BERT-DTI
license: ?

DDI

summary: A dataset for extracting drug-drug interactions from biomedical texts, focusing on pharmacological relationships
task: Relation Extraction
format:
- input: Text (biomedical text with drug mentions)
- output: List of drug entities and interactions
paper: https://www.sciencedirect.com/science/article/pii/S1532046413001123
data:
- https://huggingface.co/datasets/bigbio/ddi_corpus
- https://github.com/isegura/DDICorpus
license: CC BY-NC

2009 n2c2

summary:
task: NER and relation extraction
format:
- input: Text (clinical discharge summaries)
- output: Medications, dosages, modes of administration, frequency, and reasons for administration
data:
- https://portal.dbmi.hms.harvard.edu/projects/n2c2-2009/
- https://huggingface.co/datasets/bigbio/n2c2_2009
license: i2c2 / n2c2 license

BC5CDR

summary: A corpus for chemical-disease relation (CDR) extraction from biomedical literature
task: NER and relation extraction
format:
- input: Text (PubMed abstracts)
- output: Named entity labels (chemicals and diseases) with their spans, and relation labels (chemical-induced disease relations)
paper: https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414
data:
license: Public Domain Mark 1.0

2010 n2c2

summary: A corpus of de-identified clinical records for concept extraction, assertion classification, and relation classification
task: NER, assertion classification, relation extraction
format:
- input: Text (clinical discharge summaries)
- output: Named entity labels (medical problems, treatments, tests) with their spans, assertion labels for each entity, and relation labels between entities
data:
- https://portal.dbmi.hms.harvard.edu/projects/n2c2-2010/
- https://huggingface.co/datasets/bigbio/n2c2_2010
license: i2c2 / n2c2 license

2011 n2c2

summary: A dataset of de-identified clinical records for coreference resolution in clinical
task: Coreference resolution
format:
- input: Text (clinical discharge summaries)
- output: Coreference chains for medical concepts
data:
- https://portal.dbmi.hms.harvard.edu/projects/n2c2-2011/
- https://huggingface.co/datasets/bigbio/n2c2_2011
license: i2c2 / n2c2 license

MIMIC-CDM

summary: A curated dataset derived from MIMIC-IV to evaluate large language models (LLMs) on clinical decision-making tasks for abdominal pathologies
task: Clinical decision-making
format:
- input: Structured patient data, including history of present illness (HPI), physical examination findings, laboratory results, radiology reports, and discharge summaries
- output: Predicted diagnosis and treatment plan procedures
paper: https://www.nature.com/articles/s41591-024-03097-1
data:
license: PhysioNet Credentialed Health Data License 1.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets for Medicine #31

Datasets for Medicine #31

cyrilzakka commented Jan 25, 2025

Rajatavaa commented Jan 25, 2025 •

edited

Loading

ATaylorAerospace commented Jan 27, 2025 •

edited

Loading

mkieffer1107 commented Jan 27, 2025

Datasets for Medicine #31

Datasets for Medicine #31

Comments

cyrilzakka commented Jan 25, 2025

Rajatavaa commented Jan 25, 2025 • edited Loading

ATaylorAerospace commented Jan 27, 2025 • edited Loading

mkieffer1107 commented Jan 27, 2025

PubMedQA

MedQA

MedMCQA

HEAD-QA

MTSample

BioNLI

MedNLI

MIMIC-IV-ICD

HoC

EmrQA

2018 n2c2

KD-DTI

DDI

2009 n2c2

BC5CDR

2010 n2c2

2011 n2c2

MIMIC-CDM

Rajatavaa commented Jan 25, 2025 •

edited

Loading

ATaylorAerospace commented Jan 27, 2025 •

edited

Loading