Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets for Medicine #31

Open
cyrilzakka opened this issue Jan 25, 2025 · 3 comments
Open

Datasets for Medicine #31

cyrilzakka opened this issue Jan 25, 2025 · 3 comments

Comments

@cyrilzakka
Copy link

We're aiming to curate / create a large-scale dataset of high-quality (patient presentation, diagnosis/next care step) pairs for medicine. The goal is to have pairs that are verifiable.

Given these pairs, one can:

Distill synthetic reasoning traces from DeepSeek-R1 (good for SFT on smol models)
Feed into the GRPO pipeline to bootstrap base models (like R1-Zero) or combine with RLHF (R1).
Let's use this issue to gather public datasets that are suitable to start with!

@Rajatavaa
Copy link

Rajatavaa commented Jan 25, 2025

A reasoning SFT o1 dataset - https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT
Hair problems and diagnosis - https://huggingface.co/datasets/Amod/hair_medical_sit
Dataset with instruction and diagnosis - https://huggingface.co/datasets/mamachang/medical
Explicit reasoning dataset - https://huggingface.co/datasets/PJMixers-Dev/FreedomIntelligence_medical-o1-reasoning-SFT-CustomShareGPT

@ATaylorAerospace
Copy link

ATaylorAerospace commented Jan 27, 2025

Following is a comprehensive listing of medical datasets that can be used as a foundation for medicine.

Practical Guide for Medical Data

@mkieffer1107
Copy link

I'm super excited to start training clinical reasoning LLMs since the clinical domain seems to bridge the gap between the objective and subjective. So I was pretty happy to see the TRL GRPO implementation come out!

I know that the focus right now is on simple, verifiable answers with binary rewards, so I'm not really sure what'll end up being used. It might be cool to figure out how to incorporate some sort of continuous Rouge / BERTScore down the line since there's so many clinical NLP datasets devoted to this (note generation, summarization, etc.), but I admit that doesn't sound as elegant. It might even be simpler to grade free-text responses of this type with another LLM as a judge, awarding pass or fail grades.

Some of the datasets I've included, namely MIMIC and n2c2, are a bit more traditional and harder to access, but are probably richer data sources. In addition, some aren't easy to work with out of the box, as they're annotated for traditional NER, relation extraction, etc tasks. I don't like the idea of creating dataset-specific reward functions for parsing output, but idk what else to do (maybe have another LLM transform the first's output using structured output, but this sounds complicated). Simple QA data is obviously the easiest to work with, but I'm not entirely convinced that QA alone will lead to the sort of complex clinical reasoning we're looking for (I hope it does!).

These are some nice sources I've used to find the datasets below:
https://huggingface.co/bigbio
https://huggingface.co/medalpaca
https://huggingface.co/clinicalnlplab
https://github.com/openmedlab/Awesome-Medical-Dataset/tree/main?tab=readme-ov-file#text-dataset
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/

I kind of ordered these in terms of what I believe to be increasing complexity of verification.

Dataset Task License
PubMedQA Multiple-choice QA and long-form explanatory answers MIT
MedQA Multiple-choice QA ?
MedMCQA Multiple-choice QA MIT
HEAD-QA Multiple-choice QA MIT
MTSample Classification ?
BioNLI NLI / Classification CC BY-NC
MedNLI NLI / Classification PhysioNet Credentialed Health Data License 1.5.0
MIMIC-IV-ICD Multi-label Classification PhysioNet Credentialed Health Data License 1.5.0
HoC Multi-label Classification GPL-3.0
EmrQA Passage extraction i2c2 / n2c2 license
2018 n2c2 Multi-label Classification, Relation extraction i2c2 / n2c2 license
KD-DTI Relation Extraction ?
DDI Relation Extraction CC BY-NC
2009 n2c2 NER and relation extraction i2c2 / n2c2 license
BC5CDR NER and relation extraction Public Domain Mark 1.0
2010 n2c2 NER, assertion classification, relation extraction i2c2 / n2c2 license
2011 n2c2 Coreference resolution i2c2 / n2c2 license
MIMIC-CDM Clinical decision-making PhysioNet Credentialed Health Data License 1.5.0

PubMedQA

MedQA

MedMCQA

HEAD-QA

MTSample

BioNLI

  • summary: A natural language inference (NLI) dataset for biomedical texts, testing models' ability to infer entailment or contradiction between statements
  • task: NLI / Classification
  • format:
    • input: Text pair (premise and hypothesis sentences from biomedical literature)
    • output: Classification label ("entailment", "contradiction", or "neutral")
  • paper: https://arxiv.org/abs/2210.14814
  • data:
  • license: CC BY-NC

MedNLI

  • summary: An NLI dataset specifically designed for clinical domains using patient records annotated with entailment labels by clinicians
  • task: NLI / Classification
  • format:
    • input: Text pair (premise and hypothesis sentences from clinical notes)
    • output: Classification label ("entailment", "contradiction", or "neutral")
  • paper: https://arxiv.org/abs/1808.06752
  • data:
  • license: PhysioNet Credentialed Health Data License 1.5.0

MIMIC-IV-ICD

HoC

EmrQA

  • summary: A clinical QA dataset generated from Electronic Medical Records (EMRs) using semi-automated methods
  • task: Passage extraction
  • format:
    • input: Text (clinical notes and question)
    • output: Text (extractive answer span from input)
  • paper: https://arxiv.org/abs/1809.00732
  • data:
  • license: i2c2 / n2c2 license

2018 n2c2

KD-DTI

DDI

2009 n2c2

BC5CDR

2010 n2c2

  • summary: A corpus of de-identified clinical records for concept extraction, assertion classification, and relation classification
  • task: NER, assertion classification, relation extraction
  • format:
    • input: Text (clinical discharge summaries)
    • output: Named entity labels (medical problems, treatments, tests) with their spans, assertion labels for each entity, and relation labels between entities
  • data:
  • license: i2c2 / n2c2 license

2011 n2c2

MIMIC-CDM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants