-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets for Medicine #31
Comments
A reasoning SFT o1 dataset - https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT |
Following is a comprehensive listing of medical datasets that can be used as a foundation for medicine. |
I'm super excited to start training clinical reasoning LLMs since the clinical domain seems to bridge the gap between the objective and subjective. So I was pretty happy to see the TRL GRPO implementation come out! I know that the focus right now is on simple, verifiable answers with binary rewards, so I'm not really sure what'll end up being used. It might be cool to figure out how to incorporate some sort of continuous Rouge / BERTScore down the line since there's so many clinical NLP datasets devoted to this (note generation, summarization, etc.), but I admit that doesn't sound as elegant. It might even be simpler to grade free-text responses of this type with another LLM as a judge, awarding pass or fail grades. Some of the datasets I've included, namely MIMIC and n2c2, are a bit more traditional and harder to access, but are probably richer data sources. In addition, some aren't easy to work with out of the box, as they're annotated for traditional NER, relation extraction, etc tasks. I don't like the idea of creating dataset-specific reward functions for parsing output, but idk what else to do (maybe have another LLM transform the first's output using structured output, but this sounds complicated). Simple QA data is obviously the easiest to work with, but I'm not entirely convinced that QA alone will lead to the sort of complex clinical reasoning we're looking for (I hope it does!). These are some nice sources I've used to find the datasets below: I kind of ordered these in terms of what I believe to be increasing complexity of verification.
PubMedQA
MedQA
MedMCQA
HEAD-QA
MTSample
BioNLI
MedNLI
MIMIC-IV-ICD
HoC
EmrQA
2018 n2c2
KD-DTI
DDI
2009 n2c2
BC5CDR
2010 n2c2
2011 n2c2
MIMIC-CDM
|
We're aiming to curate / create a large-scale dataset of high-quality (patient presentation, diagnosis/next care step) pairs for medicine. The goal is to have pairs that are verifiable.
Given these pairs, one can:
Distill synthetic reasoning traces from DeepSeek-R1 (good for SFT on smol models)
Feed into the GRPO pipeline to bootstrap base models (like R1-Zero) or combine with RLHF (R1).
Let's use this issue to gather public datasets that are suitable to start with!
The text was updated successfully, but these errors were encountered: