This research study focuses on the challenges posed by hallucinations in medical large language models (LLMs), particularly in the context of the medical domain. We evaluate the reasoning abilities of popular open-source medical LLMs on a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate hallucinations in medical domain. This benchmark provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities.
Datasets are hosted in Huggingface's dataset
python bio_llm.py <test_name> <batch> <model_name>
- Here test_name can be FCT, Nota or fake depending on which test we want to run
- Here model_name can be asclepius, alpacare, biomistral or pmc_llama depending on which model we want to run the inference for
- batch is the batch_size. We use different batch sizes like 4,8 and 12.
Find the evaluation notebooks here: https://github.com/abhilash-neog/FactCheckingBioLLMs/tree/Evaluation_Codes
We sincerely thank the authors of following open-source projects: