Data providers are creating synthetic versions of their datasets, for various different purposes. Below we list some synthetic datasets realted to health research.
Clinical Practice Research Datalink (CPRD)
- CPRD offer 4 synthetic datasets1 based on their two seperate primary care databases, Aurum and GOLD2.
- A fee must be paid and a data license agreement completed, in order to access these synthetic datasets.
- The four datasets:
- CPRD cardiovascular disease synthetic dataset based on CPRD Aurum database (high fidelity)
- CPRD COVID-19 symptoms and risk factors synthetic dataset based on CPRD Aurum database (high fidelity)
- CPRD Aurum sample dataset based on CPRD Aurum database (medium fidelity)
- CPRD GOLD sample dataset based on CPRD GOLD database (medium fidelity)
UK Biobank
- UK Biobank3 offer one low fidelity dataset4, at similar size and structure to the real dataset, with values generated at random.
- There appears to be no cost for this dataset, and it can be downloaded from the website directly.
NHS
- The NHSE Data Science Case Studies5 offer code & methods to generate synthetic data.
- NHS Digital ran an Artificial data pilot6, creating three synthetic datasets based on Hospital Episode Statistics (from NHS Hospitals across England).
- NHS England share a synthetic version of A&E data7
- There appears to be no cost for these datasets, and they can be downloaded from the websites directly.