Skip to content

Latest commit

 

History

History
33 lines (26 loc) · 1.96 KB

synthetic-datasets-inventory.md

File metadata and controls

33 lines (26 loc) · 1.96 KB

Inventory of synthetic datasets

Data providers are creating synthetic versions of their datasets, for various different purposes. Below we list some synthetic datasets realted to health research.

Clinical Practice Research Datalink (CPRD)

  • CPRD offer 4 synthetic datasets1 based on their two seperate primary care databases, Aurum and GOLD2.
  • A fee must be paid and a data license agreement completed, in order to access these synthetic datasets.
  • The four datasets:
    • CPRD cardiovascular disease synthetic dataset based on CPRD Aurum database (high fidelity)
    • CPRD COVID-19 symptoms and risk factors synthetic dataset based on CPRD Aurum database (high fidelity)
    • CPRD Aurum sample dataset based on CPRD Aurum database (medium fidelity)
    • CPRD GOLD sample dataset based on CPRD GOLD database (medium fidelity)

UK Biobank

  • UK Biobank3 offer one low fidelity dataset4, at similar size and structure to the real dataset, with values generated at random.
  • There appears to be no cost for this dataset, and it can be downloaded from the website directly.

NHS

  • The NHSE Data Science Case Studies5 offer code & methods to generate synthetic data.
  • NHS Digital ran an Artificial data pilot6, creating three synthetic datasets based on Hospital Episode Statistics (from NHS Hospitals across England).
  • NHS England share a synthetic version of A&E data7
  • There appears to be no cost for these datasets, and they can be downloaded from the websites directly.

Footnotes

  1. https://cprd.com/synthetic-data

  2. https://cprd.com/primary-care-data-public-health-research

  3. https://www.ukbiobank.ac.uk

  4. https://biobank.ndph.ox.ac.uk/ukb/exinfo.cgi?src=UKB_Synthetic_Dataset.html

  5. https://nhsengland.github.io/DataScience-CaseStudies

  6. https://digital.nhs.uk/services/artificial-data

  7. https://data.england.nhs.uk/dataset/a-e-synthetic-data