Skip to content

Latest commit

 

History

History
74 lines (57 loc) · 8.64 KB

README.md

File metadata and controls

74 lines (57 loc) · 8.64 KB

awesome-chemistry-datasets

text datasets

  • BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
  • ChemTables: 788 chemical patent tables with labels of their content type. Built for semantic classification of table type. Licensed under CC BY NC 3.0.
  • Europe PMC - Bulk download of full text and SI of > 5 million articles.
  • IUPAC Gold Book
  • LibreText: Open-access chemistry textbook.
  • MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
  • NLM literature archive: NLM LitArch (NLM Literature Archive) is a digital archive for books, documents, and articles in the fields of life science, medicine, and healthcare at the National Institutes of Health. Also accessible via NCBI bookshelf.
  • OpenStax Free textbooks, including Chemistry 2e, which is released under CC-BY 4.0.
  • PubChemSTM: 281K chemical structure and text pairs
  • PubMed central: free full-text archive
  • PubMed: abstracts and outlinks
  • S2ORC: The Semantic Scholar Open Research Corpus. 81.1M English-language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.

structures

ml structure-property benchmark datasets

  • ACNet: a benchmark for Activity Cliff Prediction, 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs from ChEMBL (version 28).
  • Aquasoldb: Curation of nine open source datasets on aqueous solubility. The authors also assigned reliability groups.
  • BindingDB: molecular recognition database, contains 2.6M data for 1.1M Compounds and 8.10K Targets (Feb 2023)
  • ChEBI-20: 33,010 molecule-description pairs (for molecule captioning task)
  • ESol: Water solubility data(log solubility in mols per litre) for common organic small molecules.
  • Flashpoint: Sun et al. collected a dataset of the flashpoints of 10575 molecules from academic papers, the Gelest chemical catalogue, the DIPPR database, Lange's Handbook of Chemistry, the Hazardous Chemicals Handbook, and the PubChem database.
  • FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies
  • Harvard OPV: "experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
  • Hydrogen Storage Materials Database: data on hydrides for hydrogen storage (information such as chemical formula and hydrogen capacity)
  • ILThermo: thermodynamic and transport properties of pure ionic liquids and mixtures of them.
  • Leffingwell Odor Dataset: 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database
  • Limiting activity coefficients: for different solvent/solute pairs, used to train a SMILES-based transformer.
  • Lipophilicty: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
  • MoleculeNet - Benchmark suite that contains multiple datasets listed here
  • oechem: On Feb 17 2023 OCHEM contained 3774118 records for 689 properties (with at least 50 records) collected from 20609 sources (user is granted a Creative Commons CC-BY (version 4.0) license to data submitted)
  • Papyrus: A large scale curated dataset aimed at bioactivity predictions. Contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets.
  • Photoswitch Dataset: Curated dataset of 405 photoswitch molecules.
  • QM Datasets: QM7, QM7b, QM8, QM9, MD Trajectories
  • SolProp: Database of 1 million solvent/solute COSMO-RS calculations and 10145 experimental solvation free energies (originally published as part of this paper).
  • SOMAS: Experimental and calculated solubilities for small molecules. Originally proposed for the design of redox-flow batteries.
  • Therapeutic Data Commons: ML tasks that cover small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies.
  • ThermoML Archive: experimental thermophysical and thermochemical property data (in ThermoML XML format)

Target identification data

  • Open Targets: is a large-scale resource that uses human genetics and genomics data for systematic drug target identification and prioritization.

  • Probes & Drugs Portal: is an interactive, open data resource for chemical biology. Overview of libraries of bioactive compounds (e.g., ChEMBL, Guide to PHARMACOLOGY), including commercial screening libraries.

  • Guide to PHARMACOLOGY: is an expert-curated resource of ligand-activity-target relationships. It includes activity data even for data with unknown bioactivity value (under CC BY-SA 4.0).

reactions

  • ustop: Reactions extracted by text-mining from United States patents published between 1976 and September 2016.

high-throughput screening data

  • Dreher-Doyle: yields and conditions for 3955 Pd-catalysed Buchwald–Hartwig C–N crosscouplings
  • Perera: yields and conditions for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings

eln data

related list