- BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
- Europe PMC - Bulk download of full text and SI of > 5 million articles.
- IUPAC Gold Book
- LibreText: Open-access chemistry textbook.
- MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
- NLM literature archive: NLM LitArch (NLM Literature Archive) is a digital archive for books, documents, and articles in the fields of life science, medicine, and healthcare at the National Institutes of Health. Also accessible via NCBI bookshelf.
- OpenStax Free textbooks, including Chemistry 2e, which is released under CC-BY 4.0.
- PubChemSTM: 281K chemical structure and text pairs
- PubMed central: free full-text archive
- PubMed: abstracts and outlinks
- S2ORC: The Semantic Scholar Open Research Corpus. 81.1M English-language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.
- ChemTables: 788 chemical patent tables with labels of their content type. Built for semantic classification of table type. Licensed under CC BY NC 3.0.
- Crystallography Open Database: open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers. They also derived SMILES for some compounds.
- Enamine HTS collection: 1 930 980 diverse screening compounds (37 billion molecules in 2D and 4.5 billion in 3D)
- nCov-Group Data Repository: SMILES, fingerprints, descriptors, and images of millions of compounds.
- zinc20: ZINC20 library prepared for Deep Docking-accelerated virtual screening
- zinc22: commercially-available compounds for virtual screening
- ACNet: a benchmark for Activity Cliff Prediction, 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs from ChEMBL (version 28).
- Aquasoldb: Curation of nine open source datasets on aqueous solubility. The authors also assigned reliability groups.
- BindingDB: molecular recognition database, contains 2.6M data for 1.1M Compounds and 8.10K Targets (Feb 2023)
- ChEBI-20: 33,010 molecule-description pairs (for molecule captioning task)
- ESol: Water solubility data(log solubility in mols per litre) for common organic small molecules.
- Flashpoint: Sun et al. collected a dataset of the flashpoints of 10575 molecules from academic papers, the Gelest chemical catalogue, the DIPPR database, Lange's Handbook of Chemistry, the Hazardous Chemicals Handbook, and the PubChem database.
- FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies
- Harvard OPV: "experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
- Hydrogen Storage Materials Database: data on hydrides for hydrogen storage (information such as chemical formula and hydrogen capacity)
- ILThermo: thermodynamic and transport properties of pure ionic liquids and mixtures of them.
- Leffingwell Odor Dataset: 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database
- Limiting activity coefficients: for different solvent/solute pairs, used to train a SMILES-based transformer.
- Lipophilicty: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
- MoleculeNet - Benchmark suite that contains multiple datasets listed here
- oechem: On Feb 17 2023 OCHEM contained 3774118 records for 689 properties (with at least 50 records) collected from 20609 sources (user is granted a Creative Commons CC-BY (version 4.0) license to data submitted)
- Papyrus: A large scale curated dataset aimed at bioactivity predictions. Contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets.
- Photoswitch Dataset: Curated dataset of 405 photoswitch molecules.
- QM Datasets: QM7, QM7b, QM8, QM9, MD Trajectories
- SolProp: Database of 1 million solvent/solute COSMO-RS calculations and 10145 experimental solvation free energies (originally published as part of this paper).
- SOMAS: Experimental and calculated solubilities for small molecules. Originally proposed for the design of redox-flow batteries.
- Therapeutic Data Commons: ML tasks that cover small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies.
- ThermoML Archive: experimental thermophysical and thermochemical property data (in ThermoML XML format)
- ustop: Reactions extracted by text-mining from United States patents published between 1976 and September 2016.
- Dreher-Doyle: yields and conditions for 3955 Pd-catalysed Buchwald–Hartwig C–N crosscouplings
- Perera: yields and conditions for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings
- porous materials AI gym: open data sets for machine learning pertaining to porous materials.