OngLai: An Algorithm to Classify Homologous Series

Introduction

Homologous series are groups of chemical compounds sharing the same core structure(s) and different numbers of repeating units (RU) connected end-to-end.

This is an open-source algorithm to classify homologous series within compound datasets provided as SMILES, implemented using the RDKit.

For example, these series were classified in COCONUT and the NORMAN Suspect List Exchange, datasets containing natural products and environmental chemicals respectively.

CH2 Repeating Unit:

CF2 Repeating Unit:

Requirements

The algorithm requires RDKit to be installed via conda-forge.

$ conda create -c conda-forge -n my-rdkit-env rdkit
$ conda activate my-rdkit-env

Installation

$ git clone https://github.com/adelenelai/onglai-classify-homologues
$ cd onglai-classify-homologues
$ pip install -e .

Note that pip installing the package is not enough; in addition, the repo must be cloned from GitHub because the algorithm runs as a script (see below).

Usage

Run:

$ python nextgen_classify_homols.py [-in <arg>] [-sep <arg>] [-s <arg>] [-n <arg>] [-ru <arg>] [-min <arg>] [-max <arg>] 2>log

Flag	Description
-in --input_csv	path to input CSV containing SMILES and Name columns
-sep --separator	delimiter for input CSV. Default is comma i.e., ','
-s --smiles	name of column containing SMILES. Default is SMILES
-n --names	name of column containing Names. Default is Name
-ru --repeatingunits	chemical RU as SMARTS, enclosed within speech marks. Default is CH2 i.e., '[#6&H2]'
-min --min_RU_in	minimum length of RU chain, default is 3
-max --max__RU_in	maximum length of RU chain, default is 30
-f --frag_steps	no. times to fragment molecules to obtain cores, the default is 2

Try:

$ python nextgen_classify_homols.py -in ../../tests/test1_23.csv -s SMILES -n Name -ru '[#6&H2]' -min 3 -max 30 -f 2 2>log

Successful classification will generate an output directory containing the following files:

A TXT file containing the summary of classification results and explanation of outputs (series_no codes)
A CSV file containing 8 columns: series_no, cpd_name, CanoSmiles_FinalCores, SMILES, InChI, InChIKey, molecular_formula and monoisotopic_mass. The first column series_no contains the results of the homologous series classification. CanoSmiles_FinalCores indicates the common core shared by all members within a given series. The remaining columns contain information calculated based on the SMILES.
A TXT file of unparseable SMILES that were removed (if all SMILES were parsed OK, then empty)

Reproducing Classification described in Lai et al.

Classification using default settings as described above. Code below runs for sample datasets provided in input/, full datasets have been archived on Zenodo (amend -in accordingly to classify full datasets).

#activate your rdkit environment

#NORMAN-SLE
$ python nextgen_classify_homols.py -in ../../input/pubchem_norman_sle_tree_parentcid_98116_2022-03-21_from115115_trial.csv -s isosmiles -n cmpdname 2>log

#PubChemLite
$ python nextgen_classify_homols.py -in ../../input/PubChemLite_exposomics_20220225_trial.csv -n CompoundName 2>log

#COCONUT
$ python nextgen_classify_homols.py -in ../../input/COCONUT_DB_2021-11_trial.txt 2>log

References and Links

Lai, A., Schaub, J., Steinbeck, C. et al. An algorithm to classify homologous series within compound datasets. J Cheminform 14, 85 (2022). https://doi.org/10.1186/s13321-022-00663-y
Poster presented at the 17th German Cheminformatics Conference, Garmisch-Partenkirchen, Germany (May 8-10, 2022)

Acknowledgements

Steffen Neumann, Charles Tapley-Hoyt, Kohulan Rajan, Mahnoor Zulfiqar, Anjana Elapavalore, Zhanyun Wang, Christos Nicolaou, Maximilian Beckers, Greg Landrum, Paolo Tosco. (and Kohulan for the logo :))

License

This project is licensed under Apache 2.0 - see LICENSE for details.

Our Research Groups

Environmental Cheminformatics Group at the

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
input		input
onglai-classify-homologues		onglai-classify-homologues
tests		tests
.gitignore		.gitignore
11_epoxy.png		11_epoxy.png
5027.png		5027.png
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
logo_LCSB_UL.png		logo_LCSB_UL.png
onglai-logo.png		onglai-logo.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OngLai: An Algorithm to Classify Homologous Series

Introduction

Requirements

Installation

Usage

Reproducing Classification described in Lai et al.

References and Links

Acknowledgements

License

Our Research Groups

About

Releases

Packages

Languages

License

Steinbeck-Lab/classify_homologues

Folders and files

Latest commit

History

Repository files navigation

OngLai: An Algorithm to Classify Homologous Series

Introduction

Requirements

Installation

Usage

Reproducing Classification described in Lai et al.

References and Links

Acknowledgements

License

Our Research Groups

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages