Homologous series are groups of chemical compounds sharing the same core structure(s) and different numbers of repeating units (RU) connected end-to-end.
This is an open-source algorithm to classify homologous series within compound datasets provided as SMILES, implemented using the RDKit.
For example, these series were classified in COCONUT and the NORMAN Suspect List Exchange, datasets containing natural products and environmental chemicals respectively.
The algorithm requires RDKit to be installed via conda-forge
.
$ conda create -c conda-forge -n my-rdkit-env rdkit
$ conda activate my-rdkit-env
$ git clone https://github.com/adelenelai/onglai-classify-homologues
$ cd onglai-classify-homologues
$ pip install -e .
Note that pip installing the package is not enough; in addition, the repo must be cloned from GitHub because the algorithm runs as a script (see below).
Run:
$ python nextgen_classify_homols.py [-in <arg>] [-sep <arg>] [-s <arg>] [-n <arg>] [-ru <arg>] [-min <arg>] [-max <arg>] 2>log
Flag | Description |
---|---|
-in --input_csv | path to input CSV containing SMILES and Name columns |
-sep --separator | delimiter for input CSV. Default is comma i.e., ',' |
-s --smiles | name of column containing SMILES. Default is SMILES |
-n --names | name of column containing Names. Default is Name |
-ru --repeatingunits | chemical RU as SMARTS, enclosed within speech marks. Default is CH2 i.e., '[#6&H2]' |
-min --min_RU_in | minimum length of RU chain, default is 3 |
-max --max__RU_in | maximum length of RU chain, default is 30 |
-f --frag_steps | no. times to fragment molecules to obtain cores, the default is 2 |
Try:
$ python nextgen_classify_homols.py -in ../../tests/test1_23.csv -s SMILES -n Name -ru '[#6&H2]' -min 3 -max 30 -f 2 2>log
Successful classification will generate an output
directory containing the following files:
- A TXT file containing the summary of classification results and explanation of outputs (series_no codes)
- A CSV file containing 8 columns:
series_no
,cpd_name
,CanoSmiles_FinalCores
,SMILES
,InChI
,InChIKey
,molecular_formula
andmonoisotopic_mass
. The first columnseries_no
contains the results of the homologous series classification.CanoSmiles_FinalCores
indicates the common core shared by all members within a given series. The remaining columns contain information calculated based on theSMILES
. - A TXT file of unparseable SMILES that were removed (if all SMILES were parsed OK, then empty)
Classification using default settings as described above. Code below runs for sample datasets provided in input/
, full datasets have been archived on Zenodo (amend -in
accordingly to classify full datasets).
#activate your rdkit environment
#NORMAN-SLE
$ python nextgen_classify_homols.py -in ../../input/pubchem_norman_sle_tree_parentcid_98116_2022-03-21_from115115_trial.csv -s isosmiles -n cmpdname 2>log
#PubChemLite
$ python nextgen_classify_homols.py -in ../../input/PubChemLite_exposomics_20220225_trial.csv -n CompoundName 2>log
#COCONUT
$ python nextgen_classify_homols.py -in ../../input/COCONUT_DB_2021-11_trial.txt 2>log
- Lai, A., Schaub, J., Steinbeck, C. et al. An algorithm to classify homologous series within compound datasets. J Cheminform 14, 85 (2022). https://doi.org/10.1186/s13321-022-00663-y
- Poster presented at the 17th German Cheminformatics Conference, Garmisch-Partenkirchen, Germany (May 8-10, 2022)
Steffen Neumann, Charles Tapley-Hoyt, Kohulan Rajan, Mahnoor Zulfiqar, Anjana Elapavalore, Zhanyun Wang, Christos Nicolaou, Maximilian Beckers, Greg Landrum, Paolo Tosco. (and Kohulan for the logo :))
This project is licensed under Apache 2.0 - see LICENSE for details.