There are many types of chemical fingerprint for describing the molecule provided by different tools, such as RDKit, CDK and OpenBabel. This package aims to summarize them all.
-
Clone the repo and navigate to it.
-
Create a predefined Python conda environment by
conda env create -f env/pyfingerprint_env.yml
. -
Activate conda environment and install with pip.
conda activate pyfingerprint pip install git+https://github.com/hcji/PyFingerprint.git
import numpy as np
from PyFingerprint.fingerprint import get_fingerprint, get_fingerprints
cdktypes = ['standard', 'extended', 'graph', 'maccs', 'pubchem', 'estate', 'hybridization', 'lingo',
'klekota-roth', 'shortestpath', 'cdk-substructure', 'circular', 'cdk-atompairs']
rdktypes = ['rdkit', 'morgan', 'rdk-maccs', 'topological-torsion', 'avalon', 'atom-pair', 'rdk-descriptor']
babeltypes = ['fp2', 'fp3', 'fp4', 'spectrophore']
vectypes = ['mol2vec']
smi = 'CCCCN'
output = {}
for f in cdktypes:
output[f] = get_fingerprint(smi, f)
for f in rdktypes:
output[f] = get_fingerprint(smi, f)
for f in babeltypes:
output[f] = get_fingerprint(smi, f)
for f in vectypes:
output[f] = get_fingerprint(smi, f)
output_np = output.copy()
for k, fp in output.items():
output_np[k] = fp.to_numpy()
smlist = ['CCCCC', 'CCCCN', 'CCCCO']
output = {}
for f in cdktypes:
output[f] = get_fingerprints(smlist, f)
for f in rdktypes:
output[f] = get_fingerprints(smlist, f)
for f in babeltypes:
output[f] = get_fingerprints(smlist, f)
for f in vectypes:
output[f] = get_fingerprints(smlist, f)
output_np = output.copy()
for k, fps in output.items():
output_np[k] = np.array([fp.to_numpy() for fp in fps])
@article{doi:10.1021/acs.analchem.0c01450,
author = {Ji, Hongchao and Deng, Hanzi and Lu, Hongmei and Zhang, Zhimin},
title = {Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks},
journal = {Analytical Chemistry},
volume = {92},
number = {13},
pages = {8649-8653},
year = {2020},
doi = {10.1021/acs.analchem.0c01450},
note ={PMID: 32584545},
URL = {
https://doi.org/10.1021/acs.analchem.0c01450
}}
**standard**: Considers paths of a given length. These are hashed fingerprints, with a default length of 1024.
**extended**: Similar to the standard type, but takes rings and atomic properties into account into account.
**graph**: Similar to the standard type by simply considers connectivity.
**hybridization**: Similar to the standard type, but only consider hybridization state.
**estate**: 79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.
**cdk-atompairs**: CDK's implementation of the atompairs fingerprint
**cdk-substructure**: CDK substructure fingerprint, basically identical to openbabel's fp4.
**pubchem**: 881 bit fingerprints defined by PubChem.
**klekota-roth**: 4860 bit fingerprint defined by Klekota and Roth.
**shortestpath**: A fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.
**rdk-descriptor**: Various molecular descriptors implemented and calculated by RDKit.
**circular**: An implementation of the ECFP6 fingerprint.
**lingo**: An implementation of the LINGO fingerprint.
**rdkit**: Another implementation of a Daylight-like fingerprint by RDKit.
**maccs**: The popular 166 bit MACCS keys described by MDL.
**avalon**: Substructure or similarity Avalon fingerprint.
**atom-pair**: RDKit Atom-Pair fingerprint.
**topological-torsion**: RDKit Topological-Torsion Fingerprint.
**morgan**: RDKit Morgan fingerprint.
**fp2**: OpenBabel FP2 fingerprint, which indexes small molecule fragments based on linear segments of up to 7 atoms in length.
**fp3**: OpenBabel FP3 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**fp4**: OpenBabel FP4 fingerprint, which is a fingerprint method created from a set of SMARTS patterns defining functional groups.
**spectrophore** Openbabel implementation of the spectrophore fingerprint (https://github.com/silicos-it/spectrophore).
**mol2vec**: Unsupervised machine learning approach for mulecule representation.