This provides a basic registration system which can be used either as a python library or via a command-line interface.
Basic operations:
initdb
: resets the database. Note that this destroys any information which is already in the database, so be careful with it.register
: standardize the input molecule, calculates a hash for it, and adds them molecule to the database if it's not already there. returns the molregno (registry ID) of the newly registered molecule.query
: takes a molecule as input and checks whether or not a matching molecule is registered. returns molregnos (registry IDs) of the matching molecule(s), if anyretrieve
: takes one or more IDs and returns the registered structures for them
[1] J. Chem. Inf. Model. 2024, XXXX, XXX, XXX-XXX (ASAP article): https://pubs.acs.org/doi/10.1021/acs.jcim.4c01133
Assuming that you have conda (or mamba or something equivalent) installed you can install lwreg directly from this github repo by first creating a conda environment with all the dependencies installed:
% conda env create --name py311_lwreg --file=https://raw.githubusercontent.com/rinikerlab/lightweight-registration/main/environment.yml
If you have mamba installed, you can run this instead (it will run faster):
% mamba env create --name py311_lwreg --file=https://raw.githubusercontent.com/rinikerlab/lightweight-registration/main/environment.yml
You can then activate the new environment and install lwreg:
% conda activate py311_lwreg
% python -m pip install git+https://github.com/rinikerlab/lightweight-registration
You can then verify that the install worked by doing:
% lwreg --help
If you want to use PostgreSQL as the database for lwreg, then you will also need to install the python connector for PostgreSQL:
% conda install -c conda-forge psycopg2
Please look at the INSTALL.md file.
- rdkit v2023.03.1 or later
- click
- psycopg2 (only if you want to use a postgresql database)
After installing the dependencies (above) and checking out this repo, run this command in this directory:
pip install --editable .
docker build -t lwreg .
# Run Jupyter notebook on the docker container
docker run -i -t -p 8888:8888 rdkit-lwreg /bin/bash -c "\
apt update && apt install libtiff5 -y && \
pip install notebook && \
jupyter notebook \
--notebook-dir=/lw-reg --ip='*' --port=8888 \
--no-browser --allow-root"
% lwreg initdb --confirm=yes
% lwreg register --smiles CCOCC
1
% lwreg register --smiles CCOCCC
2
% lwreg register --smiles CCNCCC
3
% lwreg register --smiles CCOCCC
ERROR:root:Compound already registered
% lwreg query --smiles CCOCCC
2
% lwreg retrieve --id 2
(2, '\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 6 5 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 5 C 5.196152 -0.000000 0.000000 0\nM V30 6 C 6.495191 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 4 1 4 5\nM V30 5 1 5 6\nM V30 END BOND\nM V30 END CTAB\nM END\n', 'mol')
>>> import lwreg
>>> from lwreg import utils
>>> lwreg.set_default_config(utils.defaultConfig()) # you generally will want to provide more information about the database
>>> lwreg.initdb()
This will destroy any existing information in the registration database.
are you sure? [yes/no]: yes
True
>>> lwreg.register(smiles='CCO')
1
>>> lwreg.register(smiles='CCOC')
2
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('CCOCC')
>>> lwreg.register(mol=m)
3
>>> lwreg.register(mol=m)
---------------------------------------------------------------------------
IntegrityError Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 lwreg.register(mol=m)
... DETAILS REMOVED ...
IntegrityError: UNIQUE constraint failed: hashes.fullhash
>>> lwreg.query(smiles='CCOC')
[2]
>>> lwreg.query(smiles='CCOCC')
[3]
>>> lwreg.query(smiles='CCOCO')
[]
>>> lwreg.retrieve(id=2)
{2: ('\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 4 3 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol')}
>>> lwreg.retrieve(ids=[2,3])
{2: ('\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 4 3 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol'),
3: ('\n RDKit 2D\n\n 0 0 0 0 0 0 0 0 0 0999 V3000\nM V30 BEGIN CTAB\nM V30 COUNTS 5 4 0 0 0\nM V30 BEGIN ATOM\nM V30 1 C 0.000000 0.000000 0.000000 0\nM V30 2 C 1.299038 0.750000 0.000000 0\nM V30 3 O 2.598076 -0.000000 0.000000 0\nM V30 4 C 3.897114 0.750000 0.000000 0\nM V30 5 C 5.196152 -0.000000 0.000000 0\nM V30 END ATOM\nM V30 BEGIN BOND\nM V30 1 1 1 2\nM V30 2 1 2 3\nM V30 3 1 3 4\nM V30 4 1 4 5\nM V30 END BOND\nM V30 END CTAB\nM END\n',
'mol')}
When using the Python API you have extensive control over the standardization and validation operations which are performed on the molecule.
Start with a couple of examples showing what the 'fragment' and 'charge' built-in standardizers do:
>>> config['standardization'] = 'fragment'
>>> Chem.MolToSmiles(lwreg.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CC[O-]'
>>> config['standardization'] = 'charge'
>>> Chem.MolToSmiles(lwreg.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CCO'
Now define a custom filter which rejects (by returning None) molecules which have a net charge and then use that:
>>> def reject_charged_molecules(mol):
... if Chem.GetFormalCharge(mol):
... return None
... return mol
...
>>> config['standardization'] = reject_charged_molecules
>>> Chem.MolToSmiles(lwreg.standardize_mol(Chem.MolFromSmiles('CC[O-].[Na+]'),config=config))
'CC[O-].[Na+]'
Here's an example which fails:
>>> lwreg.standardize_mol(Chem.MolFromSmiles('CC[O-]'),config=config) is None
True
We can chain standardization/filtering operations together by providing a list. The individual operations are run in order. Here's an example where we attempt to neutralise the molecule by finding the charge parent and then apply our reject_charged_molecules filter:
>>> config['standardization'] = ['charge',reject_charged_molecules]
>>> lwreg.standardize_mol(Chem.MolFromSmiles('CC[O-]'),config=config) is None
False
>>> lwreg.standardize_mol(Chem.MolFromSmiles('CC[N+](C)(C)C'),config=config) is None
True
That last one failed because the quarternary nitrogen can't be neutralized.
There are a collection of other standardizers/filters available in the module lwreg.standardization_lib
When the configuration option registerConformers
is set to True, lwreg expects that the compounds to be registered will have an associated conformer. The conformers are tracked in a different table than the molecule topologies and expectation is that every molecule registered will have a conformer (it's an error if they don't). It is possible to register multiple conformers for a single molecular structure (topology).
Note that once a database is created in registerConformers
mode, it probably should always be used in that mode.
register()
andbulk_register()
require molecules to have associated conformers. Both return(molregno, conf_id)
tuples instead of justmolregno
squery()
: if called with theids
argument, this will return all of the conformers for the supplied molregnos as(molregno, conf_id)
tuples. If called with a molecule, the conformer of the molecule will be hashed and looked up in theconformers`` table, returns a list of
(molregno, conf_id)` tuples.retrieve()
: if called with(molregno, conf_id)
tuple(s), this will return a dictionary of(molblock, 'mol')
tuples with(molregno, conf_id)
tuples as keys where themolblock
s contain the coordinates of the registered conformers.
Just as molecular hashes are used to recognize when two molecules are the same, lwreg uses a hashing scheme to detect when two conformers are the same. The algorithm for this is simple: The atomic positions are converted into strings (rounding the floating point values to a fixed, but configurable, number of digits), sorting the positions, and then combining them into a single string. A SHA256 hash of this string is generated to give the final conformer hash.
If registering a multi-conformer molecule, it is most efficient to call register_multiple_conformers()
. That only does the work of standardizing the molecule and calculating the molecule hash once.
Here's the SQL to create the base lwreg tables in sqlite:
create table registration_metadata (key text, value text);
create table hashes (molregno integer primary key, fullhash text unique,
formula text, canonical_smiles text, no_stereo_smiles text,
tautomer_hash text, no_stereo_tautomer_hash text, "escape" text, sgroup_data text, rdkitVersion text);
create table orig_data (molregno integer primary key, data text, datatype text, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP);
create table molblocks (molregno integer primary key, molblock text, standardization text);
Here's the SQL to create the conformers table in sqlite when registerConformers
is set:
create table conformers (conf_id integer primary key, molregno integer not null,
conformer_hash text not null unique, molblock text);