In this chapter we will learn the basics of reading molecules with RDKit.
Simplified molecular input line entry system(SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. More detials are described in SMILES Tutorial. For example c1ccccc1 means that there are six aromatic carbon atoms and has a loop structure which is connected with start and end, you know it means benzene.
Now that we know that SMILES represents a molecule, let’s load SMILES and draw a molecule. First, load the Chem class from the RDKit library. The second line is the setting for drawing the structure on Jupyter Notebook.
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
RDKit has a method called MolFromSmiles to read SMILES strings, so use this to read molecules.
mol = Chem.MolFromSmiles("c1ccccc1")
Then draw the structure by simply evaluating the mol and the structure will be displayed.
mol
Molecular structure will be drawn like following figure.
The SMILES notation expresses the same thing as the method of connecting atoms by lines as shown above (structural formula) and the SMILES notation. Structural formulas are easy for humans to understand, but SMILES has the advantage that it can be expressed with a smaller amount of data because it is expressed with ASCII strings.
Note
|
Being able to represent a character string means that a new chemical structure can be generated by applying the character string generation algorithm. This is discussed in detail in Chapter 12. |
There are several ways to store multiple compounds in one file, but it is common to use the sd file format. Because it is an sd file, the file extension is often .sdf.
A format for molecular expression developed by MDL is MOL format. An SD file is an extension of this MOL format. Specifically, multiple molecules can be handled by separating the MOL format expression with the line $$$$.
MOL format is a big difference from SMILES in that it can store 3D coordinates of molecules and can express not only 2D but also 3D structure.
Download the structural data of topoisomerase II inhibition test (CHEMBL669726) at ChEMBL referring to Chapter 4 in sdf file format.
- NOTE
Specially, open the link page and input 'CHEMBL66926' to search form then search results will be appeared. Then click compounds tab, select all and down load as SDF. File download will start and get file as compressed gzip format. Extract the file with guzip command or using an appropriate soft then rename the file to ch05_compounds.sdf.
To read the sdf file with RDKit, use the method called SDMolSupplier. Note that since multiple compounds are handled, they are stored in a variable called mols instead of mol. There is no rule on what variables to use, but it’s a good idea to reduce the possible mistakes by giving variable names that are easy to understand.
mols = Chem.SDMolSupplier("ch05_compounds.sdf")
Check how many molecules have been loaded. Use len to count numbers.
len(mols)
Total 34 molecules.
You can draw molecules one by one using a for loop, but RDKit provides a method to draw multiple molecules at once, so this time use that MolsToGridImage method. To change the number of molecules arranged in one line, specify with the molsPerRow option
Draw.MolsToGridImage(mols)
In a compound optimization project for drug discovery, you may want to change the properties of a compound without changing the shape of the molecule. In such cases, compounds with better properties may be obtained by replacing the atomic species such as carbon, nitrogen, sulfur, and oxygen that form the aromatic ring, but in this way, replacing heteroatoms (atoms other than hydrogen) The approach is called heteroshuffling.
By performing hetero-shuffling, it is possible to expect effects such as improving the dynamics by changing the physical properties while maintaining the activity, improving the activity itself, and avoiding patent claims.
Pfizer’s Sildenafil and GSK Link:https://www.ebi.ac.uk/chembl/beta/compound_report_card/CHEMBL1520/[Vardinafil] are well-known examples where slight structural differences can affect selectivity and pharmacokinetics.
Comparing the two structures, they are very similar, with the only difference being the arrangement of the nitrogen atoms in the central ring structure. Both molecules inhibit the same target protein but have different activity and pharmacokinetics.
Here is the code that generates the image above. Instead of just applying Draw.MolsToGridImage Note that the alignment is based on the Core structure, and that the Draw.MolToGridImage option is given a legend and the molecule name is displayed.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
from rdkit.Chem import rdFMCS
from rdkit.Chem import TemplateAlign
IPythonConsole.ipython_useSVG = True
rdDepictor.SetPreferCoordGen(True)
sildenafil = Chem.MolFromSmiles('CCCC1=NN(C)C2=C1NC(=NC2=O)C1=C(OCC)C=CC(=C1)S(=O)(=O)N1CCN(C)CC1')
vardenafil = Chem.MolFromSmiles('CCCC1=NC(C)=C2N1NC(=NC2=O)C1=C(OCC)C=CC(=C1)S(=O)(=O)N1CCN(CC)CC1')
rdDepictor.Compute2DCoords(sildenafil)
rdDepictor.Compute2DCoords(vardenafil)
res = rdFMCS.FindMCS([sildenafil, vardenafil], completeRingsOnly=True, atomCompare=rdFMCS.AtomCompare.CompareAny)
MCS = Chem.MolFromSmarts(res.smartsString)
rdDepictor.Compute2DCoords(MCS)
TemplateAlign.AlignMolToTemplate2D(sildenafil, MCS)
TemplateAlign.AlignMolToTemplate2D(vardenafil, MCS)
Draw.MolsToGridImage([sildenafil, vardenafil], legends=['sildenafil', 'vardenafil'])
Define a class called HeteroShuffle to generate heteroshuffled molecules. To create an object, give the molecule you want to shuffle and the substructure you want to convert (Core). In the code in the class, first, the molecule is cut at the core and divided into the core and the rest. Only Aromatic atoms in Core that have no substituent are candidates for substitution. Make_connector is a method that generates a reaction object to reconnect the shuffled Core and non-Core parts. Using the reaction object created by this method, the molecule is reconstructed by re_construct_mol.
To construct possible atom combinations, give itertools.product the atomic numbers of the candidate atoms (C, S, N, O) and the number of atoms in the ring, target_atomic_nums. After that, all possible combinations are given here because those that cannot be generated as molecules are excluded.
import copy
import itertools
from rdkit import Chem
from rdkit.Chem import AllChem
class HeteroShuffle():
def __init__(self, mol, query):
self.mol = mol
self.query = query
self.subs = Chem.ReplaceCore(self.mol, self.query)
self.core = Chem.ReplaceSidechains(self.mol, self.query)
self.target_atomic_nums = [6, 7, 8, 16]
def make_connectors(self):
n = len(Chem.MolToSmiles(self.subs).split('.'))
map_no = n+1
self.rxn_dict = {}
for i in range(n):
self.rxn_dict[i+1] = AllChem.ReactionFromSmarts('[{0}*][*:{1}].[{0}*][*:{2}]>>[*:{1}][*:{2}]'.format(i+1, map_no, map_no+1))
return self.rxn_dict
def re_construct_mol(self, core):
'''
re construct mols from given substructures and core
'''
keys = self.rxn_dict.keys()
ps = [[core]]
for key in keys:
ps = self.rxn_dict[key].RunReactants([ps[0][0], self.subs])
mol = ps[0][0]
try:
smi = Chem.MolToSmiles(mol)
mol = Chem.MolFromSmiles(smi)
Chem.SanitizeMol(mol)
return mol
except:
return None
def get_target_atoms (self):
'''
get target atoms for replace
target atoms means atoms which don't have anyatom (*) in neighbors
'''
atoms = []
for atom in self.core.GetAromaticAtoms():
neighbors = [a.GetSymbol() for a in atom.GetNeighbors()]
if '*' not in neighbors and atom.GetSymbol() != '*':
atoms.append(atom)
print(len(atoms))
return atoms
def generate_mols(self):
atoms = self.get_target_atoms()
idxs = [atom.GetIdx() for atom in atoms]
combinations = itertools.product(self.target_atomic_nums, repeat=len(idxs))
smiles_set = set()
self.make_connectors()
for combination in combinations:
target = copy.deepcopy(self.core)
#print(Chem.MolToSmiles(target))
for i, idx in enumerate(idxs):
target.GetAtomWithIdx(idx).SetAtomicNum(combination[i])
smi = Chem.MolToSmiles(target)
#smi = smi.replace('sH','s').replace('oH','o').replace('cH3','c')
#print('rep '+smi)
target = Chem.MolFromSmiles(smi)
if target is not None:
n_attachment = len([atom for atom in target.GetAtoms() if atom.GetAtomicNum() == 0])
n_aromatic_atoms = len(list(target.GetAromaticAtoms()))
if target.GetNumAtoms() - n_attachment == n_aromatic_atoms:
try:
mol = self.re_construct_mol(target)
if check_mol(mol):
smiles_set.add(Chem.MolToSmiles(mol))
except:
pass
mols = [Chem.MolFromSmiles(smi) for smi in smiles_set]
return mols
The function called check_mol used in the above code is used to avoid 6-membered ring structure such as c1coooo1 because it is also determined to be Aromatic. O and S are allowed only for 5-membered heteroaromatic rings.
def check_mol(mol):
arom_atoms = mol.GetAromaticAtoms()
symbols = [atom.GetSymbol() for atom in arom_atoms if not atom.IsInRingSize(5)]
if not symbols:
return True
elif 'O' in symbols or 'S' in symbols:
return False
else:
return True
Use the function.
# Gefitinib
mol1 = Chem.MolFromSmiles('COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4')
core1 = Chem.MolFromSmiles('c1ccc2c(c1)cncn2')
# Oxaprozin
mol2 = Chem.MolFromSmiles('OC(=O)CCC1=NC(=C(O1)C1=CC=CC=C1)C1=CC=CC=C1')
core2 = Chem.MolFromSmiles('c1cnco1')
Original molecule.
ht=HeteroSuffle(mol1, core1)
res=ht.generate_mols()
print(len(res))
Draw.MolsToGridImage(res, molsPerRow=5)
Part of the conversion result when Gefitinib is input. A molecule in which the atoms forming the aromatic ring have changed from the original compound is output. Also, only the quinazoline part specified in Core has been converted.
ht=HeteroSuffle(mol2, core2)
res=ht.generate_mols()
print(len(res))
Draw.MolsToGridImage(res, molsPerRow=5)
Conversion result when Oxaprozin is input. This has a 5-member ring structure called oxazole. Some aromatic rings that form a 5-membered ring contain nitrogen and oxygen, such as thiophene and furan. In the following example, a molecule containing S and O in the constituent atoms of the 5-membered ring is also output.
How about that. An example of two molecules is given. In the first example, Gefitinib, the aromatic ring constituting the molecule is quinazoline and benzene. Quinazoline is a structure in which two 6-membered rings, benzene and pyrimidine, are fused. Candidate atoms for forming aromatic rings based on 6-membered rings are carbon and nitrogen. (Oxygen and sulfur are also candidates if we consider charged ones such as pyrylium ions, but such structures are rarely used in Drug Design, so they are excluded from this explanation. Link:https://en.wikipedia.org/wiki/%E8%A4%87%E7%B4%A0%E7%92%B0%E5%BC%8F%E5%8C%96%E5%90%88%E7%89%A9[Description of heterocyclic compound])
Oxaprozin has oxazole. Candidate atoms for forming a 5-membered aromatic ring include carbon, nitrogen, sulfur and oxygen. This was introduced as an example for such a molecule. In each case, the above code produces a heteroatom shuffled
A little more about hetero shuffling
In the article J. Med. Chem. 2012, 55, 11, 5151-5164 is described the effect of nitrogen shuffling for PIM-1 kinase inhibitor project with Fragment Molecular Orbital method which is a method of quantum chemistry. And another article J. Chem. Inf. Model. 2019, 59, 1, 149-158 described mechanism of the stackibng between Asp-Arg salt bridge and hetero rings with quantum chemistry calclation. The approach seems to be good indicator for substituents design.
Also, an example of hetero shuffling for improving the bio availability is ink:https://dx.doi.org/10.1021/jm101027s[J. Med. Chem. 2011, 54, 8, 3076-3080]