To see how data is distributed, it is common to map it to a suitable space. In the particular case of chemoinformatics, the term chemical space is used.
Chemical space refers to the arrangement of compounds in an n-dimensional space at some scale. Generally, 2D or 3D is often used (for human understanding). Although various methods have been proposed for the scale, ie, similarity, it is often decided that a distance that well characterizes a compound is defined.
This time, we will visualize which pharmaceutical companies are developing which compounds for Orexin Receptor antagonists known as sleeping drug targets. See Chapter 4 for how to download data. This time we used the data of 10 papers in the table.
There are two main things I want to know:
-
Did any company develop a similar compound?
-
Did Merck optimize only similar scaffolds or multiple scaffolds?
Doc ID |
Journal |
Pharma |
CHEMBL3098111 |
Merck |
|
CHEMBL3867477 |
Merck |
|
CHEMBL2380240 |
Rottapharm |
|
CHEMBL3352684 |
Merck |
|
CHEMBL3769367 |
Merck |
|
CHEMBL3526050 |
Actelion |
|
CHEMBL3112474 |
Actelion |
|
CHEMBL3739366 |
Heptares |
|
CHEMBL3739395 |
Actelion |
|
CHEMBL3351489 |
Eisai |
Use ggplot for the drawing library. Principal component analysis (PCA) is used to distribute and visualize similar compounds close together. At first we import the necessary library.
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
import numpy as np
import pandas as pd
from ggplot import *
from sklearn.decomposition import PCA
import os
Load the downloaded sdf, and create fingerprints for each compound, enabling correspondence between drug companies and document IDs. If you have any questions please check Chapter 6.
oxrs = [("CHEMBL3098111", "Merck" ),("CHEMBL3867477", "Merck" ),
("CHEMBL2380240", "Rottapharm" ),("CHEMBL3352684", "Merck" ),
("CHEMBL3769367", "Merck" ),("CHEMBL3526050", "Actelion" ),
("CHEMBL3112474", "Actelion" ),("CHEMBL3739366", "Heptares" ),
("CHEMBL3739395", "Actelion" ), ("CHEMBL3351489", "Eisai" )]
fps = []
docs = []
companies = []
for cid, company in oxrs:
sdf_file = os.path.join("ch08", cid + ".sdf")
mols = Chem.SDMolSupplier(sdf_file)
for mol in mols:
if mol is not None:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
docs.append(cid)
companies.append(company)
fps.append(arr)
fps = np.array(fps)
companies = np.array(companies)
docs = np.array(docs)
When you check the fingerprint information, you can see that data of 293 compounds was obtained from 10 papers.
fps.shape
# (293, 2048)
You are now ready for principal component analysis. The number of principal components can be specified by n_components, but this time, I want to scatter two dimensions, so I set it to 2.
pca = PCA(n_components=2)
x = pca.fit_transform(fps)
Draw. I changed the color option according to each label, so I chose two attributes, COMPANY and DOCID.
d = pd.DataFrame(x)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g
You can now see what compounds each pharmaceutical company has optimized. Merck, Acterion, Eisai and Heptaress seem to have optimized similar compounds, as there is an overlapping area in the center of the chemical space. It is interesting to see if Actelion successfully expanded in a unique direction (lower left) or failed to expand into the center of the Red Ocean.
Also, Merck seems to have optimized various frameworks. I don’t know if we optimized at the same time or ran ahead for backup, but it is certain that many skeleton optimizations were working, so it would have been attractive as a target. In fact, SUVOREXANT was launched.
In this chapter, we use dissertation data, but we do not use dissertation data when performing such analysis in a real field. That’s because when a company publishes, it means that the project is over (whether it succeeded in going to the clinic or failed and closed it). In actual situations, analysis is performed using patent data.
Based on the analysis and experience of Medicinal Chemist and the insights of these companies, the project will proceed with a belief in their own successes while inferring the situation of other companies.
It is said that tSNE has better resolution than PCA and is closer to the sense of medicinal chemist. Sklearn just changes PCA to TSNE.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
tx = tsne.fit_transform(fps)
As you can see when drawing, it is better separated than PCA.
d = pd.DataFrame(tx)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g
There are many other drawing methods besides PCA and tSNE introduced this time, so it is good to check.
In the case taken this time, mapping could be performed neatly by putting the company name on the result of dimensionally compressing the fingerprint of the compound using PCA or tSNE. But we don’t always have the information. In such cases, perform clustering to divide the compounds into appropriate groups.
Clustering includes k-nearest neighbors and hierarchical clustering, but these cannot automatically determine how large a cluster should be. This time, we will use HDBSCAN , which is a clustering method based on density, because we want to divide the cluster label nicely so that the assigned cluster label almost corresponds to the company name.
Installation can be done with conda or pip command. I used seaborn for visualization . Reading data is omitted because it is the same as the above part, but RDLogger.DisableLog ('rdApp. *') Is inserted to avoid warning at the time of code execution (see jupyter notebook).
pip install hdbscan
#or
conda install -c conda-forge hdbscan
conda install -c conda-forge seaborn
I will use it immediately after installation.
% matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit import RDLogger
from sklearn.manifold import TSNE
from hdbscan import HDBSCAN
## The following packages will be used later
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import RandomForestRegressor
from mlinsights.mlmodel import PredictableTSNE
sns.set_context ('poster')
sns.set_style ('white')
sns.set_color_codes ()
plot_kwds = {'alpha': 0.5, 's': 80, 'linewidths': 0}
RDLogger.DisableLog ('rdApp. *')
seed = 794
oxrs = [("CHEMBL3098111", "Merck"), ("CHEMBL3867477", "Merck"), ("CHEMBL2380240", "Rottapharm"),
("CHEMBL3352684", "Merck"), ("CHEMBL3769367", "Merck"), ("CHEMBL3526050", "Actelion"),
("CHEMBL3112474", "Actelion"), ("CHEMBL3739366", "Heptares"), ("CHEMBL3739395", "Actelion"),
("CHEMBL3351489", "Eisai")]
fps = []
docs = []
companies = []
mol_list = []
for cid, company in oxrs:
sdf_file = os.path.join ("ch08", cid + ".sdf")
mols = Chem.SDMolSupplier (sdf_file)
for mol in mols:
if mol is not None:
mol_list.append (mol)
fp = AllChem.GetMorganFingerprintAsBitVect (mol, 2)
arr = np.zeros ((1,))
DataStructs.ConvertToNumpyArray (fp, arr)
docs.append (cid)
companies.append (company)
fps.append (arr)
fps = np.array (fps)
companies = np.array (companies)
docs = np.array (docs)
trainIDX, testIDX = train_test_split (range (len (fps)), random_state = seed)
First, let’s look at the chemical space with the company label at tSNE.
tsne = TSNE (random_state = seed)
res = tsne.fit_transform (fps)
plt.clf ()
plt.figure (figsize = (12, 6))
sns.scatterplot (res [:, 0], res [:, 1], hue = companies, ** plot_kwds)
Now that we’ve confirmed that it’s cleanly divided, let’s run HDBSCAN and visualize the results.
HDBSCAN method is based on Scikit-learn, so just create an object and call fit like Scikit-learn, but it’s completed. is needed. HDBSCAN provides many metrics such as Euclidean distance and Cosine distance by default.
In this case, we use the Tanimoto distance because it is the distance of the compound, but this must be implemented by yourself (tanimoto_dist). When using a user-defined function in this way, pass the function defined in the func argument as metric = 'pyfunc'.
def tanimoto_dist (ar1, ar2):
a = np.dot (ar1, ar2)
b = ar1 + ar2-ar1 * ar2
return 1-a / np.sum (b)
clusterer = HDBSCAN (algorithm = 'best', min_samples = 5, metric = 'pyfunc', func = tanimoto_dist)
clusterer.fit (fps)
After the calculation, labels are assigned to the label_ property of the clusterer object, so let’s plot using it. In this case, we set min_sample = 5 and set the minimum number of compounds to be included in the cluster formation to 5. Increasing or decreasing this parameter will produce different output, so it may be worth trying some trial and error. min_cluster_size is a parameter that defines the cluster size. Clusters smaller than this value are treated as noise. min_samples is the number of samples near the core (cluster center) to consider. The larger this parameter is, the more conservative the model (only those with more neighbors are clustered). There is no right or wrong answer in the clustering result, and it is important for medicinal chemists to see if it is reasonable when it is put into practice. Therefore, I think it would be better to find a good salt plum while watching the results with the medical chemist who recommends the project together. If you are interested, please see the Official Documentation for examples.
-
The gray areas in the plot are those that were judged not to belong to any cluster.
plt.clf ()
plt.figure (figsize = (12, 6))
palette = sns.color_palette ()
cluster_colors = [sns.desaturate (palette [col], sat)
if col> = 0 else (0.5, 0.5, 0.5) for col, sat in zip (clusterer.labels_, clusterer.probabilities_)]
plt.scatter (res [:, 0], res [:, 1], c = cluster_colors, ** plot_kwds)
You can see that the clusters and labels separated by tSNE roughly correspond to each other. In this way, HDBSCAN is useful as a method of dividing clusters based on density without previously determining the number of clusters.
By now, we have learned how to map a compound to a low-dimensional chemical space based on the characteristics of the compound.
When mapping using PCA or tSNE, where is the next compound to be located in this space? I often think. Also, when expanding the HTS library and wanting to fill an unprecedented chemical space, you will want to compare it with existing chemical spaces.
Scikit-learn’s PCA has a transform() method that allows you to map a new compound to the created principal component space. On the other hand, tSNE has a fit_transform() method, but no transform() method. Therefore, if you want to map a new compound, you need to calculate for all compounds. Regardless of the method, is there a way to map new data to existing chemical space? Wouldn’t it be nice to have a chemical space prediction model in the same way as creating a prediction model? In other words, we use the training data to create a predictive model for feature ⇒ low-dimensional space compression. This allows you to map new compounds to existing chemical spaces. The package that implements this is mlinsights. This package can be installed with the pip command.
pip install mlinsights
mlinsights implements several methods, but this time we will focus on case studies using only PredictableTSNE. TSNE is divided into training and test data at random, the TSNE is performed on the training data, the result is learned by two methods, RandomForest and GausianProcessRegressor, and the test data is mapped using this model.
trainFP = [fps [i] for i in trainIDX]
train_mol = [mol_list [i] for i in trainIDX]
testFP = [fps [i] for i in testIDX]
test_mol = [mol_list [i] for i in testIDX]
allFP = trainFP + testFP
tsne_ref = TSNE (random_state = seed)
res = tsne_ref.fit_transform (allFP)
plt.clf ()
plt.figure (figsize = (12, 6))
sns.scatterplot (res [:, 0], res [:, 1], hue = ['train' for i in range (len (trainFP))] + ['test' for i in range (len (testFP)) ])
Because it is a random split, you can sample evenly.
PredictableTSNE is the same as scikit-learn, so create an object, learn with fit, and reduce the dimension of Transform data. The object for TSNE and PCA dimension compression is passed to the transformer, and the learning machine that creates a model based on the result is passed to the estimator. Keep_tsne_outputs = True because training data and test data will be overlapped later.
One thing to note is that you need to pass a value for y that is not needed when you call fit. In the code below, just Index is given as a dummy value.
rfr = RandomForestRegressor (random_state = seed)
tsne1 = TSNE (random_state = seed)
pred_tsne_rfr = PredictableTSNE (transformer = tsne1, estimator = rfr, keep_tsne_outputs = True)
pred_tsne_rfr.fit (trainFP, list (range (len (trainFP))))
pred1 = pred_tsne_rfr.transform (testFP)
plt.clf ()
plt.figure (figsize = (12, 6))
plt.scatter (pred_tsne_rfr.tsne_outputs _ [:, 0], pred_tsne_rfr.tsne_outputs _ [:, 1], c = 'blue', alpha = 0.5)
plt.scatter (pred1 [:, 0], pred1 [:, 1], c = 'red', alpha = 0.5)
The red point is surprisingly good with test data. Next, we will do the same with GaussianProcess.
gbr = GaussianProcessRegressor (random_state = seed)
tsne2 = TSNE (random_state = seed)
pred_tsne_gbr = PredictableTSNE (transformer = tsne2, estimator = gbr, keep_tsne_outputs = True)
pred_tsne_gbr.fit (trainFP, list (range (len (trainFP))))
pred2 = pred_tsne_gbr.transform (testFP)
plt.clf ()
plt.figure (figsize = (12, 6))
plt.scatter (pred_tsne_gbr.tsne_outputs _ [:, 0], pred_tsne_gbr.tsne_outputs _ [:, 1], c = 'blue', alpha = 0.5)
plt.scatter (pred2 [:, 0], pred2 [:, 1], c = 'red', alpha = 0.5)
GP model did not have much performance. PredictableTSNE can be a tool for mapping new compounds to existing spaces, but the results can be very dependent on the model, as in this example. If you don’t test well when you launch the actual project, there is a risk of misleading the medicinal chemist, but I introduced it as an approach.
Tip
|
If you are using a newer version of pandas, you may get an error when calling ggplot. The github_issue solution is the code in the installed ggplot folder as in this link gpdlot / utils.py pd.tslib.Timestamp to pd.Timestamp, ggplot / stats / smoothers.py from pandas.lib import Timestamp If you change from to pandas import Timestamp, it will work. |