Skip to content

hkmztrk/SMILESVecProteinRepresentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About SMILESVec based Protein Representation

Here, we represent proteins using their interactings ligands. We utilize SMILES representation of ligands and propose, SMILESVec, which is a ligand representation that is built using Word2vec model by Mikolov et al.

Each SMILES is divided into overlapping subsequences that we call chemical words. Then Word2Vec learns a high-dimensional and real-valued vector for each of these chemical words. SMILES vector is described as the average of the vectors of its chemical word vectors.

We used Gensim implementation to build word-embeddings.

Figure


Installation

Data

"data" folder contains the input and output files.

"source code" folder contains python source code.

Embeddings files are provided in here

Requirements

You'll need to install following in order to run the codes.

  • Python 2.7.x or Python 3.x
  • numpy
  • sklearn
  • chembl_webresource_client
    • for dependency issues:
    • pip install --force-reinstall gevent==1.2.2
    • pip install --force-reinstall greenlet==0.4.12
  • pickle

In order to run the code you have to place an embedding file under utils folder inside the source folder.

You can use either drug.l8.chembl23.canon.ws20.txt or drug.l8.pubchem.canon.ws20.txt

Usage

get SMILESVec for given SMILES

For a list of SMILES strings, it outputs the corresponding SMILESVec. The following code runs for smiles_sample.txt file under utils folder.

python getsmilesvec.py [embedding_file_name]
python getsmilesvec.py drug.l8.chembl23.canon.ws20.txt

output: smiles.vec is a pickle file. use pickle.load(open("smiles.vec")) to open it.

get SMILESVec-based representation for given protein (UniProt ID)

For a list of UniProt IDs, it outputs the corresponding SMILESVec-based protein vectors. The following code runs for prots_sample.txt file under utils folder.

python getligprotvec.py [embedding_file_name]
python getligprotvec.py drug.l8.pubchem.canon.ws20.txt

output: prot.vec is a pickle file. use pickle.load(open("prot.vec")) OR

with open('protein.vec', 'rb') as f:
        prots= pickle.load(f, encoding='bytes')

to open it.

How to train your own embeddings of SMILES?

Please refer to README here for detailed information and source code.

SMILESVec-based Protein Similarity for SCOP A-50

will be updated

For citation:

A novel methodology on distributed representations of proteins using their interacting ligands

@article{Ozturk2018Anovel,
author = {Öztürk, Hakime and Ozkirimli, Elif and Özgür, Arzucan},
title = {A novel methodology on distributed representations of proteins using their interacting ligands},
journal = {Bioinformatics},
volume = {34},
number = {13},
pages = {i295-i303},
year = {2018},
doi = {10.1093/bioinformatics/bty287},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty287}

}

Releases

No releases published

Packages

No packages published