About SMILESVec based Protein Representation

Here, we represent proteins using their interactings ligands. We utilize SMILES representation of ligands and propose, SMILESVec, which is a ligand representation that is built using Word2vec model by Mikolov et al.

Each SMILES is divided into overlapping subsequences that we call chemical words. Then Word2Vec learns a high-dimensional and real-valued vector for each of these chemical words. SMILES vector is described as the average of the vectors of its chemical word vectors.

We used Gensim implementation to build word-embeddings.

Installation

Data

"data" folder contains the input and output files.

"source code" folder contains python source code.

Embeddings files are provided in here

Requirements

You'll need to install following in order to run the codes.

Python 2.7.x or Python 3.x
numpy
sklearn
chembl_webresource_client
- for dependency issues:
- pip install --force-reinstall gevent==1.2.2
- pip install --force-reinstall greenlet==0.4.12
pickle

In order to run the code you have to place an embedding file under utils folder inside the source folder.

You can use either drug.l8.chembl23.canon.ws20.txt or drug.l8.pubchem.canon.ws20.txt

Usage

get SMILESVec for given SMILES

For a list of SMILES strings, it outputs the corresponding SMILESVec. The following code runs for smiles_sample.txt file under utils folder.

python getsmilesvec.py [embedding_file_name]
python getsmilesvec.py drug.l8.chembl23.canon.ws20.txt

output: smiles.vec is a pickle file. use pickle.load(open("smiles.vec")) to open it.

get SMILESVec-based representation for given protein (UniProt ID)

For a list of UniProt IDs, it outputs the corresponding SMILESVec-based protein vectors. The following code runs for prots_sample.txt file under utils folder.

python getligprotvec.py [embedding_file_name]
python getligprotvec.py drug.l8.pubchem.canon.ws20.txt

output: prot.vec is a pickle file. use pickle.load(open("prot.vec")) OR

with open('protein.vec', 'rb') as f:
        prots= pickle.load(f, encoding='bytes')

to open it.

How to train your own embeddings of SMILES?

Please refer to README here for detailed information and source code.

SMILESVec-based Protein Similarity for SCOP A-50

will be updated

For citation:

A novel methodology on distributed representations of proteins using their interacting ligands

@article{Ozturk2018Anovel,
author = {Öztürk, Hakime and Ozkirimli, Elif and Özgür, Arzucan},
title = {A novel methodology on distributed representations of proteins using their interacting ligands},
journal = {Bioinformatics},
volume = {34},
number = {13},
pages = {i295-i303},
year = {2018},
doi = {10.1093/bioinformatics/bty287},
URL = {http://dx.doi.org/10.1093/bioinformatics/bty287}

}

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
dataset		dataset
docs/figures		docs/figures
source		source
.gitignore		.gitignore
MIT License		MIT License
README.md		README.md
data.sh		data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About SMILESVec based Protein Representation

Installation

Data

Requirements

Usage

get SMILESVec for given SMILES

get SMILESVec-based representation for given protein (UniProt ID)

How to train your own embeddings of SMILES?

SMILESVec-based Protein Similarity for SCOP A-50

About

Releases

Packages

Languages

hkmztrk/SMILESVecProteinRepresentation

Folders and files

Latest commit

History

Repository files navigation

About SMILESVec based Protein Representation

Installation

Data

Requirements

Usage

get SMILESVec for given SMILES

get SMILESVec-based representation for given protein (UniProt ID)

How to train your own embeddings of SMILES?

SMILESVec-based Protein Similarity for SCOP A-50

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages