Skip to content

Latest commit

 

History

History
73 lines (50 loc) · 3.51 KB

README.md

File metadata and controls

73 lines (50 loc) · 3.51 KB

ProteinCLIP

Introduction and background

Installation

To install proteinCLIP, start by cloning this repository. Then, create the requisite conda environment, activate it, and install ProteinCLIP in editable mode using pip. For example:

conda env create -f environment.yml
conda activate proteinclip
pip install -e ./

Note: we highly recommend the mamba package manager as an alternative to conda.

In addition to installation, you will likely need to download data files if you intend to train ProteinCLIP yourself; all datasets we use can be found at Zenodo.

using ProteinCLIP

We provide pre-trained ProteinCLIP "adapter" models for the ESM2 family of models as well as ProtT5. These models are available under the pretrained directory and can be loaded using provided functions; see below for an example.

from proteinclip import model_utils

m = model_utils.load_proteinclip("esm", 33)  # For ESM2, 33-layer model

# Create a synthetic example
# Size corresponds to embedding dimension of "parent" protein language model
model_input = np.random.randn(1280)
# ProteinCLIP expects input to be unit-normalized
model_input /= np.linalg.norm(model_input)
x = m.predict(model_input)
print(x.shape)  # (128,)
print(np.linalg.norm(x))  # 1.0; ProteinCLIP produces unit-norm vectors

Pre-trained models are available for the following models

  • ESM2, 36-layer: model_utils.load_proteinclip("esm", 36)
  • ESM2, 33-layer: model_utils.load_proteinclip("esm", 33)
  • ESM2, 30-layer: model_utils.load_proteinclip("esm", 30)
  • ESM2, 12-layer: model_utils.load_proteinclip("esm", 12)
  • ESM2, 6-layer: model_utils.load_proteinclip("esm", 6)
  • ProtT5: model_utils.load_proteinclip("t5")

These models are stored in the ONNX format so feel free to write your own loaders as well. These models are small and can run their forward inference passes very quickly even on CPU.

Example training commands

Training ProteinCLIP

To train ProteinCLIP yourself, you can use the pre-computed embeddings that we have provided above, or you can compute your own embeddings stored in a hdf5 format as (uniprot ID -> embedding array). After you have obtained a protein embedding file, pass it to training script as follows:

Example command:

python bin/train_protein_clip.py configs/clip_hparams.json /path/to/uniprot_sprot.dat.gz /path/to/protein_embedding.hdf5 --unitnorm -g text-embedding-3-large

Training should only take a couple hours with pre-computed embeddings.

Training protein-protein interaction classifier

We provide a training command to automatically train a protein-protein classifier using the data splits provided by Bernett et al. The input to this training call is a directory to a training run of the above ProteinCLIP; the relevant hdf5 embeddings for proteins will be loaded, as well as the CLIP architecture itself (as specified by the --clipnum argument).

Example command:

python bin/train_ppi.py configs/supervised_hparams.json -c ./protein_clip/version_0 --clipnum 1 -n ppi_classifier

Training should take a few minutes.

References

(1) Bernett, J., Blumenthal, D. B., & List, M. (2024). Cracking the black box of deep sequence-based protein–protein interaction prediction. Briefings in Bioinformatics, 25(2), bbae076.