Skip to content

cov2vec is a systematic effort to obtain SARS CoV-2 genome embeddings by encoding viral genomes with protein language models.

Notifications You must be signed in to change notification settings

salvatoreloguercio/cov2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cov2vec is a systematic effort to obtain SARS CoV-2 genome embeddings by encoding viral genomes with protein language models - for specific applications (i.e. improve biomedically relevant ML tasks), and globally for learning meaningful representations of the viral genome as an evolving, high-dimensional genomic manifold.

Input: called mutations from a viral sequence deposited to GISAID (currently using gff3 files from CNCB; future support for outbreak.info / nextstrain input).

gff2fasta: converts called mutations back to a mutated sequence file in fasta format, by introducing amino-acid altering mutations (indels and missense mutations).

fasta2vec: uses a SOTA protein language model (ESM_1b or ESM_1v) pre-trained on a large protein sequence corpus (UniRef50, UniRef90).

Output: per-protein and genome embedding for the input viral sequence.

About

cov2vec is a systematic effort to obtain SARS CoV-2 genome embeddings by encoding viral genomes with protein language models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages