Skip to content

Predict SARS-CoV-2 spike glycoprotein sequences via deep learning

Notifications You must be signed in to change notification settings

Spheluo/SARS-CoV2-sequence-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

SARS-CoV-2 sequence predictor

COVID-19 is a serious global health problem and producing vaccines against evolved SARS-CoV-2 virus is an important issue. We thus aim to predict the sequence evolution of SARS-CoV-2 spike protein by using deep learning. The pipeline, methods and codes here were adapted from a flu-forecaster developed by Eric Ma. Some codes were partially rewritten so that they can be run on Google Colab.

Data

The SARS-CoV-2 sequence data came from the NCBI virus database(IRD). Search parameters were as follows:

  • Species: Severe acute respiratory syndrome coronavirus 2
  • Sequence Length: 1273
  • Nucleotide completeness: complete
  • Protein: Surface glycoprotein
  • Colletion date: 2020/5/1~2021/12/31
  • Graphic regions: North America
  • isolation source: oronasopharynx
  • Host: Homo (humans)

Or you can download the sequence file "sequences_2020May_to_2021Dec.fasta" and put it in folder "data_covid19" before executing the code.

Structure

  1. Use variational autoencoders, a deep learning method, to learn a latent manifold on which sequence evolution is taking place.
  2. Simultaneously construct a genotype network of SARS-CoV-2 evolution.
    1. Nodes: SARS-CoV-2 protein sequences.
    2. Edges: Sequences differ by one amino acid.
  3. Sanity checks:
    1. Plot edit distance between any two random pairs of protein sequences against their manifold distance. There should be a linear relationship between the two.
  4. Validation:
    1. MVP validation will be done by doing one round of "back testing" - we hold out data from 2021/8/1 to 2021/12/31, and predict whether data shows up or not.