COVID-19 is a serious global health problem and producing vaccines against evolved SARS-CoV-2 virus is an important issue. We thus aim to predict the sequence evolution of SARS-CoV-2 spike protein by using deep learning. The pipeline, methods and codes here were adapted from a flu-forecaster developed by Eric Ma. Some codes were partially rewritten so that they can be run on Google Colab.
The SARS-CoV-2 sequence data came from the NCBI virus database(IRD). Search parameters were as follows:
- Species: Severe acute respiratory syndrome coronavirus 2
- Sequence Length: 1273
- Nucleotide completeness: complete
- Protein: Surface glycoprotein
- Colletion date: 2020/5/1~2021/12/31
- Graphic regions: North America
- isolation source: oronasopharynx
- Host: Homo (humans)
Or you can download the sequence file "sequences_2020May_to_2021Dec.fasta" and put it in folder "data_covid19" before executing the code.
- Use variational autoencoders, a deep learning method, to learn a latent manifold on which sequence evolution is taking place.
- Simultaneously construct a genotype network of SARS-CoV-2 evolution.
- Nodes: SARS-CoV-2 protein sequences.
- Edges: Sequences differ by one amino acid.
- Sanity checks:
- Plot edit distance between any two random pairs of protein sequences against their manifold distance. There should be a linear relationship between the two.
- Validation:
- MVP validation will be done by doing one round of "back testing" - we hold out data from 2021/8/1 to 2021/12/31, and predict whether data shows up or not.