This repository provides a global phylogeny of SARS-CoV-2 data from GISAID, as well as reproducile methods with which to estimate it. It also provides code that can be used to estimate local phylogenies around your own sequences, please see below for more information.
We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID’s EpiFlu™ Database on which this research is based. The list is detailed in this tab-separated file.
The latest tree with the associated code to reproduce that tree can always be obtained from the latest release of this repository here. Releases are named by the date (in Australian time) on which the sequences were downloaded from GISAID.
For convenience, you can also get the latest results via the following links:
- Latest global phylogeny with Transfer Bootstrap Supports
- Latest global phylogeny with Felsenstein Bootstrap Supports
- Acknowledgements file for those that upload to GISAID
- Latest version of the script to produce a global tree
Privacy rules around the alignments themselves mean that they cannot be released here. The alignments can be recreated by following the steps described below. If you are a GISAID member and would like a copy of the alignment, please email me and I'll share it with you.
Also, please note that the script to produce the global tree is continually modified as new data are released, so if you are looking at the master branch it may be the case that the latest code will not exactly reproduce the latest tree. If you would like to reproduce the latest tree, please head over to the latest release, which contains the latest tree and the version of the code used to produce it.
The trees contain more than 30,000 sequences. If you would like to visualise them, I recommend using Dendroscope. This is very fast, has a very quick search that allows you to find sequences of interest, and allows you to switch easily between cladograms, phylogenies, etc. The image above is a screenshot from Dendroscope.
Sequences are filtered out for a few reasons:
- Shorter than 28,000 bp after alignment
- Contains more than 1000 ambiguities after alignment
- Identified as being on a long branch by TreeShrink
- Identified as questionable in the excluded_sequences.tsv file
If your sequence is in GISAID, and was submitted before the date noted in the latest release of the repository, but it is not in the tree, then it was filtered for one of the above reasons.
The topology and branch lengths of the two trees are identical. In both cases, the topology is the best topology estimated by fasttree
using the -fastest
option, which I found to perform the best out of a wide range of potential approaches. The branch lengths represent substitutions per site.
The only difference between the two trees are the support values represented for each branch. One tree (ft_FBP.tree) contains standard bootstrap values (also known as Felsenstein Bootstrap Proportions, or FBP). The other is built using the Transfer Bootstrap Expectation (or TBE). You can read more about these two values and how they differe here and here.
The difference is fairly simple to describe. The FBP value represents the percentage of bootstrap trees in which a split appeared. The TBE value represents the proportion of taxa in a clade that are, on average, stable across bootstrap replicates. The TBE is always at least as high as the FBP. In trees of thousands of taxa (this tree has 10s of 1000s), FBP values can often be very close to zero simply because branches are very short and some taxa can move around a lot. The TBE is more informative in this case, because it measures how many taxa have to be moved to recover a branch in a tree, rather than just the presence/absence of that branch.
For most people this repository won't contain much of use. Instead, most people should use:
This repository exists for two reasons:
- To provide a global phylogeny of SARS-CoV-2 sequences that others might find useful
- To develop and provide reproducible best-practice analyses for estimating that tree
Figuring out the best approach for estimating a very large phylogeny like this is an open question. I have kept notes of methods I have tried and compared here. A big part of the reason for working on this is that my research group develops phylogenetic methods like IQ-TREE, PartitionFinder, and RWTY. We are always working on improving these methods (indeed, some of these methods are used internally by Nextstrain and PANGOLIN), and as such we are working hard to make sure we understand these data as much as possible.
If you want to dig deeper into how we estimate the trees, and/or contribute by trying other methods or making suggestions, please read on. Feel free to use these trees for your work, but please cite this repository, and more importantly all the relevant tools that are used to make the trees.
You should be able to re-run the scripts I provide here, provided you have access to a reasonably large server (e.g. ~500GB RAM and ~50 cores).
To start, you will need some version of conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/
Once you have that, getting all the pacakges you need is simple:
# 1. Clone the repo
git clone https://github.com/roblanf/sarscov2phylo``
# 2. change to the dir
cd sarscov2phylo/
# 3. download and install all the packages you need
conda env create -f environment.yml
# 4. acviate the conda environment with all the packages
conda activate sarscov2phylo
-
Get the latest GISAID data in a single fasta file, e.g.
gisaid.fa
, you can do this via the batch download feature on GISAID, though you will have to become a member to do this. -
Run this:
bash global_tree_gisaid.sh -i gisaid.fasta -o global.fa -t 20 -k 100
You can get more information on what is happening from the scripts themselves, and below.
-
Get the latest GISAID data in a single fasta file, e.g.
gisaid.fa
, you can do this via the batch download feature on GISAID, though you will have to become a member to do this. -
Collect your own sequences into a FASTA file, e.g.
myseqs.fa
-
Run this:
bash global_tree_gisaid_plus.sh -i gisaid.fasta -o global.fa -t 20 -k 100 -a myseqs.fa -d 2
To estimate a global phylogeny for the GISAID sequences, run this script with the following commandline:
bash global_tree_gisaid.sh -i gisaid.fasta -o global.fa -t 20 -k 100
Here's what the code does:
- Fixes known issues with GISAID sequences, include replacing spaces in sequences with N's, and illegal characters in names. (The EPI-ID is never changed, so if you are interested in a particular sequence, it's best to search for it with the EPI-ID)
- Removes sequences that I or others have determined to be questionable (these are documented in excluded_sequences.tsv)
- Trims the low-quality ends from every sequence in the input file (saved as
trimmed.fa
), following advice here: http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473. - Makes an alignment of the 100 most dissimilar (but high-quality, namely no more than 10 ambiguous bases, filtered with
seqmagick
) sequences in the input, then filters sites with >10% gaps from this alignment usingesl-alimask
(saved asaln_k_filtered.fa
) - Creates a global alignment by aligning every sequence to
aln_k_filtered.fa
with MAFFT, parallelised with GNU parallel, and doing various format interconversions with the EASEL tools. - Filters sites from the alignment with >5 % gaps using
esl-alimask
. - Filters sequence from the alignment that are shorter than 28000 bp, or contain >1000 ambiguities using
esl-alimanip
- Estimates a global tree using
fasttree
with the-fastest
setting (why's this good?) - Uses
goalign
and GNUparallel
to make 100 bootstrap alignments, then make trees from these as in step 8. - Creates bootstrap support trees with standard bootstraps and the transfer bootstrap using
raxml-ng
. The latter measure is likely more appropriate for large SARS-CoV-2 trees (see above). - Removes sequences on very long branches from the tree using TreeShrink (these sequences are likely to be either of poor quality and/or poorly aligned, and are subsequently added to the list of excluded seuqences so they are not included in future iterations of the pipeline)
- Roots the trees with nw_utils using sequence hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05, following advice here
This is a slightly different problem from estimating a global tree. Instead of worrying about the global tree, we are now concerned with getting the best possible estimate of relationships around some focal sequence or sequences. To do this, the global_tree_gisaid_plus.sh
does the following:
- Follows steps 1-7 as above, with the exception that none of your focal sequences will be removed in the alignment filtering step. Some of their sites may be removed, but the sequence itself will be retained no matter its length or the number of N's it contains.
- Estimates a global tree using
rapidnj
, which is very good at getting good trees very quickly (why's this good?) - Uses
goalign
and GNUparallel
to estimate 100 bootstraprapidnj
trees. - For each of the focal sequences in your own data, the script then uses the
rapidnj
tree to select a sub-tree that contains at least 100 and at most 1000 of the most closely-related sequences, inbetween those limits, it will cut the tree at the first node that is at least-d
(a user-supplied option) substitutions from the focal sequence. - Re-estimates the local phylogeny around each focal sequence using
raxml-ng
with a GTR+I+G model of sequence evolution with 100 bootstraps. - Creates bootstrap support trees with standard bootstraps and the transfer bootstrap using
raxml-ng
. The latter measure is likely more appropriate for large SARS-CoV-2 trees. - Attempts to make PDFs of each local tree with
Inkscape
, though you will need to install this yourself as theconda
implementation of Inkscape seems to be dead right now.
The scripts rely on a suite of excellent tools. Here are many of them:
mafft
raxml-ng
iqtree
fasttree
rapidnj
- Easel tools like
esl-alimanip
- UCSC fasta tools like
faSomerecords
seqmagick
- GNU
parallel
conda
goalign
nw_utils