Releases: roblanf/sarscov2phylo
14-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
71379
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 69034
Alignment length: 29903
Total # residues: 2057839570
Smallest: 29018
Largest: 29903
Average length: 29809.1
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 69034
Alignment length: 29903
Total # residues: 2047985731
Smallest: 28922
Largest: 29675
Average length: 29666.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68844
Alignment length: 29903
Total # residues: 2042358563
Smallest: 28922
Largest: 29675
Average length: 29666.5
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68844
Alignment length: 29660
Total # residues: 2037002535
Smallest: 28437
Largest: 29660
Average length: 29588.7
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 122241
#leaves: 68826
#dichotomies: 50928
#leaf labels: 68826
#inner labels: 48062
Number of new sequences added this iteration
3035 alignment_names_new.txt
Notable changes to the scripts in this release
- Number of SPR moves reduced from 10 to 2, after some benchmarking to determine whether this was sensible. I'll keep an eye on it, but my analyses suggest that 1 round of SPR moves should be enough to find and fix any large issues in the tree, so 2 is still conservative.
Notable aspects of the trees
- None
12-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
70577
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68259
Alignment length: 29903
Total # residues: 2034721148
Smallest: 29018
Largest: 29903
Average length: 29808.8
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68259
Alignment length: 29903
Total # residues: 2024989109
Smallest: 28922
Largest: 29675
Average length: 29666.3
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68069
Alignment length: 29903
Total # residues: 2019361941
Smallest: 28922
Largest: 29675
Average length: 29666.4
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 68069
Alignment length: 29660
Total # residues: 2014032896
Smallest: 28437
Largest: 29660
Average length: 29588.1
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 120899
#leaves: 68019
#dichotomies: 50424
#leaf labels: 68019
#inner labels: 47601
Number of new sequences added this iteration
2647 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
10-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
67933
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 65650
Alignment length: 29903
Total # residues: 1957009665
Smallest: 29018
Largest: 29903
Average length: 29809.7
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 65650
Alignment length: 29903
Total # residues: 1947568720
Smallest: 28922
Largest: 29675
Average length: 29665.9
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 65460
Alignment length: 29903
Total # residues: 1941941552
Smallest: 28922
Largest: 29675
Average length: 29666.1
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 65460
Alignment length: 29661
Total # residues: 1936801553
Smallest: 28437
Largest: 29661
Average length: 29587.6
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 116487
#leaves: 65422
#dichotomies: 48748
#leaf labels: 65422
#inner labels: 46052
Number of new sequences added this iteration
1598 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
08-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
67163
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64994
Alignment length: 29903
Total # residues: 1937466703
Smallest: 29018
Largest: 29903
Average length: 29809.9
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64994
Alignment length: 29903
Total # residues: 1928107701
Smallest: 28922
Largest: 29675
Average length: 29665.9
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64804
Alignment length: 29903
Total # residues: 1922480533
Smallest: 28922
Largest: 29675
Average length: 29666.1
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64804
Alignment length: 29661
Total # residues: 1917372682
Smallest: 28437
Largest: 29661
Average length: 29587.3
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 115234
#leaves: 64748
#dichotomies: 48199
#leaf labels: 64748
#inner labels: 45525
Number of new sequences added this iteration
203 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
06-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
66966
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64801
Alignment length: 29903
Total # residues: 1931706492
Smallest: 29018
Largest: 29903
Average length: 29809.8
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64801
Alignment length: 29903
Total # residues: 1922380482
Smallest: 28922
Largest: 29675
Average length: 29665.9
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64613
Alignment length: 29903
Total # residues: 1916812664
Smallest: 28922
Largest: 29675
Average length: 29666.1
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64613
Alignment length: 29661
Total # residues: 1911717961
Smallest: 28437
Largest: 29661
Average length: 29587.2
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 114973
#leaves: 64606
#dichotomies: 48085
#leaf labels: 64606
#inner labels: 45419
Number of new sequences added this iteration
390 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
04-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
66579
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64479
Alignment length: 29903
Total # residues: 1922105980
Smallest: 29018
Largest: 29903
Average length: 29809.8
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64479
Alignment length: 29903
Total # residues: 1912825265
Smallest: 28922
Largest: 29675
Average length: 29665.9
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64291
Alignment length: 29903
Total # residues: 1907257447
Smallest: 28922
Largest: 29675
Average length: 29666.0
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 64291
Alignment length: 29661
Total # residues: 1902176878
Smallest: 28437
Largest: 29661
Average length: 29587.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 114331
#leaves: 64223
#dichotomies: 47841
#leaf labels: 64223
#inner labels: 45197
Number of new sequences added this iteration
4053 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
02-09-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
63120
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 61032
Alignment length: 29903
Total # residues: 1820588175
Smallest: 29018
Largest: 29903
Average length: 29830.1
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 61032
Alignment length: 29903
Total # residues: 1810608002
Smallest: 28922
Largest: 29675
Average length: 29666.5
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 60754
Alignment length: 29903
Total # residues: 1802371557
Smallest: 28922
Largest: 29675
Average length: 29666.7
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 60754
Alignment length: 29661
Total # residues: 1798186405
Smallest: 28437
Largest: 29661
Average length: 29597.8
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 107870
#leaves: 60735
#dichotomies: 44969
#leaf labels: 60735
#inner labels: 42461
Number of new sequences added this iteration
1739 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
30-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883
If you publish papers that use this tree you must still follow the GISAID data sharing and attribution rules.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i [gisaid.fasta] -p [previous_iteration] -t 250
-
[gisaid.fasta] is the fasta file of high coverage and complete raw sequences from GISAID up to and including the date in the title of the release, determined by the 'submission date' filter on GISAID
-
[previous_iteration] is the filepath of the previous release, this is used to provide the
excluded_sequences.tsv
andft_SH.tree
files as the starting points of the current iteration.
Filtering statistics
sequences downloaded from GISAID
61969
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59968
Alignment length: 29903
Total # residues: 1788904459
Smallest: 29018
Largest: 29903
Average length: 29831.0
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59968
Alignment length: 29903
Total # residues: 1779071662
Smallest: 28922
Largest: 29675
Average length: 29667.0
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59692
Alignment length: 29903
Total # residues: 1770894567
Smallest: 28922
Largest: 29675
Average length: 29667.2
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59692
Alignment length: 29661
Total # residues: 1766765393
Smallest: 28437
Largest: 29661
Average length: 29598.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 106041
#leaves: 59691
#dichotomies: 44219
#leaf labels: 59691
#inner labels: 41753
Number of new sequences added this iteration
149 alignment_names_new.txt
Notable changes to the scripts in this release
- I have further automated the process so that the
excluded_sequences.tsv
file auto-updates. This removes the last human interaction that is required to run the iterative script. The key difference now is that the this file and theft_SH.tree
file are copied from the previous iteration (specified via-p
) at the start of the analysis.
Notable aspects of the trees
- None
28-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 28-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_31_23.fasta -o global.fa -s ft_SH_26-08-20.tree -t 250
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 28th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_26-8-20.tree' is the 'ft_SH.tree' file from the 26-8-20 release.
The lnL of the final tree is: -564605.798
Filtering statistics
sequences downloaded from GISAID
61880
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59889
Alignment length: 29903
Total # residues: 1786554388
Smallest: 29018
Largest: 29903
Average length: 29831.1
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59889
Alignment length: 29903
Total # residues: 1776728654
Smallest: 28922
Largest: 29675
Average length: 29667.0
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59612
Alignment length: 29903
Total # residues: 1768521884
Smallest: 28922
Largest: 29675
Average length: 29667.2
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 59612
Alignment length: 29661
Total # residues: 1764399545
Smallest: 28437
Largest: 29661
Average length: 29598.1
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 105877
#leaves: 59600
#dichotomies: 44150
#leaf labels: 59600
#inner labels: 41689
Number of new sequences added this iteration
1097 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None
26-08-20
Citation and reuse
Please cite this release as:
Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 from GISAID data, including sequences deposited up to 26-August-2020. Zenodo DOI: 10.5281/zenodo.3958883
Please note - you cannot publish papers that use this tree without following the GISAID data sharing and attribution rules. These rules are important - they protect the data uploaders, and create trust in a global system of data sharing with potentially vast public health benefits. By building and maintaining trust we ensure that people keep sharing their data, and that the public health benefits keep flowing. I do not want the existence of this tree to be some kind of attribution laundering service (e.g. where people feel free to use the tree without following the GISAID data sharing rules), so please don't use it in that way. For example, if you are going to interpret other people's data from GISAID and publish the results, including by using this tree, you should get in touch with the people that submitted the data. The code in this repo is covered by the GNU license, and you can use that however you like.
Details
The trees in this release were generated with the following command line:
bash global_tree_gisaid_start_tree.sh -i gisaid_hcov-19_2020_08_28_23.fasta -o global.fa -s ft_SH_24-08-20.tree -t 250
The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 26th of August 2020, determined by the 'submission date' filter on GISAID.
The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the tree itself here so that it can be easily downloaded without downloading the entire repo. The file ' ft_SH_24-8-20.tree' is the 'ft_SH.tree' file from the 24-8-20 release.
The lnL of the final tree is: -554592.549
Filtering statistics
sequences downloaded from GISAID
60881
//
alignment stats of global alignment
Alignment number: 1
Format: aligned FASTA
Number of sequences: 58971
Alignment length: 29903
Total # residues: 1759206607
Smallest: 29018
Largest: 29903
Average length: 29831.7
Average identity: 100%
//
alignment stats of global alignment after masking sites
Alignment number: 1
Format: aligned FASTA
Number of sequences: 58971
Alignment length: 29903
Total # residues: 1749488987
Smallest: 28922
Largest: 29675
Average length: 29666.9
Average identity: 100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number: 1
Format: aligned FASTA
Number of sequences: 58694
Alignment length: 29903
Total # residues: 1741282217
Smallest: 28922
Largest: 29675
Average length: 29667.1
Average identity: 100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number: 1
Format: aligned FASTA
Number of sequences: 58694
Alignment length: 29662
Total # residues: 1737280916
Smallest: 28437
Largest: 29662
Average length: 29599.0
Average identity: 100%
//
After filtering sequences with TreeShrink
Type: Phylogram
#nodes: 104219
#leaves: 58611
#dichotomies: 43512
#leaf labels: 58611
#inner labels: 41070
Number of new sequences added this iteration
5036 alignment_names_new.txt
Notable changes to the scripts in this release
- None
Notable aspects of the trees
- None