Skip to content

Releases: roblanf/sarscov2phylo

09-7-20

10 Jul 23:00
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_07_09_04.fasta -o global.fa -t 34

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 9th of July 2020, at 4PM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.

Filtering statistics

sequences downloaded from GISAID
39366
//
alignment stats of global alignment
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 38950
Alignment length:    29903
Total # residues:    1162549765
Smallest:            29146
Largest:             29903
Average length:      29847.2
Average identity:    100%
//
alignment stats of global alignment after masking sites
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 38950
Alignment length:    29903
Total # residues:    1157204550
Smallest:            29096
Largest:             29718
Average length:      29710.0
Average identity:    100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 38792
Alignment length:    29903
Total # residues:    1152514216
Smallest:            29096
Largest:             29718
Average length:      29710.1
Average identity:    100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 38792
Alignment length:    29704
Total # residues:    1149590203
Smallest:            28492
Largest:             29704
Average length:      29634.7
Average identity:    100%
//
After filtering sequences with TreeShrink
Type:	Phylogram
#nodes:	69583
#leaves:	38699
#dichotomies:	29578
#leaf labels:	38699
#inner labels:	30882

Notable changes to the scripts in this release

  • None

Notable aspects of the trees

  • None

30-6-20

02 Jul 22:43
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_30_00.fasta -o global.fa -t 33

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 30th of June 2020, at 9AM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloading the entire repo.

Filtering statistics

Note: these are now provided in the alignments.log file if you run the script.

sequences downloaded from GISAID
35083
//
alignment stats of global alignment
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 34708
Alignment length:    29903
Total # residues:    1036091622
Smallest:            29146
Largest:             29903
Average length:      29851.7
Average identity:    100%
//
alignment stats of global alignment after masking sites
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 34708
Alignment length:    29903
Total # residues:    1031170100
Smallest:            29032
Largest:             29718
Average length:      29709.9
Average identity:    100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 34101
Alignment length:    29903
Total # residues:    1013138509
Smallest:            29096
Largest:             29718
Average length:      29709.9
Average identity:    100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 34101
Alignment length:    29704
Total # residues:    1010509076
Smallest:            28492
Largest:             29704
Average length:      29632.8
Average identity:    100%
//
After filtering sequences with TreeShrink
Type:   Phylogram
#nodes: 61276
#leaves:        34097
#dichotomies:   26064
#leaf labels:   34097
#inner labels:  27177

Notable changes to the scripts in this release

  • The ML tree is now estimated in a slightly more rigorous way. Details are in the script.

  • Server time was limited, so only 66 bootstraps were done. This shouldn't affect anything.

Notable aspects of the trees

  • None

24-6-20

26 Jun 02:41
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_24_01.fasta -o global.fa -t 33

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 24th of June 2020, at 9AM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

Filtering statistics

Note: these are now provided in the alignments.log file if you run the script.

sequences downloaded from GISAID
34281
//
alignment stats of global alignment
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 33949
Alignment length:    29903
Total # residues:    1013441101
Smallest:            26389
Largest:             29903
Average length:      29851.9
Average identity:    100%
//
alignment stats of global alignment after masking sites
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 33949
Alignment length:    29903
Total # residues:    1008614206
Smallest:            26306
Largest:             29718
Average length:      29709.7
Average identity:    100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 33317
Alignment length:    29903
Total # residues:    989842599
Smallest:            29096
Largest:             29718
Average length:      29709.8
Average identity:    100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 33317
Alignment length:    29718
Total # residues:    988221131
Smallest:            28354
Largest:             29718
Average length:      29661.2
Average identity:    100%
//
After filtering sequences with TreeShrink
Type:   Phylogram
#nodes: 59268
#leaves:        33290
#dichotomies:   24897
#leaf labels:   33290
#inner labels:  25976

Notable changes to the scripts in this release

  • This and future releases only include the sequences marked by GISAID as 'complete' and 'high coverage'. This is to ameliorate issues we were having with low-quality sequences, which contained a lot of sequencing error, which caused long-branch attraction that made the phylogenies difficult to estimate reliably.

  • This and future releases include a third tree with a different support measure - the SH supports estimated by fasttree. These support values are correlated, though quite weakly, to the other values, and also tend to be higher than all of the other values. Nevertheless, they are very fast to calculate, so may become the only support values offered in the future if the datasets get too large to estimate standard bootstrap values like FBP and TBE.

Notable aspects of the trees

  • There are a few long branches in the trees, so proceed with caution. It is not yet clear to me why TreeShrink fails to identify these branches.

14-6-20.2

19 Jun 06:48
Compare
Choose a tag to compare

This release is an update of the tree released on 14-6-20.

The code to produce the alignments and trees in this release is identical to that release, except for one detail. In this updated release the maximum likelihood tree has been improved by re-running it with the following commandline:

fasttree -nosupport -nt -gamma global.fa

This differs from the previous release only in that it does not use fasttree's -fastest tag.

I did this because manual examination of the tree from 14-6-20 showed a large and probably misplaced clade. Re-running the tree with the commandline above leads to a large improvement in the log likelihood from -382048.463 in the original release to -381841.805 in this updated release. In addition, the probably-misplaced clade has moved back to what is likely to be the correct position at the root of the tree. Regardless, this tree has a better likelihood so should be preferred.

I exploring improvements to the tree estimation. Releases after 17-6-20 (which was started before this solution was figured out) will include these improvements, and I'll only release the tree from 17-6-20 if the clade misplaced on 14-6-20 is in the correct position.

The bootstrap values on the two trees below were mapped using exactly the same set of bootstrap trees produced in the original 14-6-20 release. And, just as for the 14-6-20 release, the trees were then run through TreeShrink to remove taxa on long branches.

14-6-20

16 Jun 01:22
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_14_04.fasta -o global.fa -t 30

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 14th of June 2020, at 9PM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

Filtering statistics

Note: these are now provided in the alignments.log file if you run the script.

sequences downloaded from GISAID
46304
//
alignment stats of global alignment
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 46249
Alignment length:    29903
Total # residues:    1367525290
Smallest:            64
Largest:             29903
Average length:      29568.8
Average identity:    100%
//
alignment stats of global alignment after masking sites
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 46249
Alignment length:    29903
Total # residues:    1360867101
Smallest:            64
Largest:             29718
Average length:      29424.8
Average identity:    100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 40627
Alignment length:    29903
Total # residues:    1207019620
Smallest:            28249
Largest:             29718
Average length:      29709.8
Average identity:    100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 40627
Alignment length:    29718
Total # residues:    1200334988
Smallest:            27973
Largest:             29718
Average length:      29545.3
Average identity:    100%
//
After filtering sequences with TreeShrink
Type:	Phylogram
#nodes:	74240
#leaves:	40585
#dichotomies:	32600
#leaf labels:	40585
#inner labels:	33653

Notable changes to the scripts in this release

  • All sequences identified by treeshrink are now removed from the excluded_sequences.tsv, since treeshrink seemed to be iteratively removing more and more sequences.

Notable aspects of the trees

  • There are some long branches in the trees, and one clade in particular that appears to have a lot of mutations. I am unsure whether this clade (and the branchlenghts leading to it) are legitimate. However, I have checked out a handful of the seuqences in this clade in the alignment (e.g. hCoV-19/Australia/VIC35/2020|EPI_ISL_419755|2020-03-10) and there is nothing obviously odd in the alignment itself. Some caution is warranted around this clade though.

12-6-20

14 Jun 04:08
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_11_21.fasta -o global.fa -t 33

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 12th of June 2020, at 9PM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

Filtering statistics

Note: these are now provided in the alignments.log file if you run the script.

alignment stats of global alignment
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 45495
Alignment length:    29903
Total # residues:    1344997236
Smallest:            64
Largest:             29903
Average length:      29563.6
Average identity:    100%
//
alignment stats of global alignment after masking sites
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 45495
Alignment length:    29903
Total # residues:    1338465117
Smallest:            64
Largest:             29718
Average length:      29420.0
Average identity:    100%
//
alignment stats after filtering out short/ambiguous sequences
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 39929
Alignment length:    29903
Total # residues:    1179699101
Smallest:            27973
Largest:             29718
Average length:      29544.9
Average identity:    100%
//
alignment stats of global alignment after trimming sites that are >50% gaps
Alignment number:    1
Format:              aligned FASTA
Number of sequences: 39929
Alignment length:    29718
Total # residues:    1179699101
Smallest:            27973
Largest:             29718
Average length:      29544.9
Average identity:    100%
//
After filtering sequences with TreeShrink
Type:	Phylogram
#nodes:	72867
#leaves:	39850
#dichotomies:	31971
#leaf labels:	39850
#inner labels:	33015

Notable changes to the scripts in this release

  • Alignments are now done by profile aligning direct to WH1, to preserve information on base positions
  • Trees now include polytomies
  • gotree is now used to calculate TBE and FBP values, because it can handle polytomies
  • Only sites that are >50% gaps are removed. This is done after changing N's to gaps in the alignment (see readme for details). In practice so far this means that only those sites which are first masked are then removed
  • The scripts were fairly extensively changed, so it is worth going through the readme because the methods are now quite different

Notable aspects of the trees

  • There are a couple of samples on long branches there were not removed by treeshrink. Some caution is warranted with these samples.

8-6-20

09 Jun 02:18
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_07_23.fasta -o global.fa -t 35 -k 100

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 8th of June 2020, at 9AM Canberra (Australia) time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

Filtering statistics

  • 41449 sequences downloaded
  • Initial alignment retains 41274 sequences and 29750 sites
  • After filtering gappy sites, alignment retains 29748 sites
  • After filtering sequences on length and ambiguity, alignment retains 36286 sequences
  • After removing sequences on long branches with TreeShrink, final tree retains 36257 sequences

Notable changes to the scripts in this release

None

Notable aspects of the trees

None

6-6-20

07 Jun 04:36
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i isaid_hcov-19_2020_06_06_00.fasta -o global.fa -t 35 -k 100

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 6th of June 2020, Australia time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

Description of filtering steps:

  • 39954 sequences downloaded
  • Initial alignment retains 39807 sequences and 29750 sites
  • After filtering gappy sites, alignment retains 29739 sites
  • After filtering sequences on length and ambiguity, alignment retains 34944 sequences
  • After removing sequences on long branches with TreeShrink, final tree retains 34915 sequences

Notable changes to the scripts in this release:

Notable aspects of the trees:

  • Note that one sequence on a long branch remains in this tree. Usually I would expect this sequence to be removed by TreeShrink. The name of the sequence is hCoV-19/Canada/QGLO-021/2020|EPI_ISL_459878|2020-03-27. It is not clear at this stage why this sequence is on a long branch.

3-6-20

04 Jun 09:56
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_02_07.fasta -o global.fa -t 35 -k 100

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 3rd of June 2020, Australia time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.

2-6-20

03 Jun 10:46
Compare
Choose a tag to compare

The trees in this release were generated with the following command line:

bash global_tree_gisaid.sh -i gisaid_hcov-19_2020_06_02_07.fasta -o global.fa -t 35 -k 100

The raw sequence file contains all available SARS-CoV-2 genomes in GISAID available on the 2nd of June 2020, Australia time.

The ZIP file contains the code necessary to reproduce the trees themselves, and the README in the zip file also describes the methods used in detail. I also include the trees themselves here so that they can be easily downloaded without downloaded the entire repo.