Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: roblanf/sarscov2phylo
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 09-7-20
Choose a base ref
...
head repository: roblanf/sarscov2phylo
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Jul 10, 2020

  1. Update excluded_sequences.tsv

    roblanf committed Jul 10, 2020
    Copy the full SHA
    6b29d92 View commit details

Commits on Jul 12, 2020

  1. Copy the full SHA
    c0e9463 View commit details

Commits on Jul 24, 2020

  1. Copy the full SHA
    3dd06e6 View commit details
  2. Copy the full SHA
    6aca857 View commit details
  3. add citation info

    roblanf committed Jul 24, 2020
    Copy the full SHA
    423efd1 View commit details

Commits on Aug 3, 2020

  1. update acknowledgements

    roblanf committed Aug 3, 2020
    Copy the full SHA
    56b79e5 View commit details
  2. update trees

    roblanf committed Aug 3, 2020
    Copy the full SHA
    c96e16e View commit details
  3. Copy the full SHA
    10d6272 View commit details

Commits on Aug 7, 2020

  1. Copy the full SHA
    763e264 View commit details
  2. compare IQ-TREE and fasttree

    details stil coming, but fasttree is still the best for now I think.
    roblanf committed Aug 7, 2020
    Copy the full SHA
    c95eb39 View commit details
  3. update user info

    roblanf committed Aug 7, 2020
    Copy the full SHA
    09c207a View commit details
  4. Create .gitignore

    roblanf committed Aug 7, 2020
    Copy the full SHA
    7a8716c View commit details
  5. Create add_seqs_to_global_tree.sh

    This was a first attempt at this, but it lacks robustness, so I think I need to redesign it.
    roblanf committed Aug 7, 2020
    Copy the full SHA
    1fa8337 View commit details

Commits on Aug 14, 2020

  1. update iqtreevsfasttree

    roblanf committed Aug 14, 2020
    Copy the full SHA
    fc708bd View commit details
  2. Copy the full SHA
    5d07d0b View commit details
  3. Update iqtree_sequential.md

    roblanf committed Aug 14, 2020
    Copy the full SHA
    ac1a4e5 View commit details

Commits on Aug 15, 2020

  1. first attempt at a full iterative approach

    something like online phylogenetics, or at least, a start.
    roblanf committed Aug 15, 2020
    Copy the full SHA
    de362fe View commit details
  2. Update iqtree_sequential.md

    roblanf committed Aug 15, 2020
    Copy the full SHA
    9695d90 View commit details
  3. fix typo

    roblanf committed Aug 15, 2020
    Copy the full SHA
    e77790f View commit details
  4. Copy the full SHA
    428d8c9 View commit details
  5. Copy the full SHA
    52d0a13 View commit details

Commits on Aug 17, 2020

  1. user output for R script

    really it's for debugging....
    roblanf committed Aug 17, 2020
    Copy the full SHA
    a42b42a View commit details
  2. more debug

    roblanf committed Aug 17, 2020
    Copy the full SHA
    5d79bbe View commit details
  3. debugged?

    roblanf committed Aug 17, 2020
    Copy the full SHA
    14442bd View commit details
  4. update acknowledgements

    roblanf committed Aug 17, 2020
    Copy the full SHA
    5d3aaa7 View commit details
  5. Copy the full SHA
    412cc61 View commit details
  6. Update README.md

    roblanf committed Aug 17, 2020
    Copy the full SHA
    d39ef54 View commit details
  7. update tree

    roblanf committed Aug 17, 2020
    Copy the full SHA
    1343840 View commit details
  8. update excluded sequences

    with sequences from 12/8/20
    roblanf committed Aug 17, 2020
    Copy the full SHA
    e3a48da View commit details
  9. Copy the full SHA
    8dbc82a View commit details
  10. log fasttree progress

    in case things like lnL values are useful in future
    roblanf committed Aug 17, 2020
    Copy the full SHA
    df94cb7 View commit details

Commits on Aug 18, 2020

  1. Update iqtree_vs_fasttree.md

    roblanf committed Aug 18, 2020
    Copy the full SHA
    400144d View commit details
  2. Copy the full SHA
    06156ef View commit details

Commits on Aug 19, 2020

  1. Create LICENSE

    roblanf authored Aug 19, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    696b9dc View commit details
  2. Merge pull request #14 from roblanf/add-license-1

    Create LICENSE
    roblanf authored Aug 19, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    8f2033b View commit details

Commits on Aug 20, 2020

  1. Update excluded_sequences.tsv

    roblanf committed Aug 20, 2020
    Copy the full SHA
    a0a988a View commit details
  2. Copy the full SHA
    0be50ee View commit details

Commits on Aug 21, 2020

  1. Copy the full SHA
    c854ccb View commit details
  2. Update excluded_sequences.tsv

    roblanf committed Aug 21, 2020
    Copy the full SHA
    175b74b View commit details
  3. Update ft_SH.tree

    roblanf committed Aug 21, 2020
    Copy the full SHA
    fe864ee View commit details
  4. Copy the full SHA
    5ece9d2 View commit details

Commits on Aug 22, 2020

  1. Update ft_SH.tree

    roblanf committed Aug 22, 2020
    Copy the full SHA
    b8e6868 View commit details

Commits on Aug 28, 2020

  1. Copy the full SHA
    4b033af View commit details
  2. Update excluded_sequences.tsv

    roblanf committed Aug 28, 2020
    Copy the full SHA
    b7b646e View commit details
  3. Update environment.yml

    running it on a new server made me realise I left out some R dependencies. Sorry to users...!
    roblanf committed Aug 28, 2020
    Copy the full SHA
    6bbf449 View commit details
  4. Update ft_SH.tree

    roblanf committed Aug 28, 2020
    Copy the full SHA
    ba099fa View commit details
  5. Update excluded_sequences.tsv

    roblanf committed Aug 28, 2020
    Copy the full SHA
    aae469f View commit details

Commits on Aug 31, 2020

  1. update tree

    roblanf committed Aug 31, 2020
    Copy the full SHA
    047539c View commit details
  2. Update excluded_sequences.tsv

    roblanf committed Aug 31, 2020
    Copy the full SHA
    2c788f3 View commit details

Commits on Sep 2, 2020

  1. update acknowlegements

    roblanf committed Sep 2, 2020
    Copy the full SHA
    343518a View commit details
Showing with 5,317 additions and 626 deletions.
  1. +3 −0 .gitignore
  2. +13 −0 .gitlab-ci.yml
  3. +674 −0 LICENSE
  4. +44 −15 README.md
  5. +80 −0 SPR_analysis.md
  6. BIN acknowledgements/gisaid_hcov-19_acknowledgement_tableJuly2nd_to_July9th.pdf
  7. BIN ...e19th2020_to_July2nd2020.pdf → gisaid_hcov-19_acknowledgement_table_2020-06-19_to_2020-07-02.pdf}
  8. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-02_to_2020-07-12.pdf
  9. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-12_to_2020-07-24.pdf
  10. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-24_to_2020-07-31.pdf
  11. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-31_to_2020-08-07.pdf
  12. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-07_to_2020-08-14.pdf
  13. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-14_to_2020-08-21.pdf
  14. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-21_to_2020-08-28.pdf
  15. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-28_to_2020-09-11.pdf
  16. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-11_to_2020-09-18.pdf
  17. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-18_to_2020-09-25.pdf
  18. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-25_to_2020-09-27.pdf
  19. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-28_to_2020-09-28.pdf
  20. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-29_to_2020-10-04.pdf
  21. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-05_to_2020-10-14.pdf
  22. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-15_to_2020-10-21.pdf
  23. BIN acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-22_to_2020-10-26.pdf
  24. +16 −0 ci/Dockerfile
  25. +7 −0 environment.yml
  26. +3,313 −509 excluded_sequences.tsv
  27. +170 −0 fasttreeOMP.md
  28. BIN ftOMP.xlsx
  29. +0 −1 ft_FBP.tree
  30. +1 −1 ft_SH.tree
  31. +0 −1 ft_TBE.tree
  32. +191 −0 iqtree_sequential.md
  33. +193 −0 iqtree_vs_fasttree.md
  34. +15 −0 onlinephylo.md
  35. BIN onlinephylo.xlsx
  36. +144 −0 scripts/QC.R
  37. +26 −0 scripts/add_seqs_to_global_tree.sh
  38. +40 −0 scripts/add_seqs_to_tree.sh
  39. +2 −2 scripts/clean_gisaid.sh
  40. +28 −0 scripts/clean_tree.R
  41. +1 −1 scripts/global_profile_alignment.sh
  42. +203 −0 scripts/global_tree_gisaid_start_tree.sh
  43. +23 −0 scripts/lsd2.sh
  44. +0 −91 scripts/tree_ft.sh
  45. +22 −0 scripts/update_excluded_seqs.R
  46. BIN spr_iterations.png
  47. +108 −5 tree_estimation2.md
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

~$ft_models.xlsx
scripts/add_seqs_to_global_tree.sh
13 changes: 13 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
stages:
- build_image
- test_env

build_image:
stage: build_image
script:
- docker build -t sarscov2phylo:$CI_COMMIT_SHORT_SHA -f ci/Dockerfile .

test_env:
stage: test_env
script:
- docker run sarscov2phylo:$CI_COMMIT_SHORT_SHA conda run -n sarscov2phylo esl-alimanip -h
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

59 changes: 44 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,49 @@
# Before you start - know that this code is not state of the art

This repository is no longer maintained. I leave it here for reference only (the rest of this readme is pretty much unchanged).

If you want to estimate your own trees of SARS-CoV-2, please look into:

* UShER and matOptimize: https://usher-wiki.readthedocs.io/en/latest/matOptimize.html
* MAPLE: https://github.com/NicolaDM/MAPLE
* Anything even better which might have come along since I wrote this (May 8th 2023)















# A global phylogeny of SARS-CoV-2

This repository provides a regularly-updated global phylogeny of SARS-CoV-2 data from GISAID, as well as reproducile methods with which to estimate it.
This repository used to provide a regularly-updated global phylogeny of SARS-CoV-2 data from GISAID, as well as reproducile methods with which to estimate it.

We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID’s EpiFlu™ Database on which this research is based. The list is detailed in [this file](https://github.com/roblanf/sarscov2phylo/blob/master/acknowledgements/).

# Key files
# Can I use it and cite it

Yes. Every new release from 22-July-20 to 13-November-20 is available at this DOI: [![DOI](https://zenodo.org/badge/260124648.svg)](https://zenodo.org/badge/latestdoi/260124648)

The latest tree with the associated code to reproduce that tree can always be obtained from the latest release of this repository [here](https://github.com/roblanf/sarscov2phylo/releases/latest). Releases are named by the date (in Australian time) on which the sequences were downloaded from GISAID.
Releases subsequent to 13-November-20 are available on the GISAID website. To obtain the trees and alignments from GISAID you'll need a membership, but this is free and new requests are processed quickly. Once you have a GISAID membership, follow these steps to get the tree and the corresponding alignment:

For convenience, you can also get the latest results via the following links:
1. Log in to GISAID using your GISAID login
2. In the top menu bar, click the "EpiCoV" tab
3. Click the "Audacity" tile (with a big rainbow tree on it, near the bottom) to get the tree
4. Click the "Downloads" link just under the "EpiCoV" tab, then choose the alignment you want to download

* [Latest global phylogeny with Transfer Bootstrap Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_TBE.tree)
* [Latest global phylogeny with Felsenstein Bootstrap Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_FBP.tree)
* [Latest global phylogeny with fasttree SH Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_SH.tree)
* [Acknowledgements file for those that upload to GISAID](https://github.com/roblanf/sarscov2phylo/blob/master/acknowledgements/)
* [Latest version of the script to produce a global tree](https://github.com/roblanf/sarscov2phylo/blob/master/scripts/global_tree_gisaid.sh)
The alignment that is used to make the tree is the one labelled "MSA masked" followed by the date. Older alignments are in the 'Archive' box.

Privacy rules around the alignments themselves mean that they cannot be released here. The alignments can be recreated by following the steps described below. If you are a GISAID member and would like a copy of the alignment for any specific tree in the releases, please email me and I'll share it with you.
# Key files

Also, please note that the script to produce the global tree is continually modified as new data are released, so if you are looking at the master branch it may be the case that the latest code will not exactly reproduce the latest tree. If you would like to reproduce the latest tree, please head over to the [latest release](https://github.com/roblanf/sarscov2phylo/releases/latest), which contains the latest tree and the version of the code used to produce it.
The latest tree and alignment, with the associated code to reproduce that tree can always be obtained from GISAID (see above)

![Part of global SARS-CoV-2 phylogeny](https://github.com/roblanf/sarscov2phylo/blob/master/tree_image.jpg)

@@ -38,7 +63,9 @@ Sequences are filtered out for a few reasons:

If your sequence is in GISAID, and was submitted before the date noted in the [latest release of the repository](https://github.com/roblanf/sarscov2phylo/releases/latest), but it is not in the tree, then it was filtered for one of the above reasons.

# Why are there three trees, and what are all the numbers?
# Why are there three trees in some releases, and what are all the numbers?

Up to and including 31-7-20, each release contains three trees. This section describes the differences between those trees. After 31-7-20 there is only one tree in each release.

The topology and branch lengths of the three trees are identical. In all cases, the topology is the best topology estimated by `fasttree` with options tuned specifically for this dataset, see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md). The branch lengths represent substitutions per site. You will see that if you multiply the branchlenghts by about 30,000 (which is roughly the length of the alignments) many of the branchlengths are close to integers. That's because there's very little variation in these sequences, meaning that many branches have some integer number of changes inferred on them.

@@ -89,6 +116,8 @@ conda env create -f environment.yml
conda activate sarscov2phylo
```

Finally you will need to compile and install the MP version of FastTree from here: http://www.microbesonline.org/fasttree/#Install

# Quickstart

1. Get the latest [GISAID data](https://www.gisaid.org/) in a single fasta file, e.g. `gisaid.fa`, you can do this via the batch download feature on [GISAID](https://www.gisaid.org/), though you will have to become a member to do this.
@@ -113,14 +142,14 @@ Here's what the code does:

6. Calculates and prints to `alignments.log` the stats of all alignments, for simple sanity checking.

7. Estimates a global ML tree from `global.fa`. This is done by using `fasttree` with settings determined empirically to be the best, which are constantly updated see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md). The scripts then use `goalign` to create 100 bootstrap alignments followed by re-estatimating all the ML trees with `fasttree` as and the `-fastest` setting, using GNU `parallel` to manage parallelisaion. FBP and TBE values are calcualted with `gotree`, and SH values are calculted with `fasttree`. The resulting three trees are rooted with the seuqence 'hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05' as suggested in [this preprint](https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1), using [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/). This creates the files `global.fa_ft_TBE.tree`, `global.fa_ft_FBP.tree`, `global.fa_ft_SH.tree`.
7. Estimates a global ML tree from `global.fa`. [NB: the methods here changed substantially subsequent to 31-7-20; please see previous releases for the previous methods]. This is done by using `fasttree` with settings determined empirically to be the best, which are constantly updated see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md) and also [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/iqtree_sequential.md). The global tree is estimated in two steps. First, we start with the best tree from the previous release and *add* any new sequences to that tree using Maximum Parsimony in IQ-TREE (see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md) for a demonstration that MP works extremely well with these data). Second, we further optimise that tree with a series of minimum evolution SPR moves and Maximum Likelihood NNI moves in `fasttree`. Details of the benchmarking behind these choices are [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/iqtree_sequential.md). We use `fasttree` to calculate SH supports on the branches of that tree. The resulting tree is rooted with the seuqence 'hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05' as suggested in [this preprint](https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1), using [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/). This creates the file `global.fa_ft_SH.tree`.

8. Removes sequences on very long branches from the tree using [`TreeShrink`](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4620-2). These sequences are likely to be either of poor quality and/or poorly aligned, so rather unreliable to interpret in a phylogeny with such limited variation. They are subsequently added to the list of excluded seuqences so they are not included in future iterations of the pipeline. This creates the files `ft_TBE.tree` and `ft_FBP.tree`, and `ft_SH.tree`.
8. Removes sequences on very long branches from the tree using [`TreeShrink`](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4620-2). These sequences are likely to be either of poor quality and/or poorly aligned, so rather unreliable to interpret in a phylogeny with such limited variation. They are subsequently added to the list of excluded seuqences so they are not included in future iterations of the pipeline. This creates the file `ft_SH.tree`.

9. Roots the final two trees with [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/) as in step 7.


# One last thing

The trees and alignments produced by the scripts here are checked regularly by eye. No matter all the clever software tools, there is no replacement for the human eye when it comes to spotting anomalies. If I find an anomaly, I go back and change the scripts to deal with it automatically (e.g. by removing a seuqence or masking a site, changing alignment or tree building parameters, etc). Rest assured that each release contains the exact code that built the trees included in that release, and also that I would not publish a release with a known anomaly. Obviously I can't guarantee to find all of the problematic sequences or sites, but I try.


80 changes: 80 additions & 0 deletions SPR_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# How many SPR moves do I need in fasttree?

This is a question I tried to address before, and settled on a (very) conservatively high number of 10. Since then, I've been updating the global tree every two days, by adding seuqences with Maximum Parsimony and updating them the ME and ML in fasttree. By far the slowest step are the ME SPR moves, and I currently do 10 rounds of range 1000 (i.e. the range is the whole tree).

In this document I'll just take a quick look at how many changes ever occur in each of those 10 rounds of SPR moves over about 10 update cycles. To do that, I'll use the logs I recorded from fasttree to just measure Robinson-Foulds distance between the starting tree and each of the 10 ME-SPR trees from each update. I started loggin the fasttree output in the 14-08-20 release. So I'll start with that one here too.

Below is some clunky R code to get the information I want. And here's the result.

![SPR iterations plot](https://github.com/roblanf/sarscov2phylo/blob/master/spr_iterations.png)

There are a LOT of limitations here. RF distances are confounded for many reasons here, but they're the only thing fast enough to practically calculate. Still, the results show clearly that the amount that the tree changes tails off a lot with each iteration. Almost half of all the changes come in the first iteration.

My strong suspicion is that most of the changes at all stages are small changes near the tips of the trees (see also the previous analyses of that issue on this repo). And combined with this (highly flawed but still informative) analysis, I think it's sensible to drop the number of SPR rounds to 2.



```
library(readr)
library(phangorn)
library(parallel)
library(ggplot2)
# here's a function to get the distance I want from a fastree log file
get_rfdists = function(filename){
d = read_lines(filename, skip = 7, n_max = 11)
e = read_delim(d, delim="\t", col_names = FALSE)
e$RF=NA
e$RFn=NA
rownum=2:11
for(i in rownum){
t1 = read.tree(text=e$X2[i-1])
t2 = read.tree(text=e$X2[i])
e$RF[i] = RF.dist(t1, t2, normalize = FALSE)
e$RFn[i] = RF.dist(t1, t2, normalize = TRUE)
}
e$name = basename(dirname(filename))
r1 = data.frame(name = e$name[2:11], dist = e$RF[2:11], dist.type = "rf.dist")
r1$spr.iteration = 1:10
r2 = data.frame(name = e$name[2:11], dist = e$RFn[2:11], dist.type = "rf.dist.norm")
r2$spr.iteration = 1:10
r = rbind(r1, r2)
return(r)
}
# here are the log files
files = c("global_14-08-20/fasttree.log",
"global_16-8-20/fasttree.log",
"global_18-08-20/fasttree.log",
"global_20-08-20/fasttree.log",
"global_22-08-20/fasttree.log",
"global_24-08-20/fasttree.log",
"global_26-08-20/fasttree.log",
"global_28-08-20/fasttree.log",
"global_30-08-20/fasttree.log",
"global_30-08-20/fasttree.log",
"global_02-09-20/fasttree.log",
"global_04-09-20/fasttree.log",
"global_06-09-20/fasttree.log",
"global_08-09-20/fasttree.log",
"global_10-09-20/fasttree.log")
# run the function and bind the results
r = mclapply(files, get_rfdists, mc.cores = 8)
results = do.call(rbind, r)
# plot it out
ggplot(results, aes(x=spr.iteration, y=dist, colour=name)) + geom_point() + geom_line() + facet_wrap(facets = c("dist.type"), ncol = 1, scales = "free_y")
```
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
16 changes: 16 additions & 0 deletions ci/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM continuumio/miniconda3:4.8.2

RUN conda update -n base -c defaults conda
COPY environment.yml /tmp/environment.yml

ARG USER_ID

RUN useradd -m --uid $USER_ID --shell /bin/bash user
RUN mkdir -p /work/sarscov2phylo
RUN chown user /work/sarscov2phylo
USER user

RUN conda env create -n sarscov2phylo -f /tmp/environment.yml

COPY scripts /work/sarscov2phylo/scripts
WORKDIR /work/sarscov2phylo
7 changes: 7 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -19,6 +19,13 @@ dependencies:
- parallel
- pip=19.3.1
- python=3.6
- r=3.6
- r-ape
- r-bms
- r-base=3.6.3
- r-ggplot2
- r-phangorn
- r-readr
- rapidnj
- snakemake-minimal=5.13
- seqmagick
3,822 changes: 3,313 additions & 509 deletions excluded_sequences.tsv

Large diffs are not rendered by default.

170 changes: 170 additions & 0 deletions fasttreeOMP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Benchmarking fasttree OMP and veryfasttree (again)

Question - how much faster is fasttree OMP (compiled on my machine) than the non-OMP version I get from conda, and how many threads is optimal. And how does it compare to veryfasttree?

I looked at veryfasttree before (see veryfasttree.md on this repo), but at that point my trees were smaller, and I was limited by memory and CPUs, since I was running full bootstraps. Now my trees are much bigger, and I'm no longer doing full bootstraps, so it seems worth revisiting that analysis.

# Methods

This one's simple. First I get fasttree OMP and veryfasttree and compile them:

```
wget http://www.microbesonline.org/fasttree/FastTree.c
gcc -DOPENMP -fopenmp -DUSE_DOUBLE -O3 -finline-functions -funroll-loops -Wall -o FastTreeMP FastTree.c -lm
git clone https://github.com/citiususc/veryfasttree
cmake .
make
make install
```

Now I re-run the analyses from 14-09-20 and time them, setting different numbers of threads like this:

```
export OMP_NUM_THREADS=1
/usr/bin/time -o 1threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 1threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 1thread.tree
```

I'll just do this for 1, 2, 5, 10, 20, 50, and 100 threads like this:

```
export OMP_NUM_THREADS=2
/usr/bin/time -o 2threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 2threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 2thread.tree
export OMP_NUM_THREADS=5
/usr/bin/time -o 5threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 5threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 5thread.tree
export OMP_NUM_THREADS=10
/usr/bin/time -o 10threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 10threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 10thread.tree
export OMP_NUM_THREADS=20
/usr/bin/time -o 20threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 20threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 20thread.tree
export OMP_NUM_THREADS=50
/usr/bin/time -o 50threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 50threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 50thread.tree
export OMP_NUM_THREADS=100
/usr/bin/time -o 100threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 100threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 100thread.tree
```

I noticed during these runs that 5 threads seemed to be going quickest, so I thought I'd also fire up everything from 2-10, because if you're going to optimise, you may as well do it right.


```
export OMP_NUM_THREADS=3
/usr/bin/time -o 3threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 3threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 3thread.tree
export OMP_NUM_THREADS=4
/usr/bin/time -o 4threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 4threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 4thread.tree
export OMP_NUM_THREADS=6
/usr/bin/time -o 6threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 6threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 6thread.tree
export OMP_NUM_THREADS=7
/usr/bin/time -o 7threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 7threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 7thread.tree
export OMP_NUM_THREADS=8
/usr/bin/time -o 8threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 8threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 8thread.tree
export OMP_NUM_THREADS=9
/usr/bin/time -o 9threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 9threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 9thread.tree
```

Now I'll run the same kinds of analysis for veryfasttree. But before I do that I'll run some analyses to find out which of veryfasttree's tricks work best on my new server:

```
/usr/bin/time -o vf1.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf1.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext SSE3 -fastexp 0 global.fa > vf1.tree
/usr/bin/time -o vf2.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf2.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext SSE3 -fastexp 2 global.fa > vf2.tree
/usr/bin/time -o vf3.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf3.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX -fastexp 0 global.fa > vf3.tree
/usr/bin/time -o vf4.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf4.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX -fastexp 2 global.fa > vf4.tree
/usr/bin/time -o vf5.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf5.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 0 global.fa > vf5.tree
/usr/bin/time -o vf6.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf6.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 2 global.fa > vf6.tree
```

here's what we get for the execution time:

1: 9:01:27
2: 9:04:37
3: 8:50:10
4: 8:44:37
5: 8:46:24
6: 8:44:15

So it's the final settings by a meaningless whisker. Now let's run the analyses

```
/usr/bin/time -o vf1threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf1threads.log -intree iqtree_seqsadded_mp.treefile -threads 1 -double-precision -ext AVX2 -fastexp 2 global.fa > vf1threads.tree
/usr/bin/time -o vf2threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf2threads.log -intree iqtree_seqsadded_mp.treefile -threads 2 -double-precision -ext AVX2 -fastexp 2 global.fa > vf2threads.tree
/usr/bin/time -o vf3threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf3threads.log -intree iqtree_seqsadded_mp.treefile -threads 3 -double-precision -ext AVX2 -fastexp 2 global.fa > vf3threads.tree
/usr/bin/time -o vf4threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf4threads.log -intree iqtree_seqsadded_mp.treefile -threads 4 -double-precision -ext AVX2 -fastexp 2 global.fa > vf4threads.tree
/usr/bin/time -o vf5threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf5threads.log -intree iqtree_seqsadded_mp.treefile -threads 5 -double-precision -ext AVX2 -fastexp 2 global.fa > vf5threads.tree
/usr/bin/time -o vf6threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf6threads.log -intree iqtree_seqsadded_mp.treefile -threads 6 -double-precision -ext AVX2 -fastexp 2 global.fa > vf6threads.tree
/usr/bin/time -o vf7threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf7threads.log -intree iqtree_seqsadded_mp.treefile -threads 7 -double-precision -ext AVX2 -fastexp 2 global.fa > vf7threads.tree
/usr/bin/time -o vf8threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf8threads.log -intree iqtree_seqsadded_mp.treefile -threads 8 -double-precision -ext AVX2 -fastexp 2 global.fa > vf8threads.tree
/usr/bin/time -o vf9threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf9threads.log -intree iqtree_seqsadded_mp.treefile -threads 9 -double-precision -ext AVX2 -fastexp 2 global.fa > vf9threads.tree
/usr/bin/time -o vf10threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf10threads.log -intree iqtree_seqsadded_mp.treefile -threads 10 -double-precision -ext AVX2 -fastexp 2 global.fa > vf10threads.tree
/usr/bin/time -o vf20threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf20threads.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 2 global.fa > vf20threads.tree
/usr/bin/time -o vf50threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf50threads.log -intree iqtree_seqsadded_mp.treefile -threads 50 -double-precision -ext AVX2 -fastexp 2 global.fa > vf50threads.tree
/usr/bin/time -o vf100threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf100threads.log -intree iqtree_seqsadded_mp.treefile -threads 100 -double-precision -ext AVX2 -fastexp 2 global.fa > vf100threads.tree
```

## Results

TL;DR: veryfasttree is a lot slower than fasttree, and gives worse trees (!). FastTreeMP works fastest with 10 threads, which is a bit odd.

| wut | threads | time (h:m:s) | %CPU | efficiency (%) | lnL |
|--------------|---------|--------------|------|----------------|-------------|
| fasttree | 1 | 8:51:47 | 99 | 99.0 | -640122.657 |
| fasttree_omp | 1 | 9:06:28 | 99 | 99.0 | -640114.204 |
| fasttree_omp | 2 | 7:43:12 | 127 | 63.5 | -640114.204 |
| fasttree_omp | 3 | 6:52:53 | 144 | 48.0 | -640114.204 |
| fasttree_omp | 4 | 6:53:48 | 156 | 39.0 | -640114.204 |
| fasttree_omp | 5 | 6:55:20 | 168 | 33.6 | -640114.204 |
| fasttree_omp | 6 | 7:21:00 | 177 | 29.5 | -640114.204 |
| fasttree_omp | 7 | 7:14:02 | 190 | 27.1 | -640114.204 |
| fasttree_omp | 8 | 7:05:57 | 203 | 25.3 | -640114.204 |
| fasttree_omp | 9 | 7:20:56 | 211 | 23.4 | -640114.204 |
| fasttree_omp | 10 | 6:20:57 | 238 | 23.8 | -640114.204 |
| fasttree_omp | 20 | 7:10:58 | 347 | 17.3 | -640114.204 |
| fasttree_omp | 50 | 7:17:07 | 723 | 14.4 | -640114.204 |
| fasttree_omp | 100 | 7:15:18 | 1385 | 13.8 | -640114.204 |
| veryfasttree | 1 | 12:30:28 | 99 | 99.0 | -640671.012 |
| veryfasttree | 2 | 12:21:46 | 103 | 51.5 | -640671.012 |
| veryfasttree | 3 | 11:05:24 | 118 | 39.3 | -640670.422 |
| veryfasttree | 4 | 11:04:09 | 120 | 30.0 | -640669.088 |
| veryfasttree | 5 | 11:05:58 | 122 | 24.4 | -640669.088 |
| veryfasttree | 6 | 9:49:00 | 141 | 23.5 | -640669.088 |
| veryfasttree | 7 | 9:43:35 | 149 | 21.2 | -640669.089 |
| veryfasttree | 8 | 9:28:12 | 153 | 19.1 | -640669.088 |
| veryfasttree | 9 | 9:19:24 | 159 | 17.6 | -640669.088 |
| veryfasttree | 10 | 9:21:32 | 163 | 16.3 | -640668.681 |
| veryfasttree | 20 | 8:36:04 | 205 | 10.2 | -640655.452 |
| veryfasttree | 50 | 8:36:26 | 320 | 6.4 | -640626.579 |
| veryfasttree | 100 | 8:53:03 | 484 | 4.8 | -640626.579 |


The 10 threads thing doesn't make much sense. Up to 10 threads it all seems sensible - 3 threads is best, and it gets gradually worse the more you add (I assume because the cost of cross-talk outweighs the benefits of parallelisation). But then there's this jump at 10 threads.

To double check what might be happening here and to see what happens above 10 threads, I'm going to re-run 3, 8, 9, 10, 11, and 12 threads.

```
export OMP_NUM_THREADS=3
/usr/bin/time -o r3threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 3threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 3thread.tree
export OMP_NUM_THREADS=8
/usr/bin/time -o r8threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 4threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 4thread.tree
export OMP_NUM_THREADS=9
/usr/bin/time -o r9threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 6threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 6thread.tree
export OMP_NUM_THREADS=10
/usr/bin/time -o r10threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 7threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 7thread.tree
export OMP_NUM_THREADS=11
/usr/bin/time -o r11threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 8threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 8thread.tree
export OMP_NUM_THREADS=12
/usr/bin/time -o r12threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 9threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 9thread.tree
```

Binary file added ftOMP.xlsx
Binary file not shown.
1 change: 0 additions & 1 deletion ft_FBP.tree

This file was deleted.

2 changes: 1 addition & 1 deletion ft_SH.tree

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion ft_TBE.tree

This file was deleted.

191 changes: 191 additions & 0 deletions iqtree_sequential.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Moving to online phylogenetics

As the data keep growing in size, it makes less and less sense to start each analysis as if the previous analysis never happened. The current process acts like this, estimating every tree from scratch, even if the dataset only increases by a few percent in size.

The previous anlayses have shown that MP is an excellent method here, so in principle we could use this to place new sequences on a tree, and make some global and local improvements to that tree.

To do this, I wrote the script `global_tree_gisaid_start_tree.sh`. This script is very similar to the previous script in that it does the alignment from scratch. But this time, instead of using fasttree to make a NJ tree, which is then optimised with various ME and ML moves, we instead start with the tree from the previous iteration of the pipeline, then *add* the new sequences to it (there are various cleaning steps involved here, but the addition itself is done with MP in IQ-TREE), and finally optimise that tree with fasttree. It took me some time to get the steps of that pipeline together, but there are a number of questions that remain about the best way to optimise the starting tree.

The requirements are that it has to be *fast enough* to hopefully put out a new tree every day, and that we need some feeling of a guarantee that global and local issues in the tree could at least in principle have been ironed out.

So, it seems sensible to include some global (i.e. SPRs with large radii) and local (lots of ML NNI) moves when optimising the tree. But I want to compare these (of course) against just estimating the tree from scratch as I usually would. Here are seven analyses which will allow me to answer a few questions that I set out below.

These analyses all use tree from 31-7-20 as the starting tree (where it's used) and GISAID data from 11-8-20 as the alignment. When using a starting tree, I take the following steps:

1. Remove sequences from the tree that are not in the alignment (this can happen because sequences are filtered)
2. Add to the tree sequences in the alignment that are not already in the tree (this uses IQ-TREE and MP, and took about 1hr to add about 3K sequences)

That's it. The total dataset here is about 52K sequences.


```
# 2 sprs, + ML NNIs
/usr/bin/time -o 1.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 1.tree
# 0 sprs, + ML NNIs
/usr/bin/time -o 2.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 2.tree
# 2 sprs, + ML NNIs, 5 rate categories
/usr/bin/time -o 3.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -cat 5 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 3.tree
# 2 sprs, + ML NNIs, 5 rate categories, no gamma
/usr/bin/time -o 4.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 4.tree
# 2 sprs, + ML NNIs, 5 rate categories, no gamma, boot 100
/usr/bin/time -o 5.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -boot 100 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 5.tree
# 2 sprs, + ML NNIs, 5 rate categories, no gamma, no boot
/usr/bin/time -o 6.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -nosupport -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 6.tree
# no input tree, otherwise default except for increased SPR length
/usr/bin/time -o 7.mem.txt -v fasttree -nt -gamma -sprlength 200 $outputfasta > 7.tree
```

## Q1: What gives the best tree?

For this we just compare analysis 1 and analysis 7. Analysis 1 uses the previous iteration as a starting tree, where the new sequences were added by IQ-TREE.

Start tree lnL: -496120.668
No start tree lnL: -496332.143

Unsurprisingly, the analysis with the starting tree is better. It's not surprising because almost all the splits in this tree were *already* highly optimised. Adding a few thousand (about 6%) new sequences doesn't change many of the existing splits. But I add 2 global SPR moves (with ME) followed by fasttrees default number of ML NNIs (30 in this case). Of note, the analysis converged after just 6 rounds of ML NNIs, suggesting that the MP placements of the new sequences were very similar to the MP placements. That's encouraging. In addition, this is about 1 week of new data. So doing this every day should be very robust (and probably a little quicker).

Of note the timings were also very different:

Start tree time: 10.17 hrs
No start tree time: 14.35 hrs

## Q2: How much time does each round of SPRs add?

Comparing analysis 1 (10.17 hrs) and analysis 2 (6.08 hrs), it's clear that each round of SPRs adds about 2 hrs to the analysis. That's about 30% of the analysis time without SPRs, so it's very significant. We should check if it's worth it...

## Q3: Are the SPR rounds worth it?

One round of global SPR moves is certainly worth it. If we don't do that we'll have zero rounds of SPR moves, and that means that any serious issues in the tree (introduced for whatever reason) will be very hard if not impossible to fix. NNI moves just don't cut it in this scenario. But it's still useful to know how much the 2 rounds of SPR moves improved the final log likelihood (also bearing in mind that the 2 SPR moves are done in an ME framework, so could in principle make things worse if ME optima conflict with ML optima)...

2 SPRs lnL: -496120.668
0 SPRs lnL: -496239.083

Surprisingly (to me at least) the two rounds of SPR moves really seemed to help. Seems worth investigating whether 1, 2, or 3 rounds of SPR moves is best in the next analysis.

## Q4: How much time can we save?

Analyses 2-6 are all about investigating ways of saving time. By comparing them to each other we can get an idea of how much time can be saved by changing various settings.

First, comparing 1 and 3 tells us if we save time or change the likelihood by moving from 20 to 5 rate categories:

cat 20 lnL: -496120.668
cat 5 lnL: -498323.359

Oof. That made the lnL a LOT worse. In addition, the cat 5 lnL analysis (number 3) still took 9.8hrs, so it is a very marginal time saving. Not worth it.

Second, comparing 3 and 4 tells us if removing the -gamma flag speeds things up (it makes no difference to the tree, it just makes the log liklihoods comparable and scales the branch lengths a little, neither of which are all that important). Analysis 3 took 9.8 hrs, analysis 4 took 9.4 hrs, so it's about a 4% time saving to remove the -gamma flag. Not nothing, but not much.

Third, comparing 4 and 5 tells us if reducing the number of bootstraps from 1000 to 100 saves time. Analysis 4 took 9.4 hrs, and analysis 5 took 7.6 hrs. That's almost a 20% time saving. Seems worth it, particularly because 100 SH bootstraps is probably enough, and these numbers are very very imprecise anyway.

Finally, comparing 5 and 6 tells us how much more we can save by removing bootstraps all together. Analysis 5 took 7.6 hrs, and analysis 6 took 7.2 hrs. So it's not a huge saving, and *some* measure of uncertainty in the splits seems worth having for a little bit of time.

## Conclusions so far

The best settings are 20 rate categories, we can take or leave the `-gamma` flag, and we should use 100 bootstraps.

## Some more investigation

The remaining question seems to be how many rounds of ME SPRs is best. So let's run a few additional analyses for that. Let's compare the likelihood and execution time for 0-4 rounds of SPRs, and for good measure let's try three spr lengths - 100, 200, 500.

```
/usr/bin/time -o 11.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 100 -log 11.log -intree iqtree_seqsadded_mp.treefile global.fa > 11.tree
/usr/bin/time -o 12.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 200 -log 12.log -intree iqtree_seqsadded_mp.treefile global.fa > 12.tree
/usr/bin/time -o 13.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 500 -log 13.log -intree iqtree_seqsadded_mp.treefile global.fa > 13.tree
/usr/bin/time -o 21.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 100 -log 21.log -intree iqtree_seqsadded_mp.treefile global.fa > 21.tree
/usr/bin/time -o 22.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 200 -log 22.log -intree iqtree_seqsadded_mp.treefile global.fa > 22.tree
/usr/bin/time -o 23.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 500 -log 23.log -intree iqtree_seqsadded_mp.treefile global.fa > 23.tree
/usr/bin/time -o 31.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 100 -log 31.log -intree iqtree_seqsadded_mp.treefile global.fa > 31.tree
/usr/bin/time -o 32.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 200 -log 32.log -intree iqtree_seqsadded_mp.treefile global.fa > 32.tree
/usr/bin/time -o 33.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 500 -log 33.log -intree iqtree_seqsadded_mp.treefile global.fa > 33.tree
/usr/bin/time -o 41.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 100 -log 41.log -intree iqtree_seqsadded_mp.treefile global.fa > 41.tree
/usr/bin/time -o 42.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 200 -log 42.log -intree iqtree_seqsadded_mp.treefile global.fa > 42.tree
/usr/bin/time -o 43.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 500 -log 43.log -intree iqtree_seqsadded_mp.treefile global.fa > 43.tree
/usr/bin/time -o 51.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 100 -log 51.log -intree iqtree_seqsadded_mp.treefile global.fa > 51.tree
/usr/bin/time -o 52.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 200 -log 52.log -intree iqtree_seqsadded_mp.treefile global.fa > 52.tree
/usr/bin/time -o 53.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 500 -log 53.log -intree iqtree_seqsadded_mp.treefile global.fa > 53.tree
```

I'll save those in a text file called spr.txt and run them like this:

```
parallel -j 15 -k --bar :::: spr.txt
```

| ID | SPRmoves | SPRlen | time | lnL | delta hrs | delta lnL |
|----|----------|--------|----------|-------------|-----------|-----------|
| 11 | 0 | 100 | 25613.75 | -496239.08 | 0.02 | 0.00 |
| 12 | 0 | 200 | 25551.77 | -496239.08 | 0.01 | 0.00 |
| 13 | 0 | 500 | 25529.44 | -496239.08 | 0.00 | 0.00 |
| 21 | 1 | 100 | 35276.59 | -496098.941 | 2.71 | 140.14 |
| 22 | 1 | 200 | 35207.22 | -496098.941 | 2.69 | 140.14 |
| 23 | 1 | 500 | 35239.38 | -496098.941 | 2.70 | 140.14 |
| 31 | 2 | 100 | 38495.2 | -496120.668 | 3.60 | 118.41 |
| 32 | 2 | 200 | 38688.24 | -496120.668 | 3.66 | 118.41 |
| 33 | 2 | 500 | 38522.25 | -496120.668 | 3.61 | 118.41 |
| 41 | 3 | 100 | 40822.96 | -496063.816 | 4.25 | 175.27 |
| 42 | 3 | 200 | 41290.1 | -496063.816 | 4.38 | 175.27 |
| 43 | 3 | 500 | 41456.04 | -496063.816 | 4.42 | 175.27 |
| 51 | 4 | 100 | 43605.31 | -495982.737 | 5.02 | 256.35 |
| 52 | 4 | 200 | 43502.2 | -495982.737 | 4.99 | 256.35 |
| 53 | 4 | 500 | 43689.79 | -495982.737 | 5.04 | 256.35 |

The best bang for our buck is to do just one round of SPRs (140 likelihood units per round, vs. e.g. 64 per round with 4 rounds). We get better likelihoods with more SPRs (in general) but the time penalty starts to get large. At 4 rounds of SPRs we've added 5 hrs to our analysis (meaning we're pushing our luck to get out an update every day, particularly as the dataset grows a lot in size).

Of note, I should have realised that an SPR radius of 100 encompasses the entire tree, so there is no difference in likelihood between any of the different SPR lengths. Still, I'll keep it at a very large number (1000) in future, to make sure that the SPR phase allows for truly global moves even as the tree grows a lot.

So, to take a starting tree of ~50K SARS-CoV-2 sequences, add ~3K new sequences to it, and then optimise it in fasttree, the best commandline for that seems to be:

```
fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 1.tree
```

Reasoning: I leave the gamma on there purely so that as things progress it's easy to compare the lnL of different runs; reduce bootstraps to 100, becuase the SH measures of support are potentially useful (though I'm not all that sure how to interpret them yet), but the additional precision of doing 1000 bootstaps is not justified, and use 1 spr move of length 1000 because that seems to be a good balance between likelihood and execution time. Though it's also tempting to consider having 4 SPR moves, because for a few additional hours it seems like we can get a higher likelihood. However, when this analysis runs every day (rather than, in this case, on 10 days worth of new sequences), I think we have a solution.


There is one more thing I want to try, and that's whether we can make good improvements by running the 1 SPR analysis on the starting tree twice. The reason for this - it's often useful to interleave two algorithms (like SPR and NNI moves on trees) when optimising. So it's feasible that running an analysis with 1 SPR move followed by NNI moves twice, does a better job than any of the above, and could (let's see) give a nice balance of time and accuracy.

To do this, we just want to compare this:

```
/usr/bin/time -o 91.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 91.tree
```

to this:

```
/usr/bin/time -o 92.mem.txt -v fasttree -nt -nni 0 -spr 1 -sprlength 1000 -nosupport -intree iqtree_seqsadded_mp.treefile $outputfasta > 92.tree
/usr/bin/time -o 93.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree 92.tree $outputfasta > 93.tree
```

Note that in the second version, we can skip the gamma optimisation of branch lengths and the bootstraps, so it will less than double the execution time. And in the second step fo the second version, we use the tree from the first step as the starting tree.

Turns out this didn't really help much:

| Iterations | time | lnL |
|------------|----------|-------------|
| 1 | 22914.02 | -496098.941 |
| 2 | 39068.08 | -496038.076 |


I almost doubled the execution time, for a marginal gain in likelihood.

## Final conclusion

This is the best solution so far. Iterated once a day with the new sequences from that day.

```
/usr/bin/time -o 91.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 91.tree
```

P.S. In the end I decided to see what would happen if I just cranked it to do 10 SPRs, and the results were surprising. The lnL was -494195.750, which is a sizeable improvement. And the time was 11hrs. It seems worth keeping this for now, even if the 10 rounds of SPRs are not always worth it.



193 changes: 193 additions & 0 deletions iqtree_vs_fasttree.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# IQ-TREE of fasttree?

In the first benchmark of methods for SARS-CoV-2 phylogenetic analyses, fasttree was the clear winner, and I've been using it succesfully for months. But IQ-TREE is certianly a more flexible piece of software, and what's more we develop it. As luck would have it, we have funding from the Chan-Zuckerberg Initiative, and we used it to hire an *excellent* programmer, James Barbetti. The first thing James has been looking at is removing some of the bottlenecks that IQ-TREE had when analysing huge datasets.

The result so far is that we have a version of IQ-TREE that does lightning fast neighbour joining trees. This is done using all sorts of methods, some of which come straight from the excellent work that went into developing the rapid-NJ algorithm used in `rapidnj`. Regardless, now's a good time to revisit whether we can get better trees from fasttree or IQ-TREE. Of note, I've spent month optimising the search in fasttree (see other .md files on this repo), to the point where I don't think I can make much in the way of additional gains.

So I'm not expecting IQ-TREE to beat fasttree first time. It will likely take a lot of tweaks. It may not be possible at all, because fasttree still does things IQ-TREE can't, like SPR moves, which at least in fasttree seem to be fairly crucial to getting good and accurate trees from SARS-CoV-2. But I still want to check.

The other reason for this is that IQ-TREE allows us to apply a much much wider range of models of sequence evolution, which I know from previous analyses fit the data way better than the JC+G or GTR+G models availble in fasttree. In addition, we are developing a range of new models in IQ-TREE, specifically targetted at dealing with the various issues we see in SARS-CoV-2 sequencing data (see posts on virological that highlight these).

## The plan

I'll start by figuring out a good model of sequence evolution in IQ-TREE. Based on previous analyses, the best I've found is `JC+I+R3`. But this was on an old alignment with different filtering, so it needs to be revisited. And I didn't even check the JC component of this.

A few open questions:

1. What's the best rate matrix framework to use (e.g. JC, HKY, GTR, UNREST)?
2. What's the best distribution of rates across sites (e.g. +I+R2, +I+R4 etc)?
3. Do either 1 or 2 make much difference to the tree topology? (this is important - it's much much harder to optimise the complex models, so it may be better to spend a long time optimising the tree on a simple model like JC first, then do a few last rounds of optimisation with a complex model)
4. What tree search settings are best in IQ-TREE?
4. Does the best tree from IQ-TREE beat the best tree from fasttree (when both are analysed in IQ-TREE under whatever the best model seems to be).


These questions suggest a simple approach:

1. Find the best rate distribution
2. Find the best rate matrix
3. Optimise in IQ-TREE with this model, and with plain JC model
3. Get the lnL of the best fasttree tree under this model
5. Compare fasttree to the IQ-TREE trees
6. Work on optimising IQ-TREE

Doing things in this order makes sense. The finder details of the tree topology tend not to make much difference for the model selection steps. And getting the rate distribution right tends to be more important than getting the rate matrix right. I'm not going to worry about running time here, because ultimately I can fix all the model parameters in IQ-TREE, I know they don't change much as the size of the dataset increases, so in practice my analyses should be much faster than the analyses here. In other words, at this stage I care about lnL values much more than I care about run time.

Of note, IQ-TREE uses a lot more memory than fasttree. So I'm very limited in how many options I can explore here. Typically I can just look at 4 options at a time. With that plain laid out, here goes.

I'll do all of the analyses on the latest global alignment from this analysis: https://github.com/roblanf/sarscov2phylo/releases/tag/11-7-20

Also of note, I am using here the bleeding edge IQ-TREE release with all sorts of udocumented options, which is version 2.0.8. For all analyses I'll set the minimum branch length to a very small number.

## 1. Find the best rate distribution

I know from previous analyses that the +G and +I+G models are very unlikely to be the best. So I'll just compare four models here that I think should capture a good range.

```
/usr/bin/time -o 1.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.1
/usr/bin/time -o 1.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.2
/usr/bin/time -o 1.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.3
/usr/bin/time -o 1.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R5 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.4
```

##### Results

Don't pay too much heed to execution time. We're working on speeding up bottlenecks.

JC
lnL -407738.7183
BIC: 1464697.832
time: 6h:53m:34s

JC+I+R3
lnL: -388587.635
BIC: 1426447.16
time: 53h:37m:6s

JC+I+R4
lnL: -388425.334
BIC: 1426143.156
time: 105h:52m:55s

JC+I+R5
lnL: ? it was converging on something a little less than -388319.099658 though
BIC: ?
time: >200h (!)

Although the final run didn't complete (I had to kill it to re-estimate the global tree, which was a week out of date...), it did reach a lnL of -388319.099658, which equates to a BIC that is also substantially better than the JC+I+R4 model.

It's worth comparing the model parameters estimated by IQ-TREE (the +R5 model parameters never completed their final optimisation, but even before that would have had the best BIC score) just to see what's going on.

```
Proportion of invariable sites: 0.389 ;Site proportion and rates: (0.529,0.808) (0.074,5.739) (0.009,16.933)
Proportion of invariable sites: 0.335; Site proportion and rates: (0.558,0.659) (0.083,4.126) (0.023,10.607) (0.002,27.445)
Proportion of invariable sites: 0.305 Site proportion and rates: (0.366,0.494) (0.220,0.840) (0.084,4.024) (0.024,10.400) (0.002,27.193)
```

The rate categories show that there are roughly 30% of sites modelled as invariant. Then there's about 30% of sites rate around 0.5, 20% at 0.8, 8% at 4, 2% at 10, and 0.2% at 27%.

That last category represents around 60 sites that seem to have exceptionally fast rates relative to all the other sites. Modelling those sites properly (i.e. with a free rate model) could be important for correctly placing sequences in the tree.

Given that the I+R5 model gave the best BIC score, it seems prudent to use this model, but to fix the model parameters to those estimated here. They won't be the ML parameters (because they never quite finished the final optimisation, and fixing them means they won't be updated when the tree changes), however, it's a safe bet from the BIC scores that they are better than the other models I tried, and a *lot* better than the JC model with no rates. This could end up being important for optimising the tree - as I currently use fasttree, which doesn't account for rate variation when optimising the tree (it adds it at the end when optimising branch lengths).

Final note: the memory usage for the +I+R5 model is HUGE. 175GB at the maximum (which was most of the time)!

Postscript: turns out IQ-TREE has a potentially much faster algorithm for the model optimisation implemented. So let's try and compare on the +R3 and +R5 models (the former so I can compare likelihoods, the latter to see if it's quick enough to do this). The standard algorithm tends to perform better on most datasets, but it's slower. Usually that's not an issue. The standard algorithm updates the rates one at a time, jiggling that rate parameter around to improve it while holding the others constant. The alternative BFGS algorithm tries out changes to all rate parameters at once. Let's see what happens...


```
/usr/bin/time -o 1.5.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -optalg 1-BFGS -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.5
/usr/bin/time -o 1.6.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R5 -fast -nt 4 -optalg 1-BFGS -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.6
```

OK well that was a surprise. My expectation was that the BFGS algorithm would be quicker, but get substantially worse likelihoods. The truth was the opposite. I had to kill the runs because they took a long time (>> 1 day, so I don't know the exact speed comparison but the BFGS is certainly not more than twice as fast, and from looking at the stage it got to, I'd estimate it's about the same speed in fact). Speed aside, one can see the likelihoods are a lot better, even when it was just the model parameters optimised on the initial JC NJ-R tree.

JC+I+R3: -375974.923841
JC+I+R5: -375490.305847

The difference between R3 and R5 is large, plenty to justify the two extra parameters I'd argue. The main take home is to compare these likelihoods to those above. They are A LOT better. Like more than 10K likelihood units. For the SAME model. That's huge.

The model parameters:

Proportion of invariable sites: 0.616; Site proportion and rates: (0.426,0.269) (0.415,0.625) (0.159,3.936)
Proportion of invariable sites: 0.616; Site proportion and rates: (0.210,0.334) (0.219,0.342) (0.284,0.344) (0.202,1.456) (0.085,5.456)


These are very different to using the standard algorithm, and a lot better too (the likelihood tells us that).

Then I realised I forgot one really important model. The plain-old +I model. Not sure quite how I forgot this, but let's see how it goes. We don't need BFGS for this. Won't help.


```
/usr/bin/time -o 1.8.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.8
```

JC+I: -396992.024799

The likelihood is a lot worse. But it did run quickly (about 5 hours). This may be worth considering in future.

So, let's continue with the +I+R5 model.


## 2. Find the best rate matrix

We can fix the rate parameters to the ones estimated in step 1 using:

`+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}`


I'll re-run the JC model too, to get an idea of how much time and memory we save by running this.

```
/usr/bin/time -o 2.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "JC+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.1
/usr/bin/time -o 2.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "HKY+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.2
/usr/bin/time -o 2.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "GTR+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.3
/usr/bin/time -o 2.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "12.12+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.4
```


lnLs (after first round of model optimisation on NJ-R ML distance tree)

JC+I+R5: -392823.153805
HKY+I+R5: -380249.398257
GTR+I+R5: -373470.588371
12.12+I+R%: didn't optimise, various numerical issues

So the GTR model is by far and away the best model. The parameters are:

Rate parameters: A-C: 0.14012 A-G: 0.66091 A-T: 0.12269 C-G: 0.09998 C-T: 2.61404 G-T: 1.00000
Base frequencies: A: 0.299 C: 0.184 G: 0.196 T: 0.321
Proportion of invariable sites: 0.616
Site proportion and rates: (0.077,0.280) (0.077,0.938) (0.077,1.823) (0.077,3.185) (0.077,6.795)


## 3. Optimise in IQ-TREE

Here I want to optimise the tree with the two different models.

the -fast flag does this:

# -ninit 2
# -n 2
# -me 0.05

But I want to allow for more rounds of nni than this, so I'll allow for 5 rounds and hard-code the other options.

I'll also set -me to 1.0, because I'm not really worried about optimising branch lengths at this point (and hte rest of the model parameters are fixed).

```
/usr/bin/time -o 3.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "JC+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -ninit 2 -n 5 -me 1.0 -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 3.1
/usr/bin/time -o 3.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "GTR{0.14012,0.66091,0.12269,0.09998,2.61404}+F{0.299,0.184,0.196,0.321}+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -ninit 2 -n 5 -me 1.0 -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 3.2
```



15 changes: 15 additions & 0 deletions onlinephylo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Comparing online to de-novo phylogenetics

Since 14-08-20 I have moved to online phylogenetics and I've been logging the fasttree output as I go. The question I want to address here is how online phylogenetics (where I add new sequences every 2 days with MP, then update the resulting tree with ML) compares to standard phylogenetics (where you re-estimate the tree from scratch every time), in terms of the final likelihoods.

To answer that, all I need to do is re-run every analysis since 14-08-20 with standard phylogenetics. I'll use the following almost default setup in `fasttree` to do that:

```
fasttree -nt -gamma -boot 100 -sprlength 1000 -log fasttree.log global.fa > global.tree
```

This means I can compare execution time (hence wanting the 100 bootstraps to get the SH supports, which matches what I do in the online phylo) and log likelihoods (hence the gamma) in online versus de-novo phylogenetics (nobody calls it de-novo phylogenetics, but it makes sense here to distinguish it from online).

# Results


Binary file added onlinephylo.xlsx
Binary file not shown.
144 changes: 144 additions & 0 deletions scripts/QC.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#!/usr/bin/env Rscript

# usage: Rscript old_tree new_tree threads

args = commandArgs(trailingOnly=TRUE)


# a little QC for the trees
library(ape)
library(phangorn)
library(parallel)
library(ggplot2)

random_nni_distance = function(n, tree){
# get the distance between tree1 and itself when
# the second copy has undergone n NNI moves

tree2 = rNNI(tree, moves=n)

return(list(n=n, dist=wRF.dist(tree, tree2)))
}

old = read.tree(args[1])
new = read.tree(args[2])
threads = args[3]

# first we get trees with only the tips they have in common
old_d = drop.tip(old, setdiff(old$tip.label, new$tip.label))
new_d = drop.tip(new, setdiff(new$tip.label, old$tip.label))


obs = wRF.dist(old_d, new_d)

# now we compare this to the same distance calculated at ever increasing numbers of NNI moves
# to get a rough idea of how many random NNI moves this is equivalent to
nnis = seq(from=0, to=10000, length.out=101)

# turn off warnings for this, then turn them back on:
options(warn=-1)
random_nni_dists = mclapply(nnis, random_nni_distance, tree=old_d, mc.cores=threads)
options(warn=0)

# now let's plot the results so we can see what's up
dists = as.data.frame(do.call(rbind, random_nni_dists))
dists$dist = as.numeric(as.character(dists$dist))
dists$n = as.numeric(as.character(dists$n))


thresh = dists$n[max(which(dists$dist<obs))]

p1 = ggplot(dists, aes(x=n, y = dist)) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = obs, size=1, colour='red', alpha = 0.5, linetype='dashed') +
ggtitle("Comparison of current tree to tree from previous iteration",
subtitle = paste(paste("Observed distance: ", obs, "(red line)"), paste("Equivalent to ~", thresh, ' random NNIs'), sep="\n"))


########
# phylo signal of lineages
# we do this by
# 1. getting the observed tree length of each lineage
# 2. doing the same for 100 random sets of the same number of tips of each lineage
d = read.csv("metadata.csv")

e = subset(d, subset = d$covv_accession_id %in% new$tip.label)
f = e[match(new$tip.label, e$covv_accession_id),]
f$covv_lineage=as.factor(f$covv_lineage)


random_treelen = function(tree, n){
# choose n random tips and get the tree length
t = keep.tip(tree, sample(tree$tip.label, n))
return(sum(t$edge.length))
}

get_signal = function(trait, tree, data){
# tree is a tree
# data is a vector IN THE SAME ORDER as the tree tip labels
# trait is the specific trait in the 'data' vector you want to measure

obs_t = drop.tip(tree, which(data!=trait))
n = length(obs_t$tip.label)

obs_l = sum(obs_t$edge.length)

rep_l = sort(replicate(99, random_treelen(tree, n)))

rep_lt = summary(rep_l)

# p value is the position in a ranked list
p_value = (sum(obs_l>rep_l)+1)/(length(rep_l)+1)

return(list(trait=as.character(trait),
obs = as.numeric(obs_l),
p = as.numeric(p_value),
n_tips = as.numeric(n),
rep_Min = as.numeric(rep_lt[1]),
rep_1stQu = as.numeric(rep_lt[2]),
rep_Median = as.numeric(rep_lt[3]),
rep_Mean = as.numeric(rep_lt[4]),
rep_3rdQu = as.numeric(rep_lt[5]),
rep_Max = as.numeric(rep_lt[6])))


}

clustering = mclapply(levels(f$covv_lineage), get_signal, tree=new, data=f$covv_lineage, mc.cores=threads)
clustering = as.data.frame(do.call(rbind, clustering))
clustering$trait = as.character(clustering$trait)

num = 2:ncol(clustering)
clustering[num] <- sapply(clustering[num],as.numeric)

# only keep the ones where we can reasonably estimate clustering
# so 5 or more members of a lineage
clustering = subset(clustering, n_tips>4)

#### Write out a log file ####
dir.create("./QC")
write.csv(clustering, "./QC/lineage_clustering.csv")
write.csv(dists, "./QC/topological_distances_randomNNI.csv")

pdf("./QC/tree_comparison_plot.pdf")
print(p1)
dev.off()


sink("./QC/QC_report.md", append=TRUE)

cat("# QC Report for the phylogenetic tree\n")
cat("\n")
cat("## Comparison to previous iteration's tree\n")
cat(paste("wRF distance between previous and new iteration's trees:", obs, "\n"))
cat(paste("This is equivalent to a number of random NNIs equal to roughly:", thresh, "\n"))
cat("\n")

cat("## Clustering of Pangolin lineages\n")
cat(paste(sum(clustering$p<0.05), "out of", nrow(clustering), "Pangolin lineages show significant clustering\n"))
cat("\n")
cat("#### Summary of p values\n")

print(summary(clustering$p))
sink()
26 changes: 26 additions & 0 deletions scripts/add_seqs_to_global_tree.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# new approach

# the script shoudl be almost the same as the global phylogeny script

# with the sole difference that a starting tree is given

# this tree should contain at least some of the same sequence names as given in the alignment

# the process here is fairly simple (I think...)

# 1. Use R to trim the input tree down to only sequences found in the final aligned fasta file (global.fa)

# get names from alignment


# 2. Use IQ-TREE with the tree from (1) as a starting tree, using MP to place the new sequences

# 3. Use the tree from 2 to input as a starting tree to fasttree


# Of course, this makes bootstraps redundant. So there are two options here:

# 1. Just use fasttree SH supports (seems fine for now)
# 2. Try IQ-TREE aLRT supports (issue: uses a lot of memory, possibly slow? let's see)


40 changes: 40 additions & 0 deletions scripts/add_seqs_to_tree.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash

helpFunction()
{
echo "Input a tree and an alignment - remove sequences from the tree that are not in the alignment, then add sequences to the tree that are in the alignment but not the tree"
echo "Usage: $0 -i fasta_file_of_SARS-CoV-2_seqs -o output_file -t threads"
echo "\t-i Full path to input alignment of SARS-CoV-2 sequences"
echo "\t-s Full path to input starting tree of SARS-CoV-2 sequences"
echo "\t-o Full path to output tree, which will contain all sequences in the alignment"
echo "\t-t number of threads to use"
exit 1 # Exit script after printing help
}

while getopts "i:s:o:t:" opt
do
case "$opt" in
i ) inputfasta="$OPTARG" ;;
i ) inputtree="$OPTARG" ;;
o ) outputtree="$OPTARG" ;;
t ) threads="$OPTARG" ;;
? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
esac
done

# Print helpFunction in case parameters are empty
if [ -z "$inputfasta" ] || [ -z "$inputtree" ] || [ -z "$outputtree" ] || [ -z "$threads" ]
then
echo "Some or all of the parameters are empty";
helpFunction
fi


# 1. Get the sequence names from the alignment

grep ">" $inputfasta | cut -c 2- > alignment_names.txt

# 2. Remove sequence from the tree that are not in the alignment
# this can happen e.g. if a sequence was removed from GISAID, or excluded because it was on a long branch
# or if the filtering conditions changed

4 changes: 2 additions & 2 deletions scripts/clean_gisaid.sh
Original file line number Diff line number Diff line change
@@ -43,9 +43,9 @@ sed -i.bak '/^$/d' $input_seqs


# next we remove sequences in the excluded_sequences.tsv file
echo "Removing sequences in exluded_sequence.tsv"
echo "Removing sequences in excluded_sequence.tsv"
BASEDIR=$(dirname "$0")
exseq=$BASEDIR/../excluded_sequences.tsv
exseq=previous_iteration_files/excluded_sequences.tsv
cut -f1 $exseq | faSomeRecords $input_seqs /dev/stdin $input_seqs"tmp.fa" -exclude
rm $input_seqs
mv $input_seqs"tmp.fa" $input_seqs
28 changes: 28 additions & 0 deletions scripts/clean_tree.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env Rscript

# usage: Rscript clean_tree.R input_tree names_of_species_in_alignment.txt
# the names file should be one name per line

args = commandArgs(trailingOnly=TRUE)

library(ape)
library(readr)

print("Input tree")
print(args[1])
print("input names")
print(args[2])


t = read.tree(args[1])
a = readLines(args[2])

drop = t$tip.label[which(t$tip.label %in% a == FALSE)]

new = a[which(a %in% t$tip.label == FALSE)]

write_lines(new, "alignment_names_new.txt")

t2 = drop.tip(t, drop)

write.tree(t2, "input_tree_cleaned.tree")
2 changes: 1 addition & 1 deletion scripts/global_profile_alignment.sh
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@
helpFunction()
{
echo "Align a set of sequences in parallel to WH1 seuqence NC_045512.2"
echo "Usage: $0 -i unaligned_fasta_path -o output_file -r reference_alignment -t threads"
echo "Usage: $0 -i unaligned_fasta_path -o output_file -t threads"
echo "\t-i Full path to unaligned fasta file of SARS-CoV-2 sequences"
echo "\t-o Output file path for aligned fasta file. New sequences will be *APPENDED* to this file"
echo "\t-t number of threads to use"
203 changes: 203 additions & 0 deletions scripts/global_tree_gisaid_start_tree.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
#!/bin/bash

helpFunction()
{
echo "Make an ML phylogeny from a large FASTA file of GISAID sequences"
echo "Usage: $0 -i GISAID_fasta -o output_filename -s start_tree -t threads "
echo " -i Full path to unaligned fasta file of SARS-CoV-2 sequences from GISAID"
echo " -p Filpath to previous iteration to be updated (must contain an ft_SH.tree file and an excluded_sequences.tsv file)"
echo " -t number of threads to use"
exit 1 # Exit script after printing help
}

while getopts "i:p:t:" opt
do
case "$opt" in
i ) inputfasta="$OPTARG" ;;
p ) previous="$OPTARG" ;;
t ) threads="$OPTARG" ;;
? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
esac
done


# Print helpFunction in case parameters are empty
if [ -z "$inputfasta" ] || [ -z "$previous" ] || [ -z "$threads" ]
then
echo "Some or all of the parameters are empty";
helpFunction
fi

DIR="$(cd "$(dirname "$0")" && pwd)"

inputdir=$(dirname $inputfasta)
outputfasta=global.fa
inputtree=previous_iteration_files/ft_SH.tree # this is copied over in a few lines

cd $inputdir

# first we copy over the previous iterations files for reproducibility
mkdir previous_iteration_files
cp $previous'/ft_SH.tree' previous_iteration_files/
cp $previous'/excluded_sequences.tsv' previous_iteration_files/

#check the copying worked
if [ -f "previous_iteration_files/ft_SH.tree" ]; then
echo "previous_iteration_files/ft_SH.tree exists."
else
echo "previous_iteration_files/ft_SH.tree does not exist. Something's wrong. Exiting."
exit 1
fi

if [ -f "previous_iteration_files/excluded_sequences.tsv" ]; then
echo "previous_iteration_files/excluded_sequences.tsv exists."
else
echo "previous_iteration_files/excluded_sequences.tsv does not exist. Something's wrong. Exiting."
exit 1
fi


# first we trim the sequences
echo ""
echo "Cleaning raw data"
echo ""

cleaned_gisaid=$inputfasta"_cleaned.fa"
bash $DIR/clean_gisaid.sh -i $inputfasta -o $cleaned_gisaid -t $threads


#### BUILD THE GLOBAL ALIGNMENT ######

echo ""
echo "Making global profile alignment"
echo ""
aln_global="$inputdir/aln_global_unmasked.fa"
bash $DIR/global_profile_alignment.sh -i $cleaned_gisaid -o $aln_global -t $threads


echo ""
echo "Masking alignment"
echo ""
aln_global_masked="$inputdir/aln_global_masked.fa"
bash $DIR/mask_alignment.sh -i $aln_global -o $aln_global_masked -t $threads


echo ""
echo "Filtering sequences that are shorter than 28000 bp and/or have >1000 ambiguities"
echo ""
aln_global_filtered="$inputdir/aln_global_filtered.fa"
esl-alimanip --lmin 28000 --xambig 1000 --informat afa --outformat afa --dna -o $aln_global_filtered $aln_global_masked

echo ""
echo "Removing sites that are >50% gaps, after converting N's to gaps"
echo ""

cp $aln_global_filtered tmp.aln
sed -i.bak '/^[^>]/s/N/-/g' tmp.aln
rm tmp.aln.bak

esl-alimask --gapthresh 0.5 --informat afa --outformat afa --dna -o $outputfasta -g tmp.aln

rm tmp.aln

echo "sequences downloaded from GISAID" >> alignments.log
echo $(grep ">" $inputfasta | wc -l) >> alignments.log
echo "//" >> alignments.log
echo "alignment stats of global alignment" >> alignments.log
esl-alistat $aln_global >> alignments.log
echo "alignment stats of global alignment after masking sites" >> alignments.log
esl-alistat $aln_global_masked >> alignments.log
echo "alignment stats after filtering out short/ambiguous sequences" >> alignments.log
esl-alistat $aln_global_filtered >> alignments.log
echo "alignment stats of global alignment after trimming sites that are >50% gaps" >> alignments.log
esl-alistat $outputfasta >> alignments.log


#### ESTIMATE THE GLOBAL TREE ######


# 1. Take the alignment and the tree, remove the irrelevant sequences with R, then add the new seuqences with IQ-TREE

echo ""
echo "Removing unused sequences from input tree"
echo ""
grep ">" $outputfasta | cut -c 2- > alignment_names.txt
Rscript $DIR/clean_tree.R $inputtree alignment_names.txt

echo ""
echo "Adding new sequences to input tree with IQ-TREE"
echo ""
# get the latest IQ-TREE
wget https://github.com/iqtree/iqtree2/releases/download/v2.1.0/iqtree-2.1.0-Linux.tar.gz
tar -xvzf iqtree-2.1.0-Linux.tar.gz

# this just adds the new sequences with parsimony
# benchmarking shows that 1 thread is optimal
./iqtree-2.1.0-Linux/bin/iqtree2 -seed 1729 -s $outputfasta -g input_tree_cleaned.tree -n 0 -m JC -fixbr -nt 1 --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -pre iqtree_seqsadded_mp

echo ""
echo "Optimising tree with fasttree MP"
echo ""
# we have to do some contortions to set the optimal number of threads for fasttree, which is 3 (see fasttreeOMP.md)
env > old_env.txt
old_threads=$(grep -hoP '^OMP_NUM_THREADS=\K\d+' old_env.txt)
rm old_env.txt
export OMP_NUM_THREADS=3
FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log fasttree.log -intree iqtree_seqsadded_mp.treefile $outputfasta > $outputfasta'_ft_SH.tree'
if [ -n "$old_threads" ]; then
export OMP_NUM_THREADS=$old_threads
else
unset OMP_NUM_THREADS
fi

echo ""
echo "Cleaning tree with treeshrink"
echo ""
run_treeshrink.py -t $outputfasta'_ft_SH.tree' -q 0.05 -c -o treeshrink_SH

# now we update the excluded sequences file
echo ""
echo "Updating excluded sequences file"
echo ""
Rscript $DIR/update_excluded_seqs.R previous_iteration_files/excluded_sequences.tsv treeshrink_SH/global.fa_ft_SH_RS_0.05.txt

echo ""
echo "Re-rooting tree on hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05"
echo "see https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1"
echo ""
nw_reroot 'treeshrink_SH/'$outputfasta'_ft_SH_0.05.tree' "'EPI_ISL_406801'" > ft_SH.tree


sed -i.bak "s/'//g" ft_SH.tree
rm ft_SH.tree.bak


echo "After filtering sequences with TreeShrink" >> alignments.log
nw_stats ft_SH.tree >> alignments.log

echo "//"
echo "Number of new sequences added this iteration" >> alignments.log
wc -l alignment_names_new.txt >> alignments.log

# run QC
Rscript $DIR/QC.R previous_iteration_files/ft_SH.tree ft_SH.tree $threads

# zip up for easy file transfer
xz -e -T $threads $outputfasta
xz -e -T $threads $aln_global
xz -e -T $threads $aln_global_filtered
xz -e -T $threads $aln_global_masked
xz -e -T $threads $aln_global_unmasked
xz -e -T $threads $inputfasta
xz -e -T $threads $inputfasta"_cleaned.fa"
xz -e -T $threads fasttree.log
xz -e -T $threads iqtree_seqsadded_mp.iqtree

rm goalign_amd64_linux
rm -rf iqtree-2.1.0-Linux/
rm iqtree_seqsadded_mp.uniqueseq.phy
rm iqtree_seqsadded_mp.parstree
rm iqtree-2.1.0-Linux.tar.gz
rm iqtree_seqsadded_mp.ckp.gz
rm metadata.csv
rm reference.fa
23 changes: 23 additions & 0 deletions scripts/lsd2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# date with LSD2

git clone https://github.com/tothuhien/lsd2.git
cd lsd2/src/
make

cd ../..

# get the dates
nw_labels -I ft_SH.tree > labels.txt
cut -d '|' -f 3 labels.txt > dates.txt
paste labels.txt dates.txt > datefile.txt

# add number of lines to start of datefile.txt
l=$(wc -l datefile.txt | cut -d' ' -f1)

echo $l | cat - datefile.txt > temp && mv temp datefile.txt

# lsd2
# -i input file
# -s sequence length
# -l 0.000000005 because this is the minimum brlen in the ft_SH.tree
./lsd2/bin/lsd2_unix -i ft_SH.tree -s 29000 -l 0.000000005 -d datefile.txt -v 1 -a 'b(2019-10-30,2019-12-17)'
91 changes: 0 additions & 91 deletions scripts/tree_ft.sh

This file was deleted.

22 changes: 22 additions & 0 deletions scripts/update_excluded_seqs.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env Rscript

# usage: Rscript update_excluded_seqs.R old_iteration_file treeshrink_file
args = commandArgs(trailingOnly=TRUE)

old_file = args[1]
old = read.delim(old_file, header=F)

treeshrink = args[2]

if (file.info(treeshrink)$size < 2) {
new = old
} else {
e = read.delim(treeshrink, header=F)
e = t(e)
e = e[complete.cases(e)]
f = data.frame(cbind(e, "on long branch identified by treeshrink q=0.05"))
names(f)=names(old)
new = rbind(old, f)
}

write.table(new, file="excluded_sequences.tsv", sep="\t", col.names = F, row.names = F, quote=F)
Binary file added spr_iterations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
113 changes: 108 additions & 5 deletions tree_estimation2.md
Original file line number Diff line number Diff line change
@@ -561,41 +561,144 @@ By mistake, I left out the `-noml` command in the last example. Turns out the ML
11.7.4 (after 3 rounds of SPRs of length 200): -330438
11.7.5 (after 3 rounds of SPRs of length 100): -330452




#### Iteration 12

By chance I
A quick test of the latest version of IQ-TREE, which we've been working on making faster...


ME SPRs followed by ML NNIs

```
/usr/bin/time -o 12.1.1.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 500 -nni 0 -spr 20 -refresh 0.8 -topm 1.5 -close 0.75 global.fa > 12.1.1.tree
/usr/bin/time -o 12.1.2.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 50 -nni 0 -spr 20 -intree 12.1.1.tree global.fa > 12.1.2.tree
/usr/bin/time -o 12.1.3.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 20 -intree 12.1.2.tree global.fa > 12.1.2.tree
wget https://github.com/iqtree/iqtree2/releases/download/v2.0.8/iqtree-2.0.8-Linux.tar.gz
tar -xvzf iqtree-2.0.8-Linux.tar.gz
/usr/bin/time -o iqtree2.08.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t BIONJ -pre BIONJ
/usr/bin/time -o iqtree2.08.NJ.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ -pre NJ
/usr/bin/time -o iqtree2.08.NJ-R.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -pre NJ-R
/usr/bin/time -o iqtree2.08.BIONJ-R.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t BIONJ-R -pre BIONJ-R
```

It works well, and fast too!!

Some results with -experimental, 4 cores, on soma (lots of memory). I show both the lnL of the NJ tree, as well as the final tree (after just two rounds of NNI). All with a JC model.

`-t BIONJ`
```
Percent of CPU this job got: 380%
Elapsed (wall clock) time (h:mm:ss or m:ss): 8:27:24
Maximum resident set size (kbytes): 50745644
Log-likelihood of BIONJ tree: -352742.080492
BEST SCORE FOUND : -347506.937
```

`-t NJ`
```
Percent of CPU this job got: 378%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:23:31
Maximum resident set size (kbytes): 50745304
Log-likelihood of NJ tree: -349888.823021
BEST SCORE FOUND : -346814.328
```

`-t NJ-R`
```
Percent of CPU this job got: 377%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:15:14
Maximum resident set size (kbytes): 50746932
Log-likelihood of NJ-R tree: -349910.476699
BEST SCORE FOUND : -346885.416
```

`-t BIONJ-R`
```
Percent of CPU this job got: 381%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:47:39
Maximum resident set size (kbytes): 50746768
Log-likelihood of BIONJ-R tree: -352887.323284
BEST SCORE FOUND : -347512.641
```

NJ-R provides the best balance of speed an accuracy, so I'll go with that for now, noting though that NJ is gives very slightly higher likelihoods.


#### Iteration 13

Now I'll try to squeeze higher liklihoods out of IQ-TREE. A few obvious things to try first.


1. Reduce the minimum branch length (unlike fasttree, IQ-TREE doesn't have hard polytomies. Given that some polytomies in this tree have 1000's of sequences, we probably want the minimum branch length to be at most 1/10000 of a substitution. Given that the alignment is about 30,000 bases, 1 substitution would be a branch length of 1/30000, so we want a minimum of 1/10000th of this , which is 0.000000003. For convenience I'll just round this even further down to 0.0000000001)

2. Better models of rates (I'll move on to molecular evolution later)

```
/usr/bin/time -o 13.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.1
/usr/bin/time -o 13.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+G -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.2
/usr/bin/time -o 13.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+G+I -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.3
/usr/bin/time -o 13.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R1 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.4
/usr/bin/time -o 13.5.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R2 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.5
/usr/bin/time -o 13.6.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.6
/usr/bin/time -o 13.7.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.7
/usr/bin/time -o 13.8.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.8
```


Results:

13.1
lnL_NJ-R: -346219.079852
lnL: -345413.780
time: 4h:5m:0s
mem: 50746412

13.2
lnL_NJ-R: -334571.368294
lnL: -333702.914
time: 16h:45m:52s
mem: 105633656

13.3
lnl_NJ-R: -333649.112195
lnL: -332890.756
time: 19h:18m:2s
mem: 106700024

13.4
lnl_NJ-R: -337466.943109
lnL: -336651.617
time: 10h:47m:40s
mem: 56482076

13.5
lnl_NJ-R: -332073.220730
lnL: -331279.983
time: 17h:40m:5s
mem: 75205400

13.6
lnl_NJ-R: -331545.424656
lnL: -330704.610
time: 30h:30m:50s
mem: 95032788

At this point it's clear that IQ-TREE has the *potential* to do just as well as fasttree. What's required are a few things though. First, we need to compare the two directly in a fair way. Second, we need to move on to new alignments since there are substantial new masking methods for the latest alignments. Finally, we need a sensible way of asking whether the huge differences in likelihood we observe here are due to improvements in the tree topology, or just the model of evolution.

So, it's time to move on to a new setup.