roblanf · Jul 10, 2020 · Jul 12, 2020 · Jul 24, 2020 · Jul 24, 2020 · Jul 24, 2020
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+
+~$ft_models.xlsx
+scripts/add_seqs_to_global_tree.sh
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -0,0 +1,13 @@
+stages:
+  - build_image
+  - test_env
+
+build_image:
+  stage: build_image
+  script:
+    - docker build -t sarscov2phylo:$CI_COMMIT_SHORT_SHA -f ci/Dockerfile .
+
+test_env:
+  stage: test_env
+  script:
+    - docker run sarscov2phylo:$CI_COMMIT_SHORT_SHA conda run -n sarscov2phylo esl-alimanip -h
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,24 +1,49 @@
+# Before you start - know that this code is not state of the art
+
+This repository is no longer maintained. I leave it here for reference only (the rest of this readme is pretty much unchanged).
+
+If you want to estimate your own trees of SARS-CoV-2, please look into:
+
+* UShER and matOptimize: https://usher-wiki.readthedocs.io/en/latest/matOptimize.html
+* MAPLE: https://github.com/NicolaDM/MAPLE
+* Anything even better which might have come along since I wrote this (May 8th 2023)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 # A global phylogeny of SARS-CoV-2
 
-This repository provides a regularly-updated global phylogeny of SARS-CoV-2 data from GISAID, as well as reproducile methods with which to estimate it. 
+This repository used to provide a regularly-updated global phylogeny of SARS-CoV-2 data from GISAID, as well as reproducile methods with which to estimate it. 
 
 We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID’s EpiFlu™ Database on which this research is based. The list is detailed in [this file](https://github.com/roblanf/sarscov2phylo/blob/master/acknowledgements/).
 
-# Key files
+# Can I use it and cite it
+
+Yes. Every new release from 22-July-20 to 13-November-20 is available at this DOI: [![DOI](https://zenodo.org/badge/260124648.svg)](https://zenodo.org/badge/latestdoi/260124648)
 
-The latest tree with the associated code to reproduce that tree can always be obtained from the latest release of this repository [here](https://github.com/roblanf/sarscov2phylo/releases/latest). Releases are named by the date (in Australian time) on which the sequences were downloaded from GISAID.
+Releases subsequent to 13-November-20 are available on the GISAID website. To obtain the trees and alignments from GISAID you'll need a membership, but this is free and new requests are processed quickly. Once you have a GISAID membership, follow these steps to get the tree and the corresponding alignment:
 
-For convenience, you can also get the latest results via the following links:
+1. Log in to GISAID using your GISAID login
+2. In the top menu bar, click the "EpiCoV" tab
+3. Click the "Audacity" tile (with a big rainbow tree on it, near the bottom) to get the tree
+4. Click the "Downloads" link just under the "EpiCoV" tab, then choose the alignment you want to download
 
-* [Latest global phylogeny with Transfer Bootstrap Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_TBE.tree)
-* [Latest global phylogeny with Felsenstein Bootstrap Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_FBP.tree)
-* [Latest global phylogeny with fasttree SH Supports](https://github.com/roblanf/sarscov2phylo/blob/master/ft_SH.tree)
-* [Acknowledgements file for those that upload to GISAID](https://github.com/roblanf/sarscov2phylo/blob/master/acknowledgements/)
-* [Latest version of the script to produce a global tree](https://github.com/roblanf/sarscov2phylo/blob/master/scripts/global_tree_gisaid.sh)
+The alignment that is used to make the tree is the one labelled "MSA masked" followed by the date. Older alignments are in the 'Archive' box.
 
-Privacy rules around the alignments themselves mean that they cannot be released here. The alignments can be recreated by following the steps described below. If you are a GISAID member and would like a copy of the alignment for any specific tree in the releases, please email me and I'll share it with you.
+# Key files
 
-Also, please note that the script to produce the global tree is continually modified as new data are released, so if you are looking at the master branch it may be the case that the latest code will not exactly reproduce the latest tree. If you would like to reproduce the latest tree, please head over to the [latest release](https://github.com/roblanf/sarscov2phylo/releases/latest), which contains the latest tree and the version of the code used to produce it.
+The latest tree and alignment, with the associated code to reproduce that tree can always be obtained from GISAID (see above) 
 
 ![Part of global SARS-CoV-2 phylogeny](https://github.com/roblanf/sarscov2phylo/blob/master/tree_image.jpg)
 
@@ -38,7 +63,9 @@ Sequences are filtered out for a few reasons:
 
 If your sequence is in GISAID, and was submitted before the date noted in the [latest release of the repository](https://github.com/roblanf/sarscov2phylo/releases/latest), but it is not in the tree, then it was filtered for one of the above reasons.
 
-# Why are there three trees, and what are all the numbers?
+# Why are there three trees in some releases, and what are all the numbers?
+
+Up to and including 31-7-20, each release contains three trees. This section describes the differences between those trees. After 31-7-20 there is only one tree in each release.
 
 The topology and branch lengths of the three trees are identical. In all cases, the topology is the best topology estimated by `fasttree` with options tuned specifically for this dataset, see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md). The branch lengths represent substitutions per site. You will see that if you multiply the branchlenghts by about 30,000 (which is roughly the length of the alignments) many of the branchlengths are close to integers. That's because there's very little variation in these sequences, meaning that many branches have some integer number of changes inferred on them.
 
@@ -89,6 +116,8 @@ conda env create -f environment.yml
 conda activate sarscov2phylo
 ```
 
+Finally you will need to compile and install the MP version of FastTree from here: http://www.microbesonline.org/fasttree/#Install
+
 # Quickstart
 
 1. Get the latest [GISAID data](https://www.gisaid.org/) in a single fasta file, e.g. `gisaid.fa`, you can do this via the batch download feature on [GISAID](https://www.gisaid.org/), though you will have to become a member to do this.
@@ -113,14 +142,14 @@ Here's what the code does:
 
 6. Calculates and prints to `alignments.log` the stats of all alignments, for simple sanity checking.
 
-7. Estimates a global ML tree from `global.fa`. This is done by using `fasttree` with settings determined empirically to be the best, which are constantly updated see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md). The scripts then use `goalign` to create 100 bootstrap alignments followed by re-estatimating all the ML trees with `fasttree` as and the `-fastest` setting, using GNU `parallel` to manage parallelisaion. FBP and TBE values are calcualted with `gotree`, and SH values are calculted with `fasttree`. The resulting three trees are rooted with the seuqence 'hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05' as suggested in [this preprint](https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1), using [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/). This creates the files `global.fa_ft_TBE.tree`, `global.fa_ft_FBP.tree`, `global.fa_ft_SH.tree`.
+7. Estimates a global ML tree from `global.fa`. [NB: the methods here changed substantially subsequent to 31-7-20; please see previous releases for the previous methods]. This is done by using `fasttree` with settings determined empirically to be the best, which are constantly updated see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md) and also [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/iqtree_sequential.md). The global tree is estimated in two steps. First, we start with the best tree from the previous release and *add* any new sequences to that tree using Maximum Parsimony in IQ-TREE (see [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation2.md) for a demonstration that MP works extremely well with these data). Second, we further optimise that tree with a series of minimum evolution SPR moves and Maximum Likelihood NNI moves in `fasttree`. Details of the benchmarking behind these choices are [here](https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md) and [here](https://github.com/roblanf/sarscov2phylo/blob/master/iqtree_sequential.md). We use `fasttree` to calculate SH supports on the branches of that tree. The resulting tree is rooted with the seuqence 'hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05' as suggested in [this preprint](https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1), using [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/). This creates the file `global.fa_ft_SH.tree`.
 
-8. Removes sequences on very long branches from the tree using [`TreeShrink`](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4620-2). These sequences are likely to be either of poor quality and/or poorly aligned, so rather unreliable to interpret in a phylogeny with such limited variation. They are subsequently added to the list of excluded seuqences so they are not included in future iterations of the pipeline. This creates the files `ft_TBE.tree` and `ft_FBP.tree`, and `ft_SH.tree`.
+8. Removes sequences on very long branches from the tree using [`TreeShrink`](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4620-2). These sequences are likely to be either of poor quality and/or poorly aligned, so rather unreliable to interpret in a phylogeny with such limited variation. They are subsequently added to the list of excluded seuqences so they are not included in future iterations of the pipeline. This creates the file `ft_SH.tree`.
 
 9. Roots the final two trees with [`nw_reroot`](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887050/) as in step 7. 
 
 
 # One last thing
 
 The trees and alignments produced by the scripts here are checked regularly by eye. No matter all the clever software tools, there is no replacement for the human eye when it comes to spotting anomalies. If I find an anomaly, I go back and change the scripts to deal with it automatically (e.g. by removing a seuqence or masking a site, changing alignment or tree building parameters, etc). Rest assured that each release contains the exact code that built the trees included in that release, and also that I would not publish a release with a known anomaly. Obviously I can't guarantee to find all of the problematic sequences or sites, but I try.
-
+
diff --git a/SPR_analysis.md b/SPR_analysis.md
@@ -0,0 +1,80 @@
+# How many SPR moves do I need in fasttree?
+
+This is a question I tried to address before, and settled on a (very) conservatively high number of 10. Since then, I've been updating the global tree every two days, by adding seuqences with Maximum Parsimony and updating them the ME and ML in fasttree. By far the slowest step are the ME SPR moves, and I currently do 10 rounds of range 1000 (i.e. the range is the whole tree).
+
+In this document I'll just take a quick look at how many changes ever occur in each of those 10 rounds of SPR moves over about 10 update cycles. To do that, I'll use the logs I recorded from fasttree to just measure Robinson-Foulds distance between the starting tree and each of the 10 ME-SPR trees from each update. I started loggin the fasttree output in the 14-08-20 release. So I'll start with that one here too.
+
+Below is some clunky R code to get the information I want. And here's the result.
+
+![SPR iterations plot](https://github.com/roblanf/sarscov2phylo/blob/master/spr_iterations.png)
+
+There are a LOT of limitations here. RF distances are confounded for many reasons here, but they're the only thing fast enough to practically calculate. Still, the results show clearly that the amount that the tree changes tails off a lot with each iteration. Almost half of all the changes come in the first iteration. 
+
+My strong suspicion is that most of the changes at all stages are small changes near the tips of the trees (see also the previous analyses of that issue on this repo). And combined with this (highly flawed but still informative) analysis, I think it's sensible to drop the number of SPR rounds to 2.
+
+
+
+```
+library(readr)
+library(phangorn)
+library(parallel)
+library(ggplot2)
+
+# here's a function to get the distance I want from a fastree log file
+get_rfdists = function(filename){
+
+    d = read_lines(filename, skip = 7, n_max = 11)
+    e = read_delim(d, delim="\t", col_names = FALSE)
+    e$RF=NA
+    e$RFn=NA
+    
+    rownum=2:11
+    for(i in rownum){
+        
+        t1 = read.tree(text=e$X2[i-1])
+        t2 = read.tree(text=e$X2[i])
+        
+        e$RF[i] = RF.dist(t1, t2, normalize = FALSE)
+        e$RFn[i] = RF.dist(t1, t2, normalize = TRUE)
+        
+    }
+
+    e$name = basename(dirname(filename))
+    
+    r1 = data.frame(name = e$name[2:11], dist = e$RF[2:11], dist.type = "rf.dist")
+    r1$spr.iteration = 1:10
+
+    r2 = data.frame(name = e$name[2:11], dist = e$RFn[2:11], dist.type = "rf.dist.norm")
+    r2$spr.iteration = 1:10
+    
+    r = rbind(r1, r2)
+            
+    return(r)
+    
+}
+
+# here are the log files
+files = c("global_14-08-20/fasttree.log",
+          "global_16-8-20/fasttree.log",
+          "global_18-08-20/fasttree.log",
+          "global_20-08-20/fasttree.log",
+          "global_22-08-20/fasttree.log",
+          "global_24-08-20/fasttree.log",
+          "global_26-08-20/fasttree.log",
+          "global_28-08-20/fasttree.log",
+          "global_30-08-20/fasttree.log",
+          "global_30-08-20/fasttree.log",
+          "global_02-09-20/fasttree.log",
+          "global_04-09-20/fasttree.log",
+          "global_06-09-20/fasttree.log",
+          "global_08-09-20/fasttree.log",
+          "global_10-09-20/fasttree.log")
+
+# run the function and bind the results
+r = mclapply(files, get_rfdists, mc.cores = 8)
+results = do.call(rbind, r)
+
+# plot it out
+ggplot(results, aes(x=spr.iteration, y=dist, colour=name)) + geom_point() + geom_line() + facet_wrap(facets = c("dist.type"), ncol = 1, scales = "free_y")
+
+```
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_tableJuly2nd_to_July9th.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_tableJuly2nd_to_July9th.pdf
diff --git a/...ent_table_June19th2020_to_July2nd2020.pdf → ...gement_table_2020-06-19_to_2020-07-02.pdf b/...ent_table_June19th2020_to_July2nd2020.pdf → ...gement_table_2020-06-19_to_2020-07-02.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-02_to_2020-07-12.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-02_to_2020-07-12.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-12_to_2020-07-24.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-12_to_2020-07-24.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-24_to_2020-07-31.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-24_to_2020-07-31.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-31_to_2020-08-07.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-07-31_to_2020-08-07.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-07_to_2020-08-14.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-07_to_2020-08-14.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-14_to_2020-08-21.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-14_to_2020-08-21.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-21_to_2020-08-28.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-21_to_2020-08-28.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-28_to_2020-09-11.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-08-28_to_2020-09-11.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-11_to_2020-09-18.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-11_to_2020-09-18.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-18_to_2020-09-25.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-18_to_2020-09-25.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-25_to_2020-09-27.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-25_to_2020-09-27.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-28_to_2020-09-28.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-28_to_2020-09-28.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-29_to_2020-10-04.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-09-29_to_2020-10-04.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-05_to_2020-10-14.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-05_to_2020-10-14.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-15_to_2020-10-21.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-15_to_2020-10-21.pdf
diff --git a/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-22_to_2020-10-26.pdf b/acknowledgements/gisaid_hcov-19_acknowledgement_table_2020-10-22_to_2020-10-26.pdf
diff --git a/ci/Dockerfile b/ci/Dockerfile
@@ -0,0 +1,16 @@
+FROM continuumio/miniconda3:4.8.2
+
+RUN conda update -n base -c defaults conda
+COPY environment.yml /tmp/environment.yml
+
+ARG USER_ID
+
+RUN useradd -m --uid $USER_ID --shell /bin/bash user
+RUN mkdir -p /work/sarscov2phylo
+RUN chown user /work/sarscov2phylo
+USER user
+
+RUN conda env create -n sarscov2phylo -f /tmp/environment.yml
+
+COPY scripts /work/sarscov2phylo/scripts
+WORKDIR /work/sarscov2phylo
diff --git a/environment.yml b/environment.yml
@@ -19,6 +19,13 @@ dependencies:
   - parallel
   - pip=19.3.1
   - python=3.6
+  - r=3.6
+  - r-ape
+  - r-bms
+  - r-base=3.6.3
+  - r-ggplot2
+  - r-phangorn
+  - r-readr
   - rapidnj
   - snakemake-minimal=5.13
   - seqmagick

diff --git a/excluded_sequences.tsv b/excluded_sequences.tsv
diff --git a/fasttreeOMP.md b/fasttreeOMP.md
@@ -0,0 +1,170 @@
+# Benchmarking fasttree OMP and veryfasttree (again)
+
+Question - how much faster is fasttree OMP (compiled on my machine) than the non-OMP version I get from conda, and how many threads is optimal. And how does it compare to veryfasttree?
+
+I looked at veryfasttree before (see veryfasttree.md on this repo), but at that point my trees were smaller, and I was limited by memory and CPUs, since I was running full bootstraps. Now my trees are much bigger, and I'm no longer doing full bootstraps, so it seems worth revisiting that analysis. 
+
+# Methods
+
+This one's simple. First I get fasttree OMP and veryfasttree and compile them:
+
+```
+wget http://www.microbesonline.org/fasttree/FastTree.c
+gcc -DOPENMP -fopenmp -DUSE_DOUBLE -O3 -finline-functions -funroll-loops -Wall -o FastTreeMP FastTree.c -lm
+
+git clone https://github.com/citiususc/veryfasttree
+cmake .
+make
+make install
+
+```
+
+Now I re-run the analyses from 14-09-20 and time them, setting different numbers of threads like this:
+
+```
+export OMP_NUM_THREADS=1
+/usr/bin/time -o 1threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 1threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 1thread.tree
+```
+
+I'll just do this for 1, 2, 5, 10, 20, 50, and 100 threads like this:
+
+```
+export OMP_NUM_THREADS=2
+/usr/bin/time -o 2threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 2threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 2thread.tree
+
+export OMP_NUM_THREADS=5
+/usr/bin/time -o 5threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 5threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 5thread.tree
+
+export OMP_NUM_THREADS=10
+/usr/bin/time -o 10threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 10threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 10thread.tree
+
+export OMP_NUM_THREADS=20
+/usr/bin/time -o 20threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 20threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 20thread.tree
+
+export OMP_NUM_THREADS=50
+/usr/bin/time -o 50threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 50threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 50thread.tree
+
+export OMP_NUM_THREADS=100
+/usr/bin/time -o 100threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 100threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 100thread.tree
+```
+
+I noticed during these runs that 5 threads seemed to be going quickest, so I thought I'd also fire up everything from 2-10, because if you're going to optimise, you may as well do it right.
+
+
+```
+export OMP_NUM_THREADS=3
+/usr/bin/time -o 3threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 3threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 3thread.tree
+
+export OMP_NUM_THREADS=4
+/usr/bin/time -o 4threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 4threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 4thread.tree
+
+export OMP_NUM_THREADS=6
+/usr/bin/time -o 6threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 6threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 6thread.tree
+
+export OMP_NUM_THREADS=7
+/usr/bin/time -o 7threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 7threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 7thread.tree
+
+export OMP_NUM_THREADS=8
+/usr/bin/time -o 8threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 8threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 8thread.tree
+
+export OMP_NUM_THREADS=9
+/usr/bin/time -o 9threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 9threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 9thread.tree
+```
+
+Now I'll run the same kinds of analysis for veryfasttree. But before I do that I'll run some analyses to find out which of veryfasttree's tricks work best on my new server:
+
+```
+/usr/bin/time -o vf1.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf1.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext SSE3 -fastexp 0 global.fa > vf1.tree 
+/usr/bin/time -o vf2.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf2.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext SSE3 -fastexp 2 global.fa > vf2.tree 
+/usr/bin/time -o vf3.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf3.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX -fastexp 0 global.fa > vf3.tree 
+/usr/bin/time -o vf4.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf4.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX -fastexp 2 global.fa > vf4.tree 
+/usr/bin/time -o vf5.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf5.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 0 global.fa > vf5.tree 
+/usr/bin/time -o vf6.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf6.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 2 global.fa > vf6.tree 
+```
+
+here's what we get for the execution time:
+
+1: 9:01:27
+2: 9:04:37
+3: 8:50:10
+4: 8:44:37
+5: 8:46:24
+6: 8:44:15
+
+So it's the final settings by a meaningless whisker. Now let's run the analyses
+
+```
+/usr/bin/time -o vf1threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf1threads.log -intree iqtree_seqsadded_mp.treefile -threads 1 -double-precision -ext AVX2 -fastexp 2 global.fa > vf1threads.tree 
+/usr/bin/time -o vf2threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf2threads.log -intree iqtree_seqsadded_mp.treefile -threads 2 -double-precision -ext AVX2 -fastexp 2 global.fa > vf2threads.tree 
+/usr/bin/time -o vf3threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf3threads.log -intree iqtree_seqsadded_mp.treefile -threads 3 -double-precision -ext AVX2 -fastexp 2 global.fa > vf3threads.tree 
+/usr/bin/time -o vf4threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf4threads.log -intree iqtree_seqsadded_mp.treefile -threads 4 -double-precision -ext AVX2 -fastexp 2 global.fa > vf4threads.tree 
+/usr/bin/time -o vf5threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf5threads.log -intree iqtree_seqsadded_mp.treefile -threads 5 -double-precision -ext AVX2 -fastexp 2 global.fa > vf5threads.tree 
+/usr/bin/time -o vf6threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf6threads.log -intree iqtree_seqsadded_mp.treefile -threads 6 -double-precision -ext AVX2 -fastexp 2 global.fa > vf6threads.tree 
+/usr/bin/time -o vf7threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf7threads.log -intree iqtree_seqsadded_mp.treefile -threads 7 -double-precision -ext AVX2 -fastexp 2 global.fa > vf7threads.tree 
+/usr/bin/time -o vf8threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf8threads.log -intree iqtree_seqsadded_mp.treefile -threads 8 -double-precision -ext AVX2 -fastexp 2 global.fa > vf8threads.tree 
+/usr/bin/time -o vf9threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf9threads.log -intree iqtree_seqsadded_mp.treefile -threads 9 -double-precision -ext AVX2 -fastexp 2 global.fa > vf9threads.tree 
+/usr/bin/time -o vf10threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf10threads.log -intree iqtree_seqsadded_mp.treefile -threads 10 -double-precision -ext AVX2 -fastexp 2 global.fa > vf10threads.tree 
+/usr/bin/time -o vf20threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf20threads.log -intree iqtree_seqsadded_mp.treefile -threads 20 -double-precision -ext AVX2 -fastexp 2 global.fa > vf20threads.tree 
+/usr/bin/time -o vf50threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf50threads.log -intree iqtree_seqsadded_mp.treefile -threads 50 -double-precision -ext AVX2 -fastexp 2 global.fa > vf50threads.tree 
+/usr/bin/time -o vf100threads.txt -v VeryFastTree -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log vf100threads.log -intree iqtree_seqsadded_mp.treefile -threads 100 -double-precision -ext AVX2 -fastexp 2 global.fa > vf100threads.tree 
+```
+
+## Results
+
+TL;DR: veryfasttree is a lot slower than fasttree, and gives worse trees (!). FastTreeMP works fastest with 10 threads, which is a bit odd. 
+
+| wut          | threads | time (h:m:s) | %CPU | efficiency (%) | lnL         |
+|--------------|---------|--------------|------|----------------|-------------|
+| fasttree     | 1       | 8:51:47      | 99   | 99.0           | -640122.657 |
+| fasttree_omp | 1       | 9:06:28      | 99   | 99.0           | -640114.204 |
+| fasttree_omp | 2       | 7:43:12      | 127  | 63.5           | -640114.204 |
+| fasttree_omp | 3       | 6:52:53      | 144  | 48.0           | -640114.204 |
+| fasttree_omp | 4       | 6:53:48      | 156  | 39.0           | -640114.204 |
+| fasttree_omp | 5       | 6:55:20      | 168  | 33.6           | -640114.204 |
+| fasttree_omp | 6       | 7:21:00      | 177  | 29.5           | -640114.204 |
+| fasttree_omp | 7       | 7:14:02      | 190  | 27.1           | -640114.204 |
+| fasttree_omp | 8       | 7:05:57      | 203  | 25.3           | -640114.204 |
+| fasttree_omp | 9       | 7:20:56      | 211  | 23.4           | -640114.204 |
+| fasttree_omp | 10      | 6:20:57      | 238  | 23.8           | -640114.204 |
+| fasttree_omp | 20      | 7:10:58      | 347  | 17.3           | -640114.204 |
+| fasttree_omp | 50      | 7:17:07      | 723  | 14.4           | -640114.204 |
+| fasttree_omp | 100     | 7:15:18      | 1385 | 13.8           | -640114.204 |
+| veryfasttree | 1       | 12:30:28     | 99   | 99.0           | -640671.012 |
+| veryfasttree | 2       | 12:21:46     | 103  | 51.5           | -640671.012 |
+| veryfasttree | 3       | 11:05:24     | 118  | 39.3           | -640670.422 |
+| veryfasttree | 4       | 11:04:09     | 120  | 30.0           | -640669.088 |
+| veryfasttree | 5       | 11:05:58     | 122  | 24.4           | -640669.088 |
+| veryfasttree | 6       | 9:49:00      | 141  | 23.5           | -640669.088 |
+| veryfasttree | 7       | 9:43:35      | 149  | 21.2           | -640669.089 |
+| veryfasttree | 8       | 9:28:12      | 153  | 19.1           | -640669.088 |
+| veryfasttree | 9       | 9:19:24      | 159  | 17.6           | -640669.088 |
+| veryfasttree | 10      | 9:21:32      | 163  | 16.3           | -640668.681 |
+| veryfasttree | 20      | 8:36:04      | 205  | 10.2           | -640655.452 |
+| veryfasttree | 50      | 8:36:26      | 320  | 6.4            | -640626.579 |
+| veryfasttree | 100     | 8:53:03      | 484  | 4.8            | -640626.579 |
+
+
+The 10 threads thing doesn't make much sense. Up to 10 threads it all seems sensible - 3 threads is best, and it gets gradually worse the more you add (I assume because the cost of cross-talk outweighs the benefits of parallelisation). But then there's this jump at 10 threads. 
+
+To double check what might be happening here and to see what happens above 10 threads, I'm going to re-run 3, 8, 9, 10, 11, and 12 threads. 
+
+```
+export OMP_NUM_THREADS=3
+/usr/bin/time -o r3threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 3threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 3thread.tree
+
+export OMP_NUM_THREADS=8
+/usr/bin/time -o r8threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 4threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 4thread.tree
+
+export OMP_NUM_THREADS=9
+/usr/bin/time -o r9threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 6threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 6thread.tree
+
+export OMP_NUM_THREADS=10
+/usr/bin/time -o r10threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 7threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 7thread.tree
+
+export OMP_NUM_THREADS=11
+/usr/bin/time -o r11threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 8threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 8thread.tree
+
+export OMP_NUM_THREADS=12
+/usr/bin/time -o r12threads.txt -v ../FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log 9threads.log -intree iqtree_seqsadded_mp.treefile global.fa > 9thread.tree
+```
+
diff --git a/ftOMP.xlsx b/ftOMP.xlsx
diff --git a/ft_FBP.tree b/ft_FBP.tree
diff --git a/ft_SH.tree b/ft_SH.tree
diff --git a/ft_TBE.tree b/ft_TBE.tree
diff --git a/iqtree_sequential.md b/iqtree_sequential.md
@@ -0,0 +1,191 @@
+# Moving to online phylogenetics
+
+As the data keep growing in size, it makes less and less sense to start each analysis as if the previous analysis never happened. The current process acts like this, estimating every tree from scratch, even if the dataset only increases by a few percent in size.
+
+The previous anlayses have shown that MP is an excellent method here, so in principle we could use this to place new sequences on a tree, and make some global and local improvements to that tree.
+
+To do this, I wrote the script `global_tree_gisaid_start_tree.sh`. This script is very similar to the previous script in that it does the alignment from scratch. But this time, instead of using fasttree to make a NJ tree, which is then optimised with various ME and ML moves, we instead start with the tree from the previous iteration of the pipeline, then *add* the new sequences to it (there are various cleaning steps involved here, but the addition itself is done with MP in IQ-TREE), and finally optimise that tree with fasttree. It took me some time to get the steps of that pipeline together, but there are a number of questions that remain about the best way to optimise the starting tree.
+
+The requirements are that it has to be *fast enough* to hopefully put out a new tree every day, and that we need some feeling of a guarantee that global and local issues in the tree could at least in principle have been ironed out. 
+
+So, it seems sensible to include some global (i.e. SPRs with large radii) and local (lots of ML NNI) moves when optimising the tree. But I want to compare these (of course) against just estimating the tree from scratch as I usually would. Here are seven analyses which will allow me to answer a few questions that I set out below.
+
+These analyses all use tree from 31-7-20 as the starting tree (where it's used) and GISAID data from 11-8-20 as the alignment. When using a starting tree, I take the following steps:
+
+1. Remove sequences from the tree that are not in the alignment (this can happen because sequences are filtered)
+2. Add to the tree sequences in the alignment that are not already in the tree (this uses IQ-TREE and MP, and took about 1hr to add about 3K sequences)
+
+That's it. The total dataset here is about 52K sequences.
+
+
+```
+# 2 sprs, + ML NNIs
+/usr/bin/time -o 1.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 1.tree
+
+# 0 sprs, + ML NNIs
+/usr/bin/time -o 2.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 2.tree
+
+# 2 sprs, + ML NNIs, 5 rate categories
+/usr/bin/time -o 3.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -cat 5 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 3.tree
+
+# 2 sprs, + ML NNIs, 5 rate categories, no gamma
+/usr/bin/time -o 4.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 4.tree
+
+
+# 2 sprs, + ML NNIs, 5 rate categories, no gamma, boot 100
+/usr/bin/time -o 5.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -boot 100 -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 5.tree
+
+# 2 sprs, + ML NNIs, 5 rate categories, no gamma, no boot
+/usr/bin/time -o 6.mem.txt -v fasttree -nt -nni 0 -spr 2 -cat 5 -nosupport -sprlength 200 -intree iqtree_seqsadded_mp.treefile $outputfasta > 6.tree
+
+
+# no input tree, otherwise default except for increased SPR length
+/usr/bin/time -o 7.mem.txt -v fasttree -nt -gamma -sprlength 200 $outputfasta > 7.tree
+```
+
+## Q1: What gives the best tree?
+
+For this we just compare analysis 1 and analysis 7. Analysis 1 uses the previous iteration as a starting tree, where the new sequences were added by IQ-TREE.
+
+Start tree lnL: -496120.668
+No start tree lnL: -496332.143
+
+Unsurprisingly, the analysis with the starting tree is better. It's not surprising because almost all the splits in this tree were *already* highly optimised. Adding a few thousand (about 6%) new sequences doesn't change many of the existing splits. But I add 2 global SPR moves (with ME) followed by fasttrees default number of ML NNIs (30 in this case). Of note, the analysis converged after just 6 rounds of ML NNIs, suggesting that the MP placements of the new sequences were very similar to the MP placements. That's encouraging. In addition, this is about 1 week of new data. So doing this every day should be very robust (and probably a little quicker). 
+
+Of note the timings were also very different:
+
+Start tree time: 10.17 hrs
+No start tree time: 14.35 hrs
+
+## Q2: How much time does each round of SPRs add?
+
+Comparing analysis 1 (10.17 hrs) and analysis 2 (6.08 hrs), it's clear that each round of SPRs adds about 2 hrs to the analysis. That's about 30% of the analysis time without SPRs, so it's very significant. We should check if it's worth it...
+
+## Q3: Are the SPR rounds worth it?
+
+One round of global SPR moves is certainly worth it. If we don't do that we'll have zero rounds of SPR moves, and that means that any serious issues in the tree (introduced for whatever reason) will be very hard if not impossible to fix. NNI moves just don't cut it in this scenario. But it's still useful to know how much the 2 rounds of SPR moves improved the final log likelihood (also bearing in mind that the 2 SPR moves are done in an ME framework, so could in principle make things worse if ME optima conflict with ML optima)...
+
+2 SPRs lnL: -496120.668
+0 SPRs lnL: -496239.083
+
+Surprisingly (to me at least) the two rounds of SPR moves really seemed to help. Seems worth investigating whether 1, 2, or 3 rounds of SPR moves is best in the next analysis.
+
+## Q4: How much time can we save?
+
+Analyses 2-6 are all about investigating ways of saving time. By comparing them to each other we can get an idea of how much time can be saved by changing various settings.
+
+First, comparing 1 and 3 tells us if we save time or change the likelihood by moving from 20 to 5 rate categories:
+
+cat 20 lnL: -496120.668
+cat 5  lnL: -498323.359
+
+Oof. That made the lnL a LOT worse. In addition, the cat 5 lnL analysis (number 3) still took 9.8hrs, so it is a very marginal time saving. Not worth it. 
+
+Second, comparing 3 and 4 tells us if removing the -gamma flag speeds things up (it makes no difference to the tree, it just makes the log liklihoods comparable and scales the branch lengths a little, neither of which are all that important). Analysis 3 took 9.8 hrs, analysis 4 took 9.4 hrs, so it's about a 4% time saving to remove the -gamma flag. Not nothing, but not much.
+
+Third, comparing 4 and 5 tells us if reducing the number of bootstraps from 1000 to 100 saves time. Analysis 4 took 9.4 hrs, and analysis 5 took 7.6 hrs. That's almost a 20% time saving. Seems worth it, particularly because 100 SH bootstraps is probably enough, and these numbers are very very imprecise anyway. 
+
+Finally, comparing 5 and 6 tells us how much more we can save by removing bootstraps all together. Analysis 5 took 7.6 hrs, and analysis 6 took 7.2 hrs. So it's not a huge saving, and *some* measure of uncertainty in the splits seems worth having for a little bit of time. 
+
+## Conclusions so far
+
+The best settings are 20 rate categories, we can take or leave the `-gamma` flag, and we should use 100 bootstraps. 
+
+## Some more investigation
+
+The remaining question seems to be how many rounds of ME SPRs is best. So let's run a few additional analyses for that. Let's compare the likelihood and execution time for 0-4 rounds of SPRs, and for good measure let's try three spr lengths - 100, 200, 500.
+
+```
+/usr/bin/time -o 11.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 100 -log 11.log -intree iqtree_seqsadded_mp.treefile global.fa > 11.tree
+/usr/bin/time -o 12.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 200 -log 12.log -intree iqtree_seqsadded_mp.treefile global.fa > 12.tree
+/usr/bin/time -o 13.mem.txt -v fasttree -nt -gamma -nni 0 -spr 0 -sprlength 500 -log 13.log -intree iqtree_seqsadded_mp.treefile global.fa > 13.tree
+/usr/bin/time -o 21.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 100 -log 21.log -intree iqtree_seqsadded_mp.treefile global.fa > 21.tree
+/usr/bin/time -o 22.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 200 -log 22.log -intree iqtree_seqsadded_mp.treefile global.fa > 22.tree
+/usr/bin/time -o 23.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 500 -log 23.log -intree iqtree_seqsadded_mp.treefile global.fa > 23.tree
+/usr/bin/time -o 31.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 100 -log 31.log -intree iqtree_seqsadded_mp.treefile global.fa > 31.tree
+/usr/bin/time -o 32.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 200 -log 32.log -intree iqtree_seqsadded_mp.treefile global.fa > 32.tree
+/usr/bin/time -o 33.mem.txt -v fasttree -nt -gamma -nni 0 -spr 2 -sprlength 500 -log 33.log -intree iqtree_seqsadded_mp.treefile global.fa > 33.tree
+/usr/bin/time -o 41.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 100 -log 41.log -intree iqtree_seqsadded_mp.treefile global.fa > 41.tree
+/usr/bin/time -o 42.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 200 -log 42.log -intree iqtree_seqsadded_mp.treefile global.fa > 42.tree
+/usr/bin/time -o 43.mem.txt -v fasttree -nt -gamma -nni 0 -spr 3 -sprlength 500 -log 43.log -intree iqtree_seqsadded_mp.treefile global.fa > 43.tree
+/usr/bin/time -o 51.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 100 -log 51.log -intree iqtree_seqsadded_mp.treefile global.fa > 51.tree
+/usr/bin/time -o 52.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 200 -log 52.log -intree iqtree_seqsadded_mp.treefile global.fa > 52.tree
+/usr/bin/time -o 53.mem.txt -v fasttree -nt -gamma -nni 0 -spr 4 -sprlength 500 -log 53.log -intree iqtree_seqsadded_mp.treefile global.fa > 53.tree
+```
+
+I'll save those in a text file called spr.txt and run them like this:
+
+```
+parallel -j 15 -k --bar :::: spr.txt
+```
+
+| ID | SPRmoves | SPRlen | time     | lnL         | delta hrs | delta lnL |
+|----|----------|--------|----------|-------------|-----------|-----------|
+| 11 | 0        | 100    | 25613.75 | -496239.08  | 0.02      | 0.00      |
+| 12 | 0        | 200    | 25551.77 | -496239.08  | 0.01      | 0.00      |
+| 13 | 0        | 500    | 25529.44 | -496239.08  | 0.00      | 0.00      |
+| 21 | 1        | 100    | 35276.59 | -496098.941 | 2.71      | 140.14    |
+| 22 | 1        | 200    | 35207.22 | -496098.941 | 2.69      | 140.14    |
+| 23 | 1        | 500    | 35239.38 | -496098.941 | 2.70      | 140.14    |
+| 31 | 2        | 100    | 38495.2  | -496120.668 | 3.60      | 118.41    |
+| 32 | 2        | 200    | 38688.24 | -496120.668 | 3.66      | 118.41    |
+| 33 | 2        | 500    | 38522.25 | -496120.668 | 3.61      | 118.41    |
+| 41 | 3        | 100    | 40822.96 | -496063.816 | 4.25      | 175.27    |
+| 42 | 3        | 200    | 41290.1  | -496063.816 | 4.38      | 175.27    |
+| 43 | 3        | 500    | 41456.04 | -496063.816 | 4.42      | 175.27    |
+| 51 | 4        | 100    | 43605.31 | -495982.737 | 5.02      | 256.35    |
+| 52 | 4        | 200    | 43502.2  | -495982.737 | 4.99      | 256.35    |
+| 53 | 4        | 500    | 43689.79 | -495982.737 | 5.04      | 256.35    |
+
+The best bang for our buck is to do just one round of SPRs (140 likelihood units per round, vs. e.g. 64 per round with 4 rounds). We get better likelihoods with more SPRs (in general) but the time penalty starts to get large. At 4 rounds of SPRs we've added 5 hrs to our analysis (meaning we're pushing our luck to get out an update every day, particularly as the dataset grows a lot in size).
+
+Of note, I should have realised that an SPR radius of 100 encompasses the entire tree, so there is no difference in likelihood between any of the different SPR lengths. Still, I'll keep it at a very large number (1000) in future, to make sure that the SPR phase allows for truly global moves even as the tree grows a lot. 
+
+So, to take a starting tree of ~50K SARS-CoV-2 sequences, add ~3K new sequences to it, and then optimise it in fasttree, the best commandline for that seems to be:
+
+```
+fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 1.tree
+```
+
+Reasoning: I leave the gamma on there purely so that as things progress it's easy to compare the lnL of different runs; reduce bootstraps to 100, becuase the SH measures of support are potentially useful (though I'm not all that sure how to interpret them yet), but the additional precision of doing 1000 bootstaps is not justified, and use 1 spr move of length 1000 because that seems to be a good balance between likelihood and execution time. Though it's also tempting to consider having 4 SPR moves, because for a few additional hours it seems like we can get a higher likelihood. However, when this analysis runs every day (rather than, in this case, on 10 days worth of new sequences), I think we have a solution. 
+
+
+There is one more thing I want to try, and that's whether we can make good improvements by running the 1 SPR analysis on the starting tree twice. The reason for this - it's often useful to interleave two algorithms (like SPR and NNI moves on trees) when optimising. So it's feasible that running an analysis with 1 SPR move followed by NNI moves twice, does a better job than any of the above, and could (let's see) give a nice balance of time and accuracy. 
+
+To do this, we just want to compare this:
+
+```
+/usr/bin/time -o 91.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 91.tree
+```
+
+to this:
+
+```
+/usr/bin/time -o 92.mem.txt -v fasttree -nt -nni 0 -spr 1 -sprlength 1000 -nosupport -intree iqtree_seqsadded_mp.treefile $outputfasta > 92.tree
+/usr/bin/time -o 93.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree 92.tree $outputfasta > 93.tree
+```
+
+Note that in the second version, we can skip the gamma optimisation of branch lengths and the bootstraps, so it will less than double the execution time. And in the second step fo the second version, we use the tree from the first step as the starting tree.
+
+Turns out this didn't really help much:
+
+| Iterations | time     | lnL         |
+|------------|----------|-------------|
+| 1          | 22914.02 | -496098.941 |
+| 2          | 39068.08 | -496038.076 |
+
+
+I almost doubled the execution time, for a marginal gain in likelihood. 
+
+## Final conclusion
+
+This is the best solution so far. Iterated once a day with the new sequences from that day.
+
+```
+/usr/bin/time -o 91.mem.txt -v fasttree -nt -gamma -nni 0 -spr 1 -sprlength 1000 -boot 100 -intree iqtree_seqsadded_mp.treefile $outputfasta > 91.tree
+```
+
+P.S. In the end I decided to see what would happen if I just cranked it to do 10 SPRs, and the results were surprising. The lnL was -494195.750, which is a sizeable improvement. And the time was 11hrs. It seems worth keeping this for now, even if the 10 rounds of SPRs are not always worth it. 
+
+
+
diff --git a/iqtree_vs_fasttree.md b/iqtree_vs_fasttree.md
@@ -0,0 +1,193 @@
+# IQ-TREE of fasttree?
+
+In the first benchmark of methods for SARS-CoV-2 phylogenetic analyses, fasttree was the clear winner, and I've been using it succesfully for months. But IQ-TREE is certianly a more flexible piece of software, and what's more we develop it. As luck would have it, we have funding from the Chan-Zuckerberg Initiative, and we used it to hire an *excellent* programmer, James Barbetti. The first thing James has been looking at is removing some of the bottlenecks that IQ-TREE had when analysing huge datasets. 
+
+The result so far is that we have a version of IQ-TREE that does lightning fast neighbour joining trees. This is done using all sorts of methods, some of which come straight from the excellent work that went into developing the rapid-NJ algorithm used in `rapidnj`. Regardless, now's a good time to revisit whether we can get better trees from fasttree or IQ-TREE. Of note, I've spent month optimising the search in fasttree (see other .md files on this repo), to the point where I don't think I can make much in the way of additional gains. 
+
+So I'm not expecting IQ-TREE to beat fasttree first time. It will likely take a lot of tweaks. It may not be possible at all, because fasttree still does things IQ-TREE can't, like SPR moves, which at least in fasttree seem to be fairly crucial to getting good and accurate trees from SARS-CoV-2. But I still want to check.
+
+The other reason for this is that IQ-TREE allows us to apply a much much wider range of models of sequence evolution, which I know from previous analyses fit the data way better than the JC+G or GTR+G models availble in fasttree. In addition, we are developing a range of new models in IQ-TREE, specifically targetted at dealing with the various issues we see in SARS-CoV-2 sequencing data (see posts on virological that highlight these).
+
+## The plan
+
+I'll start by figuring out a good model of sequence evolution in IQ-TREE. Based on previous analyses, the best I've found is `JC+I+R3`. But this was on an old alignment with different filtering, so it needs to be revisited. And I didn't even check the JC component of this. 
+
+A few open questions:
+
+1. What's the best rate matrix framework to use (e.g. JC, HKY, GTR, UNREST)?
+2. What's the best distribution of rates across sites (e.g. +I+R2, +I+R4 etc)?
+3. Do either 1 or 2 make much difference to the tree topology? (this is important - it's much much harder to optimise the complex models, so it may be better to spend a long time optimising the tree on a simple model like JC first, then do a few last rounds of optimisation with a complex model)
+4. What tree search settings are best in IQ-TREE?
+4. Does the best tree from IQ-TREE beat the best tree from fasttree (when both are analysed in IQ-TREE under whatever the best model seems to be).
+
+
+These questions suggest a simple approach:
+
+1. Find the best rate distribution
+2. Find the best rate matrix
+3. Optimise in IQ-TREE with this model, and with plain JC model
+3. Get the lnL of the best fasttree tree under this model
+5. Compare fasttree to the IQ-TREE trees
+6. Work on optimising IQ-TREE
+
+Doing things in this order makes sense. The finder details of the tree topology tend not to make much difference for the model selection steps. And getting the rate distribution right tends to be more important than getting the rate matrix right. I'm not going to worry about running time here, because ultimately I can fix all the model parameters in IQ-TREE, I know they don't change much as the size of the dataset increases, so in practice my analyses should be much faster than the analyses here. In other words, at this stage I care about lnL values much more than I care about run time. 
+
+Of note, IQ-TREE uses a lot more memory than fasttree. So I'm very limited in how many options I can explore here. Typically I can just look at 4 options at a time. With that plain laid out, here goes. 
+
+I'll do all of the analyses on the latest global alignment from this analysis: https://github.com/roblanf/sarscov2phylo/releases/tag/11-7-20
+
+Also of note, I am using here the bleeding edge IQ-TREE release with all sorts of udocumented options, which is version 2.0.8. For all analyses I'll set the minimum branch length to a very small number.
+
+## 1. Find the best rate distribution
+
+I know from previous analyses that the +G and +I+G models are very unlikely to be the best. So I'll just compare four models here that I think should capture a good range.
+
+```
+/usr/bin/time -o 1.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.1
+
+/usr/bin/time -o 1.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.2
+
+/usr/bin/time -o 1.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.3
+
+/usr/bin/time -o 1.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R5 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.4
+```
+
+##### Results
+
+Don't pay too much heed to execution time. We're working on speeding up bottlenecks. 
+
+JC 
+lnL -407738.7183
+BIC: 1464697.832
+time: 6h:53m:34s
+
+JC+I+R3
+lnL: -388587.635
+BIC: 1426447.16
+time: 53h:37m:6s
+
+JC+I+R4
+lnL: -388425.334
+BIC: 1426143.156
+time: 105h:52m:55s
+
+JC+I+R5
+lnL: ? it was converging on something a little less than -388319.099658 though
+BIC: ?
+time: >200h (!)
+
+Although the final run didn't complete (I had to kill it to re-estimate the global tree, which was a week out of date...), it did reach a lnL of -388319.099658, which equates to a BIC that is also substantially better than the JC+I+R4 model. 
+
+It's worth comparing the model parameters estimated by IQ-TREE (the +R5 model parameters never completed their final optimisation, but even before that would have had the best BIC score) just to see what's going on.
+
+```
+Proportion of invariable sites: 0.389 ;Site proportion and rates:  (0.529,0.808) (0.074,5.739) (0.009,16.933)
+Proportion of invariable sites: 0.335; Site proportion and rates:  (0.558,0.659) (0.083,4.126) (0.023,10.607) (0.002,27.445)
+Proportion of invariable sites: 0.305 Site proportion and rates:  (0.366,0.494) (0.220,0.840) (0.084,4.024) (0.024,10.400) (0.002,27.193)
+```
+
+The rate categories show that there are roughly 30% of sites modelled as invariant. Then there's about 30% of sites rate around 0.5, 20% at 0.8, 8% at 4, 2% at 10, and 0.2% at 27%.
+
+That last category represents around 60 sites that seem to have exceptionally fast rates relative to all the other sites. Modelling those sites properly (i.e. with a free rate model) could be important for correctly placing sequences in the tree.
+
+Given that the I+R5 model gave the best BIC score, it seems prudent to use this model, but to fix the model parameters to those estimated here. They won't be the ML parameters (because they never quite finished the final optimisation, and fixing them means they won't be updated when the tree changes), however, it's a safe bet from the BIC scores that they are better than the other models I tried, and a *lot* better than the JC model with no rates. This could end up being important for optimising the tree - as I currently use fasttree, which doesn't account for rate variation when optimising the tree (it adds it at the end when optimising branch lengths).
+
+Final note: the memory usage for the +I+R5 model is HUGE. 175GB at the maximum (which was most of the time)!
+
+Postscript: turns out IQ-TREE has a potentially much faster algorithm for the model optimisation implemented. So let's try and compare on the +R3 and +R5 models (the former so I can compare likelihoods, the latter to see if it's quick enough to do this). The standard algorithm tends to perform better on most datasets, but it's slower. Usually that's not an issue. The standard algorithm updates the rates one at a time, jiggling that rate parameter around to improve it while holding the others constant. The alternative BFGS algorithm tries out changes to all rate parameters at once. Let's see what happens...
+
+
+```
+/usr/bin/time -o 1.5.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -optalg 1-BFGS -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.5
+
+/usr/bin/time -o 1.6.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R5 -fast -nt 4 -optalg 1-BFGS -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.6
+
+```
+
+OK well that was a surprise. My expectation was that the BFGS algorithm would be quicker, but get substantially worse likelihoods. The truth was the opposite. I had to kill the runs because they took a long time (>> 1 day, so I don't know the exact speed comparison but the BFGS is certainly not more than twice as fast, and from looking at the stage it got to, I'd estimate it's about the same speed in fact). Speed aside, one can see the likelihoods are a lot better, even when it was just the model parameters optimised on the initial JC NJ-R tree.  
+
+JC+I+R3:  -375974.923841
+JC+I+R5:  -375490.305847
+
+The difference between R3 and R5 is large, plenty to justify the two extra parameters I'd argue. The main take home is to compare these likelihoods to those above. They are A LOT better. Like more than 10K likelihood units. For the SAME model. That's huge. 
+
+The model parameters:
+
+Proportion of invariable sites: 0.616; Site proportion and rates:  (0.426,0.269) (0.415,0.625) (0.159,3.936)
+Proportion of invariable sites: 0.616; Site proportion and rates:  (0.210,0.334) (0.219,0.342) (0.284,0.344) (0.202,1.456) (0.085,5.456)
+
+
+These are very different to using the standard algorithm, and a lot better too (the likelihood tells us that). 
+
+Then I realised I forgot one really important model. The plain-old +I model. Not sure quite how I forgot this, but let's see how it goes. We don't need BFGS for this. Won't help. 
+
+
+```
+/usr/bin/time -o 1.8.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 1.8
+```
+
+JC+I: -396992.024799
+
+The likelihood is a lot worse. But it did run quickly (about 5 hours). This may be worth considering in future.
+
+So, let's continue with the +I+R5 model.
+
+
+## 2. Find the best rate matrix
+
+We can fix the rate parameters to the ones estimated in step 1 using:
+
+`+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}`
+
+
+I'll re-run the JC model too, to get an idea of how much time and memory we save by running this. 
+
+```
+/usr/bin/time -o 2.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "JC+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.1
+
+/usr/bin/time -o 2.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "HKY+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.2
+
+/usr/bin/time -o 2.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "GTR+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.3
+
+/usr/bin/time -o 2.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "12.12+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 2.4
+
+```
+
+
+lnLs (after first round of model optimisation on NJ-R ML distance tree)
+
+JC+I+R5: -392823.153805
+HKY+I+R5: -380249.398257
+GTR+I+R5: -373470.588371
+12.12+I+R%: didn't optimise, various numerical issues
+
+So the GTR model is by far and away the best model. The parameters are: 
+
+Rate parameters:  A-C: 0.14012  A-G: 0.66091  A-T: 0.12269  C-G: 0.09998  C-T: 2.61404  G-T: 1.00000
+Base frequencies:  A: 0.299  C: 0.184  G: 0.196  T: 0.321
+Proportion of invariable sites: 0.616
+Site proportion and rates:  (0.077,0.280) (0.077,0.938) (0.077,1.823) (0.077,3.185) (0.077,6.795)
+
+
+## 3. Optimise in IQ-TREE
+
+Here I want to optimise the tree with the two different models. 
+
+the -fast flag does this:
+
+    #     -ninit 2
+    #     -n 2
+    #     -me 0.05
+
+But I want to allow for more rounds of nni than this, so I'll allow for 5 rounds and hard-code the other options. 
+
+I'll also set -me to 1.0, because I'm not really worried about optimising branch lengths at this point (and hte rest of the model parameters are fixed).
+
+```
+/usr/bin/time -o 3.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "JC+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -ninit 2 -n 5 -me 1.0 -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 3.1
+
+/usr/bin/time -o 3.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m "GTR{0.14012,0.66091,0.12269,0.09998,2.61404}+F{0.299,0.184,0.196,0.321}+I{0.616}+R5{0.210,0.334,0.219,0.342,0.284,0.344,0.202,1.456,0.085,5.456}" -ninit 2 -n 5 -me 1.0 -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 3.2
+```
+
+
+
diff --git a/onlinephylo.md b/onlinephylo.md
@@ -0,0 +1,15 @@
+# Comparing online to de-novo phylogenetics
+
+Since 14-08-20 I have moved to online phylogenetics and I've been logging the fasttree output as I go. The question I want to address here is how online phylogenetics (where I add new sequences every 2 days with MP, then update the resulting tree with ML) compares to standard phylogenetics (where you re-estimate the tree from scratch every time), in terms of the final likelihoods. 
+
+To answer that, all I need to do is re-run every analysis since 14-08-20 with standard phylogenetics. I'll use the following almost default setup in `fasttree` to do that:
+
+```
+fasttree -nt -gamma -boot 100 -sprlength 1000 -log fasttree.log global.fa > global.tree
+```
+
+This means I can compare execution time (hence wanting the 100 bootstraps to get the SH supports, which matches what I do in the online phylo) and log likelihoods (hence the gamma) in online versus de-novo phylogenetics (nobody calls it de-novo phylogenetics, but it makes sense here to distinguish it from online). 
+
+# Results
+
+
diff --git a/onlinephylo.xlsx b/onlinephylo.xlsx
diff --git a/scripts/QC.R b/scripts/QC.R
@@ -0,0 +1,144 @@
+#!/usr/bin/env Rscript
+
+# usage: Rscript old_tree new_tree threads
+
+args = commandArgs(trailingOnly=TRUE)
+
+
+# a little QC for the trees
+library(ape)
+library(phangorn)
+library(parallel)
+library(ggplot2)
+
+random_nni_distance = function(n, tree){
+    # get the distance between tree1 and itself when
+    # the second copy has undergone n NNI moves
+
+    tree2 = rNNI(tree, moves=n)
+
+    return(list(n=n, dist=wRF.dist(tree, tree2)))
+}
+
+old = read.tree(args[1])
+new = read.tree(args[2])
+threads = args[3]
+
+# first we get trees with only the tips they have in common
+old_d = drop.tip(old, setdiff(old$tip.label, new$tip.label))
+new_d = drop.tip(new, setdiff(new$tip.label, old$tip.label))
+
+
+obs = wRF.dist(old_d, new_d)
+
+# now we compare this to the same distance calculated at ever increasing numbers of NNI moves
+# to get a rough idea of how many random NNI moves this is equivalent to
+nnis = seq(from=0, to=10000, length.out=101)
+
+# turn off warnings for this, then turn them back on:
+options(warn=-1)
+random_nni_dists = mclapply(nnis, random_nni_distance, tree=old_d, mc.cores=threads)
+options(warn=0)
+
+# now let's plot the results so we can see what's up
+dists = as.data.frame(do.call(rbind, random_nni_dists))
+dists$dist = as.numeric(as.character(dists$dist))
+dists$n = as.numeric(as.character(dists$n))
+
+
+thresh = dists$n[max(which(dists$dist<obs))]
+
+p1 = ggplot(dists, aes(x=n, y = dist)) + 
+    geom_point() + 
+    geom_smooth() + 
+    geom_hline(yintercept = obs, size=1, colour='red', alpha = 0.5, linetype='dashed') + 
+    ggtitle("Comparison of current tree to tree from previous iteration", 
+            subtitle = paste(paste("Observed distance: ", obs, "(red line)"), paste("Equivalent to ~", thresh, ' random NNIs'), sep="\n"))
+
+
+########
+# phylo signal of lineages
+# we do this by
+# 1. getting the observed tree length of each lineage
+# 2. doing the same for 100 random sets of the same number of tips of each lineage
+d = read.csv("metadata.csv")
+
+e = subset(d, subset = d$covv_accession_id %in% new$tip.label)
+f = e[match(new$tip.label, e$covv_accession_id),]
+f$covv_lineage=as.factor(f$covv_lineage)
+
+
+random_treelen = function(tree, n){
+    # choose n random tips and get the tree length
+    t = keep.tip(tree, sample(tree$tip.label, n))
+    return(sum(t$edge.length))
+}
+
+get_signal = function(trait, tree, data){
+    # tree is a tree
+    # data is a vector IN THE SAME ORDER as the tree tip labels
+    # trait is the specific trait in the 'data' vector you want to measure
+
+    obs_t = drop.tip(tree, which(data!=trait))
+    n = length(obs_t$tip.label)
+
+    obs_l = sum(obs_t$edge.length)
+
+    rep_l = sort(replicate(99, random_treelen(tree, n)))
+
+    rep_lt = summary(rep_l)
+
+    # p value is the position in a ranked list
+    p_value = (sum(obs_l>rep_l)+1)/(length(rep_l)+1)
+
+    return(list(trait=as.character(trait), 
+                obs = as.numeric(obs_l), 
+                p = as.numeric(p_value), 
+                n_tips = as.numeric(n),
+                rep_Min = as.numeric(rep_lt[1]),
+                rep_1stQu = as.numeric(rep_lt[2]),
+                rep_Median = as.numeric(rep_lt[3]),
+                rep_Mean = as.numeric(rep_lt[4]),
+                rep_3rdQu = as.numeric(rep_lt[5]),
+                rep_Max = as.numeric(rep_lt[6])))
+
+
+}
+
+clustering = mclapply(levels(f$covv_lineage), get_signal, tree=new, data=f$covv_lineage, mc.cores=threads)
+clustering = as.data.frame(do.call(rbind, clustering))
+clustering$trait = as.character(clustering$trait)
+
+num = 2:ncol(clustering)
+clustering[num] <- sapply(clustering[num],as.numeric)
+
+# only keep the ones where we can reasonably estimate clustering
+# so 5 or more members of a lineage
+clustering = subset(clustering, n_tips>4)
+
+#### Write out a log file ####
+dir.create("./QC")
+write.csv(clustering, "./QC/lineage_clustering.csv")
+write.csv(dists, "./QC/topological_distances_randomNNI.csv")
+
+pdf("./QC/tree_comparison_plot.pdf")
+print(p1)
+dev.off()
+
+
+sink("./QC/QC_report.md", append=TRUE)
+
+cat("# QC Report for the phylogenetic tree\n")
+cat("\n")
+cat("## Comparison to previous iteration's tree\n")
+cat(paste("wRF distance between previous and new iteration's trees:", obs, "\n"))
+cat(paste("This is equivalent to a number of random NNIs equal to roughly:", thresh, "\n"))
+cat("\n")
+
+cat("## Clustering of Pangolin lineages\n")
+cat(paste(sum(clustering$p<0.05), "out of", nrow(clustering), "Pangolin lineages show significant clustering\n"))
+cat("\n")
+cat("#### Summary of p values\n")
+
+print(summary(clustering$p))
+sink()
diff --git a/scripts/add_seqs_to_global_tree.sh b/scripts/add_seqs_to_global_tree.sh
@@ -0,0 +1,26 @@
+# new approach
+
+# the script shoudl be almost the same as the global phylogeny script
+
+# with the sole difference that a starting tree is given
+
+# this tree should contain at least some of the same sequence names as given in the alignment
+
+# the process here is fairly simple (I think...)
+
+# 1. Use R to trim the input tree down to only sequences found in the final aligned fasta file (global.fa)
+
+# get names from alignment
+
+
+# 2. Use IQ-TREE with the tree from (1) as a starting tree, using MP to place the new sequences
+
+# 3. Use the tree from 2 to input as a starting tree to fasttree
+
+
+# Of course, this makes bootstraps redundant. So there are two options here:
+
+# 1. Just use fasttree SH supports (seems fine for now)
+# 2. Try IQ-TREE aLRT supports (issue: uses a lot of memory, possibly slow? let's see)
+
+
diff --git a/scripts/add_seqs_to_tree.sh b/scripts/add_seqs_to_tree.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+
+helpFunction()
+{
+   echo "Input a tree and an alignment - remove sequences from the tree that are not in the alignment, then add sequences to the tree that are in the alignment but not the tree"
+   echo "Usage: $0 -i fasta_file_of_SARS-CoV-2_seqs -o output_file -t threads"
+   echo "\t-i Full path to input alignment of SARS-CoV-2 sequences"
+   echo "\t-s Full path to input starting tree of SARS-CoV-2 sequences"
+   echo "\t-o Full path to output tree, which will contain all sequences in the alignment"
+   echo "\t-t number of threads to use"
+   exit 1 # Exit script after printing help
+}
+
+while getopts "i:s:o:t:" opt
+do
+   case "$opt" in
+      i ) inputfasta="$OPTARG" ;;
+      i ) inputtree="$OPTARG" ;;
+      o ) outputtree="$OPTARG" ;;
+      t ) threads="$OPTARG" ;;
+      ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
+   esac
+done
+
+# Print helpFunction in case parameters are empty
+if [ -z "$inputfasta" ] || [ -z "$inputtree" ] || [ -z "$outputtree" ] || [ -z "$threads" ]
+then
+   echo "Some or all of the parameters are empty";
+   helpFunction
+fi
+
+
+# 1. Get the sequence names from the alignment
+
+grep ">" $inputfasta | cut -c 2- > alignment_names.txt
+
+# 2. Remove sequence from the tree that are not in the alignment
+# this can happen e.g. if a sequence was removed from GISAID, or excluded because it was on a long branch
+# or if the filtering conditions changed
+
diff --git a/scripts/clean_gisaid.sh b/scripts/clean_gisaid.sh
@@ -43,9 +43,9 @@ sed -i.bak '/^$/d' $input_seqs
 
 
 # next we remove sequences in the excluded_sequences.tsv file
-echo "Removing sequences in exluded_sequence.tsv"
+echo "Removing sequences in excluded_sequence.tsv"
 BASEDIR=$(dirname "$0")
-exseq=$BASEDIR/../excluded_sequences.tsv
+exseq=previous_iteration_files/excluded_sequences.tsv
 cut -f1 $exseq | faSomeRecords $input_seqs /dev/stdin $input_seqs"tmp.fa" -exclude
 rm $input_seqs
 mv $input_seqs"tmp.fa" $input_seqs

diff --git a/scripts/clean_tree.R b/scripts/clean_tree.R
@@ -0,0 +1,28 @@
+#!/usr/bin/env Rscript
+
+# usage: Rscript clean_tree.R input_tree names_of_species_in_alignment.txt
+# the names file should be one name per line
+
+args = commandArgs(trailingOnly=TRUE)
+
+library(ape)
+library(readr)
+
+print("Input tree")
+print(args[1])
+print("input names")
+print(args[2])
+
+
+t = read.tree(args[1])
+a = readLines(args[2])
+
+drop = t$tip.label[which(t$tip.label %in% a == FALSE)]
+
+new = a[which(a %in% t$tip.label == FALSE)]
+
+write_lines(new, "alignment_names_new.txt")
+
+t2 = drop.tip(t, drop)
+
+write.tree(t2, "input_tree_cleaned.tree")
diff --git a/scripts/global_profile_alignment.sh b/scripts/global_profile_alignment.sh
@@ -3,7 +3,7 @@
 helpFunction()
 {
    echo "Align a set of sequences in parallel to WH1 seuqence NC_045512.2"
-   echo "Usage: $0 -i unaligned_fasta_path -o output_file -r reference_alignment -t threads"
+   echo "Usage: $0 -i unaligned_fasta_path -o output_file -t threads"
    echo "\t-i Full path to unaligned fasta file of SARS-CoV-2 sequences"
    echo "\t-o Output file path for aligned fasta file. New sequences will be *APPENDED* to this file"
    echo "\t-t number of threads to use"

diff --git a/scripts/global_tree_gisaid_start_tree.sh b/scripts/global_tree_gisaid_start_tree.sh
@@ -0,0 +1,203 @@
+#!/bin/bash
+
+helpFunction()
+{
+   echo "Make an ML phylogeny from a large FASTA file of GISAID sequences"
+   echo "Usage: $0 -i GISAID_fasta -o output_filename -s start_tree -t threads "
+   echo "    -i Full path to unaligned fasta file of SARS-CoV-2 sequences from GISAID"
+   echo "    -p Filpath to previous iteration to be updated (must contain an ft_SH.tree file and an excluded_sequences.tsv file)"
+   echo "    -t number of threads to use"
+   exit 1 # Exit script after printing help
+}
+
+while getopts "i:p:t:" opt
+do
+   case "$opt" in
+      i ) inputfasta="$OPTARG" ;;
+      p ) previous="$OPTARG" ;;
+      t ) threads="$OPTARG" ;;
+      ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
+   esac
+done
+
+
+# Print helpFunction in case parameters are empty
+if [ -z "$inputfasta" ] || [ -z "$previous" ] || [ -z "$threads" ]
+then
+   echo "Some or all of the parameters are empty";
+   helpFunction
+fi
+
+DIR="$(cd "$(dirname "$0")" && pwd)"
+
+inputdir=$(dirname $inputfasta)
+outputfasta=global.fa
+inputtree=previous_iteration_files/ft_SH.tree # this is copied over in a few lines
+
+cd $inputdir
+
+# first we copy over the previous iterations files for reproducibility
+mkdir previous_iteration_files
+cp $previous'/ft_SH.tree' previous_iteration_files/
+cp $previous'/excluded_sequences.tsv' previous_iteration_files/
+
+#check the copying worked
+if [ -f "previous_iteration_files/ft_SH.tree" ]; then
+    echo "previous_iteration_files/ft_SH.tree exists."
+else 
+    echo "previous_iteration_files/ft_SH.tree does not exist. Something's wrong. Exiting."
+    exit 1
+fi
+
+if [ -f "previous_iteration_files/excluded_sequences.tsv" ]; then
+    echo "previous_iteration_files/excluded_sequences.tsv exists."
+else 
+    echo "previous_iteration_files/excluded_sequences.tsv does not exist. Something's wrong. Exiting."
+    exit 1
+fi
+
+
+# first we trim the sequences
+echo ""
+echo "Cleaning raw data"
+echo ""
+
+cleaned_gisaid=$inputfasta"_cleaned.fa"
+bash $DIR/clean_gisaid.sh -i $inputfasta -o $cleaned_gisaid -t $threads
+
+
+#### BUILD THE GLOBAL ALIGNMENT ######
+
+echo ""
+echo "Making global profile alignment"
+echo ""
+aln_global="$inputdir/aln_global_unmasked.fa"
+bash $DIR/global_profile_alignment.sh -i $cleaned_gisaid -o $aln_global -t $threads
+
+
+echo ""
+echo "Masking alignment"
+echo ""
+aln_global_masked="$inputdir/aln_global_masked.fa"
+bash $DIR/mask_alignment.sh -i $aln_global -o $aln_global_masked -t $threads
+
+
+echo ""
+echo "Filtering sequences that are shorter than 28000 bp and/or have >1000 ambiguities"
+echo ""
+aln_global_filtered="$inputdir/aln_global_filtered.fa"
+esl-alimanip --lmin 28000 --xambig 1000 --informat afa --outformat afa --dna -o $aln_global_filtered $aln_global_masked
+
+echo ""
+echo "Removing sites that are >50% gaps, after converting N's to gaps"
+echo ""
+
+cp $aln_global_filtered tmp.aln
+sed -i.bak '/^[^>]/s/N/-/g' tmp.aln
+rm tmp.aln.bak
+
+esl-alimask --gapthresh 0.5 --informat afa --outformat afa --dna -o $outputfasta -g  tmp.aln
+
+rm tmp.aln
+
+echo "sequences downloaded from GISAID" >> alignments.log
+echo $(grep ">" $inputfasta | wc -l) >> alignments.log
+echo "//" >> alignments.log
+echo "alignment stats of global alignment" >> alignments.log
+esl-alistat $aln_global >> alignments.log
+echo "alignment stats of global alignment after masking sites" >> alignments.log
+esl-alistat $aln_global_masked >> alignments.log
+echo "alignment stats after filtering out short/ambiguous sequences" >> alignments.log
+esl-alistat $aln_global_filtered >> alignments.log
+echo "alignment stats of global alignment after trimming sites that are >50% gaps" >> alignments.log
+esl-alistat $outputfasta >> alignments.log
+
+
+#### ESTIMATE THE GLOBAL TREE ######
+
+
+# 1. Take the alignment and the tree, remove the irrelevant sequences with R, then add the new seuqences with IQ-TREE
+
+echo ""
+echo "Removing unused sequences from input tree"
+echo ""
+grep ">" $outputfasta | cut -c 2- > alignment_names.txt
+Rscript $DIR/clean_tree.R $inputtree alignment_names.txt
+
+echo ""
+echo "Adding new sequences to input tree with IQ-TREE"
+echo ""
+# get the latest IQ-TREE
+wget https://github.com/iqtree/iqtree2/releases/download/v2.1.0/iqtree-2.1.0-Linux.tar.gz
+tar -xvzf iqtree-2.1.0-Linux.tar.gz
+
+# this just adds the new sequences with parsimony
+# benchmarking shows that 1 thread is optimal
+./iqtree-2.1.0-Linux/bin/iqtree2 -seed 1729 -s $outputfasta -g input_tree_cleaned.tree -n 0 -m JC -fixbr -nt 1 --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -pre iqtree_seqsadded_mp
+
+echo ""
+echo "Optimising tree with fasttree MP"
+echo ""
+# we have to do some contortions to set the optimal number of threads for fasttree, which is 3 (see fasttreeOMP.md)
+env > old_env.txt
+old_threads=$(grep -hoP '^OMP_NUM_THREADS=\K\d+' old_env.txt)
+rm old_env.txt
+export OMP_NUM_THREADS=3
+FastTreeMP -nt -gamma -nni 0 -spr 2 -sprlength 1000 -boot 100 -log fasttree.log -intree iqtree_seqsadded_mp.treefile $outputfasta > $outputfasta'_ft_SH.tree'
+if [ -n "$old_threads" ]; then
+    export OMP_NUM_THREADS=$old_threads
+else
+    unset OMP_NUM_THREADS
+fi
+
+echo ""
+echo "Cleaning tree with treeshrink"
+echo ""
+run_treeshrink.py -t $outputfasta'_ft_SH.tree' -q 0.05 -c -o treeshrink_SH
+
+# now we update the excluded sequences file
+echo ""
+echo "Updating excluded sequences file"
+echo ""
+Rscript $DIR/update_excluded_seqs.R previous_iteration_files/excluded_sequences.tsv treeshrink_SH/global.fa_ft_SH_RS_0.05.txt
+
+echo ""
+echo "Re-rooting tree on hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05"
+echo "see https://www.biorxiv.org/content/10.1101/2020.04.17.046086v1"
+echo ""
+nw_reroot 'treeshrink_SH/'$outputfasta'_ft_SH_0.05.tree' "'EPI_ISL_406801'" > ft_SH.tree
+
+
+sed -i.bak "s/'//g" ft_SH.tree
+rm ft_SH.tree.bak
+
+
+echo "After filtering sequences with TreeShrink" >> alignments.log
+nw_stats ft_SH.tree >> alignments.log
+
+echo "//"
+echo "Number of new sequences added this iteration" >> alignments.log
+wc -l alignment_names_new.txt >> alignments.log
+
+# run QC
+Rscript $DIR/QC.R previous_iteration_files/ft_SH.tree ft_SH.tree $threads
+
+# zip up for easy file transfer
+xz -e -T $threads $outputfasta
+xz -e -T $threads $aln_global
+xz -e -T $threads $aln_global_filtered
+xz -e -T $threads $aln_global_masked
+xz -e -T $threads $aln_global_unmasked
+xz -e -T $threads $inputfasta
+xz -e -T $threads $inputfasta"_cleaned.fa"
+xz -e -T $threads fasttree.log
+xz -e -T $threads iqtree_seqsadded_mp.iqtree
+
+rm goalign_amd64_linux
+rm -rf iqtree-2.1.0-Linux/
+rm iqtree_seqsadded_mp.uniqueseq.phy
+rm iqtree_seqsadded_mp.parstree
+rm iqtree-2.1.0-Linux.tar.gz
+rm iqtree_seqsadded_mp.ckp.gz
+rm metadata.csv
+rm reference.fa
diff --git a/scripts/lsd2.sh b/scripts/lsd2.sh
@@ -0,0 +1,23 @@
+# date with LSD2
+
+git clone https://github.com/tothuhien/lsd2.git
+cd lsd2/src/
+make
+
+cd ../..
+
+# get the dates
+nw_labels -I ft_SH.tree > labels.txt
+cut -d '|' -f 3  labels.txt > dates.txt
+paste labels.txt dates.txt > datefile.txt
+
+# add number of lines to start of datefile.txt
+l=$(wc -l datefile.txt | cut -d' ' -f1)
+
+echo $l | cat - datefile.txt > temp && mv temp datefile.txt
+
+# lsd2
+# -i input file
+# -s sequence length
+# -l 0.000000005 because this is the minimum brlen in the ft_SH.tree
+./lsd2/bin/lsd2_unix -i ft_SH.tree -s 29000 -l 0.000000005 -d datefile.txt -v 1 -a 'b(2019-10-30,2019-12-17)'
diff --git a/scripts/tree_ft.sh b/scripts/tree_ft.sh
diff --git a/scripts/update_excluded_seqs.R b/scripts/update_excluded_seqs.R
@@ -0,0 +1,22 @@
+#!/usr/bin/env Rscript
+
+# usage: Rscript update_excluded_seqs.R old_iteration_file treeshrink_file
+args = commandArgs(trailingOnly=TRUE)
+
+old_file = args[1]
+old = read.delim(old_file, header=F)
+
+treeshrink = args[2]
+
+if (file.info(treeshrink)$size < 2) {
+    new = old
+} else {
+    e = read.delim(treeshrink, header=F)
+    e = t(e)
+    e = e[complete.cases(e)]
+    f = data.frame(cbind(e, "on long branch identified by treeshrink q=0.05"))
+    names(f)=names(old)
+    new = rbind(old, f)
+}
+
+write.table(new, file="excluded_sequences.tsv", sep="\t", col.names = F, row.names = F, quote=F)
diff --git a/spr_iterations.png b/spr_iterations.png
diff --git a/tree_estimation2.md b/tree_estimation2.md
@@ -561,41 +561,144 @@ By mistake, I left out the `-noml` command in the last example. Turns out the ML
 11.7.4 (after 3 rounds of SPRs of length 200): -330438
 11.7.5 (after 3 rounds of SPRs of length 100): -330452
 
+
+
+
 #### Iteration 12
 
-By chance I 
+A quick test of the latest version of IQ-TREE, which we've been working on making faster...
+
 
-ME SPRs followed by ML NNIs
 
 ```
-/usr/bin/time -o 12.1.1.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 500 -nni 0 -spr 20 -refresh 0.8 -topm 1.5 -close 0.75 global.fa > 12.1.1.tree
-/usr/bin/time -o 12.1.2.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 50 -nni 0 -spr 20 -intree 12.1.1.tree global.fa > 12.1.2.tree
-/usr/bin/time -o 12.1.3.mem.txt -v fasttree -nt -gamma -nosupport -sprlength 20 -intree 12.1.2.tree global.fa > 12.1.2.tree
+wget https://github.com/iqtree/iqtree2/releases/download/v2.0.8/iqtree-2.0.8-Linux.tar.gz
+tar -xvzf iqtree-2.0.8-Linux.tar.gz
+
+/usr/bin/time -o iqtree2.08.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t BIONJ -pre BIONJ
+
+/usr/bin/time -o iqtree2.08.NJ.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ -pre NJ
+
+/usr/bin/time -o iqtree2.08.NJ-R.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -pre NJ-R
+
+/usr/bin/time -o iqtree2.08.BIONJ-R.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t BIONJ-R -pre BIONJ-R
+
 ```
 
+It works well, and fast too!!
 
+Some results with -experimental, 4 cores, on soma (lots of memory). I show both the lnL of the NJ tree, as well as the final tree (after just two rounds of NNI). All with a JC model.
 
+`-t BIONJ`
+```
+        Percent of CPU this job got: 380%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 8:27:24
+        Maximum resident set size (kbytes): 50745644
 
+Log-likelihood of BIONJ tree: -352742.080492
+BEST SCORE FOUND : -347506.937
+```
 
+`-t NJ`
+```
+        Percent of CPU this job got: 378%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:23:31
+        Maximum resident set size (kbytes): 50745304 
 
+Log-likelihood of NJ tree: -349888.823021
+BEST SCORE FOUND : -346814.328
+```
 
+`-t NJ-R`
+```
+        Percent of CPU this job got: 377%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:15:14
+        Maximum resident set size (kbytes): 50746932
 
+Log-likelihood of NJ-R tree: -349910.476699
+BEST SCORE FOUND : -346885.416 
+```
 
+`-t BIONJ-R`
+```
+        Percent of CPU this job got: 381%
+        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:47:39
+        Maximum resident set size (kbytes): 50746768
+
+Log-likelihood of BIONJ-R tree: -352887.323284
+BEST SCORE FOUND : -347512.641
+```
+
+NJ-R provides the best balance of speed an accuracy, so I'll go with that for now, noting though that NJ is gives very slightly higher likelihoods.
+
+
+#### Iteration 13
+
+Now I'll try to squeeze higher liklihoods out of IQ-TREE. A few obvious things to try first. 
 
 
+1. Reduce the minimum branch length (unlike fasttree, IQ-TREE doesn't have hard polytomies. Given that some polytomies in this tree have 1000's of sequences, we probably want the minimum branch length to be at most 1/10000 of a substitution. Given that the alignment is about 30,000 bases, 1 substitution would be a branch length of 1/30000, so we want a minimum of 1/10000th of this , which is 0.000000003. For convenience I'll just round this even further down to 0.0000000001)
 
+2. Better models of rates (I'll move on to molecular evolution later)
 
+```
+/usr/bin/time -o 13.1.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.1
+
+/usr/bin/time -o 13.2.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+G -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.2
+
+/usr/bin/time -o 13.3.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+G+I -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.3
+
+/usr/bin/time -o 13.4.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R1 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.4
 
+/usr/bin/time -o 13.5.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R2 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.5
 
+/usr/bin/time -o 13.6.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R3 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.6
+
+/usr/bin/time -o 13.7.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+I+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.7
+
+/usr/bin/time -o 13.8.mem.txt -v ./iqtree-2.0.8-Linux/bin/iqtree2 -s global.fa -m JC+R4 -fast -nt 4 -experimental --suppress-zero-distance --suppress-list-of-sequences --suppress-duplicate-sequence -t NJ-R --no-opt-gamma-inv -blmin 0.0000000001 -pre 13.8
+```
 
 
+Results:
 
+13.1
+lnL_NJ-R: -346219.079852
+lnL: -345413.780
+time: 4h:5m:0s 
+mem: 50746412
 
+13.2
+lnL_NJ-R: -334571.368294
+lnL: -333702.914
+time: 16h:45m:52s
+mem: 105633656
 
+13.3
+lnl_NJ-R: -333649.112195
+lnL: -332890.756
+time: 19h:18m:2s
+mem: 106700024
 
+13.4
+lnl_NJ-R: -337466.943109
+lnL: -336651.617
+time: 10h:47m:40s
+mem: 56482076
 
+13.5
+lnl_NJ-R: -332073.220730
+lnL: -331279.983
+time: 17h:40m:5s
+mem: 75205400
 
+13.6
+lnl_NJ-R: -331545.424656
+lnL: -330704.610
+time: 30h:30m:50s
+mem: 95032788
 
+At this point it's clear that IQ-TREE has the *potential* to do just as well as fasttree. What's required are a few things though. First, we need to compare the two directly in a fair way. Second, we need to move on to new alignments since there are substantial new masking methods for the latest alignments. Finally, we need a sensible way of asking whether the huge differences in likelihood we observe here are due to improvements in the tree topology, or just the model of evolution.
 
+So, it's time to move on to a new setup.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@

		~$ft_models.xlsx
		scripts/add_seqs_to_global_tree.sh