Merge pull request #2 from johnne/devel

Devel
insect-biome-atlas · Jan 10, 2023 · cd0c772 · cd0c772
2 parents 4f3398a + 381cc20
commit cd0c772
Show file tree

Hide file tree

Showing 27 changed files with 2,376 additions and 235 deletions.
diff --git a/README.md b/README.md
@@ -29,10 +29,15 @@ conda activate ASV-clustering
 
 There are several parameters that can be set for the different tools, but the
 minimum required information needed is `rundir:` which should be the name of a 
-subdirectory under `data/` that should contain a file `asv_seqs.fasta`, a fasta
-file with sequences for ASVs, and `asv_counts.tsv`, a tab-separated file with 
-ASV ids as rows and sample names as columns that describes the counts of ASVs
-in the different samples.
+subdirectory under `data/` that should contain:
+
+1. a file `asv_seqs.fasta` which should be a fasta file with sequences for ASVs, 
+2. a file `asv_counts.tsv` which should be a tab-separated file with ASV ids as rows and sample names as columns giving the counts of ASVs  in the different samples.
+3. a file `asv_taxa.tsv` which should be a tab-separated file with taxonomic assignments for ASVs
+
+Note that the `asv_taxa.tsv` file should have ranks in the header that match with
+the configuration settings for `split_rank` and `evaluation_rank` (`Family` and `Species`
+by default, respectively).
 
 As an example, with the subdirectory `project1` under `data/` like so:
 
@@ -41,6 +46,7 @@ data/
 ├── project1
 │ ├── asv_counts.tsv
 │ ├── asv_seqs.fasta
+│ ├── asv_taxa.tsv
 ```
 
 you should set `rundir: project1`. This can be done either in a configuration
@@ -54,6 +60,190 @@ which you then point to with `--configfile <your-config-file>.yml` in the
 snakemake call. Or you can set it directly with `--config rundir=project1` when
 you start the workflow.
 
+### Test run
+You can try the workflow out by running it on a small test dataset under `data/test/`.
+To do so, simply run:
+
+```bash
+snakemake --profile test
+```
+
+## Workflow overview
+The idea with this workflow is to make it easy to run OTU clustering with many 
+different parameter settings then evaluate which settings you think works best
+for your data. `vsearch` makes up the basis of the workflow by creating pairwise
+alignments of the sequences (often the most time consuming step) and several different
+clustering runs can then be executed without having to re-create the alignments
+for each run.
+
+### 1. Splitting the input
+The workflow is set up to split the input data by taxonomic rank family and 
+perform clustering of sequences in each family in parallell. However, the data
+can be arbitrarily split on any taxonomic rank, configurable via the `split_rank`
+config parameter. The workflow will by default run on all unique taxa names at 
+rank `split_rank` found in the `data/{rundir}/asv_taxa.tsv` file, and will list
+the taxa in the file `data/{rundir}/{split_rank}.txt`. You can create that latter
+file yourself if you want, which will override the default behaviour and cause 
+the workflow to only run on taxa listed. For example, if you list (one per row)
+only your families of interest in `data/{rundir}/Family.txt` then the workflow
+will only run with sequences belonging to those families.
+Note that if you do not specify a file with ranks, _e.g._ `Family.txt` then the
+workflow will read the `asv_taxa.tsv` file, identify taxa at `{split_rank}` but 
+only output those that have at least one ASV with a total count > 0 in the counts
+file **and** that have a sequence in the `asv_seqs.fasta` file.
+
+### 2. Filtering and formatting
+Initially, the counts for each sequence is summed across samples and any potential 
+sequence with a sum of 0 is removed. The filtering step also ensures that the 
+taxonomic and sequence data lines up with no missing sequences in either.
+
+### 3. Aligning
+Filtered sequences are then aligned per `split_rank` (Family by default) using
+`vsearch` like so:
+
+```bash
+vsearch --usearch_global <input> --db <input> --self --userout <output-distance> \
+  -userfields query+target+id --maxaccepts 0 --maxrejects 0 --id {params.id} \
+  --iddef {params.iddef}  --query_cov {params.query_cov} --threads {threads}
+```
+
+where `params.id=0.84`, `params.iddef=1`, `params.query_cov=0.9` are the default
+settings (configurable via the config file).
+
+### 4. OTU clustering
+When sequences have been aligned and pairwise distances are available the workflow
+continues with clustering of ASVs into OTUs using the tools listed in `config["software"]`.
+By default the tools are:
+
+- opticlust
+- swarm
+- dbotu3
+
+#### 4.1 opticlust
+The opticlust method is implemented in `mothur`. Here the optimal clustering of 
+sequences is found by iteratively moving sequences between OTU clusters in an 
+attempt to maximize the Matthew’s Correlation Coefficient (MCC) which is a metric
+that combines True/False positives and negatives. In this context:
+- true positives: are sequences that are within a maximum pairwise distance from each
+other (defined by the `cutoff` parameter) and that share an OTU cluster
+- false positives are further in distance than the cutoff and share a cluster,
+- false negatives are within the cutoff but are placed in different clusters, and
+- true negatives are further in distance than the cutoff and are placed in different clusters
+
+Opticlust takes as input a pairwise similarity matrix and a file with sum of counts
+across samples for each ASV.
+
+#### 4.2 swarm
+Swarm clusters sequences using a local linking threshold `d` (set to 1 by default)
+which represents the maximum number of differences between two amplicons. The input
+is a single fasta sequence where the total count of each sequence is suffixed
+to the sequence ID.
+
+#### 4.3 dbotu3
+dbotu3 uses both sequences and their distribution across samples to cluster ASVs
+into OTUs.
+
+### 5. Generating OTU tables
+While the output from the tools used here differ, the workflow will generate a 
+standardized table called `asv_clusters.tsv` with ASVs and their corresponding 
+cluster for each combination of tool and taxa. This table has the following format:
+
+| ASV   | cluster  |
+|-------|----------|
+| asv1  | cluster1 |
+| asv2  | cluster1 |
+| asv3  | cluster2 |
+
+
+
+### 6. Workflow output
+
+```bash
+results/
+├── opticlust
+│   └── run1
+│       ├── Taxa1
+│       │   └── default
+│       │       └── asv_clusters.tsv
+│       ├── Taxa2
+│       │   └── default
+│       │       └── asv_clusters.tsv
+│       └── Taxa3
+│           └── default
+│               └── asv_clusters.tsv
+├── swarm
+│   └── run1
+│       ├── Taxa1
+│       │   └── default
+│       │       └── asv_clusters.tsv
+│       ├── Taxa2
+│       │   └── default
+│       │       └── asv_clusters.tsv
+│       └── Taxa3
+│           └── default
+│               └── asv_clusters.tsv
+└── vsearch
+    └── run1
+        ├── Taxa1
+        │   ├── asv_seqs.dist.gz
+        │   └── asv_seqs.dist.reformat.gz
+        ├── Taxa2
+        │   ├── asv_seqs.dist.gz
+        │   └── asv_seqs.dist.reformat.gz
+        └── Taxa3
+            ├── asv_seqs.dist.gz
+            └── asv_seqs.dist.reformat.gz
+```
+
+In the example above, the configfile used contained:
+```yaml
+rundir="run1"
+software=["swarm", "opticlust"]
+run_name="default"
+```
+
+You can see that each tool gets its own subdirectory under `results/`,
+and that `rundir` (in this case "run1") in turn is created under each tool.
+In addition, `run_name` is created as a subdirectory under each taxa directory. 
+This way the underlying data (ASV sequences, counts and taxonomic info) found in 
+the `data/<rundir>` directory can be used for any number of `run_name` settings.
+
+### 7. Evaluation
+True and false positives, as well as true and false negatives can be evaluated
+for each clustering results using the taxonomic assignments of ASVs. The `evaluation_rank`
+config parameter (default = "Species") determines what to evaluate the clustering
+against (whatever assignment is given for this rank that is taken as the ground
+truth for an ASV). 
+
+- Precision is calculated as: `precision = TP/(TP + FP)`
+- Recall is calculated as: `recall = TP/(TP + FN)`
+
+where
+
+- True positives (TP) is calculated by counting the number of times two ASVs belonging 
+  to the same OTU cluster also share the same taxa at `evaluation_rank`
+- False positives (FP) is calculated as: `FP = totalPositives - TP`
+- False negatives (FN) is calculated by counting the number of times ASVs that share
+  the same taxa at `evaluation_rank` are placed into different OTU clusters
+
+The evaluation script used in this workflow begins by calculating the total number
+of positives from a given dataframe. This is formulated as: `totalPositives = N * (N - 1) / 2`
+where `N` is the total number of ASVs.
+
+
+## Benchmarking
+
+### FINBOL
+
+We benchmarked the FINBOL database with the following parameters
+
+| tool      | default                  | params1     | params2      | params3     | params4      | params5             | params6     | params7     | params8     | params9     | params10    | params11    | params12   |
+|-----------|--------------------------|-------------|--------------|-------------|--------------|---------------------|-------------|-------------|-------------|-------------|-------------|-------------|------------|
+| swarm     | -d 1 --fastidious -b 3   | -d 2 -b 0   | -d 3 -b 0    | -d 4 -b 0   | -d 5 -b 0    | -d 4 -m 6 -p 3 -b 0 | -d 13 -b 0  | -d 15 -b 0  | -d 17 -b 0  | -d 20 -b 0  | -d 22 -b 0  | -d 23 -b 0  | -d 25 -b 0 |
+| opticlust | cutoff=0.03              | cutoff=0.01 | cutoff=0.015 | cutoff=0.02 | cutoff=0.025 | cutoff=0.035        | cutoff=0.04 | cutoff=0.05 | cutoff=0.06 | cutoff=0.07 | cutoff=0.08 | cutoff=0.09 | cutoff=0.1 |
+
+Our analysis showed that `params4` with opticlust gave the highest precision/recall.
+
 ## More references
 
 - [Brandt et al 2021](https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13398)
diff --git a/data/test/Family.txt b/data/test/Family.txt
@@ -0,0 +1,53 @@
+Lycosidae
+Noctuidae
+Tetragnathidae
+Lasiocampidae
+Tortricidae
+Carabidae
+Sarcophagidae
+Sphingidae
+Erebidae
+Geometridae
+Nymphalidae
+Nolidae
+Hesperiidae
+Dytiscidae
+Xyloryctidae
+Calliphoridae
+Araneidae
+Megachilidae
+Syrphidae
+Salticidae
+Tenebrionidae
+Thomisidae
+Cicadellidae
+Chrysomelidae
+Chironomidae
+Curculionidae
+Oecophoridae
+Ichneumonidae
+Theridiidae
+Buprestidae
+Hepialidae
+Crabronidae
+Largidae
+Miturgidae
+Sparassidae
+Staphylinidae
+Derodontidae
+Linyphiidae
+Formicidae
+Coleophoridae
+Braconidae
+Miridae
+Agromyzidae
+Mycetophilidae
+Lycaenidae
+Fanniidae
+Triozidae
+Scytodidae
+Anamidae
+Aphididae
+Encyrtidae
+Psyllidae
+Culicidae