This repository contains a Snakemake workflow for taxonomic assignments, filtering and clustering of Amplicon Sequence Variants (ASVs).
Currently the following clustering tools are supported:
Software | Reference | Code |
---|---|---|
SWARM | Mahé et al 2014 | GitHub |
OptiClust | Westcott & Schloss 2017 | GitHub |
dbOTU3 | Olesen et al 2017 | GitHub |
The idea with this workflow is to make it easy to run OTU clustering with many different parameter settings then evaluate which settings you think works best for your data.
To install the workflow, either clone the repository by running:
git clone [email protected]:insect-biome-atlas/happ.git
cd happ
or visit the release page and download the latest release.
The software required to run the workflow can either be installed via pixi or with Conda.
Follow the instructions to install pixi on your system, then run:
pixi shell
from within the root of the repository. This will activate an interactive shell ready to use with the workflow.
If you prefer to use Conda, you can create a new environment with the required software by running:
conda env create -f environment.yml
Then activate the environment with:
conda activate happ
Tip
If you are running the workflow on a high performance computing system with
the SLURM workload manager use the configuration profile under
profiles/slurm
. If you're running on the Dardel HPC system at PDC
specifically, use the configuration profile under profiles/dardel
. See the
README files in the respective subdirectory for more information.
Once you have activated the software environment (either with pixi shell
or conda activate happ
as described above) the basic syntax to run the workflow is:
snakemake --sdm <apptainer/conda> --configfile <path-to-your-configfile.yml> --profile <slurm/dardel/local> <additional-arguments>
The --configfile
argument specifies the path to a configuration file in YAML format. The --profile
argument specifies the configuration profile to use. The --sdm
flag is a short-hand for --software-deployment-method
and specifies how Snakemake will handle rule-specific dependencies. See below for a description of each of these arguments.
The workflow can be configured using a configuration file in YAML format. A
default config file is available at config/config.yml. It's
recommended that you copy this file and make your changes in the copy, then pass
it on the command line with --configfile <path-to-your-configfile>
.
See the Configuration section for more information on how to configure the workflow.
The workflow comes with three different configuration profiles. These are
basically subdirectories containing a config.yaml
file which sets various
Snakemake command line arguments and define resources for running the workflow
on different systems. The available profiles are:
local
: For running the workflow on a local machine (e.g. your laptop)slurm
: For running the workflow on a system with the SLURM workload managerdardel
: For running the workflow on the Dardel HPC system at PDC.
To use a profile, simply add --profile <profile-name>
to the Snakemake command line call. For example, to run the workflow on a system with SLURM you would run:
Once you have installed the main software packages needed to run the workflow
you also have to specify how Snakemake will handle rule-specific software
dependencies. This is controlled by the --sdm
command line argument (short-hand for --software-deployment-method
). We recommend to use Apptainer if this is available on your system. To use apptainer, add --sdm apptainer
to the Snakemake command line call. Alternatively you can use Conda in which case you would use --sdm conda
.
The most important parameters in the config file are the input files:
asv_seqs.fasta
(ASV sequences in FASTA format)asv_counts.tsv
(tab separated file with counts of ASVs (rows) in samples (columns))
These files must be placed in a subdirectory under data/
that is specified by the rundir
parameter in the config file. The rundir
parameter should be the name of the subdirectory containing the input files.
The default config file contains:
rundir: "test"
and the input files are placed under data/test/
.
The run_name:
parameter defines the name of the run and can be used to
separate different runs of the workflow (e.g. with different settings on the
same input data). The default is run1
.
The split_rank:
parameter specifies the taxonomic rank to split the ASVs by
before running the ASV clustering tools. Splitting the data means that ASVs that do not share the same split_rank
are not compared for clustering which means that you should set this parameter to a relatively high rank in the taxonomy tree. The default is Family
.
The ranks:
parameter
HAPP uses taxonomic information about your input ASVs to split sequences by taxonomic ranks, allowing for parallel clustering of ASVs. Therefore the ASVs need to be assigned to taxonomic ranks. This can be done using the workflow itself, or by providing a file with taxonomic assignments.
The workflow supports taxonomic assignment using the following tools:
- SINTAX
- EPA-NG (including assignments with GAPPA)
- SINTAX + EPA-NG (reassigning SINTAX assignments with EPA-NG)
- vsearch (implemented in QIIME2)
To assign taxonomy to your input ASVs you need to provide a reference database that works with SINTAX and/or with EPA-NG and GAPPA, depending on the tools you want to use. See below for instructions on how to obtain and configure these references.
Important
You will also need to make sure that the split_rank
and ranks
config parameters are set correctly to match the taxonomic ranks in your
taxonomic assignments file.
A SINTAX compatible reference database containing COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database is available on Figshare. From there, simply download the file bold_clustered.sintax.fasta.gz
, unzip it and set the path to the file in the config file under the sintax
section:
sintax:
ref: "/path/to/bold_clustered.sintax.fasta"
By default, SINTAX assignments are made using a cutoff of 0.8. This can be changed by setting the cutoff
parameter in the config file under the sintax
section.
The phylogenetic placement/assignment tools EPA-NG and GAPPA require a reference tree, a multiple alignment and a reference taxonomy file. A compatible reference that allows assignments to classes Collembola, Diplura, Protura and Insecta is available to download from https://github.com/insect-biome-atlas/paper-bioinformatic-methods/tree/main/data/chesters_tree. The files you need are:
- chesters_new_outgroups_aligned.trim0.9.fasta (alignment)
- chesters_new_outgroups.nwk (tree)
- taxonomy.tsv (taxonomy)
Download these files and place them in a directory on your system. Then set the path to the files in the config file under the epa-ng
section in the config file:
epa-ng:
tree: "/path/to/chesters_tree/chesters_new_outgroups.nwk"
msa: "/path/to/chesters_new_outgroups_aligned.trim0.9.fasta"
ref_taxonomy: "/path/to/chesters_tree/taxonomy.tsv"
EPA-NG and GAPPA can be further configured with the following parameters under the epa-ng
section in the config file:
model:
This is the starting model used by RAxML-NG to evaluate the reference tree.heuristic:
This is the preplacement heuristic used by EPA-NG. The default isdyn-heur
which corresponds to the default dynamic mode in EPA-NG. Also available arebaseball-heur
(as used by pplacer) andno-heur
(no preplacement heuristic).chunk_size:
This is the number of sequences to process in at a time. Set this to a lower value if you run out of memory during the EPA-NG step.
Gappa can be configured using the gappa:
section under epa-ng:
:
distribution_ration:
Determines how gappa handles edges with two possible annotations. If set to-1
(default) the program will determine the ratio automatically from the 'distal length' specified per placement.consensus_thresh:
When assigning taxonomy to missing labels in the reference, require this consensus threshold to assign a label. The default is1
.
To run the vsearch
taxonomic assignment method (implemented in QIIME2) you need to provide paths to files compatible with QIIME2 under the qiime2:
config section.
ref:
This is the path to a fasta file containing reference sequences for the taxonomic assignment with the QIIME2feature-classifier
plugin.taxfile:
This is the path to a file containing the taxonomy for the reference sequences in the fasta file.ranks:
This is a list of taxonomic ranks present in the reference.
See the QIIME2 docs for how to obtain and format these files.
Tip
If you downloaded the COIDB SINTAX reference above you can leave the ref:
,
taxfile:
and ranks:
config parameters empty. HAPP will then generate the
necessary files using the SINTAX reference.
If you want to run SINTAX and EPA-NG then use the latter to reasssign SINTAX assignments include sintax+epa-ng
in the taxtools
list in the config file:
taxtools:
- "sintax+epa-ng"
To use this as the taxonomic source for downstream steps, also set the taxonomy_source:
parameter in the config file to sintax+epa-ng
:
taxonomy_source: "sintax+epa-ng"
We have found that SINTAX has a high precision at low taxonomic ranks (e.g. species) meaning when it makes an assignment those assignments are often correct. However, SINTAX is conservative meaning that it will often fail to make assignments at higher taxonomic ranks (e.g. order), leaving ASVs unclassified. EPA-NG on the other hand is less conservative and will often make correct assignments at higher ranks such as order. By combining the two methods we can take advantage of the strengths of both, using SINTAX to make assignments at low ranks and use EPA-NG to update higher level assignments if those exist.
The reassign:
config section handles this behaviour and can be configured with the following parameters:
placeholder_rank:
This is the taxonomic rank for which both SINTAX and EPA-NG assignments must be identical in order for any updates to be made. The default isClass
which means that if both SINTAX and EPA-NG agree on the class level assignment for an ASV, the EPA-NG assignment can be used to update the SINTAX assignment (if additional criteria are met, see below).placeholder_taxa:
This is a list of taxa names at rank =placeholder_rank
for which to make updates. This list should only contain taxa with sufficient coverage in the reference database used for EPA-NG.reassign_ranks:
This is a list of taxonomic ranks at which to update assignments. The default is `["Order"] which means that assignments will be updated at the order level if all criteria are met.downstream_ranks:
This is a list of taxonomic ranks below thereassign_ranks
. Taxonomic assignments for ranks in this list will be prefixed with 'unclassified.' if the assignment atreassign_ranks
is updated. The default is["Family", "Genus", "Species", "BOLD_bin"]
.
With default settings, this means that ASVs without an assignment at Order in SINTAX will be updated with the Order level EPA-NG assignment if the EPA-NG assignment is identical to the SINTAX assignment at Class. Downstream ranks will be left unchanged, unless they are also unassigned in which case they will be prefixed with 'unclassified.' followed by the updated assignment at the rank above.
If you already have taxonomic assignments for your ASVs you can point to the file with taxonomic assignments in the config file with the taxonomy_source:
parameter:
taxonomy_source: "/path/to/asv_taxa.tsv"
This file should be a tab-separated file with the ASV id in the first column and subsequent columns with taxonomic assignments at different ranks. An example of such a file is shown below:
ASV | Kingdom | Phylum | Class | Order | Family | Genus | Species | BOLD_bin |
---|---|---|---|---|---|---|---|---|
ASV1 | Animalia | Arthropoda | Insecta | Coleoptera | Scarabaeidae | Melinopterus | Melinopterus punctatosulcatus | BOLD:AAJ8574 |
ASV2 | Animalia | Arthropoda | Insecta | Lepidoptera | Tortricidae | Dichrorampha | Dichrorampha sylvicolana | BOLD:AAL6971 |
If you do not want to run additional taxonomic assignment tools, you can then set the taxtools:
parameter to an empty list in the config file:
taxtools: []
Once the ASVs have been clustered into OTU clusters, the workflow will attempt to assign a consensus taxonomy to each cluster. This is done by comparing the taxonomic assignments of the ASVs in the cluster starting from the lowest taxonomic rank and moving up until a consensus is reached. The consensus is reached when the sum of the abundance weighted assignments for a taxon is equal to or greater than a certain threshold (configurable by the consensus_threshold
parameter in the config, default is 80%). The consensus taxonomy is then assigned to the cluster. The consensus_ranks
parameter in the config file specifies the taxonomic ranks to use when assigning consensus taxonomy. The default is ["Family", "Genus", "Species"]
which means that the workflow will only attempt to assign a consensus using these ranks.
Important
The ranks specified with consensus_ranks:
must be present in the taxonomic
assignments file used as input to the workflow. If a rank is missing from the
taxonomic assignments file, the workflow will not be able to assign a
consensus taxonomy at that rank. If you are using a custom taxonomic
assignments file, make sure that it contains the ranks specified in the
consensus_ranks:
parameter.
As an example, if a cluster contains 3 ASVs with the following taxonomic assignments and total sum of counts across samples:
ASV | Family | Genus | Species | ASV_sum |
---|---|---|---|---|
ASV1 | Tenthredinidae | Pristiphora | Pristiphora mollis | 20 |
ASV2 | Tenthredinidae | Pristiphora | Pristiphora cincta | 60 |
ASV3 | Tenthredinidae | Pristiphora | Pristiphora leucopodia | 20 |
then at Species the abundance weighted taxonomic assignment is 60% Pristiphora cincta and 20% Pristiphora mollis and Pristiphora leucopodia each. At a 80% consensus threshold we cannot assign a taxonomy at Species level to the cluster, so the algorithm will move up to Genus level where we have 100% agreement for Pristiphora. The final consensus taxonomy for the cluster will then be:
Cluster | Family | Genus | Species |
---|---|---|---|
Cluster1 | Tenthredinidae | Pristiphora | unresolved.Pristiphora |
At consensus_threshold: 60
we would have been able to assign taxonomy at the
Species level and the cluster taxonomy would have been:
Cluster | Family | Genus | Species |
---|---|---|---|
Cluster1 | Tenthredinidae | Pristiphora | Pristiphora cincta |
The workflow supports optional preprocessing of the input data by filtering ASVs by length and/or removal of ASVs with in-frame stop codons. These steps are controlled by the preprocess
section in the config file.
The following parameters are available:
filter_length:
A boolean specifying whether to filter ASVs by length. The default isFalse
.min_length:
The minimum length of ASVs to keep. The default is403
.max_length:
The maximum length of ASVs to keep. The default is418
.filter_codons:
A boolean specifying whether to filter ASVs for in-frame stop codons. The default isFalse
.stop_codons:
A comma separated list of stop codons to look for. The default is"TAA,TAG"
.start_position:
The position in the ASV sequence to start looking for stop codons. The default is2
.end_position:
The position in the ASV sequence to stop looking for stop codons. The default is0
which means that the entire sequence is checked.
The workflow supports optional chimera removal using the uchime algorithm
implemented in vsearch. Chimera detection can be run either in 'batchwise' mode
using the data under rundir
directly or in 'samplewise' mode in which the
asv_counts.tsv
file is used to generate sample-specific fasta files containing
sequences with a count > 0 in each sample. Sequences identified as non-chimeric
are then passed downstream in the workflow and used as input for clustering.
Config parameters for the chimera removal part of the workflow are nested
under the chimera:
section in the config file:
remove_chimeras:
A boolean specifying whether to run chimera filtering or not. The default isTrue
.run_name:
A name for the chimera filtering run. This is used to separate different runs of the chimera filtering. The default ischimera1
.method:
The chimera filtering method to use. Can be eitherbatchwise
orsamplewise
. The default issamplewise
.algorithm:
The chimera detection algorithm to use. Can be eitheruchime_denovo
,uchime2_denovo
oruchime3_denovo
. The default isuchime_denovo
. The latter two require perfect matches between the ASV sequence and a chimeric model, whereas 'uchime_denovo' does not.min_samples_shared:
The number of samples in which chimeric ASVs have to be present with their 'parents' in order to be marked as chimeric. The default is1
.min_samples_shared:
parameter is specific to the batchwise mode and specifies the number of samples in which chimeric ASVs have to be present with their 'parents' (see Uchime docs) in ordered to be marked as chimeric.min_frac_samples_shared:
parameter is similar tomin_samples_shared
, but instead of an absolute number require that sequences are present with their parents in a fraction of the samples in which they are present.min_chimeric_samples:
refers to the samplewise chimera mode, and determines in how many samples an ASVs has to be marked as chimeras in to be filtered as chimeric. Setting this value to 0 (the default) means that sequences have to be marked as chimeric in all samples where they are present in order to be filtered as chimeric.min_frac_chimeric_samples:
is analogous tomin_chimeric_samples
but specifies a fraction of samples, instead of an absolute number
The dn
, mindiffs
, mindiv
and minh
parameters are specific to how the
chimeric score of sequences is calculated. Please see the Uchime
docs for details.
In all algorithms ASV sequences are first aligned to other ASVs that are above a
certain abundance threshold. This so called abundance
skew threshold is
by default set to 2.0 for the 'uchime_denovo' and 'uchime2_denovo' algorithms
and to 16.0 for 'uchime3_denovo'. However, this value can be overridden by
explicitly setting abskew
in the config file under chimera:
.
The workflow can be configured to run one or more tools to cluster ASVs into
OTUs. The software
parameter takes a list of supported tool names: swarm
,
opticlust
and dbotu3
currently.
Below are specific parameters for tools used in the workflow.
Pairwise identities between ASVs are obtained by running vsearch with the
usearch_global
setting. This is used as input by opticlust
for clustering
ASVs. Below are the parameters specific to vsearch:
id:
sets the minimum pairwise identity to report. The default is 0.84 meaning that sequences with a pairwise identity less than 84% is rejected.iddef:
sets the identity definition used by vsearch. The default setting of1
means that identity is calculated as(matching columns) /(alignment length)
. Refer to the vsearch manual for other settings.query_cov:
sets the minimum aligned fraction of the query sequence. The default setting of0.9
means that alignments are rejected if this fraction is less than 90%.
The opticlust method is implemented in mothur
. Here the optimal clustering of
sequences is found by iteratively moving sequences between OTU clusters in an
attempt to maximize the Matthew’s Correlation Coefficient (MCC) which is a metric
that combines True/False positives and negatives. In this context:
- true positives: are sequences that are within a maximum pairwise distance from each
other (defined by the
cutoff
parameter) and that share an OTU cluster - false positives are further in distance than the cutoff and share a cluster,
- false negatives are within the cutoff but are placed in different clusters, and
- true negatives are further in distance than the cutoff and are placed in different clusters
Opticlust takes as input a pairwise similarity matrix and a file with sum of counts across samples for each ASV.
For a full description of opticlust parameters, see the manual. Below are the parameters specific to opticlust:
aligner:
specifies whether to usevsearch
(default) or the built-in aligner inmothur
for generating pairwise identities. Note that the latter has not been tested for this workflow.delta:
sets the stable value for the metric in opticlust.cutoff:
controls at what similarity threshold clusters are generated.initialize:
sets the initial randomization for opticlust. The default issingleton
where each sequence is randomly assigned to its own OTU. The other accepted settingoneotu
means that all sequences are assigned to one otu.precision:
sets the floating point precision for opticlust.
Swarm clusters sequences using a local linking threshold d
which represents
the maximum number of differences between two amplicons. The input is a single
fasta sequence where the total count of each sequence is suffixed to the
sequence ID.
For a full description of Swarm parameters, see the manual. Below are the parameters specific to Swarm:
differences:
specifies the maximum number of differences (mismatches) allowed between two ASVs, meaning that two ASVs will be grouped if they have integer (or less) differences.no-otu-breaking:
deactivates Swarms built-in cluster refinement when usingd=1
.fastidious:
specifies whether to perform a second clustering pass (when set to True) to reduce the number of small clusters when usingd=1
.boundary:
defines the minimum abundance of what should be considered a large cluster when usingd=1
.
The parameters match-reward
, mismatch-penalty
, gap-opening-penalty
and
gap-extension-penalty
allows control of Swarms advanced Pairwise alignment
options.
dbotu3 uses both sequences and their distribution across samples to cluster ASVs into OTUs. It takes as input ASV sequences in fasta format and a table of counts.
For a full description of dbotu3 parameters, see the documentation. Below are the parameters specific to dbotu3:
dist:
sets the maximum allowed genetic dissimilarity between sequences.abund:
sets the minimum fold difference for comparing two OTUs.pval:
sets the minimum p-value for merging ASVs.
The generated clusters can be filtered for noise such as Nuclear mitochondrial DNA segments (NUMTs) and sequencing errors. This is done using taxonomic information, sequence similarity and abundance information. Below are the parameters specific to the noise filtering step:
run_name:
sets the name of the run for the noise filtering step. Similarly to thechimera
section, this is used to separate different runs of the noise filtering.split_rank:
sets the taxonomic rank to split the ASVs by before running the noise filtering.assignment_rank:
sets the taxonomic rank to use for the noise filtering. ASVs unassigned or ambiguously assigned at this rank are considered noise and will be removed. To skip this step setassignment_rank: ""
.max_target_seqs:
sets the maximum number of target sequences to return for the vsearch alignent step.min_match:
sets the minimum threshold of sequence similarity (in %) for considering any OTU as an error of another.n_closest:
sets the number of potential parent clusters to use in the comparison when identifying noise.echo_min_overlap:
in the 'echo filter', sets the minimum fraction of the putative “noise” OTU samples that must also contain the putative authentic CO1 “parent” OTU.echo_max_read_ratio:
sets the maximum value of the ratio between noise and parent reads.echo_read_ratio_type:
sets the type of read ratio considered by theecho_max_read_ratio
criterion. Can be eithermax
ormean
.echo_require_corr:
specifies whether a significant correlation between noise and parent read numbers should be required before the putative OTU is removed.evo_local_min_overlap:
in the 'local evo filter', sets the minimum fraction of the putative “noise” OTU samples that must also contain the putative authentic CO1 “parent” OTU.dist_type_local:
in the 'local evo filter', specifies whether distances between representative ASVs should be based on unweighted amino acid distances (dadn
) or amino acid distances weighted based on biochemical similarities (wdadn
).dist_threshold_local:
in the 'local evo filter', sets the distance value above which an OTU is considered to represent noise.dist_threshold_global:
in the 'global evo filter', sets the distance value above which an OTU is considered to represent noise.abundance_cutoff_type:
specifies whether the abundance cutoff should be based on the sum of reads across samples in the dataset (sum
), or the max number of reads (max
).abundance_cutoff:
sets the minimum number of reads required for an OTU to pass the abundance filter.codon_table:
sets the codon table to use for translation when aligning sequences with pal2nal.codon_start:
sets the reading frame start (1 based).