A Nextflow pipeline for running the ARTIC network's fieldbioinformatics tools (https://github.com/artic-network/fieldbioinformatics), with a focus on ncov2019.
This version was forked from COG-UK and customized for CanCOGeN-VirusSeq by adding a dehosting step, switching the variant caller from ivar to freebayes, and adding additional artifact filtering steps. This version is specialized for the Illumina workflow - nanopore support is retained unchanged from the COG-UK version. This documentation focuses on the differences in functionality from the COG-UK version linked above.
This Nextflow pipeline automates the ARTIC network nCoV-2019 novel coronavirus bioinformatics protocol. It is being developed to aid the harmonisation of the analysis of sequencing data generated by the COG-UK project. It will turn SARS-COV2 sequencing data (Illumina or Nanopore) into consensus sequences and provide other helpful outputs to assist the project's sequencing centres with submitting data.
nextflow run /path/to/repo/ncov2019-artic-nf [-profile conda,singularity,docker,slurm,lsf] \
--illumina \
--prefix "output_file_prefix" \
--directory /path/to/reads \
--bed /path/to/resources/nCoV-2019_v3_fixed.bed \
--primer_pairs_tsv /path/to/resources/nCoV-2019_outer_primernames.tsv \
--ref /path/to/resources/nCoV-2019.reference.fasta \
--composite_ref /path/to/resources/composite_human_virus_reference.fasta \
--viral_contig_name MN908947.3 \
--cpus 8
The composite_ref
and viral_contig_name
options control the dehosting process. The composite reference genome should be created by merging the SARS-CoV-2 reference genome with the human reference genome then indexing it with bwa index
. The primer_pairs_tsv
argument is a simple two-column tab-delimited file describing the outer pair of primers for each amplicon. This allows additional amplification artifact filtering.
An up-to-date version of Nextflow is required because the pipeline is written in DSL2. Following the instructions at https://www.nextflow.io/ to download and install Nextflow should get you a recent-enough version.
The repo contains a environment.yml files which automatically build the correct conda env if -profile conda
is specifed in the command. Although you'll need conda
installed, this is probably the easiest way to run this pipeline.
Common configuration options are set in conf/base.config
. Workflow specific configuration options are set in conf/nanopore.config
and conf/illumina.config
They are described and set to sensible defaults (as suggested in the nCoV-2019 novel coronavirus bioinformatics protocol)
Use --illumina
to run the Illumina workflow. Use --directory
to point to an Illumina output directory usually coded something like: <date>_<machine_id>_<run_no>_<some_zeros>_<flowcell>
. The workflow will recursively grab all fastq files under this directory, so be sure that what you want is in there, and what you don't, isn't!
Important config options are:
Option | Description |
---|---|
allowNoprimer | Allow reads that don't have primer sequence? Ligation prep = false, nextera = true |
illuminaKeepLen | Length of illumina reads to keep after primer trimming |
illuminaQualThreshold | Sliding window quality threshold for keeping reads after primer trimming (illumina) |
mpileupDepth | Mpileup depth for ivar |
varFreqThreshold | frequency threshold for variants |
varMinDepth | Minimum coverage depth to call variants |
A subdirectory for each process in the workflow is created in --outdir
.