ViCLaRA is a bioinformatics analysis pipeline for classification and reference guided assembly of segmented viruses from metagenomics reads obtained on Illumina platform.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
ViCLaRA runs on UNIX/LINUX systems. You will install Miniconda3 from here. Once Miniconda3 has been installed, proceed with pipeline installation
-
Install
nextflow
-
Install any of
Docker
,Singularity
,Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (please only useConda
as a last resort; see docs) -
Download the pipeline
git clone https://github.com/ajodeh-juma/viclara.git
cd rvfvMETA
conda env create -n viclara-dev1.0 -f environment.yml
conda activate viclara-dev1.0
-
Optional: Test the installation on a minimal dataset bundled in the installation
nextflow run main.nf -profile test
For minimal pipeline options, use the --help
flag e.g.
nextflow run main.nf --help
To see all the options, use the --show_hidden_params
flag e.g.
nextflow run main.nf --help --show_hidden_params
A typical command for S segment is
nextflow run main.nf \
--input 'data/test/*_R{1,2}.fastq.gz' \
--segment S \
--outdir output-dir \
-work-dir work-dir \
If you intend to identify off-target reads in your samples, the pipeline uses Kraken2
to classify the reads.
One can create a standard Kraken2 database (which is quite large and requires ample memory when creating ) or a
custom database with organisms of interest. Once created, the database can be given as an option while running the pipeline
Typically you can create a database using the command:
# install the taxonony
DBNAME="/Users/jjuma/databases/kraken2/viral/"
kraken2-build --download-taxonomy --db $DBNAME
# install a reference library. In this case, `viral` which has all RefSeq complete viral genomes/proteins
kraken2-build --download-library viral --db $DBNAME
# build the database
kraken2-build --build --db $DBNAME
You can then run the pipeline as
nextflow run main.nf \
--input 'data/test/*_R{1,2}.fastq.gz' \
--segment S \
--database /Users/jjuma/databases/kraken2/viral/ \
--outdir output-dir \
-work-dir work-dir \
The pipeline offers several parameters including as highlighted:
Input/output options
--input [string] Input FastQ files.
--single_end [boolean] Specifies that the input is single-end reads.
--metadata [string] Input comma-separated values (csv) metadata file containing the columns 'sample_name' and 'Ct.
--outdir [string] The output directory where the results will be saved. [default: ./results]
--email [string] Email address for completion summary.
genome options
--segment [string] genomic segment of the virus. options are 'S', 'M' and 'L'
--filter_phix [boolean] Filter PhiX sequences
--host_fasta [string] Path to the host FASTA genome file. Not required if either --host_bwa_index or --host_bowtie2_index is
specified
--host_bwa_index [string] Path to host genome directory or tar.gz archive for pre-built BWA index.
--host_bowtie2_index [string] Path to host genome directory or tar.gz archive for pre-built BOWTIE2 index.
Read trimming options
--trimmer [string] Specifies the alignment algorithm to use - available options are 'fastp', 'trimmomatic'. [default: fastp]
--adapters [string] Path to FASTA adapters file
--illumina_clip [string] Instructs Trimmomatic apply a maximum mismatch count, how accurate the match between the two adapter ligated reads
reads must be for PE palindrome, and how accurate the match between any adapter etc. sequence must be against a
read [default: 2:30:10]
--leading [integer] Instructs Trimmomatic to cut bases off the start of a read, if below a threshold quality [default: 3]
--trailing [integer] Instructs Trimmomatic to cut bases off the end of a read, if below a threshold quality [default: 3]
--average_quality [integer] Instructs Trimmomatic or Fastp the average quality required in the sliding window [default: 20]
--min_length [integer] Instructs Trimmomatic or Fastp to drop the read if it is below a specified length [default: 50]
--window_size [integer] Instructs Trimmomatic to apply a sliding window (number of bases to average across) [default: 4]
--window_quality [integer] Instructs Trimmomatic to apply the average quality required in the sliding window [default: 20]
--qualified_quality_phred [integer] Instructs Fastp to apply the --qualified_quality_phred option [default: 30]
--unqualified_percent_limit [integer] Instructs Fastp to apply the --unqualified_percent_limit option [default: 10]
--skip_trimming [boolean] Skip the adapter trimming step.
--save_trimmed [boolean] Save the trimmed FastQ files in the results directory.
--save_trimmed_fail [boolean] Save failed trimmed reads.
Alignment options
--aligner [string] Specifies the alignment algorithm to use - available options are 'bwa', 'bowtie2'. [default: bwa]
--seq_center [string] Sequencing center information to be added to read group of BAM files.
--save_align_intermeds [boolean] Save the intermediate BAM files from the alignment step.
--min_mapped [integer] Minimum number of mapped reads to be used as threshold to drop low mapped samples [default: 200]
Variant calling options
--variant_caller [string] Specifies the variant calling algorithm to use - available options are 'bcftools', 'varscan'. [default:
bcftools]
--mpileup_depth [integer] SAMtools mpileup max per-file depth, avoids excessive memory usage
--min_base_quality [integer] Skip bases with baseQ/BAQ smaller than this value when performing variant calling
--min_coverage [integer] Skip positions with an overall read depth smaller than this value when performing variant calling
--min_allele_freq [number] Minimum allele frequency threshold for calling variant [default: 0.25]
--max_allele_freq [number] Maximum allele frequency threshold for calling variant [default: 0.75]
--varscan_strand_filter [boolean] Ignore variants with >90% support on one strand [default: true]
--save_mplieup [boolean] Save SAMtools mpileup output file
Process skipping options
--skip_multiqc [boolean] Skip MultiQC.
--skip_qc [boolean] Skip all QC steps except for MultiQC.
--skip_trimming [boolean] Skip quality trimming step.
--skip_kraken2 [boolean] Skip classification of reads using Kraken2.
--skip_markduplicates [boolean] Skip picard MarkDuplicates step.
--skip_alignment [boolean] Skip all of the alignment-based processes within the pipeline.
--skip_assembly [boolean] Skip De novo assembly step.
Reads classification options
--database [string] Path to the kraken2 database name
All the output results will be written to the results
directory if no --outdir
is not used. Masked and non-masked consensus genomes will be located in bcftools/consensus
See usage docs for all of the available options when running the pipeline.
By default, the pipeline currently performs the following:
- Sequencing quality control (
FastQC
) - Reads classification (
Kraken2
) - Quality control and preprocessing (
fastp
) or (trimmomatic
) - Reads alignment/mapping (
BWA
) or (Bowtie2
) - Alignment summary (
SAMtools
) - Call variants (
BCFtools
) or (Varscan2
) - Annotate variants (
SnpEff
) or (SnpSift
) - Genome coverage (
BEDTools
) - Visualization (
R
), (ggplot2
) - De novo assembly (
MetaSPAdes
) or (Trinity
) or (Tadpole
) - Overall pipeline run summaries (
MultiQC
)
Generating whole genome sequences of segmented viruses has largely depended on sequencing of partial gene sequences of the viruses. Here we implement a pipeline that can be adopted to other segmented viruses in order to assemble complete genomic sequences from RNA metagenomic sequencing. We implement this pipeline to generate complete genome sequences of Rift Valley fever virus, a tripartite virus having 3 segments - Small (S), Medium (M) and Large (L).
The pipeline comes bundled with reference genome and annotation, and the user only has to specify the segment to obtain full genome sequences. The pipeline calls variants
using 2 commonly used tools: bcftools
or varscan
and annotates the variants using SnpEff
and SnpSift
ViCLaRA was originally written by @ajodeh-juma with inspiration from the @nf-core team, particularly on viralrecon
We thank the following people for their extensive assistance in the development of this pipeline:
ViCLaRA is free software, licensed under GPLv3.
Please report any issues to the issues page.
If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use GitHub Flow style development. Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. We will then review your changes and merge them, or provide feedback on enhancements.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
In addition, references of tools and data used in this pipeline are as follows:
Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. https://doi.org/10.1093/bioinformatics/bty560
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics (Oxford, England), 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. Published 2012 Mar 4. doi:10.1038/nmeth.1923
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16)> 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987-2993. doi:10.1093/bioinformatics/btr509
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012 Mar;22(3):568-76. doi: 10.1101/gr.129684.111. Epub 2012 Feb 2. PMID: 22300766; PMCID: PMC3290792.
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing,. https://www.R-project.org/
Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0
Cingolani, P., Platts, A., Wang, l., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695