Skip to content

A pipeline based on EMU, a taxonomic profiler optimized for long 16S rRNA reads.

License

Notifications You must be signed in to change notification settings

genomic-medicine-sweden/gms_16S

Repository files navigation

gms_16S

Nextflow Run with conda Run with docker Run with singularity

Introduction

gms_16S bioinformatics analysis pipeline for the EMU tool.

This Nextflow pipeline utilizes FastQC, Nanoplot, MultiQC, Porechop_ABI, Longfilt, EMU, and Krona. EMU is the tool that does the taxonomic profiling of 16S rRNA reads. The results are displayed with Krona. Built with Nextflow, it ensures portability and reproducibility across different computational infrastructures. It has been tested on Linux and on mac M1 (not recommended, quite slow). FastQC and Nanoplot performs quality control, Porechop_ABI trims adapters (optional)), Longfilt filters the fastq-files such that only reads that are close to 1500 bp are used (optional), EMU assigns taxonomic classifications, and Krona visualises the result table from EMU. The pipeline enables microbial community analysis, offering insights into the diversity in samples.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Pipeline summary

Pipeline overview image

Roadmap/workflow. Only the NanoPore flow is available. Minor testing has been done for PacBio and it seems to work. short read has no support yet. MultiQC collects only info from FastQC and some information about software versions and pipeline info.

Krona plot

Krona plot

Quick Start

  1. Install Nextflow (>=22.10.1)
  2. Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort. See docs).
  3. Add you samples to an input file e.g., sample_sheet.csv. See examples.
  4. gunzip all gzipped files in the database directory (assets/databases/emu_database)
  5. gunzip all gzipped files in the krona/taxonomy directory (assets/databases/krona/taxonomy)
  6. Run your command:
nextflow run main.nf \
  --input sample_sheet.csv
  --outdir [absolute path]/gms_16S/results \
  --db /[absolute path]/gms_16S/assets/databases/emu_database \
  --seqtype map-ont \
   -profile singularity,test \
  --quality_filtering \
  --longread_qc_qualityfilter_minlength 1200 \
  --longread_qc_qualityfilter_maxlength 1800

Runs with Nanopore barcode directories

You can run with or without a sample sheet. If no sample_sheet is used, the results will be named according to the barcode. If a sample sheet is used the results will be named after whats in the second column of the sample sheet. Note that the --input flag is not needed when --merge_fastq_pass is defined.

Run without barcode sample sheet:

nextflow run main.nf \
  --outdir [absolute path]/gms_16S/results \
  --db /[absolute path]/gms_16S/assets/databases/emu_database \
  --seqtype map-ont \
   -profile singularity,test \
  --quality_filtering \
  --longread_qc_qualityfilter_minlength 1200 \
  --longread_qc_qualityfilter_maxlength 1800 \
  --merge_fastq_pass /[absolute path]/gms_16S/fastq_pass/

Run with barcode sample sheet:

nextflow run main.nf \
  --outdir /[absolute path to]/gms_16S/results \
  --db /[absolute path to database]/gms_16S/assets/databases/emu_database \
  --seqtype map-ont \
   -profile singularity,test \
  --quality_filtering \
  --longread_qc_qualityfilter_minlength 1200 \
  --longread_qc_qualityfilter_maxlength 1800 \
  --merge_fastq_pass /[absolute path to fastq_pass]/fastq_pass/ \
  --barcodes_samplesheet /[absolute path to barcode sample sheet]/sample_sheet_merge.csv

Sample sheets

There are two types of sample sheets that can be used: 1) If the fastq files are already concatenated/merged i.e., the fastq-files in Nanopore barcode directories have been concataned already, the --input can be used. --input expects a .csv sample sheet with 3 columns (note the header names). It looks like this (See also the examples directory):

sample,fastq_1,fastq_2
SAMPLE_1,/absolute_path/gms_16S/assets/test_assets/medium_Mock_dil_1_2_BC1.fastq.gz,
SAMPLE_2,/absolute_path/gms_16S/assets/test_assets/medium_Mock_dil_1_2_BC3.fastq.gz,
  1. If the fastq files are separated in their respective barcode folder i.e., you have several fastq files for each sample and they are organized in barcode directories in a fastq_pass dir. a) If you do not want to create a sample sheet for the barcodes, then the results will be named according to the barcode folders. flag --merge_fastq_pass b) If you want your own sample names on the results, then use --merge_fastq_pass in combination with --barcodes_samplesheet. This requires a barcode sample sheet which is tab separated. Se example file sample_sheet_merge.csv in examples for a demonstration.

Useful env variables

NXF_WORK = working directory. # If the work is spread out on different nodes,
                              # set this to a shared place.
                              # export NXF_WORK=/path/to/your/working/dir
APPTAINER_TMPDIR
NXF_SINGULARITY_CACHEDIR
APPTAINER_CACHEDIR

Credits

gms_16S was originally written by @fwa93.

This pipeline is not a formal nf-core pipeline but it partly uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 2839>

Pipeline tools

  • FastQC

  • MultiQC

    Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354.>

Software packaging/containerisation tools

  • Anaconda

    Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

  • Bioconda

    Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018>

  • BioContainers

    da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-R>

  • Docker

  • Singularity

    Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; >

  • EMU

    Kristen D. Curry et al., “Emu: Species-Level Microbial Community Profiling of Full-Length 16S RRNA Oxford Nanopore Sequencing Data,” Nature Methods, June 30, 2022, 1–9, https://doi.org/10.1038/s41592-022-015>

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.