This repository was archived on 2024-04-18 as it was no longer in use or being developed
GMSMetaPost takes the output files of three classifiers from taxprofiler (kraken2
, centrifuge
, kaiju
).
- Calculates reads per million (RPM) for each classifier.
- Concatenates the different tsv files from each classifier into one.
- Filters RPM based on a threshold given by the user.
- Validate and remove missing taxonomic IDs.
- Download potential reference genomes from a local BlAST nt database (See below on how to install the local blast db).
- Coverage plots for potential significant findings.
For running the pipeline two metadata files are required: samples.csv
and samplesheet.csv
. Here are examples of how they may look like:
- samples.csv
sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
SRR12875558_se-SRR12875558,SRR12875558,OXFORD_NANOPORE,assets/input/se/SRR12875558_1.fastq.gz,,
SRR12875570_pe-SRR12875570,SRR12875570,ILLUMINA,assets/input/pe/SRR12875570_1.fastq.gz,assets/input/pe/SRR12875570_2.fastq.gz,
- samplesheet.csv
sample,kraken2,centrifuge,kaiju
SRR12875570_pe-SRR12875570,assets/tsvs/SRR12875570_pe-SRR12875570-kraken2.kraken2.report.tsv,assets/tsvs/SRR12875570_pe-SRR12875570-centrifuge.report.tsv,assets/tsvs/SRR12875570_pe-SRR12875570-kaiju.tsv
SRR12875558_se-SRR12875558,assets/tsvs/SRR12875558_se-SRR12875558-kraken2.kraken2.report.tsv,assets/tsvs/SRR12875558_se-SRR12875558-centrifuge.report.tsv,assets/tsvs/SRR12875558_se-SRR12875558-kaiju.tsv
The path to samples.csv
must be defined in fastq_data
parameter in nextflow.config
and path to samplesheet.csv
must be defined in input
parameter.
At the current stage of this pipeline additional metadata files containing joined taxonomic profiling files are required. They should be located in the root of the directory defined by assets
parameter in nextflow.config
file. Their names (without the .tsv
suffix) must be identical to the sample names given in the two previously mentioned metadata files. E.g. if assets
parameter is assets
and one of the samples in samples.csv
or samplesheet.csv
is named SRR12875558_se-SRR12875558
there must be a joined taxonomic profiling file named assets/SRR12875558_se-SRR12875558.tsv
These temporary metadata files can look for example the following:
taxon_name rpm taxid taxonomic_rank classifier centrifuge_genome_size centrifuge_num_reads centrifuge_abundance kaiju_percent kraken2_percentage_fragments_covered kraken2_num_fragments_covered reads_count cami_taxid cami_rank cami_taxpath cami_taxpathsn cami_percentage
Ungulate tetraparvovirus 3 70062.2 1511916 species centrifuge 5444 1389.0 0.0 1542
Parvovirus YX-2010/CHN 45663.1 754189 leaf centrifuge 5444 851.0 0.0 1005
GMSMetaPost requires a local version of BLAST nt database. The easiest way to aqcuire it is by downloading with blast v.2.13.0
utility update_blastdb.pl
:
update_blastdb.pl \
--decompress \
nt
In addition the latest taxonomy database is required as well. It should be downloaded into the same directory:
wget --continue https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
rm taxdb.btd taxdb.bti
tar xvf taxdb.tar.gz
rm -f taxdb.tar.gz
It is also recommended to update regularly the local BLAST nt database with command:
update_blastdb.pl \
nt
Finally, the user should define the path to the database by either defining blast_db
parameter in nextflow.config
or on the command line.
GMSMetaPost can be run with command:
nextflow run main.nf \
--input assets/samplesheet.csv \
--outdir test \
-profile conda
conda
profile can also be replaced by singularity
or docker
.
-
If you get the following error message:
no space left on device
, make sure that that you are using:singularity.cacheDir = "/path/to/cache"
-
If you get the following error message for
singularity
:FATAL: While getting image info: error decoding image: invalid ObjectId in JSON
You are likely using an older version of singularity. You can update it with the command:
conda install -c conda-forge singularity
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows: