Skip to content

04. Getting Started

Krista Ternus edited this page Nov 3, 2019 · 15 revisions

Getting Started

Table of Contents

Overview

After completion of the install updates and offline setup, specifications can be made to run a user's samples through the metagenomics workflows. This wiki page details the sample specification steps necessary to run the v1.3 workflows.

Quick Start

First time users are encouraged to read through all wiki pages and practice analyzing the provided example dataset before processing their own samples, but more advanced users can skip ahead to processing their own samples through the following steps:

  • The workflows in v1.3 require an updated version of singularity, as well as additional images needed to execute the new and updated tools in v1.3. Please see the Install and Offline Setup instructions to install the new v1.3 source code and update previous versions of the software.
  • Check to make sure all necessary container images (ending in .sif) are present in the metagenomics/container_images directory
  • Check to make sure the adapters_combined_256_unique.fasta Trimmomatic adapter file and all of the needed taxonomic and functional databases are in the metagenomics/workflows/data directory
  • Move input files to the metagenomics/workflows/data directory
  • Setup the default or custom config file as needed (i.e., change file names and parameters throughout the config to process the samples). Note that custom config file options have been updated between v1.2 and v1.3. Please see my_custom_config.json or illumina_custom_config.json for examples of how to construct a custom config to run with v1.3.
  • Save the updated config file in the metagenomics/workflows/config directory
  • Run screen or something similar, since these workflows can run for a while
  • Activate the metag environment
  • Navigate to the metagenomics/workflows directory
  • Set the singularity bindpath
  • Run a command to execute the snakemake rules

The following commands will run screen, activate the metag environment, navigate to the metagenomics/workflows directory, and set the singularity bindpath:

screen
conda activate metag 
cd metagenomics/workflows 
export SINGULARITY_BINDPATH="data:/tmp"

The following command will run popular terminal snakemake rules within a custom config and use all available cores, provided that all of these rules are included in the custom config (i.e., if a custom config file was created and saved in the metagenomics/workflows/config/ directory with a unique name, then run this command), and that you have run the required download_offline.py flag options for each workflow:

snakemake --cores --use-singularity --configfile=config/my_custom_config.json read_filtering_multiqc_workflow read_filtering_khmer_count_unique_kmers_workflow assembly_multiqc_workflow assembly_metaquast_workflow comparison_output_heatmap_plots_all_workflow tax_class_gather_workflow tax_class_visualize_krona_kaijureport_workflow tax_class_kraken2_workflow tax_class_bracken_workflow tax_class_krakenuniq_workflow functional_with_srst2_workflow functional_prokka_with_megahit_workflow functional_prokka_with_metaspades_workflow functional_abricate_with_megahit_workflow functional_abricate_with_metaspades_workflow

Note that --configfile=config/my_custom_config.json is used to specify the name of the custom config in the above command. Users should replace my_custom_config.json with the name of their custom config file.

The following command will run popular terminal snakemake rules within the metagenomics/workflows/config/default_workflowconfig.settings default config file (i.e., if the default config was directly edited with the sample names and parameters, then run this command):

snakemake --use-singularity read_filtering_multiqc_workflow read_filtering_multiqc_workflow read_filtering_khmer_count_unique_kmers_workflow assembly_multiqc_workflow assembly_metaquast_workflow comparison_output_heatmap_plots_all_workflow tax_class_gather_workflow tax_class_visualize_krona_kaijureport_workflow tax_class_kraken2_workflow tax_class_bracken_workflow tax_class_krakenuniq_workflow functional_with_srst2_workflow functional_prokka_with_megahit_workflow functional_prokka_with_metaspades_workflow functional_abricate_with_megahit_workflow functional_abricate_with_metaspades_workflow

Configuration Files

Users only need to specify their sample name(s) in one location within the configuration (config) file, and the other sample information is specified in default_workflowparams.settings. Note that these settings files do not need to be edited if a custom config is used.

Users may directly use and/or edit the default config file, or override the default settings by using their own custom config file. An example custom config file is provided that will run with the standard Illumina FASTQ naming convention, as well as an abbreviated default example naming convention.

Default Settings Files

The metagenomics workflows and individual tools are executed through snakemake according to specifications in the following default settings files, which are located in the metagenomics/workflows/config/ directory:

  1. metagenomics/workflows/config/default_workflowparams.settings - This file specifies the container versions to use (corresponding to container images downloaded during the offline setup), the parameters to run with each container during workflow execution, parameters to build snakemake rules in each workflow, and file naming patterns for snakemake execution. The params file indicates what specific naming conventions snakemake should use to search for inputs that will successfully build and execute each workflow. Current parameters of interest include input file types, naming patterns, number of threads, and databases to use. For more information on this, please see the Workflow Architecture wiki page.

  2. metagenomics/workflows/config/default_workflowconfig.settings - This file specifies what samples to run and the configuration of how the workflows should be executed (a.k.a. the "config file"). The config file indicates what specific features snakemake should search for in file names. For example, the config file can tell snakemake to find samples with a specific name {sample} that were read filtered with a quality threshold {qual} of 30 to take through the taxonomic classification workflow, and to take samples with that same name {sample} that were read filtered with a quality threshold {qual} of 2 through the assembly workflow. Custom config files will override specified default settings in the default_workflowconfig.settings file. Alternatively, sample names and settings can be changed by directly editing the default_workflowconfig.settings default config file. A custom config file only needs to contain the information necessary to override specific default settings (e.g., sample name, quality trimming thresholds), whereas the default config contains a comprehensive list of all default settings. Custom config files may be helpful for organizational purposes because they can be named according to the specific samples and/or settings run, whereas directly editing the default config file might be beneficial if users initially learning the workflows operate prefer to not make a custom config file.

Default Config Files

The current default_workflowconfig.settings and default_workflowparams.settings files are set up to run paired-end FASTQ reads with the default naming convention of {SAMPLE}_1_reads.fq.gz and {SAMPLE}_2_reads.fq.gz. The example dataset distributed with these workflows is named SRR606249_subset10_1_reads.fq.gz and SRR606249_subset10_2_reads.fq.gz for this reason.

For snakemake to identify the intended input files, it looks for sample name(s) specified in the default_workflowconfig.settings file (i.e., SRR606249_subset10). The following example shows how the analysis of multiple paired-end FASTQ files can be executed simultaneously (starting at line #5 in default_workflowconfig.settings):

    "workflows" : {

        "samples_input_workflow" : {
            # Global parameter for the sample input
            "samples"    : ["SRR606249_subset10", "second_sample_here"]
        },

The rest of the sample name features are specified under the read_filtering workflow section, which starts at line #11 in the default_workflowparams.settings file.

For snakemake to find the forward reads of a sample, the remaining text features from the forward read file name (i.e. _1_reads.fq.gz) are specified as the pre_trimming_glob_pattern (line #18 in the default_workflowparams.settings file):

            "pre_trimming_glob_pattern"  : "*_1_reads.fq.gz",

For snakemake to identify a forward read's reverse paired-end, the following parameter is specified under post_trimming_pattern (lines #26-27):

            "reverse_pe_pattern_search"  : "1_",
            "reverse_pe_pattern_replace" : "2_",

Lastly, the correct file extension type is indicated under quality_trimming (line #40 in the default_workflowparams.settings file):

            "sample_file_ext" : ".fq.gz"

Custom Config Files

We recommend that users create a custom config .json file to process their samples. Custom config files can be named according the sample or analysis being performed, which can help with organization and recording the specific methods used to generate results. Example custom config files are available in the metagenomics/workflows/config directory, including my_custom_config.json and illumina_custom_config.json. Users can copy and edit illumina_custom_config.json if their samples follow the standard Illumina FASTQ naming convention, or my_custom_config.json if their FASTQ naming conventions follow the metagenomics workflow defaults. At a minimum, users should edit the provided custom config files with the name(s) of their sample(s) to be analyzed. Users may also want to edit parameters or databases in the custom config file.

Multiple parameters are included in the example custom config files, and users may want to reduce the number of parameters executed by deleting the unwanted ones. For example, read filtering is set to run at quality thresholds of both "2" and "30" in the example custom configs, but "2" can be deleted if users only want to process samples with an aggressive quality filtering threshold of "30".

Once a custom config file is prepared and saved with a unique name, it can be specified in an execution command with the flag --configfile followed by the location of the custom configuration file and the rules to execute. Remember to run the command from inside the metagenomics/workflows directory, activate the metag environment, and set the Singularity bind path prior to executing snakemake.

screen
source activate metag 
cd metagenomics/workflows 
export SINGULARITY_BINDPATH="data:/tmp"
snakemake --use-singularity --configfile=<location_of_your_file> <rule(s) to execute>

The following command will run the read_filtering_pretrim_workflow rule with the my_custom_config.json custom config file:

snakemake --use-singularity --configfile=config/my_custom_config.json read_filtering_pretrim_workflow 

Illumina File Naming Convention

The v1.3 workflows have been configured with an option to process paired-end Illumina FASTQ files with standard Illumina naming conventions. In the following description of the standard Illumina FASTQ naming convention, * indicates any numeric value:

Forward reads: {Sample}_S*_L*_R1_001.fastq.gz 
Reverse reads: {Sample}_S*_L*_R2_001.fastq.gz 

To successfully process these Illumina filenames, the illumina_custom_config.json custom config file has been provided in the metagenomics/workflows/config directory. It is configured for snakemake to recognize and execute files with the standard Illumina naming convention, as described above. As with other data types, users will need to specify the name(s) of one or more sample(s) in their config file (starting at line #13 of illumina_custom_config.json):

    "workflows" : {

        "samples_input_workflow" : {
            "samples"    : ["First_Sample", "Second_Sample", "Third_Sample"]
        },

The rest of Illumina name features are described under the read_filtering workflow in the default_workflowparams.settings, which does not need to be edited by users. For explanatory purposes, how that process works is described below.

For snakemake to find the forward reads in the sample, the remaining text features from the forward read filename are specified as the pre_trimming_glob_pattern (line #5 in illumina_custom_config.json):

            "pre_trimming_glob_pattern"  : "*_S*_L*_R1_001.fastq.gz",

For snakemake to identify its paired end mate, the following parameter is specified under post_trimming_pattern (lines #6 and #7 in illumina_custom_config.json):

            "reverse_pe_pattern_search"  : "R1",
            "reverse_pe_pattern_replace" : "R2",

Finally, the correct file extension type is indicated under quality_trimming (line #10 in illumina_custom_config.json):

            "sample_file_ext" : ".fastq.gz"

Parameters can be edited in the illumina_custom_config.json file as needed. For example, the illumina_custom_config.json file is setup to run quality filtering at thresholds of both "2" and "30", but "2" could be deleted if users only want to process samples with an aggressive quality filtering threshold of "30".

Note: The subsequent wiki pages do not describe the execution of datasets with standard Illumina naming nomenclature, but we encourage the use of this new capability. As always, feel free to reach out to us through our issues page with any questions in the meantime!

Helpful Tips for Executing Workflows

Terminal Snakemake Rules

Due to the progressive nature of the metagenomics workflows and inherent capabilities of snakemake, many of the rules and workflows build upon each other. As a result, there are terminal snakemake rules that may be called in the command line to instruct snakemake to automatically identify, build, and execute the intermediate rules to needed accomplish the terminal rule specified in the command. For instance, if the read filtering workflow is ready to run with raw paired-end reads (e.g., SRR606249_subset10_1_reads.fq.gz and SRR606249_subset10_1_reads.fq.gz), the the read_filtering_pretrim_workflow and read_filtering_posttrim_workflow rules will automatically be run if the read_filtering_multiqc_workflow snakemake rule is executed:

snakemake --use-singularity read_filtering_multiqc_workflow

This also holds true across workflows, given the presence of all dependencies (i.e., containers and/or databases) needed to execute all intermediate files, as well as the presence of all intermediate rules in the config file.

Below is a list of all the terminal rules for each workflow. By executing these terminal rules, data will be processed through all of the intermediate rules without explicitly calling them. We recommend reviewing all of these rules and their intermediate ones, since their applicability may vary depending on the intended scientific goal. There may be times when all of the rules don't need to be called, and users may want to end the analysis with the output(s) of an intermediate rule.

The following commands call only the terminal rules needed to execute the entirety of each workflow with metagenomic tools (Note: this does not include the additional tools to evaluate isolate data):

Read Filtering

snakemake --use-singularity read_filtering_multiqc_workflow read_filtering_khmer_split_interleaved_reads_workflow read_filtering_khmer_count_unique_kmers_workflow read_filtering_fastq_to_fasta_workflow

Assembly

snakemake --use-singularity assembly_multiqc_workflow assembly_metaquast_workflow

Comparison

snakemake --use-singularity comparison_output_heatmap_plots_all_workflow

Taxonomic Classification with Reads

snakemake --use-singularity tax_class_gather_workflow tax_class_visualize_krona_kaijureport_workflow tax_class_visualize_krona_species_summary_workflow tax_class_add_taxonnames_workflow tax_class_kraken2_workflow tax_class_bracken_workflow tax_class_krakenuniq_workflow

Note: Depending on the goal, tax_class_visualize_krona_kaijureport_workflow can be replaced with tax_class_visualize_krona_kaijureport_filtered_workflow or tax_class_visualize_krona_kaijureport_filteredclass_workflow, or these rules can be used in tandem with one another.

Taxonomic Classification with Contigs (only available for kaiju and krona analyses)

snakemake --use-singularity tax_class_visualize_krona_kaijureport_contigs_workflow tax_class_visualize_krona_species_summary_contigs_workflow tax_class_add_taxonnames_to_contigs_workflow

Note: Depending on the goal, tax_class_visualize_krona_kaijureport_contigs_workflow can be replaced with tax_class_visualize_krona_kaijureport_filtered_contigs_workflow or tax_class_visualize_krona_kaijureport_filteredclass_contigs_workflow, or these rules can be used in tandem with one another.

Functional Inference

snakemake --use-singularity functional_with_srst2_workflow functional_prokka_with_megahit_workflow functional_prokka_with_metaspades_workflow functional_abricate_with_megahit_workflow functional_abricate_with_metaspades_workflow

Different Workflow Entry Points

The v1.3 metagenomics workflows also provide the capability for users to jump into the workflow with their own files at different entry points. For instance, if a user has their own assembled contig (.fa), or quality-trimmed paired-end read files generated outside of this workflow, those files can still be processed by the snakemake rules. To do this, users should perform the exact same set up as normal (i.e., download all dependencies for offline execution, activate the metag environment, set the Singularity bind path, and modify the config files). All input files must also be named/renamed to match the file naming convention and patterns as indicated in the config file to be recognized by snakemake. For trimmed reads, samples should have the following naming patterns:

Forward reads: {sample}_1_reads_trim{quality_threshold}_1.fq.gz
Reverse reads: {sample}_1_reads_trim{quality_threshold}_2.fq.gz

For assembled contigs, sample(s) should have the following naming pattern:

{sample}_1_reads_trim{quality_number}.{assembler}.contigs.fa

For more information on naming conventions at different entry points, see the Workflow Architecture Page. It is also important that the parameters in the sample name match to those in the config file (i.e. quality trim value of 30 in both sample name and in config file).

If the analysis does not begin with raw reads, a user should also create two stand-in-place files for snakemake to build naming patterns for execution. We recommend doing this through the touch command in the metagenomics/workflows/data directory.

touch {sample}_1_reads.fq.gz
touch {sample}_2_reads.fq.gz

It is important to remember that snakemake has to build its workflow progressively, so it knows the order in which it should see the output from each workflow and rule. Because snakemake looks at the time stamps of each file, it will notice if the new base files are more recent than the input file and will not execute. To resolve this, the timestamp for files can be updated with the touch command as well:

touch <your_file(s)> 

If file naming patterns are consistent and time stamps are ordered correctly, then rule(s) and workflows can be executed as normal. Snakemake will build and create all the intermediate files needed to run a command. Please see subsequent wiki pages to learn more about the available options and expected outputs for each workflow.

Organizing Final Output Files

Following the completion of all analyses, a final post processing command has been incorporated to organize all datasets according to their sample name(s). The post_processing_move_samples_dir_workflow snakemake rule will create sub-directories in the data/ directory and move all files associated with that sample into their respective sub-directories.

This can be executed with the following command:

snakemake --use-singularity  post_processing_move_samples_dir_workflow

Before executing this command, we recommend that users ensure they are completely finished with all analyses. Re-executing a workflow can't be done unless data files are moved back into the metagenomics/workflows/data/ directory.

Aside from the organizational benefits of moving the files for each sample into their own directory, we are also exploring options for generating final reports that could be executed on all files located within a single directory.

A preliminary example of a final report can be found here, and the 0-summary-report.html file can be viewed after download: https://github.com/signaturescience/metagenomics/blob/master/documentation/example-summary-report

Clone this wiki locally