Skip to content

A bacterial RNAseq pipeline, designed for analysis of Treponema pallidum data

License

Notifications You must be signed in to change notification settings

sanger-pathogens/TpRNAseq

Repository files navigation

Tp RNAseq

Nextflow run with docker run with singularity

Introduction

Tp RNAseq is a pipeline for quantifying RNAseq reads that map to features of a bacterial genome using Bowtie2 and HTSeq. It has been adapted for use with Treponema pallidum, but could also in principle be used to analyse data for other bacterial species.

The pipeline can be run on local machines and high-performance computing (HPC) clusters with minimal setup.

Pipeline summary

TpRNAseq_workflow

In its simplest usage, Tp RNAseq takes a sample manifest (CSV; see Generating a manifest), reference (fasta) and annotation (gff) as input. It then runs the following steps on .fastq.gz files provided in the sample manifest:

  1. Data combining: .fastq.gz files are combined based on sample or replicate structure.
  2. Basic QC: Adaptor removal and trimming of poor quality bases is performed by fastp. Quality is assessed with FASTQC pre- and post-fastp.
  3. Mapping: Reads are mapped to the given reference genome using Bowtie2. Bowtie2 index files are created if necessary.
  4. Duplicate removal (optional): Duplicate reads are removed using Picard's MarkDuplicates.
  5. Read filtering: Aligned reads are filtered with a user-defined filter (keeping all properly mapped read pairs by default). Additional filters are applied to generate strand-specific BAM files. A mapping quality (MAPQ score) filter can also be applied.
  6. Read quantification per feature: Performed on the user-defined filtered BAM by htseq-count. The generated count tables are combined across samples to generate a summary count table for convenient downstream analysis.
  7. Coverage analysis: Performed by bedtools and custom scripts on the strand-specific BAM files. Coverage plots will be generated per gene and intergenic region from the given annotation file (gff).

FASTQC, samtools stats and statistics from fastp and picard are collected and displayed in a convenient html report using multiQC.

The pipeline allows users to keep various intermediate files along the way. See Usage and Output for more details.

Requirements

Software

  • A Unix-like operating system environment (e.g. Linux, macOS, Windows with WSL) with Bash 3.2 or later.
  • Java 11 or later (OpenJDK) as a nextflow dependency
  • nextflow as workflow management system
  • Docker, Singularity/Apptainer, or conda for pipeline dependency management

Hardware (recommended)

  • >= 16GB RAM
  • >= 2 CPUs/cores
  • Enough space for images (< 10GB) and intermediate data

Dependency Installation

Getting started

Running on a personal computer

  1. Install the necessary dependencies

  2. Open a terminal.

  3. Run the pipeline:
    For example input, please see Generating a manifest.

    To run the pipeline with docker, use the -profile docker option:

    nextflow run sanger-pathogens/TpRNAseq \
        -profile docker \
        --manifest test_data/manifest.csv \
        --reference test_data/ref.fasta \
        --annotation test_data/ref.gff \
        --library_strandedness reverse
    

    Other profiles are also supported (docker, conda, singularity, apptainer).
    ⚠️ If no profile is specified the pipeline will run with a Sanger HPC-specific configuration.

    See usage for all available pipeline options.

Running on the farm (Sanger HPC clusters)

  1. Load nextflow and singularity modules:

    module load nextflow ISG/singularity
  2. Start the pipeline:
    For example input, please see Generating a manifest.

    Example:

    nextflow run sanger-pathogens/TpRNAseq \
        --manifest test_data/manifest.csv \
        --reference test_data/ref.fasta \
        --annotation test_data/ref.gff \
        --library_strandedness reverse

    It is good practice to submit a dedicated job for the nextflow master process (use the oversubscribed queue):

    bsub -o output.o -e error.e -q oversubscribed -R "select[mem>4000] rusage[mem=4000]" -M4000 \
        nextflow run sanger-pathogens/TpRNAseq \
            --manifest test_data/manifest.csv \
            --reference test_data/ref.fasta \
            --annotation test_data/ref.gff \
            --library_strandedness reverse

    See usage for all available pipeline options.

  3. Once your run has finished, check output in the directory supplied to the --outdir option (default ./results). Remember to clean up any intermediate files. To do this (assuming no other pipelines are running from the current working directory) run:

    rm -rf work .nextflow*

    Alternatively, use nextflow clean, which allows selective removal of these files.

Running on any other HPC cluster

This pipeline should be compatible with most HPC clusters, but will not have the necessary configuration files for those clusters hosted in this repository.

Nevertheless, your insitution may support an nf-core configuration profile. Documentation for the use of these profiles can be found here.

These profiles are supported out-of-the-box by this pipeline. You simply have to run the pipeline with the relevant profile. For example, to run the pipeline on the Cambridge HPC, run with -profile cambridge.

⚠️ If no profile is specified the pipeline will run with a Sanger HPC-specific configuration.

Managing image dependencies and running offline

With some profiles, namely singularity or apptainer, environment variables (NXF_SINGULARITY_CACHEDIR and NXF_APPTAINER_CACHEDIR) should be set to a directory path where a cache of images should be stored. See here for more information on nextflow environment variables and here for more information on how nextflow manages container dependencies. These variables should be set in a config file, e.g. ~/.bashrc, to allow the reuse of images in the cache (and avoid the pipeline downloading them whenever it is run in a new directory).

Should you need to run the pipeline offline, it is best to make use of pre-populated dependency caches. These can be created with any of the supported profiles (e.g. -profile docker or -profile conda) and involves running the pipeline once to completion with all the options that would be required for offline usage.

Additionally, the pipeline makes use of configuration files that are pulled from online sources. To avoid a dependence on these, you must download the configuration files in advance and specify their local paths using --nf_core_custom_config and --generic_config.

nf-core also provides nf-core download tool that allows you to easily download pipelines and singularity images for offline use. To use this, you will have to first install the nf-core tool.

Generating a manifest

Manifests supplied as an argument to --manifest, should be of of the following format:

ID,REP,R1,R2
sample,1,./test_data/inputs/sample_rep1_1.fastq.gz,./test_data/inputs/sample_rep2_2.fastq.gz
sample,2,./test_data/inputs/sample_rep2_1.fastq.gz,./test_data/inputs/sample_rep2_2.fastq.gz

Where column ID can be an arbitrary sample identifier, REP describes the replicate structure of the data, R1 is a .fastq.gz file of forward reads, R2 is the mate .fastq.gz file containing reverse reads. The above example shows entries for two replicates of one sample.

Usage

Pipeline-specific options are outlined in the help message below. Run the pipeline with --help to view this.

There are also a number of nextflow-specific options, which start with a single -. A summary of these options can be found at https://www.nextflow.io/docs/latest/cli.html and viewed on the command line by running nextflow -h. Of these, the most useful are probably:

  • -resume: to run the pipeline again while making use of data that has been generated previously (saving computation time and resources).
  • -r <tag/branch/revision_number>: to run a specific version of this pipeline to ensure reproducible/comparable results.
---------------------------------------------------------------------------------------


88888888888            8888888b.  888b    888        d8888                           
    888                888   Y88b 8888b   888       d88888                           
    888                888    888 88888b  888      d88P888                           
    888  88888b.       888   d88P 888Y88b 888     d88P 888 .d8888b   .d88b.   .d88888
    888  888 "88b      8888888P"  888 Y88b888    d88P  888 88K      d8P  Y8b d88" 888
    888  888  888      888 T88b   888  Y88888   d88P   888 "Y8888b. 88888888 888  888
    888  888 d88P      888  T88b  888   Y8888  d8888888888      X88 Y8b.     Y88b 888
    888  88888P"       888   T88b 888    Y888 d88P     888  88888P'  "Y8888   "Y88888
         888                                                                      888
         888                                                                      888
         888                                                                      888
                              ____          ,--r ~.       ,^"   `"w                   
               ____        x^      'w     |L.._    ^m    A^''w     V                  
              D    "W     [R``'w    '@   ,R   [.    [L  jR    K     K     m" ` `W     
             A0     [H    R    [H    [   R     0     @  R     [     [L   0      R     
            ["[L     0   R     [@     @ R      RH    [ A       K     W  A      A      
           #"  [      K,R     /L[     0R     ,R D    [R       #0     [ #      #       
         y^    ,%     '@    ,R   L    [     z"  [    [H      A [L     R      A       
          `  "   T_    'W ,^     0    [L_,.#     L   [H    zC   0     @     R         
                  ^~-.. <^        %_     ,4`     T,   " ^"x"    !L    0__.gR         
                                    ``" `          "----^"        Y.____x^"           
---------------------------------------------------------------------------------------

Typical pipeline command:

  nextflow run main.nf --manifest <manifest> --annotation <gff> --reference <fasta> --library_strandedness [reverse] --outdir [./results]

Input Options
  --manifest                      [string]  Path to a CSV manifest comprising 4 columns (ID, REP, R1, R2). ID is an arbitrary sample ID, REP describes replicate 
                                            structure, R1 and R2 columns contain paths to *.fastq.gz files. 
  --reference                     [string]  null
  --annotation                    [string]  Path to genome annotation in GFF format.
  --library_strandedness          [string]  Strandedness of the RNAseq library. Options: reverse, forward, none. (accepted: reverse, forward, none)

Output Options
  --outdir                        [string]  The output directory where the results will be saved. [default: ./results]
  --keep_combined_fastqs          [boolean] Keep the combined fastqs.
  --keep_trimmed_fastqs           [boolean] Keep the trimmed fastqs.
  --keep_sorted_bam               [boolean] Keep the sorted bam files generated from mapping.
  --keep_dedup_bam                [boolean] Keep the deduplicated bam files.
  --keep_filtered_bam             [boolean] Keep filtered bam files.

Data Combining Options
  --combine_level                 [string]  Combine fastqs at the specified level. `replicate` will combine fastqs that have the same sample and replicate 
                                            identifiers in the manifest. `sample` will combine fastqs that have the same sample identifiers in the manifest 
                                            (i.e. it will combine replicates of the same sample). `none` will perform no combining (sample identifiers must be 
                                            unique in this case). (accepted: replicate, sample, none) [default: none] 

QC Options
  --skip_trim                     [boolean] Skip trimming.
  --trimmer                       [string]  Software to use for trimming. (accepted: fastp) [default: fastp]
  --fastp_args                    [string]  Options and arguments that will be supplied to fastp to modify QC behaviour.
  --dedup                         [boolean] Remove duplicates (hopefully library prep artifacts) from bam file post-mapping.
  --min_mapping_quality           [integer] Remove alignments with MAPQ score smaller than given value. [default: 2]

Mapping Options
  --bowtie2_args                  [string]  Options and arguments that will be supplied to Bowtie2 to modify mapping behaviour. [default: --local 
                                            --very-sensitive-local --rdg 8,4 --rfg 8,4 --no-mixed] 

Counting Options
  --htseq_args                    [string]  Options and arguments that will be supplied to htseq-count to modify counting behaviour. [default: --type gene 
                                            --idattr locus_tag --nonunique none --secondary-alignments ignore] 
  --samtools_filter_args          [string]  Arguments supplied to samtools to will be used to filter alignments of interest for counting. [default: -f 
                                            2] 
  --annotate_feature_assignment   [boolean] Generate an annotated bam file where each read is assigned to the feature for which is has been counted.

Coverage Options
  --skip_strand_specific_analysis [boolean] Skip strand-specific coverage analysis.
  --pairwise                      [boolean] Generate pairwise sample comparison coverage plots.
  --coverage_window_size          [integer] Size of window or step over which to compute cumulative per base coverage (output as a wig file). [default: 
                                            100] 
  --coverage_context              [integer] Size of context around the annotated region (coverage_window_size base pairs either side) in the coverage plot. 
                                            [default: 100] 

Resource Request Options
  --max_cpus                      [integer] Maximum number of CPUs that can be requested for any single job. [default: 256]
  --max_memory                    [string]  Maximum amount of memory that can be requested for any single job. [default: 2.9 TB]
  --max_time                      [string]  Maximum amount of time that can be requested for any single job. [default: 30d]
  --max_retries                   [integer] Maximum number of retries before ingnoring process failure. [default: 2]
  --retry_strategy                [string]  Default retry strategy (used in the event that software fails to process data for an unexpected reason) 
                                            [default: ignore] 
  --submit_rate_limit             [string]  LSF-specific: Change the rate at which the pipeline will submit jobs on the cluster.
  --queue_size                    [string]  LSF-specific: Maximum number of jobs pipeline will submit jobs on the cluster at any given time.

Generic options
  --help                          [boolean] Display help text.
  --monochrome_logs               [boolean] Do not use coloured log outputs.
  --multiqc_config                [string]  Custom config file to supply to MultiQC.
  --tracedir                      [string]  Directory in which to save pipeline reporting information. [default: ./results/pipeline_info]

Config Inheritance Options
  --generic_config_base           [string]  Inherit configuration from a base URL. [default: 
                                            https://raw.githubusercontent.com/sanger-pathogens/nextflow-commons/] 
  --generic_config_version        [string]  Specify the version to inherit (will be appended to base URL). [default: master]
  --generic_config                [string]  Override the above with a generic config URL (or file path).
  --nf_core_custom_config_base    [string]  nf-core configuration base URL. [default: https://raw.githubusercontent.com/nf-core/configs/]
  --nf_core_custom_config_version [string]  Specify the version of nf-core configs to inherit (will be appended to base URL). [default: master]
  --nf_core_custom_config         [string]  Override the above with an nf-core config URL (or file path).
  --input                         [string]  Ignore: Parameter inherited but not used by this pipeline.

------------------------------------------------------

Output

The pipeline will generate a directory structure in the output directory specified by --outdir. Some directories may be missing depending on the data retention flags employed (See Usage). Note that parameter values are represented by {param_name} below.

results/
├── bowtie2_index
├── coverage
│   ├── plots
│   ├── wig_{coverage_window_size}
│   └── wig_raw
├── fastp
│   ├── trimmed_fastqs
│   └── reports
├── picard
│   └── metrics
├── filtered_bams
│   ├── minus_filter
│   ├── plus_filter
│   └── user_defined_filter
├── htseq
├── multiqc
│   ├── multiqc_report_data
│   ├── multiqc_report_plots
│   └── multiqc_report.html
├── pipeline_info
├── qc
│   ├── fastqc
│   │   ├── raw
│   │   └── trim
│   └── samtools_stats
├── sorted_bams
└── ref_index

The content of these folders is described in more detail below. Note that output files will be sample-specific if --combine_level sample is used, replicate-specific if --combine_level replicate, and input file-specific otherwise:

Directory / File Description
bowtie2_index Contains Bowtie2 index files (.bt2.*) for the given reference (--reference).
coverage/plots Contains per sample (or pairwise-sample if --pairwise) folders, containing strand-specific coverage plots for all gene level annotations in the given --gff and intergenic regions.
coverage/wig_{coverage_window_size} Contains *.wig files which summarize cumulative (per base) genome coverage over steps of the given --coverage_window_size.
coverage/wig_raw Contains *.wig files per base genome coverage.
fastp/trimmed_fastqs Contains trimmed reads in *.fastq.gz files.
Generated with --keep_trimmed_fastqs and if not --skip_trim.
fastp/reports Contains fastp report files in html and json format.
picard Contains deduplicated BAM files.
Generated with --keep_dedup_bam.
picard/metrics Contains deduplication metrics files generated by picard.
filtered_bams/(plus|minus)_filter Contains reverse (plus_filter) and forward (minus_filter) strand-specific BAM files, containing reads that mapped to each strand of the given --reference.
Generated with --keep_filter_bam.
filtered_bams/user_defined_filter Contains BAM files, containing reads that mapped to the given --reference and filtered using --samtools_filter_args.
Generated with --keep_filter_bam.
htseq Contains HTSeq-generated count data (tsv format) suitable for differential expression analysis. It also contains a summary across all samples gene_counts.tsv.
multiqc Contains a summary of all interesting QC metrics/statistics generated by the pipeline.
sorted_bams Contains raw sorted BAM files (prior to filtering).
Generated with --keep_sorted_bam.
qc/fastqc Contains FASTQC data and reports for both trimmed and raw *.fastq.gz files.
qc/samtools_stats Contains statistics on mapped and filtered reads.
ref_index Contains a copy of the reference fasta and fasta index *.fai.
Generated only if a reference *.fai could not be detected in the same directory as the input reference.
pipeline_info Contains an execution report (html), execution trace (tsv) - useful for identifying failed processes, and a pipeline flowchart (svg).

Known Issues

  • The --pairwise option will fail to generate pairwise coverage plots - this functionality will be fixed in a future release.

Credits

Tp RNAseq was produced by PAM informatics at the Wellcome Sanger Institute, in collaboration with Linda Grillova and Eli Carrami. Inspiration was taken, in places, from the nf-core pipelines.

Support

If you require any help running this pipeline, experience a bug, or would like to request a new feature, please post an issue.

Citations

This pipeline was developed to support the work of the following publications. If you use this pipeline in your work, please cite as appropriate:

<TODO - List of citations>

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

About

A bacterial RNAseq pipeline, designed for analysis of Treponema pallidum data

Resources

License

Stars

Watchers

Forks

Packages

No packages published