Skip to content

Nextflow pipeline for Salmon quantification of HSC bulk RNA-seq data followed by R/Bioconductor analysis

Notifications You must be signed in to change notification settings

felixm3/RNA-seq

Repository files navigation

RNA-seq: Comparison of Gene Expression in Young vs Aged Hematopoietic Stem Cells (HSCs)

I performed this analysis in two parts (corresponding to the two notebooks here): first, I wrote a Nextflow pipeline to process the raw sequencing reads on a SLURM HPC using conda environments for the necessary bioinformatics tools. The resulting Salmon quantification files were then imported into R for differential analysis using different Bioconductor packages including DESeq2.


The Nextflow script sets up an RNA-seq analysis pipeline that processes raw sequencing reads through quality control (using FASTQC and TRIMGALORE), builds an index (with SALMON), and performs quantification on the processed reads. Below is the description of its components:

Overall Functionality:

The pipeline facilitates RNA-seq analysis by performing the following steps:

  1. FASTQC: Evaluates the quality of raw sequencing reads.
  2. TRIMGALORE: Trims adapter sequences and low-quality bases from reads.
  3. INDEX: Builds an index using a provided transcriptome and decoy sequences.
  4. QUANT: Quantifies gene expression by aligning trimmed reads to the indexed transcriptome.

Input Files Required to Run:

  • Raw sequencing reads: The pipeline expects fastq files (*.fq) located in the directory specified by params.reads.
  • Genome transcriptome reference file: Provided via params.gentrome.
  • Decoy sequences file: Specified by params.decoys.

Required Bioinformatics Tools:

  • FastQC: Quality control tool for raw sequencing reads.
  • Trim Galore: Tool for trimming adapter sequences and low-quality bases.
  • Salmon: Used for indexing the transcriptome and quantifying gene expression.
  • Nextflow serves as the workflow manager.

Outputs:

  • FastQC Reports: Quality assessment reports for each sample.
  • Trimmed Reads: Trimmed and quality-filtered fastq files.
  • Salmon Index: Indexed transcriptome for subsequent quantification.
  • Quantification Results: Gene expression quantification files.

Configuration:

The nextflow.config file specifies parameters for the execution environment, including using the Slurm executor for job execution and configuring Conda environments for tool dependencies.


The R script is an analysis pipeline for RNA-seq data processing and differential expression analysis. Here's a breakdown of its components:

Overall Functionality:

  1. Data Preparation and Import: Utilizes various R packages (tximport, AnnotationHub, ensembldb, etc.) to import transcript and gene information, prepare quantification files generated by SALMON, and create a sample table.
  2. Normalization and Transformation: Performs data normalization and transformation steps (rlog transformation) to stabilize variance for downstream analysis.
  3. Visualization:
    • Generates sample distance matrix and visualizes it as a heatmap (pheatmap).
    • Conducts Principal Component Analysis (PCA) and generates a scatter plot (ggplot2).
    • Visualizes differential gene expression results using EnhancedVolcano and generates heatmaps for significant genes.

Input Files Required to Run:

  • Quantification Files: Generated by SALMON (quant.sf files) from the RNA-seq analysis for each sample.
  • Annotation Files: Obtained using AnnotationHub to retrieve Mus musculus transcript and gene information.

R Packages and Bioinformatics Tools Used:

  • R Packages:
    • tximport, AnnotationHub, ensembldb, DESeq2, ggplot2, dplyr, pheatmap, RColorBrewer, biomaRt, EnhancedVolcano, etc.
  • Bioinformatics Tools:
    • Salmon: Used for quantifying transcript abundance from RNA-seq data.

Outputs Generated:

  • Quality Visualizations: Heatmaps depicting sample distances and differential gene expression.

  • Normalized Data: Data after normalization and transformation steps.

  • Differential Expression Results: Provides differential expression results, including log2 fold changes, p-values, adjusted p-values, and associated gene symbols.

  • Visualization Outputs: Scatter plots for PCA and volcano plots for visualizing differentially expressed genes.

    The dataset is from the NCBI Gene Expression Omnibus (GEO) repository.

About

Nextflow pipeline for Salmon quantification of HSC bulk RNA-seq data followed by R/Bioconductor analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages