bulk-rnaseq

Simple workflow to quantify gene-level RNA abundance and detect differentially expressed genes (DEGs) from bulk RNAseq samples. The pipeline uses salmon to quantify transcript level abundance and DESeq2 to normalize counts and detect DEGs.

Resource Requirements

Salmon usually works with 8GB to 12GB of RAM. 16GB is recommended for DESeq2 with small datasets. Larger amounts of memory will be required for larger datasets.

Installation

Install Singularity and snakemake
Download the salmon references or build your own
Clone this repository or a create new repository from this template
Describe your samples in samples.csv
Modify the settings in config.yaml
(Optional) If you plan on using a SLURM cluster, fill out the #SBATCH directives in run_pipeline.sh and the account field in cluster.json.

Salmon vs Kallisto

As of April 2022, bulk-rnaseq only supports Salmon quantification. Older versions supported Kallisto, but this was dropped due to better Salmon integration with tximeta

Sample File Configuration

bulk-rnaseq requires you specify a CSV file with a metadata about samples to analyze. There are several required columns:

patient
condition
fq1
fq2

The patient and condition columns are used to create a sample specific ID for downstream processing and identification. The values specified in the these columns should create unique values when combined as {patient}-{condition}. condition should specify the experimental condition of the sample (normal or tumor, primary or metastatic etc).

NOTE salmon will automatically infer the library stranding and choose the best option.

Running the pipeline

Run jobs locally

snakemake -j [cores] --use-singularity

Run jobs on a SLURM cluster

# Must run within the bulk-rnaseq directory
sbatch run_pipeline.sh

Output

Results are stored in results/[gene|transcript]-level and include:

normalized_counts.rds - DESeq2 object with VST normalized log2 counts. Useful for downstream visualization, clustering, and machine learning
pca_plot.svg - Plot of the samples described by the first two principal components. Useful to check for batch effects and identify bad samples

Each contrast has its own folder in results/ with the following files:

mle_foldchanges.csv - List of log2 fold changes estimated by DESeq2 using maximum likelihood estimation (MLE)
map_foldchanges.csv - List of log2 fold changes estimated by DESeq2 via shrinkage using the apeglm package
mle_ma.svg - MA plot showing the MLE fold changes
map_ma.svg - MA plot showing the MAP fold changes

The full DESeq2 object used for differential expression analysis can be found in deseq2/.

A MultiQC html report with quantification statistics and FASTQC reports for each sample is located in qc/. Check this to verify a reasonable proportion of reads were pseudoaligned. validateFastq_summary.csv has the results of running qc/validateFastq on each sample's input FASTQ files. This checks for proper FASTQ formatting and properly sorted read headers.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
envs		envs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config.yaml		config.yaml
run_pipeline.sh		run_pipeline.sh
samples.csv		samples.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bulk-rnaseq

Resource Requirements

Installation

Salmon vs Kallisto

Sample File Configuration

Running the pipeline

Output

About

Releases

Packages

Languages

tjbencomo/bulk-rnaseq

Folders and files

Latest commit

History

Repository files navigation

bulk-rnaseq

Resource Requirements

Installation

Salmon vs Kallisto

Sample File Configuration

Running the pipeline

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages