This is a simple snakemake
workflow template for preparing single-end ChIP-Seq data.
The steps implemented are:
- Download raw fastq files from SRA
- Trim and Filter raw fastq files using
AdapterRemoval
- Align to the supplied genome using
bowtie2
- Deduplicate Alignments using
Picard MarkDuplicates
- Call Macs2 Peaks using
macs2
A pdf of the rulegraph is available here
Full details for each step are given below.
Any additional parameters for tools can be specified using config/config.yml
, along with many of the requisite paths
To run the workflow with default settings, simply run as follows (after editing config/samples.tsv
)
snakemake --use-conda --cores 16
If running on an HPC cluster, a snakemake profile will required for submission to the queueing system and appropriate resource allocation. Please discuss this will your HPC support team. Nodes may also have restricted internet access and rules which download files may not work on many HPCs. Please see below or discuss this with your support team
Whilst no snakemake wrappers are explicitly used in this workflow, the underlying scripts are utilised where possible to minimise any issues with HPC clusters with restrictions on internet access.
These scripts are based on v1.31.1
of the snakemake wrappers
It should be noted that this workflow is currently incompatible with OSX-based systems. There are two unsolved issues
fasterq-dump
has a bug which is specific to conda environments. This has been updated in v3.0.3 but this patch has not yet been made available to conda environments for OSX. Please check here to see if this has been updated.- The following error appears in some OSX-based R sessions, in a system-dependent manner:
Error in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
polygon edge not found
The fix for this bug is currently unknown
The file samples.tsv
is used to specify all steps for this workflow.
This file must contain the columns: accession
, target
, treatment
and input
accession
must be an SRA accession. Only single-end data is currently supported by this workflowtarget
defines the ChIP target. All files common to a target and treatment will be used to generate summarised coverage in bigWig Filestreatment
defines the treatment group each file belongs to. If only one treatment exists, simply use the value 'control' or similar for every fileinput
should contain the accession for the relevant input sample. These will only be downloaded once. Valid input samples are required for this workflow
As some HPCs restrict internet access for submitted jobs, it may be prudent to run the initial rules in an interactive session if at all possible. This can be performed using the following (with 2 cores provided as an example)
snakemake --use-conda --until get_fastq --cores 2
- Downloaded files will be gzipped and written to
data/fastq/raw
. FastQC
andMultiQC
will also be run, with output indocs/qc/raw
Both of these directories are able to be specified as relative paths in config.yml
Read trimming is performed using AdapterRemoval. Default settings are customisable using config.yml, with the defaults set to discard reads shorter than 50nt, and to trim using quality scores with a threshold of Q30.
- Trimmed fastq.gz files will be written to
data/fastq/trimmed
FastQC
andMultiQC
will also be run, with output indocs/qc/trimmed
- AdapterRemoval 'settings' files will be written to
output/adapterremoval
Alignment is performed using bowtie2
and it is assumed that this index is available before running this workflow.
The path and prefix must be provided using config.yml
This index will also be used to produce the file chrom.sizes
which is essential for conversion of bedGraph files to the more efficient bigWig files.
- Alignments will be written to
data/aligned
bowtie2
log files will be written tooutput/bowtie2
(not the conenvtional log directory)- The file
chrom.sizes
will be written tooutput/annotations
Both sorted and the original unsorted alignments will be returned.
However, the unsorted alignments are marked with temp()
and can be deleted using
snakemake --delete-temp-output --cores 1
Deduplication is performed using MarkDuplicates from the Picard set of tools. By default, deduplication will remove the duplicates from the set of alignments. All resultant bam files will be sorted and indexed.
- Deduplicated alignments are written to
data/deduplicated
and are indexed - DuplicationMetrics files are written to
output/markDuplicates
This is performed using macs2 callpeak
.
- Peak calling will be performed on: a. each sample individually, and b. merged samples for those sharing a common ChIP target and treatment group.
- Coverage bigWig files for each individual sample are produced using CPM values (i.e. Signal Per Million Reads, SPMR)
- For all combinations of target and treatment coverage bigWig files are also produced, along with fold-enrichment bigWig files
- Individual outputs are written to
output/macs2/{accession}
- Peaks are written in
narrowPeak
format along withsummits.bed
- bedGraph files are automatically converted to bigWig files, and the originals are marked with
temp()
for subsequent deletion - callpeak log files are also added to this directory
- Peaks are written in
- Merged outputs are written to
output/macs2/{target}/
- bedGraph Files are also converted to bigWig and marked with
temp()
- Fold-Enrichment bigWig files are also created with the original bedGraph files marked with
temp()
- bedGraph Files are also converted to bigWig and marked with