Skip to content

ChIP seq

bowhan edited this page Apr 12, 2016 · 8 revisions

piPipes ChIP-seq pipeline

This document explains how to run piPipes ChIP-seq pipeline and how to interpret the output.

This pipeline provides analysis on the abundance of a specific chromatin signal on genes and transposons using single-end or paired-end ChIP-seq reads generated by Next Generation Sequencing.

The ChIP-seq pipeline contains two modes: single-sample mode and dual-sample mode. Please see the examples below.

Example 1. Run single sample ChIP-seq library

Install the fly dm3 genome, if you haven't done so

# require internet access
piPipes install -g dm3

Download ChIP-seq sample data from NCBI SRA

ChIP-seq usually contains two libraries for each experiment: input control and immunoprecipitates.

# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
# --split-3 split the left and right read from paired-end ChIP-seq data
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_piwi   SRR646594
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_piwi SRR646595

Check usage message

piPipes chip
# or
piPipes chip -h

Using default parameters

# -l: left read of  ChIP-seq IP
# -r: right read of ChIP-seq IP
# -L: left read of  ChIP-seq input
# -R: right read of ChIP-seq input
# -g: genome to use
# -o: output directory
piPipes chip \
  -l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
  -r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
  -L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
  -R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
  -g dm3 \
  -B \
  -o Hannon.ChIP.H3k9me3.nos_piwi.piPipes_out \
  1> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stdout \
  2> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stderr

Run the pipeline with optional parameter

# -B: broad peak; used for H3K9me3; for transcriptional factor, don't use this option
# -x: extend this many nucleotide on both sides of genomic features for TSS/TES analysis
#     Note that for meta-analysis of gene body, the extended length is proportional to the size
#     of the gene body (1:1:1) so this parameter does not influence meta-gene
# -M: Path to BED files for meta-plot analysis on user-defined regions.
#     The files should be delimited by comma.

# the following three options toggle the usage of multi-mappers (unambiguous mappers)
# -u: Only use unique mappers. default: on
# -m: Use both unique and multi-mappers. For multi-mappers, let Bowtie2 randomly report one locus
#     from the best alignment pool. default: off
# -e: Use both unique and multi-mappers. For multi-mappers, use Expectation–Maximization algorithm
#     implemented by CSEM to allocate them. Only alignments passing 0.5 CSEM posterior are kept.

piPipes chip \
  -l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
  -r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
  -L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
  -R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
  -g dm3 \
  -B \
  -x 100 \
  -m \
  -M region_of_interest1.bed,region_of_interest2.bed \
  -o Hannon.ChIP.H3k9me3.nos_piwi.piPipes_out \
  1> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stdout \
  2> Hannon.ChIP.H3k9me3.nos_piwi.piPipes.stderr

Interpretation of the output files

The output folder should contain the following folders:

# folder raw data for mega analysis
aggregate_output/
# bigWig files for UCSC genome browser
bigWig/
# BAM files for genome mapping
genome_mapping/
# outputs from MACS2, the peak calling program used
macs2_peaks_calling/
# log
nos_piwi.callpeak.log
# pdfs for meta-analysis
pdfs/

genome_mapping/ contains bam files for genomic alignments. Note that input and IP are aligned separately.

** Based on the option of -u -m -e, the bam file could contain different conteins**

  • if -u is used (by default), reads are mapped to the genome by bowtie2, the bam file contains only the unique mapper
  • if -m is used, for each of the multi-mapper reads, bowtie2 will randomly report one alignment from the best alignments.
  • if -e is used, reads are aligned to the genome using bowtie with -a -m 100 --best --strata (to report best alignments for multi-mappers with <= 100 loci). Then the bam file will be used by csem, which uses Expectation–maximization algorithm to allocate multi-mappers. piPipes then filter the alignments and only keep the ones with crem posterior higher than 0.5. The reason to use bowtie instead of bowtie2 is that crem currently is incompatible with bowtie2 output.

Note that bowtie is fundamentally a string matching algorithm, it can allow up to 3 mismatches and no indels. bowtie2 uses score matrix based alignment algorithm and are tolerable with more mismatches and indels. Thus for long reads (>70), we recommend using -u. For targets enriched in repetitive region, including H3K9me2/3 or Rhino, we recommend to use -m so that we don't lose important information.

# alignments for the input
Hannon.ChIP.H3K9me3.dm3.Input.b2.log
Hannon.ChIP.H3K9me3.dm3.Input.b2.sorted.bam
Hannon.ChIP.H3K9me3.dm3.Input.b2.sorted.bam.bai

# alignments for the IP
Hannon.ChIP.H3K9me3.dm3.IP.b2.log
Hannon.ChIP.H3K9me3.dm3.IP.b2.sorted.bam
Hannon.ChIP.H3K9me3.dm3.IP.b2.sorted.bam.bai

macs2_peaks_calling/ contains output files from MACS2.

# bedGraph files for IP and input alignments
Hannon.ChIP.H3K9me3_treat_pileup.bdg
Hannon.ChIP.H3K9me3_control_lambda.bdg

# loci of peaks called by MACS2
Hannon.ChIP.H3K9me3_peaks.xls

# information on the "broad" regions called by MACS2 in BED format
Hannon.ChIP.H3K9me3_peaks.broadPeak

# information on both the "broad" regions and narrow peaks
Hannon.ChIP.H3K9me3_peaks.gappedPeak

# log for peak calling, --SPMR option is used so the signal was normalized
Hannon.ChIP.H3K9me3.callpeak.SPMR.log

# bedGraph files for signal enrichment (IP over input) with three different methods
Hannon.ChIP.H3K9me3.ppois.bdg
Hannon.ChIP.H3K9me3.FE.bdg
Hannon.ChIP.H3K9me3.logLR.bdg

bigWig/ contains bigWig files for UCSC genome browser

# bigWig files for IP/input signal enrichment from three different methods.
Hannon.ChIP.H3K9me3.ppois.bigWig
Hannon.ChIP.H3K9me3.FE.bigWig
Hannon.ChIP.H3K9me3.logLR.bigWig

aggregate_output/ contains accumulated signal around start site (TSS), end site (TES), and the whole region (meta). They are generated by bwtool aggregate.

By default, piPipes extends 100 bp around the start/end and calculate the accumulated signals. The size can be toggled using -x option. For meta-gene analysis, the extension length is the same as the gene body, so it is not affected by -x.

# piPipes used three different methods to calculate the signal enrichment of IP/input,
# Poisson P value, Folder Enrichment and logLR.
# for each position from -100 to 100
# the data is used to draw pdf
$ head piRNA_Cluster_42AB.starts.txt
piRNA_Cluster_42ABstart	Poisson P value	-100	0.121330
piRNA_Cluster_42ABstart	Fold Enriched	-100	0.452620
piRNA_Cluster_42ABstart	logLR	-100	-0.115620
piRNA_Cluster_42ABstart	Poisson P value	-99	0.121330
piRNA_Cluster_42ABstart	Fold Enriched	-99	0.452620
piRNA_Cluster_42ABstart	logLR	-99	-0.115620
piRNA_Cluster_42ABstart	Poisson P value	-98	0.121330
piRNA_Cluster_42ABstart	Fold Enriched	-98	0.452620
piRNA_Cluster_42ABstart	logLR	-98	-0.115620

Example 2. Run dual-sample mode to compare ChIP-seq libraries from two samples

Run single-sample mode for each library

# download all the data
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_white    SRR646592
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_white  SRR646593
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.input.nos_piwi     SRR646594
fastq-dump -F --split-3 --gzip -A Hannon.ChIP.H3K9me3.nos_piwi   SRR646595

# run single-sample pipeline for control sample
piPipes chip \
  -l Hannon.ChIP.H3K9me3.nos_white_1.fastq.gz \
  -r Hannon.ChIP.H3K9me3.nos_white_2.fastq.gz \
  -L Hannon.ChIP.input.nos_white_1.fastq.gz \
  -R Hannon.ChIP.input.nos_white_2.fastq.gz \
  -g dm3 \
  -c 24 \
  -B \
  -x 100 \
  -o Hannon.ChIP.H3K9me3.nos_white.piPipes_out \
  1> Hannon.ChIP.H3K9me3.nos_white.piPipes.stdout \
  2> Hannon.ChIP.H3K9me3.nos_white.piPipes.stderr

# run single sample pipeline for experimental sample
piPipes chip \
  -l Hannon.ChIP.H3K9me3.nos_piwi_1.fastq.gz \
  -r Hannon.ChIP.H3K9me3.nos_piwi_2.fastq.gz \
  -L Hannon.ChIP.input.nos_piwi_1.fastq.gz \
  -R Hannon.ChIP.input.nos_piwi_2.fastq.gz \
  -g dm3 \
  -c 24 \
  -B \
  -x 100 \
  -o Hannon.ChIP.H3K9me3.nos_piwi.piPipes_out \
  1> Hannon.ChIP.H3K9me3.nos_piwi.piPipes.stdout \
  2> Hannon.ChIP.H3K9me3.nos_piwi.piPipes.stderr

Run dual-sample pipeline using the outputs of single-sample mode

piPipes chip2 \
  -a Hannon.ChIP.H3K9me3.nos_white.piPipes_out \
  -b Hannon.ChIP.H3K9me3.nos_piwi.piPipes_out \
  -g dm3 \
  -c 24 \
  -o Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi \
  -A nos_white \
  -B nos_piwi \
  1> Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi.stdout \
  2> Hannon.ChIP.H3K9me3.nos_white.vs.nos_piwi.stderr

Interpretation of the output files

# pdfs output
pdfs/
# peak calling output from sample 1, without in-library normalization
macs2_peaks_calling_no_normalization_nos_white/
# peak calling output from sample 2, without in-library normalization
macs2_peaks_calling_no_normalization_nos_piwi/
# differential peaks calling between sample 1 and sample 2
differential_peaks_calling/
# aggregate analysis on differentially expressed peaks identified
aggregate_output/

pdfs/ folder contains the TSS/TES/meta analysis on the newly identified regions that are differentially enriched/depleted in mutant.

macs2_peaks_calling_no_normalization_nos_white/ contains peak calling of MACS2 on control sample

macs2_peaks_calling_no_normalization_nos_piwi/ contains peak calling of MACS2 on experimental sample

differential_peaks_calling/ contains outputs of MACS2 on differential peak calling

# log file
nos_.nos_white_vs_nos_piwi.bdgdiff.log

# common peaks enriched in IP over input for both samples, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_common.bed

# peaks enriched in sample A, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_cond1.bed

# peaks enriched in sample B, in BED format
nos_.nos_white_vs_nos_piwi_c3.0_cond2.bed

aggregate_output/ contains text outputs for TSS/TES/meta analysis on the newly identified regions that are differentially enriched/depleted in mutant. The data has been used to draw the figures in pdfs/ folder.

Please see example figures from the Github Wiki page or our manuscript.

##Flowchart and example figures from our manuscript


img/S4.png