Skip to content
Brian Haas edited this page Nov 12, 2017 · 13 revisions

CSHL SEQTEC 2017 Transcriptome Assembly

In this workshop, we will be performing both genome-guided and genome-free transcript reconstruction.

For genome-guided reconstruction, we will leverage the Tuxedo2 protocol, leveraging HISAT2 tool for aligning RNA-Seq reads to a target genome, and StringTie for reconstructing transcript isoforms.

We will also leverage Trinity for genome-free transcript reconstruction, and compare our de novo assembled transcripts to the genome-guided reconstructions by aligning our Trinity transcripts to the target genome using GMAP and visualizing all the data using IGV.

Target genome and RNA-Seq data

Our genome and RNA-Seq data can be found at:

%   ls ~/workspace/data/trinity/rnaseq/*

.

/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.fa   
/home/ubuntu/workspace/data/trinity/rnaseq/samples.txt
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.gtf

/home/ubuntu/workspace/data/trinity/rnaseq/data:
sA_rep1_1.fastq.gz  sA_rep2_2.fastq.gz  sB_rep1_1.fastq.gz  sB_rep2_2.fastq.gz
sA_rep1_2.fastq.gz  sA_rep3_1.fastq.gz  sB_rep1_2.fastq.gz  sB_rep3_1.fastq.gz
sA_rep2_1.fastq.gz  sA_rep3_2.fastq.gz  sB_rep2_1.fastq.gz  sB_rep3_2.fastq.gz

The genome data correspond to a small subset of the Drosophila genome (~200 genes) and the RNA-Seq data correspond to two conditions (sA & sB), with each containing three biological replicates (rep1,2,3).

The minigenome.gtf contains the reference transcript structure annotations, and the 'samples.txt' file describes the relationships between samples, replicates, and their corresponding paired-end RNA-Seq fastq files.

Setting up your local workspace

Create a 'transreco' directory in your workspace where you will perform your analyses.

%  cd ~/workspace
%  mkdir transreco
%  cd transreco

Copy over the genome and rna-seq data to your current working directory 'transreco'.

%  cp -r ~/workspace/data/trinity/rnaseq/* .

View the genome and annotations using IGV

Load the 'minigenome.fa' into IGV as a new genome, and then load in the annotations 'minigenome.gtf'. Setting the annotations to 'squished' will enable you to see all the different reference isoforms for each gene.

Align reads using HISAT2

Index the target genome for use with hisat2

Before we can align the reads to the genome using hisat2, we must first prepare (index) the target genome. This is a multi-step process involving the following steps, all involving tools included in the hisat2 installation:

Extract the splice sites from the genome annotation file:

%  hisat2_extract_splice_sites.py minigenome.gtf  > minigenome.gtf.ss

Extract the exon records from the genome annotation file:

%  hisat2_extract_exons.py minigenome.gtf > minigenome.gtf.exons

Now build the genome index using these new files along with the genome and annotation file:

 % hisat2-build --exon minigenome.gtf.exons \
       --ss minigenome.gtf.ss \
       -p 2 minigenome.fa minigenome.fa

Align the reads using hisat2, generating bam files

Align the reads for each replicate separate. Each replicate corresponds to a pair of fastq files. Align each like so:

Sample A, replicate 1

%  hisat2 --dta -x minigenome.fa -p 2 \
    -1 data/sA_rep1_1.fastq.gz -2 data/sA_rep1_2.fastq.gz \
    | samtools view -Sb -F 4 \
    | samtools sort  -o sA_rep1.cSorted.hisat2.minigenome.fa.bam

Sample A, replicate 2

%  hisat2 --dta -x minigenome.fa -p 2 \
      -1 data/sA_rep2_1.fastq.gz -2 data/sA_rep2_2.fastq.gz \
      | samtools view -Sb -F 4 \
      | samtools sort  -o sA_rep2.cSorted.hisat2.minigenome.fa.bam

Sample A, replicate 3

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sA_rep3_1.fastq.gz -2 data/sA_rep3_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sA_rep3.cSorted.hisat2.minigenome.fa.bam 

Sample B, replicate 1

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep1_1.fastq.gz -2 data/sB_rep1_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep1.cSorted.hisat2.minigenome.fa.bam

Sample B, replicate 2

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep2_1.fastq.gz -2 data/sB_rep2_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep2.cSorted.hisat2.minigenome.fa.bam

Sample B, replicate 3

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep3_1.fastq.gz -2 data/sB_rep3_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep3.cSorted.hisat2.minigenome.fa.bam
Clone this wiki locally