Skip to content
Brian Haas edited this page Nov 12, 2017 · 13 revisions

CSHL SEQTEC 2017 Transcriptome Assembly

In this workshop, we will be performing both genome-guided and genome-free transcript reconstruction.

For genome-guided reconstruction, we will leverage the Tuxedo2 protocol, leveraging HISAT2 tool for aligning RNA-Seq reads to a target genome, and StringTie for reconstructing transcript isoforms.

We will also leverage Trinity for genome-free transcript reconstruction, and compare our de novo assembled transcripts to the genome-guided reconstructions by aligning our Trinity transcripts to the target genome using GMAP and visualizing all the data using IGV.

Target genome and RNA-Seq data

Our genome and RNA-Seq data can be found at:

%   ls ~/workspace/data/trinity/rnaseq/*

.

/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.fa   
/home/ubuntu/workspace/data/trinity/rnaseq/samples.txt
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.gtf

/home/ubuntu/workspace/data/trinity/rnaseq/data:
sA_rep1_1.fastq.gz  sA_rep2_2.fastq.gz  sB_rep1_1.fastq.gz  sB_rep2_2.fastq.gz
sA_rep1_2.fastq.gz  sA_rep3_1.fastq.gz  sB_rep1_2.fastq.gz  sB_rep3_1.fastq.gz
sA_rep2_1.fastq.gz  sA_rep3_2.fastq.gz  sB_rep2_1.fastq.gz  sB_rep3_2.fastq.gz

The genome data correspond to a small subset of the Drosophila genome (~200 genes) and the RNA-Seq data correspond to two conditions (sA & sB), with each containing three biological replicates (rep1,2,3).

The minigenome.gtf contains the reference transcript structure annotations, and the 'samples.txt' file describes the relationships between samples, replicates, and their corresponding paired-end RNA-Seq fastq files.

Setting up your local workspace

Create a 'transreco' directory in your workspace where you will perform your analyses.

%  cd ~/workspace
%  mkdir transreco
%  cd transreco

Copy over the genome and rna-seq data to your current working directory 'transreco'.

%  cp -r ~/workspace/data/trinity/rnaseq/* .

View the genome and annotations using IGV

Load the 'minigenome.fa' into IGV as a new genome, and then load in the annotations 'minigenome.gtf'. Setting the annotations to 'squished' will enable you to see all the different reference isoforms for each gene.

Genome-guided Transcript Reconstruction using Tuxedo2

Align reads using HISAT2

Index the target genome for use with hisat2

Before we can align the reads to the genome using hisat2, we must first prepare (index) the target genome. This is a multi-step process involving the following steps, all involving tools included in the hisat2 installation:

Extract the splice sites from the genome annotation file:

%  hisat2_extract_splice_sites.py minigenome.gtf  > minigenome.gtf.ss

Extract the exon records from the genome annotation file:

%  hisat2_extract_exons.py minigenome.gtf > minigenome.gtf.exons

Now build the genome index using these new files along with the genome and annotation file:

 % hisat2-build --exon minigenome.gtf.exons \
       --ss minigenome.gtf.ss \
       -p 2 minigenome.fa minigenome.fa

Note, it's useful to run 'ls -ltr' to see which new files exist in your working directory.

Align the reads using hisat2, generating bam files

Align the reads for each replicate separate. Each replicate corresponds to a pair of fastq files. Align each like so:

Sample A, replicate 1

%  hisat2 --dta -x minigenome.fa -p 2 \
    -1 data/sA_rep1_1.fastq.gz -2 data/sA_rep1_2.fastq.gz \
    | samtools view -Sb -F 4 \
    | samtools sort  -o sA_rep1.cSorted.hisat2.minigenome.fa.bam

Sample A, replicate 2

%  hisat2 --dta -x minigenome.fa -p 2 \
      -1 data/sA_rep2_1.fastq.gz -2 data/sA_rep2_2.fastq.gz \
      | samtools view -Sb -F 4 \
      | samtools sort  -o sA_rep2.cSorted.hisat2.minigenome.fa.bam

Sample A, replicate 3

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sA_rep3_1.fastq.gz -2 data/sA_rep3_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sA_rep3.cSorted.hisat2.minigenome.fa.bam 

Sample B, replicate 1

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep1_1.fastq.gz -2 data/sB_rep1_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep1.cSorted.hisat2.minigenome.fa.bam

Sample B, replicate 2

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep2_1.fastq.gz -2 data/sB_rep2_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep2.cSorted.hisat2.minigenome.fa.bam

Sample B, replicate 3

% hisat2 --dta -x minigenome.fa -p 2 \
       -1 data/sB_rep3_1.fastq.gz -2 data/sB_rep3_2.fastq.gz \
       | samtools view -Sb -F 4 \
       | samtools sort  -o sB_rep3.cSorted.hisat2.minigenome.fa.bam

Note, for many samples and replicates, we would often script out the process instead of running each manually. Cutting/pasting commands is easy enough here, though.

Index each of the bam files for viewing with IGV:

The following command will run 'samtools index' on each of the .bam files:

%  for file in *.bam  
   do 
   samtools index $file
   echo indexed $file
   done

Load a bam file from each of the sample types into IGV (ie. just rep1 from sA and sB) for viewing.

Reconstruct Transcripts using StringTie

For each of the bam files, we'll reconstruct transcript isoforms. Later, we'll merge those individual replicate-based transcript sets into a single non-redundant set of final transcripts.

Run StringTie on each of the bam files by running the following command, which will loop through the bam files run StringTie and generate a separate .stringtie.GTF output file containing transcript structures based on each set of aligned reads:

%  for file in *.bam
   do 
   stringtie $file -o $file.stringtie.GTF
   echo done running stringtie on $file
   done

use 'ls -ltr' to see the files just generated.

Next, merge these gtf files into our final set of genome-based reconstructed transcripts:

%  stringtie --merge -o stringtie.final.GTF  *.stringtie.GTF

Load the 'stringtie.final.GTF' file into IGV. You might relocate the tier within the viewer, set to 'expanded' and change the color to your preference.

How well do the reconstructed transcripts compare to the input reads? to the reference annotations?

Genome-free de novo transcript reconstruction using Trinity

De novo transcript reconstruction involves assembling transcripts directly from the reads themselves and not requiring any genome sequence. We will do this using the Trinity software, and then align our reconstructed transcripts to our target genome as a way of investigating just how well it worked.

Perform Trinity de novo assembly

Run Trinity using the 'samples.txt' file like so:

%  $TRINITY_HOME/Trinity --samples_file samples.txt \
      --seqType fq --CPU 2 --max_memory 1G

This may take a few minutes to run and you can follow its progress in the terminal as it's executing.

Once Trinity is complete, you will find the output file as 'trinity_out_dir/Trinity.fasta'.

Have a look at this output file. How many transcripts does it contain?

Transcript-to-Genome Alignment

Align the Trinity-reconstructed transcripts to the genome using GMAP.

Indexing the genome using gmap_build:

Run the following to build a gmap index for the genome, required prior to searching:

 % gmap_build -D . -T . -d minigenome.fa.gmap -k 13 minigenome.fa 

Align Trinity-reconstructed transcripts using GMAP

Now, align the transcripts to the genome:

% gmap -D . -d minigenome.fa.gmap trinity_out_dir/Trinity.fasta \
      -f samse -n 0 -x 50 -t 2 -B 5 \
      | samtools view -Sb | samtools sort -o Trinity.gmap.bam

Index the bam file for viewing in IGV:

%  samtools index Trinity.gmap.bam

Load the Trinity.gmap.bam file into IGV.

Clone this wiki locally