-
Notifications
You must be signed in to change notification settings - Fork 2
Home
In this workshop, we will be performing both genome-guided and genome-free transcript reconstruction.
For genome-guided reconstruction, we will leverage the Tuxedo2 protocol, leveraging HISAT2 tool for aligning RNA-Seq reads to a target genome, and StringTie for reconstructing transcript isoforms.
We will also leverage Trinity for genome-free transcript reconstruction, and compare our de novo assembled transcripts to the genome-guided reconstructions by aligning our Trinity transcripts to the target genome using GMAP and visualizing all the data using IGV.
Our genome and RNA-Seq data can be found at:
% ls ~/workspace/data/trinity/rnaseq/*
.
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.fa
/home/ubuntu/workspace/data/trinity/rnaseq/samples.txt
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.gtf
/home/ubuntu/workspace/data/trinity/rnaseq/data:
sA_rep1_1.fastq.gz sA_rep2_2.fastq.gz sB_rep1_1.fastq.gz sB_rep2_2.fastq.gz
sA_rep1_2.fastq.gz sA_rep3_1.fastq.gz sB_rep1_2.fastq.gz sB_rep3_1.fastq.gz
sA_rep2_1.fastq.gz sA_rep3_2.fastq.gz sB_rep2_1.fastq.gz sB_rep3_2.fastq.gz
The genome data correspond to a small subset of the Drosophila genome (~200 genes) and the RNA-Seq data correspond to two conditions (sA & sB), with each containing three biological replicates (rep1,2,3).
The minigenome.gtf contains the reference transcript structure annotations, and the 'samples.txt' file describes the relationships between samples, replicates, and their corresponding paired-end RNA-Seq fastq files.
Create a 'transreco' directory in your workspace where you will perform your analyses.
% cd ~/workspace
% mkdir transreco
% cd transreco
Copy over the genome and rna-seq data to your current working directory 'transreco'.
% cp -r ~/workspace/data/trinity/rnaseq/* .
Load the 'minigenome.fa' into IGV as a new genome, and then load in the annotations 'minigenome.gtf'. Setting the annotations to 'squished' will enable you to see all the different reference isoforms for each gene.
Before we can align the reads to the genome using hisat2, we must first prepare (index) the target genome. This is a multi-step process involving the following steps, all involving tools included in the hisat2 installation:
Extract the splice sites from the genome annotation file:
% hisat2_extract_splice_sites.py minigenome.gtf > minigenome.gtf.ss
Extract the exon records from the genome annotation file:
% hisat2_extract_exons.py minigenome.gtf > minigenome.gtf.exons
Now build the genome index using these new files along with the genome and annotation file:
% hisat2-build --exon minigenome.gtf.exons \
--ss minigenome.gtf.ss \
-p 2 minigenome.fa minigenome.fa
Align the reads for each replicate separate. Each replicate corresponds to a pair of fastq files. Align each like so:
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep1_1.fastq.gz -2 data/sA_rep1_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep1.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep2_1.fastq.gz -2 data/sA_rep2_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep2.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep3_1.fastq.gz -2 data/sA_rep3_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep3.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep1_1.fastq.gz -2 data/sB_rep1_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep1.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep2_1.fastq.gz -2 data/sB_rep2_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep2.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep3_1.fastq.gz -2 data/sB_rep3_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep3.cSorted.hisat2.minigenome.fa.bam