-
Notifications
You must be signed in to change notification settings - Fork 14
/
lab_assembly.Rmd
242 lines (163 loc) · 12.5 KB
/
lab_assembly.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: "Transcriptome assembly"
subtitle: "Workshop on RNA-Seq"
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
# Preparation
This exercise is meant to get you acquainted with the type of data you would normally encounter in an assembly/annotation project. We will for all exercises use data for the fruit fly (*Drosophila melanogaster*), as that is one of the currently best annotated organisms and there is plenty of high quality data available.
You can create the folders where you want but I would suggest a folder organisation, if you do not follow this organisation remember to put the correct path to your data.
```{sh,chunk.title=TRUE,eval=FALSE}
mkdir bonus/assembly
cd bonus/assembly
```
Let's link the raw fastq files.
```{sh,chunk.title=TRUE,eval=FALSE}
mkdir raw
cd raw
ln -s /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/raw_computes/* .
```
We will clone this Github repo from NBIS annotation team which will be used during this practical.
```{sh,chunk.title=TRUE,eval=FALSE}
git clone https://github.com/NBISweden/GAAS.git
```
## Assembling transcripts based on RNA-seq data
RNA-Seq data is in general very useful in annotation projects as the data usually comes from the actual organism you are studying and thus avoids the danger of introducing errors caused by differences in gene structure between your study organism and other species.
Important points to remember before starting working with RNA-Seq:
* Check if sequence data is paired-end or single-end. Last generation of sequenced short reads (since 2013) are almost all paired. Anyway, it is important to check that information, which will be useful for the tools used in the next steps.
* Check if the sequence data are stranded. This information will be useful for the tools used in the next steps. (It is recommended to use stranded data to avoid transcript fusion during the transcript assembly process. That gives more reliable results.)
* Left / L / forward / 1 have identical meaning. It is the same for Right / R /Reverse / 2
### QC
We run a quick QC to check encoding version and fastq quality score format. We have to use FastQC tool.
```{sh,chunk.title=TRUE,eval=FALSE}
module load bioinfo-tools
module load FastQC/0.11.5
mkdir fastqc
fastqc ./raw/ERR305399_1.fastq.gz -o fastqc/
```
Copy the FastQC report HTML file to your system and inspect the results. You can use `SCP` to copy the file. Run the command below from a **LOCAL** terminal.
```{sh,chunk.title=TRUE,eval=FALSE}
scp [email protected]:[path-to-report.html] .
```
Next, we check the fastq quality score format. For this, we will be using the scripts libraries available in the git gaas you need first to export the libraries and run the script `fastq_guessMyFormat.pl`.
```{sh,chunk.title=TRUE,eval=FALSE}
export PERL5LIB=$PERL5LIB:./GAAS/annotation/
./GAAS/annotation/Tools/bin/fastq_guessMyFormat.pl -i ./raw/ERR305399_1.fastq.gz
```
In the normal mode, it differentiates between Sanger/Illumina1.8+ and Solexa/Illumina1.3+/Illumina1.5+. In the advanced mode, it will try to pinpoint exactly which scoring system is used.
More test can be made and should be made on RNA-seq data before doing the assembly, we have not time to do all of them during this course. have a look [here](https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools)
# Guided assembly
In this section, we run a genome-guided transcriptome assembly. We use Trimmomatic to trim the reads, HISAT2 to map reads to reference genome and StringTie to assemble reads into transcripts.
## Trimmomatic
[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line.
```{sh,chunk.title=TRUE,eval=FALSE}
mkdir trimmomatic
module load bioinfo-tools
module load trimmomatic/0.36
```
The following command line will perform the following:
* Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
* Remove leading low quality or N bases (below quality 3) (LEADING:3)
* Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
* Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
* Drop reads below the 36 bases long (MINLEN:36)
```{sh,chunk.title=TRUE,eval=FALSE}
java -jar /sw/apps/bioinfo/trimmomatic/0.36/milou/trimmomatic-0.36.jar PE -threads 5 -phred33 ./raw/ERR305399_1.fastq.gz ./raw/ERR305399_2.fastq.gz trimmomatic/ERR305399.left_paired.fastq.gz trimmomatic/ERR305399.left_unpaired.fastq.gz trimmomatic/ERR305399.right_paired.fastq.gz trimmomatic/ERR305399.right_unpaired.fastq.gz ILLUMINACLIP:/sw/apps/bioinfo/trimmomatic/0.36/milou/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
```
## HISAT2
Once the reads have been trimmed, we use [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) to align the RNA-seq reads to a genome in order to identify exon-exon splice junctions.
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against a reference genome.
First you need to build an index of your chromosome 4
```{sh,chunk.title=TRUE,eval=FALSE}
mkdir index
module load HISAT2
module load samtools/1.8
hisat2-build /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/chromosome/chr4.fa ./index/chr4_index
```
Then you can run HISAT2 :
* `--phred33` Input qualities are ASCII chars equal to the Phred quality plus 33. This is also called the "Phred+33" encoding, which is used by the very latest Illumina pipelines (you checked it with the script *fastq_guessMyFormat.pl*)
* `--rna-strandness <string>` For single-end reads, use **F** or **R**. **F** means a read corresponds to a transcript. **R** means a read corresponds to the reverse complemented counterpart of a transcript. For paired-end reads, use either **FR** or **RF**. (**RF** means fr-firststrand see [here](https://github.com/NBISweden/GAAS/blob/master/annotation/CheatSheet/rnaseq_library_types.md) for more explanation).
* `--novel-splicesite-outfile <path>` In this mode, HISAT2 reports a list of splice sites in the file :
chromosome name <tab> genomic position of the flanking base on the left side of an intron <tab> genomic position of the flanking base on the right <tab> strand (+, -, and .) '.' indicates an unknown strand for non-canonical splice sites.
```{sh,chunk.title=TRUE,eval=FALSE}
hisat2 --phred33 --rna-strandness RF --novel-splicesite-outfile hisat2/splicesite.txt -S hisat2/accepted_hits.sam -p 5 -x index/chr4_index -1 trimmomatic/ERR305399.left_paired.fastq.gz -2 trimmomatic/ERR305399.right_paired.fastq.gz
```
Finally you need to change the SAM into BAM file and to sort it in order for StringTie to use the bam file to assemble the read into transcripts.
```{sh,chunk.title=TRUE,eval=FALSE}
samtools view -bS -o hisat2/accepted_hits.bam hisat2/accepted_hits.sam
samtools sort -o hisat2/accepted_hits.sorted.bam hisat2/accepted_hits.bam
```
## StringTie
[StringTie](https://ccb.jhu.edu/software/stringtie/) is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus.
You can add as input an annotation from GTF/GFF3 file to calculate TPM and FPKM values.
```{sh,chunk.title=TRUE,eval=FALSE}
module load bioinfo-tools
module load StringTie
stringtie hisat2/accepted_hits.sorted.bam -o stringtie/transcripts.gtf
```
When done you can find your results in the directory **outdir**. The file `transcripts.gtf` includes your assembled transcripts.
You could now also visualise all this information using a genome browser, such as IGV. IGV requires a genome fasta file and any number of annotation files in GTF or GFF3 format (note that GFF3 formatted file tend to look a bit weird in IGV sometimes).
Transfer the GTF files to your computer using SCP:
```{sh,chunk.title=TRUE,eval=FALSE}
scp [email protected]:[path-to-transcripts.gtf] .
```
<i class="fas fa-comments"></i> Looking at your results, are you happy with the default values of StringTie (which we used in this exercise) or is there something you would like to change?
# De-novo assembly
[Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki) assemblies can be used as complementary evidence, particularly when trying to polish a gene build with Pasa. Before you start, check how big the raw read data is that you wish to assemble to avoid unreasonably long run times.
```{sh,chunk.title=TRUE,eval=FALSE}
module load bioinfo-tools
module load trinity/2.4.0
module load samtools
Trinity --seqType fq --max_memory 32G --left ./raw/ERR305399_1.fastq.gz --right ./raw/ERR305399_2.fastq.gz --CPU 5 --output trinity --SS_lib_type RF
```
Trinity takes a long time to run (like hours), you can stop the program when you start it and have a look at the results, look in `/proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/RNAseq/trinity` and the output is `Trinity.fasta`.
In order to compare the output of StringTie and the output of Trinity we need to map the Trinity transcript to the chr4 of Drosophila.
We'll use the GMAP software to align the Trinity transcripts to our reference genome. Trinity contains a utility that facilitates running GMAP, which first builds an index for the target genome followed by running the gmap aligner:
```{sh,chunk.title=TRUE,eval=FALSE}
module load gmap-gsnap
mkdir gmap
/sw/apps/bioinfo/trinity/2.4.0/rackham/util/misc/process_GMAP_alignments_gff3_chimeras_ok.pl --genome /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/chromosome/chr4.fa --transcripts /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/RNAseq/trinity/Trinity.fasta > gmap/transcript_trinity.gff
```
## Assembly quality
There are different ways of assessing the quality of your assembly, you will find some of them [here](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment).
We will run BUSCO to check the the quality of the assembly. [BUSCO](https://busco.ezlab.org/) provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness (what we are going to do here). Genes that make up the BUSCO sets for each major lineage are selected from orthologous groups with genes present as single-copy orthologs in at least 90% of the species in the chosen branch of tree of life.
For the Trinity results :
```{sh,chunk.title=TRUE,eval=FALSE}
module load BUSCO/3.0.2b
/sw/apps/bioinfo/BUSCO/3.0.2b/rackham/bin/run_BUSCO.py -i /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/RNAseq/trinity/Trinity.fasta -o busco_trinity -l $BUSCO_LINEAGE_SETS/arthropoda_odb9 -m tran -c 5
```
Busco will take 30 to run so you can check the results in `/proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/RNAseq/busco_trinity`
For the guided assembly results, you need first to extract the transcript sequences from the gtf transcript file :
```{sh,chunk.title=TRUE,eval=FALSE}
module load BioPerl
export PERL5LIB=$PERL5LIB:./GAAS/annotation/
./GAAS/annotation/Tools/bin/gff3_sp_extract_sequences.pl --cdna -g transcripts.gtf -f /proj/uppstore2017171/courses/RNAseqWorkshop/downloads/assembly_annotation/chromosome/chr4.fa -o transcripts_stringtie.fa
```
Then you can run BUSCO again :
```{sh,chunk.title=TRUE,eval=FALSE}
/sw/apps/bioinfo/BUSCO/3.0.2b/rackham/bin/run_BUSCO.py -i stringtie/transcripts_stringtie.fa -o busco_stringtie -l $BUSCO_LINEAGE_SETS/arthropoda_odb9 -m tran -c 5
```
<i class="fas fa-clipboard-list"></i> Compare the two BUSCO results, what do you think happened for StringTie?
## What's next?
Now you are ready either to annotate your RNA-Seq or you can use then to do the genome annotation.
For the de-novo assembly you can use the `Trinity.fasta` file obtained.
For the genome-guided assembly you can either use the StringTie results `transcripts.gtf`, but you will often need to reformat it into a gff file.
***