mRNAs starting with splice leader sequences
(ACTCATCCCATTTTTGAGTCCGATTTCGATTGTCTAACAG
)
provide information on the position of trans-splicing sites
in the genome. mRNAs that do not start with splice leader sequences
provide information on transcription start sites (and therefore
promoters). These two informations need to be processed separately.
Reads pairs where Read 1 is starting with splice leader sequences are moved to different files. The splice leader sequences are then trimmed. At the moment this is done with the TagDust program, but it may be better to replace it with a program with a larger user base.
The HiSat aligner used in the nf-core RNA-seq pipeline is able to process paired-end CAGE data and output BAM files that can be handled correctly by downstream tools. In addition, most of its quality control analyses are relevant to CAGE. Therefore, we use this RNA-seq pipeline to align the CAGE libraries.
CAGE libraries should not contain ribosomal RNA sequences, since the rRNAs are not capped. However, the libraries are never perfect, and a fraction of rRNA remains. This is problematic because the genome contains a lot of repeats that have a strong sequence similarity with the rRNAs, which would lead to spurious CAGE signal if the reads were not filtered.
The RNA-seq pipeline removes sequences matching the ribosomal DNA using sortmerna tool. The reference rDNA regions for the main populations of O. dioica are now available in GenBank.
The transcription start sites found in the CAGE alignments are at single-nucleotide resolution. This data is easier to analyse after clustering the TSS into peaks that represent individual promoters.
The trans-splicing sites are by definition a single-nucleotide resolution feature, but processing the data through the same clustering algorithm removes noise caused by sequencing errors
Alignments in BAM format are clustered with the Bioconductor package CAGEr. The Bioinfo user group's RStudio container provides all the C libraries needed to compile the dependencies of CAGEr.
The computed peaks are exported to BED and BED12 formats for upload in the ZENBU genome browser using the zenbu_upload command-line tool.
At the moment the raw data is uploaded in BAM format too. In principle the BED12 format should be preferred because it saves a lot of space, but I could not convert the BAM files to BED12 format because the pairedBAMtoBED12 tool does not handle the case when R1 and R2 overlap and both align to the same splice junction (long paired-end CAGE reads did not exist when the tool was developed).
More technical details are given on the README page of each library.