GUESSmyLT

Software to guess the RNA-Seq library type of paired and single end read files using mapping and gene annotation.

Background

The choice of RNA-Seq library type defines the read orientation of the sequencing and the order in which the strands of cDNA are sequenced, which means that RNA-Seq reads from different library types can differ significantly. The information regarding library type can be very useful for reads to be assembled into a transcriptome or mapped to a reference assembly. This is because the library type can help to discern where in the transcriptome shorter ambiguous reads belong by using the read’s relative orientation and from which strand it was sequenced. Unfortunately, this information regarding the library type used is not included in sequencing output files and may be lost before the assembly of the data. Even when working with RNA-Seq data from public repositories there is no guarantee that the library type information is correct or that it exists at all. This is what GUESSmyLT aims to fix by looking at how reads map to a reference and together with gene annotation guess which library was used to generate the data.

Dependencies:

Developed for Unix systems. Depending installation approach more or less dependencies will be installed automatically. Check the installation paragraph.

Python and libraries:

Python >3
biopython (1.67)
bcbio-gff (0.6.4) - handling gff annotation
pysam (0.15.1) - handling mapped reads

Other programs:

Snakemake (5.4.0) - Workflow management
BUSCO (3.0.2) - Gene annotation
Bowtie2 (2.3.4.3) - Mapping
Trinity (2.8.4) - Reference assembly

Others:

Prokaryote and eukaryote BUSCO datasets (from https://busco.ezlab.org) included in the package.

Installation

Using Docker

First you must have Docker installed and running.
Secondly have look at the availabe GUESSmyLT biocontainers at quay.io.
Then:

# get the chosen GUESSmyLT container version
docker pull quay.io/biocontainers/guessmylt:0.2.5--py_0
# run GUESSmyLT
docker run quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT

Using Singularity

First you must have Singularity installed and running. Secondly have look at the availabe GUESSmyLT biocontainers at quay.io.
Then:

# get the chosen GUESSmyLT container version
singularity pull docker://quay.io/biocontainers/guessmylt:0.2.5--py_0 
# run the container
singularity run guessmylt_0.2.5--pl5262hdfd78af_0.sif

Installation with conda:

With an activated Bioconda channel (see 2. Set up channels), install with:

conda install guessmylt

Installation with pip:

Installation using pip will not install BUSCO, Bowtie2 and Trinity. These external programs can be installed using conda.

pip install GUESSmyLT

Installation with git:

Installation using git will not install BUSCO, Bowtie2 and Trinity. These external programs can be installed using conda.

Clone the repository and move to the folder:

git clone https://github.com/NBISweden/GUESSmyLT.git
cd GUESSmyLT/

Launch the installation either:

python setup.py install

Or if you do not have administrative rights on your machine:

python setup.py install --user

Check installation

Executing:

GUESSmyLT

or

GUESSmyLT -h

to display help.

There is also an example run that takes roughly 5 mins. A folder called GUESSmyLT_example_out will be created in the working directory:

GUESSmyLT-example

Result

The results are printed as stdout and to a result file. One example of a result would be:

Results of paired library inferred from reads:

   Library type    Relative orientation         Reads     Percent    Vizualization according to firststrand

 ff_firststrand                matching             5        0.1%    3' <==2==----<==1== 5'
                                                                     5' ---------------- 3'


ff_secondstrand                matching             3        0.0%    3' ---------------- 5'
                                                                     5' ==1==>----==2==> 3'


 fr_firststrand                  inward          4167       47.7%    3' ----------<==1== 5'
                                                                     5' ==2==>---------- 3'


fr_secondstrand                  inward          4521       51.7%    3' ----------<==2== 5'
                                                                     5' ==1==>---------- 3'


 rf_firststrand                 outward            19        0.2%    3' <==2==---------- 5'
                                                                     5' ----------==1==> 3'


rf_secondstrand                 outward            23        0.3%    3' <==1==---------- 5'
                                                                     5' ----------==2==> 3'


      undecided                      NA             1        0.0%    3' -------??------- 5'
                                                                     5' -------??------- 3'
								     
Roughly 50/50 split between the strands of the same library orientation should be interpreted as unstranded.

Based on the orientations of the reads we would assume that the library type is fr-unstranded as there is roughly a 50-50 split between fr-first and fr-second.

Usage

File formats

Read files: .fastq Mapping: .bam Reference: .fa

Supported header formats

Tested for Old/New Illumina headers and downloads from SRA. Should work, but not tested for all fastq header formats at: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/

Supported interleaved formats

If headers are in Old/New Illumina or if reads are alternating.

Old Illumina: @HWUSI-EAS100R:6:73:941:1973#0/1
New Illumina: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Alternating:
@read1 (first mate)
..
@read1 (second mate)
..
@read2 (first mate)
..
@read2 (second mate)
..