Software to guess the RNA-Seq library type of paired and single end read files using mapping and gene annotation.
- Background
- Dependencies
- Installation
- Result
- Usage
- File formats
- Supported header formats
- Supported interleaved formats
- Example commands
- If you have no other information just the reads
- If you only have a reference genome
- If you have reference genome and annotation
- If you have reference genome and mapped reads
- If you have reference genome and annotation and mapped reads
- If you only have transcript sequences
- If you have transcript sequences and annotation
- Other examples
- Output
- Parameters
- Overview of the pipeline
- Overview of the different library types
- Library prep methods
- External resources
- Known issues
- TO DO
- Citation
The choice of RNA-Seq library type defines the read orientation of the sequencing and the order in which the strands of cDNA are sequenced, which means that RNA-Seq reads from different library types can differ significantly. The information regarding library type can be very useful for reads to be assembled into a transcriptome or mapped to a reference assembly. This is because the library type can help to discern where in the transcriptome shorter ambiguous reads belong by using the read’s relative orientation and from which strand it was sequenced. Unfortunately, this information regarding the library type used is not included in sequencing output files and may be lost before the assembly of the data. Even when working with RNA-Seq data from public repositories there is no guarantee that the library type information is correct or that it exists at all. This is what GUESSmyLT aims to fix by looking at how reads map to a reference and together with gene annotation guess which library was used to generate the data.
Developed for Unix systems. Depending installation approach more or less dependencies will be installed automatically. Check the installation paragraph.
Python and libraries:
- Python >3
- biopython (1.67)
- bcbio-gff (0.6.4) - handling gff annotation
- pysam (0.15.1) - handling mapped reads
Other programs:
- Snakemake (5.4.0) - Workflow management
- BUSCO (3.0.2) - Gene annotation
- Bowtie2 (2.3.4.3) - Mapping
- Trinity (2.8.4) - Reference assembly
Others:
- Prokaryote and eukaryote BUSCO datasets (from https://busco.ezlab.org) included in the package.
First you must have Docker installed and running.
Secondly have look at the availabe GUESSmyLT biocontainers at quay.io.
Then:
# get the chosen GUESSmyLT container version
docker pull quay.io/biocontainers/guessmylt:0.2.5--py_0
# run GUESSmyLT
docker run quay.io/biocontainers/guessmylt:0.2.5--py_0 GUESSmyLT
First you must have Singularity installed and running.
Secondly have look at the availabe GUESSmyLT biocontainers at quay.io.
Then:
# get the chosen GUESSmyLT container version
singularity pull docker://quay.io/biocontainers/guessmylt:0.2.5--py_0
# run the container
singularity run guessmylt_0.2.5--pl5262hdfd78af_0.sif
With an activated Bioconda channel (see 2. Set up channels), install with:
conda install guessmylt
Installation using pip will not install BUSCO, Bowtie2 and Trinity. These external programs can be installed using conda.
pip install GUESSmyLT
Installation using git will not install BUSCO, Bowtie2 and Trinity. These external programs can be installed using conda.
Clone the repository and move to the folder:
git clone https://github.com/NBISweden/GUESSmyLT.git
cd GUESSmyLT/
Launch the installation either:
python setup.py install
Or if you do not have administrative rights on your machine:
python setup.py install --user
Executing:
GUESSmyLT
or
GUESSmyLT -h
to display help.
There is also an example run that takes roughly 5 mins. A folder called GUESSmyLT_example_out will be created in the working directory:
GUESSmyLT-example
The results are printed as stdout and to a result file. One example of a result would be:
Results of paired library inferred from reads:
Library type Relative orientation Reads Percent Vizualization according to firststrand
ff_firststrand matching 5 0.1% 3' <==2==----<==1== 5'
5' ---------------- 3'
ff_secondstrand matching 3 0.0% 3' ---------------- 5'
5' ==1==>----==2==> 3'
fr_firststrand inward 4167 47.7% 3' ----------<==1== 5'
5' ==2==>---------- 3'
fr_secondstrand inward 4521 51.7% 3' ----------<==2== 5'
5' ==1==>---------- 3'
rf_firststrand outward 19 0.2% 3' <==2==---------- 5'
5' ----------==1==> 3'
rf_secondstrand outward 23 0.3% 3' <==1==---------- 5'
5' ----------==2==> 3'
undecided NA 1 0.0% 3' -------??------- 5'
5' -------??------- 3'
Roughly 50/50 split between the strands of the same library orientation should be interpreted as unstranded.
Based on the orientations of the reads we would assume that the library type is fr-unstranded as there is roughly a 50-50 split between fr-first and fr-second.
Read files: .fastq Mapping: .bam Reference: .fa
Tested for Old/New Illumina headers and downloads from SRA. Should work, but not tested for all fastq header formats at: https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/
If headers are in Old/New Illumina or if reads are alternating.
Old Illumina: @HWUSI-EAS100R:6:73:941:1973#0/1
New Illumina: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Alternating:
@read1 (first mate)
..
@read1 (second mate)
..
@read2 (first mate)
..
@read2 (second mate)
..
In top of your fastq RNA-Seq read file(s) (compressed or uncompressed):
Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq
Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode genome --organism euk
Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode genome --annotation annotation.gff --organism euk
Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode genome --mapped mapped.bam --organism euk
Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode genome --mapped mapped.bam --annotation annotation.gff --organism euk
/!\ not yet implemented (use genome mode instead it should work anyway) Example with paired reads in eukaryote.
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode transcriptome --organism euk
/!\ not yet implemented (use genome mode instead it should work anyway) Example with paired reads in eukaryote. (The annotation has to be the annotation within the trascriptome not the genome)
GUESSmyLT --reads read_1.fastq read_2.fastq --reference ref.fa --mode transcriptome --annotation annotation.gff --organism euk
Paired end reads and reference with specified subsampled reads. Output directed to existing directory.
GUESSmyLT --reads read_1.fastq read_2.fastq --organism pro --reference ref.fa --subsample 100000 --output my_output/
GUESSmyLT --reads reads.fastq --reference ref.fa --organism pro
cd GUESSmyLT/
python3 GUESSmyLT.py --reads reads.fastq --reference ref.fa --organism euk
GUESSmyLT will print the result in the command line as well as write it to a file:
[output_dir]/result_[read_name]on_[refname].txt
Results from intermediate steps, such as the mapping from Bowtie2 or annotation from BUSCO are saved in
[output_dir]/intermediate_data/
Parameter | Input | Description |
---|---|---|
--reads | .fastq file(s) | Full path(s) to RNA-Seq read file(s). Can be compressed or uncompressed. Order is not important. Can handle two paired end read files, one interleaved read file and single end read file. |
--organism | euk or pro | Eukaryote or prokaryote (euk/pro) is an option needed for the BUSCO annotation. |
Parameter | Input | Description |
---|---|---|
--subsample | Even integer | Number of reads that will be used for subsampling. |
--reference | .fa file | Full paths to reference genome/transcriptome for mapping reads to (nucleotide fasta file). |
--mode | genome or transcriptome | When no annotation is provided, tells the programm if the reference fasta file has to be considered as a genome or a transcriptome in order to use BUSCO properly. |
--threads | Integer | Number of threads to use. (default 2) |
--memory | Number of GB ex: 10G | Maximum memory that can be used in GB. (default 8G) |
--annotation | .gff file | Full path to annotation file for skipping BUSCO step. |
--mapped | Sorted .bam file | Full path to mapped read file for skipping Bowtie2 step. |
--output | File path | Full path to result file. If left out files will be written to working directory. |
GUESSmyLT uses Snakemake to build the pipeline it needs in order to predict the library type. Required arguments are organism (euk/pro) and reads (read file(s) in fastq format). Reference (genome or transcriptome in .fasta format) is optional, and if it is not provided, Trinity will first be executed to create a De novo assembly of the reads. Next, BUSCO is used for annotation. This is also a QC step because BUSCO looks for core genes, so called BUSCOs, in the reference. If they cannot be found, it indicates that the reference has bad quality and therefore the pipeline will terminate. If BUSCOs are found, the process continues with mapping the reads to the reference using Bowtie2. The mapping is done with unstranded option so that the reads can be mapped on both the strands and in both directions. Finally, the mapping and annotation is used for inference, which is done with a python script and the library type is returned.
On top of Snakemake, we have a python script, GUESSmyLT.py. Its purpose is to handle user arguments by:
- Checking that arguments are correct, files exists and are in correct format.
- Telling Snakemake what files exist by updating the config file.
- Executing snakemake.
The Snakefile subsample handles preparation of the readfiles:
- Subsamples reads into new read files that are used in the analysis. This makes GUESSmyLT faster and protects the original files from being modified.
- Modifying files: a. Changes read files that are in wrong format. Trinity and Pysam can only handle old Illumina format: @read_ID/pair#, where pair# is 1 or 2. They do not work with whitespaces, punctutations nor undescrores. Therefore, the script makes sure that the headers are converted into the correct format. b. Deinterleaves paired end read files if they are interleaved.
kit | Description | Paired | Stranded | Strand according to mRNA | Strand according to first strand |
---|---|---|---|---|---|
TruSeq RNA Sample Prep kit | yes | No | fr-unstranded | ||
SMARTer ultralow RNA protocol | yes | No | fr-unstranded | ||
All dUTP methods, NSR, NNSR | yes | Yes | RF | fr-firststrand | |
TruSeq Stranded Total RNA Sample Prep Kit | yes | Yes | RF | fr-firststrand | |
TruSeq Stranded mRNA Sample Prep Kit | yes | Yes | RF | fr-firststrand | |
NEB Ultra Directional RNA Library Prep Kit | yes | Yes | RF | fr-firststrand | |
Agilent SureSelect Strand-Specific | yes | Yes | RF | fr-firststrand | |
Directional Illumina (Ligation) | yes | Yes | FR | fr-secondstrand | |
Standard SOLiD | Yes | yes | FR | fr-secondstrand | |
ScriptSeq v2 RNA-Seq Library Preparation Kit | yes | Yes | FR | fr-secondstrand | |
SMARTer Stranded Total RNA | yes | Yes | FR | fr-secondstrand | |
Encore Complete RNA-Seq Library Systems | yes | Yes | FR | fr-secondstrand | |
NuGEN SoLo | yes | Yes | FR | fr-secondstrand | |
Illumina ScriptSeq | yes | Yes | FR | fr-secondstrand | |
SOLiD mate-pair protocol | ff |
--rf orientation are produced using the Illumina mate-pair protocol?
https://chipster.csc.fi/manual/library-type-summary.html
https://galaxyproject.org/tutorials/rb_rnaseq/
http://onetipperday.sterding.com/2012/07/how-to-tell-which-library-type-to-use.html
https://sailfish.readthedocs.io/en/master/library_type.html
https://rnaseq.uoregon.edu
https://www.researchgate.net/post/What_is_the_difference_between_strand-specific_and_not_strand-specific_RNA-seq_data
- Complains about gzip broken pipe when subsampling with compressed files (but works anyway).
- BUSCO sometimes looses the config path. Fix manually in terminal:
export AUGUSTUS_CONFIG_PATH=~/miniconda3/pkgs/augustus-3.2.3-boost1.60_0/config
- BUSCO might not find any core genes. Fix by using more reads or by providing reference.
- Mapping, annotation, assembly or the entire pipeline is skipped. This is most likely due to the fact that Snakemake checks which output files need to be generated and from there only performs the necessary steps of the pipeline. The result of this is that is you already have a .bam file, BUSCO/Trinity output folder or a result .txt file for the reads Snakemake will skip steps
- Installing Trinity for mac via Conda will give you a version from 2011 that doesn't work. Install using Homebrew instead.
* Add Travis using example data provided as reference.
* Look more into why some reads get undecided orientation. This is when a read's mate cannot be found and is probably due to a read is at the end of a gene and its mate is outside of the selected region.
If you use GUESSmyLT in your work, please cite us:
Berner Wik E.*,1, Olin H.*,1, Vigetun Haughey C.*,1, Lisa Klasson1, Jacques Dainat2,3
*These authors contributed equally to the work.
1Molecular Evolution, Department of Cell and Molecular Biology, Uppsala University, 75124 Sweden.
2National Bioinformatics Infrastructure Sweden (NBIS), SciLifeLab, Uppsala Biomedicinska Centrum (BMC), Husargatan 3, S-751 23 Uppsala, SWEDEN.
3IMBIM - Department of Medical Biochemistry and Microbiology, Box 582, S-751 23 Uppsala, SWEDEN.