Sequequences were downloaded from ENA (project name PRJNA493745). View Zotero for the paper in folder RNAseq trainning
. Download commands were obtained for ENA and cluster options were added to the .sh
file. Commnads are stored in sequences/ena-file-download-20230718-1222.sh
, the other *.sh
files are possible alternatives for downloading.
Sequences were downloaded successfully. Checked the number of reads for each sequence by doing:
for i in *.fastq.gz; do echo $i; echo $(zcat $i | wc -l)/4 | bc; done
Results are stored in sequences/number_reads
. Since files have a lot of reads, it might be a good idea to take a subset of one million reads in order to test our pipeline. We can take two samples
mkdir test
cd test
# Select subset. In this case, the first 4 samples:
ls ../sequences/*.fastq.gz | head -n 4 > subset
# Subsample one million reads and generate new .fastq files:
while read -r line; do zcat $line | sed -n 1,4000000p > $(basename "$line" .fastq.gz).fastq & done < subset >log&
# Also compress files for extension consistency:
for i in *.fastq; do gzip $i; done
Instead of modules, we are going to use singularity containers:
mkdir containers
cd containers
singularity pull docker://biocorecrg/rnaseq2020:1.0
Add this to your .bashrc
:
export RUN="singularity exec -e $HOME/RNAseq-course/containers/rnaseq2020_1.0.sif"
Now commands can be run using $RUN programname --help
.
FastQC can be run using:
qsub scripts/singularity/fastqc.sh test
For a large number of files, it might be a good idea to run multiqc. Since it's not iside the container, we can pull a new container:
Singularity pull docker://ewels/multiqc:latest
Add it to .bashrc
as above, in this case, the environment variable name is $RUN_MQC
. You can modfy the script fastqc.sh
to include multiqc.
Useful tip. When pulling a container using singularity, you can add a custom name to the image. Example:
singularity pull skewer.sif https://depot.galaxyproject.org/singularity/skewer:0.2.2--hc9558a2_3
This image is now in contaiers
folder, although it might not be useful.
For trimming the reads, skewer was used. Check parameters in scripts/singularity/skewer.sh
. The next line was run:
for i in $PWD/test/*_1.fastq.gz; do qsub scripts/singularity/skewer.sh $i; done
To understand skewer output, check log files as well as run FastQC and multiQC again:
qsub scripts/singularity/fastqc.sh outfolder/trimmed_reads/
In case of skewer, you will see in the log something like this:
12372 ( 1.24%) trimmed read pairs available after processing
987623 (98.76%) untrimmed read pairs available after processing
This means that 1.24% reads pairs had their adapters removed, whereas in 98.76% of the reads no adapter was found.
Let's check both aligners: STAR and Salmon (pseudoaligner). Donwload data for aligners using:
# genome
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz
# transcriptome
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
# annotation
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz
In the case of splice junctions, it is necessary, when building an index, to let STAR know the maximum number of bp for the overhang (maximum lenght of the donnor/acceptor sequence on each side of the junctions), which would be the read lenght - 1. For more information look how STAR deals with splice junctions. Note that depeding on the read lenght a value of 100bp might suffice. You can use the next line to check the read length distribution:
zcat SRR7939021-trimmed-pair1.fastq.gz | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c | less
Check again how skwer works, there is a difference in size distributions between the trimmed and untrimmed sequences. All reads before trimming have a size of 101bp, this changes after trimming, where reads diffrenret sizes (some of 31bp) can be seen.
Since most the reads have a lenght of 101, the deafault value for --sjdbOverhang
will suffice. To index the reference genome using star use this line:
qsub scripts/singularity/star_index.sh reference/Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa reference/Homo_sapiens.GRCh38.99.chr.gtf
Run STAR using the next line:
for i in outfolder/trimmed_reads/*-pair1.fastq.gz; do qsub scripts/singularity/star.sh GRCh38_index $i; done
This will also output the counts for the genes for the three types of protcols: unstranded (2nd column), stranded-forward (3rd column), stranded-reverse (4th column).
To check the strandness of the samples, you can check dtge N_noFeature field and also use the follwoing:
grep -v "N_" outfolder/STAR/SRR7939021ReadsPerGene.out.tab | awk '{unst+=$2;forw+=$3;rev+=$4}END{print unst,forw,rev}'
If there is an imbalance between read1 and read2, then the protocol is stranded. The stranded column with the lowest N_noFeature count corresponds to the correct strand option.
Run salmon
Apparently there is a problem with using ENSEMBL cDNA to build the index. Use GENCODE instead (view https://support.bioconductor.org/p/p134094/#p134255).
Downlaoded necessary files from gencode using:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.transcripts.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
Salmon can be run using:
for i in outfolder/trimmed_reads/*-pair1.fastq.gz; do qsub scripts/singularity/salmon.sh index_salmon $i reference/gencode.v44.annotation.gtf.gz A; done
Note that option A is set for -l
(library type), which mean that the library type will be automatically inferred. In this case library is ISR, which is consisten with STAR's output for counts.