In many of our lessons we have been working with RNASeq data from Saccharomyces cerevisiae. In the transcriptome assembly lesson, which we didn't run through this year (yet, at least), we would have assembled these RNA reads into contigs using Trinity, a popular transcriptome assembler. Here we will compare the results of that assembly to the reference RNA sequences from NCBI (specifically the GCF000146045.2_R64_rna.fna.gz file from here) while getting some more practice at the command line!
- Explore some more bash commands in practical usage examples (e.g.
grep
,sed
,cut
,comm
,cat
) - Practice stringing together multiple commands with pipes (
|
) step-by-step - Use blast to compare an assembled transcriptome to its reference transcriptome
- Run a "remote" blast from the command line
You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
We will be using blast here, and need one more package for being able to BLAST remotely (send sequences to their servers from the command line). So let's make a new environment, install those, and enter our new environment:
conda create -y -n blast_env -c conda-forge -c bioconda blast gnutls
conda activate blast_env
Let's set up a new working directory and download the files we'll be working with:
cd ~
mkdir more-unix-fun/
cd more-unix-fun/
curl -L https://ndownloader.figshare.com/files/16200176 -o our-transcriptome-assembly.fa
curl -L https://ndownloader.figshare.com/files/16218335 -o ref-transcripts.fa
Out of curiousity, let's see how many contigs we assembled vs how many open-reading frames there are in the reference we have:
grep -c ">" our-transcriptome-assembly.fa
grep -c ">" ref-transcripts.fa
QUICK QUESTION!What might be some of the reasons for the large difference between the number of transcripts we assembled and the number of reference transcripts?
Possible Causes(This is not an exhaustive list of possibilities.)
- not all transcripts may have been expressed at the time of sampling
- even expressed transcripts at low relative abundance may not have been amplified and sequenced
- not all transcripts that were expressed and sequenced may have assembled successfully
Even though we assembled fewer transcripts than the reference holds, let's use BLAST and the command line to try to find out if all of what we did assemble can actually be found in our reference!
Here we are going to make our blast database using our original reference fasta. Then we are going to blast the assembled transcripts against it.
makeblastdb -dbtype nucl -in ref-transcripts.fa -out ref-transcripts.blastdb
CODE BREAKDOWN
makeblastdb
- this is our command to make a blast database out of our reference fasta file
-dbtype nucl
- here we are specifying there are nucleotides-in
- specifying the input file-out
- naming the prefix of the output database files
Now we're going to try to align our assembled contigs against the reference. There are a lot of options for blastn
that can be seen by running blastn -help
, and there's an explanation of what we've done here in the following code breakdown block.
blastn -query our-transcriptome-assembly.fa -db ref-transcripts.blastdb \
-max_target_seqs 1 -max_hsps 1 \
-out assembly-to-ref-blastout.tsv \
-outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore"
CODE BREAKDOWN
blastn
- this is our base command, to blast nucleotide queries against a nucleotide database (that's what a blastn is)
-query
- specifies our input fasta-db
- specifies our database we just made-max_target_seqs 1
- sets the maximum number of reference sequences returned to 1-max_hsps 1
- sets maximum highest-scoring pairs returned (since BLAST is a local alignment, you can get more than one alignment returned between one query and one reference)-out
- specifies name of output file-outfmt
- specifies the output format
Now let's take a look (remember we can exit less
by pressing the q
key):
column -t assembly-to-ref-blastout.tsv | less -NS
Code breakdown:
column
– thecolumn
command with the-t
flag will attempt to keep columns together based on the tabs. This can be convenient sometimes.less
– since there are a lot of lines here, we are piping (|
) the output intoless
- the
-NS
flags we are providing add line numbers and prevent lines from softwrapping at the edge of the terminal.
Let's see how many of our assembled contigs successfully aligned to the orf reference fasta.
wc -l assembly-to-ref-blastout.tsv
Remember that wc -l
tells us how many lines there are in the file. 3,262 of our contigs successfully aligned to the orf reference fasta. But we had a total of 3,323 contigs from our assembly as we saw with grep -c ">" our-transcriptome-assembly.fa
, so some didn't successfully align to the reference.
Here is one way we can do a quick calculation at the command line:
echo "3323-3262" | bc
This tells us we have 61 contigs from our assembly that did not successfully align to the reference. Let's find out what they are!
Here are the steps we'll take to get the sequences that didn't align, and then try to find out what they are:
- get all the names of the contigs from our assembly
- get all the names of the contigs from our assembly that were reported as "hits" in the blast output
- compare these to figure out which contigs from our assembly are not in the list of contigs reported as successfully aligning to the reference
- use this new list of contig names to pull out their sequences in fasta format
- blast the sequences against NCBI's nr database "remotely" (from the command line, sending our sequences to the NCBI servers)
grep ">" our-transcriptome-assembly.fa | tr -d ">" | cut -f1 -d " " | sort > all-assembly-contig-IDs.txt
Code breakdown:
grep
- Just like we usedgrep
above to count how many sequences we had by providing the-c
flag, here we are leaving off the-c
flag so that it will pull out the lines that match. Since fasta format dictates the only place the ">" character appears is in front of sequence headers, we can pull out all head lines with that.tr
- We then "pipe" (|
) that output into a command calledtr
, which is useful for manipulating indvidual characters. Here we are using to delete all of the ">" in the file (so our names are just names without the ">" in front of them, like they are in the blast output file).cut
- We then pipe that into thecut
command, which is good for manipulating columns. Here, we're going to use it to cut the first column (-f1
) setting the delimiter to a space (-d " "
)
- this is because blast automatically cut the trailing space off of our sequence headers, and we want to match their format
- you can see this by running
head assembly_to_orf_coding_blastnout_with_header.tsv
andhead yeast-transcriptome-assembly.fa
sort
- We use sort here because the command we're going to use later to compare our two lists of headers needs them to be sorted in the same fashion, and running this on both will ensure that>
– We then redirect the output to a new file called "all-assembly-contigs.txt"
2. Get all the names of the contigs from our assembly that were reported as "hits" in the blast output
cut -f1 assembly-to-ref-blastout.tsv | sort > all-assembly-contig-IDs-with-hits.txt
Code breakdown:
cut
– Here we are usingcut
to cut the first column (the sequence name) from the blast output file.
- Note that here we didn't need to provide the
-d
flag to set the delimiter. This is because by defaultcut
sets the delimiter to tabs.sort
– We then pipe the output fromcut
intosort
- as mentioned above, to compare the two lists of names the way we are going to below requires that the lists be sorted in the same fashion, and running
sort
on both of them ensures that>
– We then redirect the output to a new file called "all-assembly-contig-hits.txt"
3. Compare these to figure out which contigs from our assembly are not in the list of contigs reported as successfully aligning to the reference
We can use the comm
command (compare with an extra "m" for some reason...) to quickly find out which sequences we assembled that didn't successfully align to the reference transcriptome.
comm -23 all-assembly-contig-IDs.txt all-assembly-contig-IDs-with-hits.txt > all-assembly-contig-IDs-that-did-not-hit-ref.txt
Code breakdown:
comm
– this command compares the lines in two files and by default returns 3 columns:
- The lines unique to file 1 (first file given)
- The lines unique to file 2 (second file given)
- The lines common to both files
- Looking at the manual for it with
man comm
we can see you can suppress columns by provide them as arguments- So by providing the flags
-23
we are saying to keep those, all we want are the lines (contig names) in file 1 that are not in file 2. This gives us all the contig names that we assembled but that did not successfully align to the reference.>
– We then redirect the output to a new file called "all-assembly-contigs-that-did-not-hit-ref.txt"
And this should hold the 61 contig names that we're looking for (sanity check):
wc -l all-assembly-contig-IDs-that-did-not-hit-ref.txt
Stellar 🙂
Now let's use grep
to pull out the sequences!
NOTE: This requires that the fasta file is in "true" fasta form – meaning that each entry (sequence) takes up two lines. Sometimes, fasta files are formatted with what are known as softwraps. There is a one-liner at the end of this page to convert them if that is the case with your data.
If we wanted to pull one sequence, we could do it like this:
grep -A1 "TRINITY_DN501_c0_g1_i1" our-transcriptome-assembly.fa
for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
grep -A1 "$header" our-transcriptome-assembly.fa
done
Code breakdown: The special words of the loop are
for
,in
,do
, anddone
.
for
– here we are specifying the variable name we want to use ("header"), this could be anythingin
– here we are specifying what we are going to loop through, in this case it is every line of the "all-assembly-contigs-that-did-not-hit-ref.txt" file
$(cat ...)
– this is a special type of notation in shell. The operation within the parentheses here will be performed and the output of that replaces the whole thing. It would be the same as if we typed out each line in the file with a space in between themdo
– indicates we are about to state the command(s) we want to be performed on each item we are looping through
grep
– just like with the individual example above, for each sequence name in our file we are pulling out the header line and the following line containing the sequencedone
– tells the loop it is finished
This just printed to the screen for right now, but let's grab one and take it to NCBI and do a web blast to see what we get.
>TRINITY_DN939_c0_g1_i1 len=900 path=[0:0-899]
TTCGACTATGATACTAATTCGCTGTGCACCCAGTGGGAGGATGTTATTTCGCTTACATGCCATATGCTGTATATTCTTGCGTTAACGGTTTCCGCGTGCGATCTCTCTTGTGTTCAGACGAGGCCCAATTGAGCACCATCCCCCTCGGGTAGTTTCCCGATCAAACTGGAAGATAGGCGTCTTTACTTACGCGCTCCTCTTGACCGAGACCCCCAATGCGCGATGTATCGAACCTTCACTAACCCTAGAAATTAGTGGTGGGAATCAGCGAAGTTACAATGTGGGGTTGGACCCAGGATGTTAGCCTGCAAGCTATACAATTCTCTTAGATTAGACGAGAACGGAGAATTTAACCCCTGCAGCATTGGAGGTATGGTCTTGGGCATACCCGATACATGCAACGCAGCTCGGGATGTTCATGGTAGCACCTAACTGTATGGCATAGTTATGCAGAAGTGCGCTGCTTAAGAGCGATACCCCATAAAGAACGATTTTGGTGGTATTGCCCAAAGATAATGTCCCACGTTATCATCTGGTCAACGATGAGGTGGGTTGTTTTGTGATTGTTTGAGATGCTGAGTGCTGTTTAATGCGGGACATAAGGAAGGATATTAGTAGGGAGAAACGCTTGATGCCGGAAATATCCTTGCCTGGTTAACTGCTCGAAGTTAATCTGCGACGCTCGCCCTCATTCGGATGCATCGAAGGGCTCCCCTGCAGTTGCAAAGTCTTTGTTCTGCGAACTCGTAAAGTCGTAATGCCGTTGGTGGACCGTGCTTGTTAGGGATATTAAATGTTTCCTGGCCTTTAAAGCTATTGGCACGGCGGTTTAGATGGGACACCCTATCTCGTTTTCTACTTGCGCTTCAAGCGTCCCAACGAAACGAAATTGCGGACCGG
Now let's write these all to a file instead of printing them to a screen. We're just adding the >
redirector and a file name at the end of the loop to save the output.
for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
grep -A1 "$header" our-transcriptome-assembly.fa
done > contigs-not-in-ref.fa
5. Blast the sequences against NCBI's nr database "remotely" (from the command line, sending our sequences to the NCBI servers)
We can send our sequences to NCBI through the command line to blast them instead of doing it at their website. But in order to include taxonomic information with the output, we need to have another small blast database that holds that information.
We can download and unpack that database with these commands:
curl -L https://ndownloader.figshare.com/files/16219079 -o taxdb.tar.gz
tar -xzvf taxdb.tar.gz
Now we have the taxonomy database too so we can add taxonomy to our blast output. Let's blast them!
nohup blastn -query contigs-not-in-ref.fa -remote -db nr \
-max_target_seqs 1 -max_hsps 1 \
-out contigs-not-in-ref-blast-to-nr.tsv \
-outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore ssciname" &
CODE BREAKDOWN
The new things added here different from the previous blast are:
nohup
– this tells the computer not to stop the process if we disconnect from the server (very handy)-db
– here we are specifying "nr" which is for NCBI's "non-redundant" database (nucleotide in this case, which it knows because we usedblastn
-remote
– this tells the computer to send the BLAST job to the NCBI servers&
- this at the end tells the computer to run the process in the background so we can still work in our terminal. We can check on the job with the commandjobs
. This will finish when it does and it won't be interrupted disconnect or sign off before then.
We can press enter
to get get our regular prompt back.
This may take a few minutes (the amount of traffic the blast servers deal with fluctuates). So we can download a results file as follows and move on while that continues in the background:
curl -L https://ndownloader.figshare.com/files/16219238 -o contigs-not-in-ref-blast-to-nr-DL.tsv
The 9th column contains the taxonomy information, so we can use the cut
command to look at just that.
cut -f 9 contigs-not-in-ref-blast-to-nr-DL.tsv | sort | uniq -c
CODE BREAKDOWN
cut
– this crops out the specified columns, here we're cutting column 9 (taxonomy column, labeled as "ssciname" by blast)sort
– sorting is required in order for the followinguniq
command to function properlyuniq
– this removes all duplicates, leaving only single copies of the originals
- the
-c
flag provided touniq
also tells us how many of each were found
1 Bacillus subtilis subsp. subtilis str. NCIB 3610
2 Cloning vector pBAD-FLP
1 Methanocaldococcus jannaschii
33 Saccharomyces cerevisiae
1 Yeast expression vector pSCUDT2MFE6H
23 synthetic construct
We see there are quite a few synthetic constructs in here, we might want to remove them and look further into these other sequences if this were a real project. And there are 33 S. cerevisiae sequences that didn't successfully align to our reference cDNA file.
QUICK QUESTION!Why might there by S. cerevisiae sequences assembled from our RNA that are not aligning to the reference cDNA file?
Possible CausesIt's possible some of these might be non-coding RNAs, and therefore aren't in the cDNA reference. There may (probably are) other reasons too.
Isn't Unix grand 🙂
One-liner that removes softwraps from fasta files:
Download and unzip example file with softwraps:
curl -LO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_rna.fna.gz
gunzip GCF_000146045.2_R64_rna.fna.gz
Look at the file before removing softwraps:
head GCF_000146045.2_R64_rna.fna
Complicated command to fix this (but easy to save somewhere for when needed, so no need to remember):
awk '!/^>/ { printf "%s", $0; n="\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' GCF_000146045.2_R64_rna.fna > GCF_000146045.2_R64_rna_no_softwraps.fna
Looking at new file:
head GCF_000146045.2_R64_rna_no_softwraps.fna
Now we'd be able to grep
out sequences if we wanted 🙂