Silva database? #5

davidvilanova · 2022-11-25T16:57:39Z

Hi,
Thanks for the gtdb-taxdump. I´m working on Silva , is there any available tax dump ?

shenwei356 · 2022-11-26T02:50:19Z

No existing one, but you can create it by yourself following the example, using the binaries here.

davidvilanova · 2022-11-28T08:55:44Z

Great,

I have tried so far, see shot below.

The taxonomy is two columns , fist column is accession and seconds columns are taxonomy as follows:

129138 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali

What is wrong ? there are two columns in the taxonomy file.

shenwei356 · 2022-11-28T09:24:27Z

Taxa should be tab-separated, as shown in Example 4.

Prepare lineages

 $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
     | sed 's/;/\t/g' \
     | cut -f 2-8
 Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas amygdali

Create taxdump

 $ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
     | sed 's/;/\t/g' \
     | cut -f 2-8 \
     | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump

Test

 $ taxonkit list -n -r --ids 1 --data-dir taxdump/
 1 [no rank] root
   609216830 [superkingdom] Bacteria
     1641076285 [phylum] Proteobacteria
       329474883 [class] Gammaproteobacteria
         86398254 [order] Pseudomonadales
           1478401337 [famil] Pseudomonadaceae
             1616653803 [genus] Pseudomonas
               691281667 [species] Pseudomonas amygdali

shenwei356 · 2022-11-28T09:28:40Z

If you need to map the Silver TaxIDs to the created ones.

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump2 -A 1
    
$ tree taxdump2/
taxdump2/
├── delnodes.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
└── taxid.map

$ cat taxdump2/taxid.map
129138  691281667

davidvilanova · 2022-11-28T15:39:56Z

Great !!!
I have generated the database, can i use now with taxonkit -lca ??

shenwei356 · 2022-11-28T15:45:49Z

Sure, please follow the usage and examples.

davidvilanova · 2022-12-01T09:51:23Z

I have the following problem.
I have added a new database using the workflow above and generated the Silva to TaxIDs.

For a particular id (1824050977) i don´t have a corresponding mapping (see shot). However it is in names.dmp.

The taxid map was generated the above post with option "-A 1"

Why it is missing ?

shenwei356 · 2022-12-01T10:08:29Z

It just maps custom IDs to TaxIDs of the taxa of the lowest rank, e.g. species.

If you need to map to taxa of other ranks, like the genus Staphylococcus. Here's the way, csvtk is needed.

cd taxdmp;

cat taxid.map \
    | taxonkit lineage --data-dir . -i 2 -t -d , \
    | cut -f 1,4 \
    | csvtk unfold -Ht -f 2 -s ,
    
129138  609216830
129138  1641076285
129138  329474883
129138  86398254
129138  1478401337
129138  1616653803
129138  691281667

davidvilanova · 2022-12-01T12:51:56Z

Great,
One more comment. When using all ranks i will have a problem to get a proper taxonkit lca assignment because by default the upper levels will be picked up. That mean for a particular list of ids all with get a Bacteria assigment. Don´t know if that can be fixed ?

In this case for example the 609216830 that is a Streptoccus is matched to Bacteria. I would expect an assignation to the genus family/genus level at least.

shenwei356 · 2022-12-01T14:30:19Z

So, mapping only IDs to taxa of species rank is reasonable.

Back to the previous concern, why did you want map IDs to the genus Staphylococcus?

For a particular id (1824050977) i don´t have a corresponding mapping (see shot)

Mapping to species is enough, cause the complete lineage of the species can be retrieved with the taxid of the species.

Could you please show how you query the LCA? What were the TaxIDs used? What's the direct purpose?

davidvilanova · 2022-12-01T14:39:43Z

Yes sure,
What i´m trying to achieve is to annotate with Silva a set of nanopore reads.
For that purpose i´m mapping one metagenomic sample reads to the Silva 16S database with minimap allowing for error tolerance as nanopore reads tend to give an error rate. For each read mapped to silva i get the first 10 hits.

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

For some reads the resolution (because of the error rate) cannot be achieved at the species level, that´s why in some cases(like the Staphyloccocus case above) the best hit (output from LCA) is a genus. But could be family or order or class or phyla.

Does this makes sense?

shenwei356 · 2022-12-01T15:20:19Z

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Good, here you've assigned LCA to each read.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

No, you can just use taxonkit lineage or taxonkit reformat -I to retrieve lineage via the LCAs, no matter what the rank they are. There's no need to query with the taxid.map file.

davidvilanova · 2022-12-03T18:59:38Z

Thanks for the update, there is one strange behaviour for lca. I was expecting LCA to go up to the staphylococcus genus level but it went up to bacilli. I assume something is wrong with my database.

Here is the pipeline to build the database:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz

gunzip  SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz 
infile="SILVA_138.1_SSURef_NR99_tax_silva.fasta"

#Remove unwanted taxa
zcat  ${infile} | grep ">" |grep  -v 'Archaea\Eukaryota\|unidentified\|metagenomes\|environmental samples' >tab_silva_taxonomy

sed -i 's/>//g' tab_silva_taxonomy #replace ">"
sed -i 's/ /@@/' tab_silva_taxonomy # replace first space with @@
sed -i 's/ /_/g' tab_silva_taxonomy # replace all spaces with "_" mainly on species names
sed -i 's/@@/'$'\t''/' tab_silva_taxonomy # restore tab after accession
sed -i 's/;/'$'\t''/g' tab_silva_taxonomy # replace commas with semicolons
awk 'NF>7' tab_silva_taxonomy | cut -f 1-8 > tmp
mv tmp tab_silva_taxonomy
taxonkit create-taxdump tab_silva_taxonomy -O taxdump --force -RR "superkingdom,phylum,class,order,family,genus,species" -A 1

Don´t what is wrong...

shenwei356 · 2022-12-04T03:18:27Z

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 3
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;813944714;95949142;368069282;1202746109
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;485431882

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 2
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Staphylococcus_saprophyticus
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sp._mixed_culture_J3-

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | sort | uniq  \
    | axonkit reformat --data-dir taxdump/ -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species"  \
    | csvtk pretty -t
taxid        kingdom    phylum       class     order              family              genus            species
----------   --------   ----------   -------   ----------------   -----------------   --------------   -------------------------------------
1202746109   Bacteria   Firmicutes   Bacilli   Bacillales         Planococcaceae      Sporosarcina     Staphylococcus_saprophyticus
1956460793   Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sciuri
485431882    Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sp._mixed_culture_J3-3
883645126    Bacteria   Firmicutes   Bacilli   Bacillales         Bacillaceae         Bacillus         Streptococcus_pneumoniae

The LCA seems to be 1845768359 (Bacilli), without a doubt.

BTW, the process could be simplified.

    seqkit seq -n SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz \
        | grep -v 'Archaea\|Eukaryota\|unidentified\|metagenomes\|environmental samples' \
        | sed 's/ /\t/' \
        | sed 's/;/\t/g' \
        | awk -F '\t' 'NF > 7' \
        > silva_taxonomy.tsv

    taxonkit create-taxdump silva_taxonomy.tsv -O taxdump \
        --force -R "superkingdom,phylum,class,order,family,genus,species" -A 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silva database? #5

Silva database? #5

davidvilanova commented Nov 25, 2022

shenwei356 commented Nov 26, 2022

davidvilanova commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

davidvilanova commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 3, 2022

shenwei356 commented Dec 4, 2022 •

edited

Loading

Silva database? #5

Silva database? #5

Comments

davidvilanova commented Nov 25, 2022

shenwei356 commented Nov 26, 2022

davidvilanova commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

davidvilanova commented Nov 28, 2022

shenwei356 commented Nov 28, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 1, 2022

shenwei356 commented Dec 1, 2022

davidvilanova commented Dec 3, 2022

shenwei356 commented Dec 4, 2022 • edited Loading

shenwei356 commented Dec 4, 2022 •

edited

Loading