Skip to content

Latest commit

 

History

History

gtdb_taxonomy

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

This folder contains gtdb*dump.tar.gz files coming from the Genome Taxonomy Database, ready for use with Ete (see also Ete's documentation for more details).

How to create the tar.gz files

To create the gtdb*dump.tar.gz files, we first get the archea and bacteria taxonomies from their releases (for example, for the latest release, ar53_taxonomy and bac120_taxonomy).

Then, we use Nick Youngblut's gtdb_to_taxdump (which can also be found in tools -> third party) to convert GTDB taxonomy to NCBI taxdump format. To do it, we run:

gtdb_to_taxdump.py ar53_taxonomy.tsv.gz bac120_taxonomy.tsv.gz

and then we just put the 4 resulting .dmp files into a tar.gz:

tar -czf gtdb_latest_dump.tar.gz *.dmp

How to use GTDB databases in Ete4

Let's download release 207 as an example:

wget https://github.com/etetoolkit/ete-data/raw/main/gtdb_taxonomy/gtdb207/gtdb207dump.tar.gz

(Note that we download the raw dump file, .../ete-data/raw/main/..., and not .../ete-data/blob/main/....)

We can then run the following python code to use it in Ete:

from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
gtdb.update_taxonomy_database("./gtdb207dump.tar.gz")