This folder contains gtdb*dump.tar.gz
files coming from the Genome
Taxonomy Database, ready for use with
Ete (see also Ete's
documentation
for more details).
To create the gtdb*dump.tar.gz
files, we first get the archea and
bacteria taxonomies from their releases
(for example, for the latest release,
ar53_taxonomy
and
bac120_taxonomy).
Then, we use Nick Youngblut's gtdb_to_taxdump (which can also be found in tools -> third party) to convert GTDB taxonomy to NCBI taxdump format. To do it, we run:
gtdb_to_taxdump.py ar53_taxonomy.tsv.gz bac120_taxonomy.tsv.gz
and then we just put the 4 resulting .dmp
files into a tar.gz:
tar -czf gtdb_latest_dump.tar.gz *.dmp
Let's download release 207 as an example:
wget https://github.com/etetoolkit/ete-data/raw/main/gtdb_taxonomy/gtdb207/gtdb207dump.tar.gz
(Note that we download the raw dump file, .../ete-data/raw/main/...
,
and not .../ete-data/blob/main/...
.)
We can then run the following python code to use it in Ete:
from ete4 import GTDBTaxa
gtdb = GTDBTaxa()
gtdb.update_taxonomy_database("./gtdb207dump.tar.gz")