-
Notifications
You must be signed in to change notification settings - Fork 2
Generating data files for GTDB website
A number of data files must be created for each GTDB release and placed on the GTDB website at: https://data.ace.uq.edu.au/public/gtdb/data/releases
- We need to combine the 2 final taxonomy files:
cat final_taxonomy_ar122.tsv final_taxonomy_bac120.tsv >> final_taxonomy_combined.tsv
- We propagate the Taxonomy from reps to all genomes in their clusters.
gtdb_migration_tk propagate_curated_taxonomy -t final_taxonomy_combined.tsv -m gtdb_r202_metadata_20210413.tsv -o propagated_taxonomy.tsv
- We push the new taxonomy to the database.
gtdb_migration_tk add_taxonomy_to_database --hostname $hostname -u $user -d $db -p password -t propagated_taxonomy.tsv -m gtdb_r202_metadata_20210413.tsv --truncate_taxonomy
- Makes sure sure we have the same number of genomes propagated_taxonomy and the DB.
wc -l propagated_taxonomy.tsv
should be equals toSELECT * from metadata_view where gtdb_species not like 's__'
- GTDB database must be updated to contain latest species clustering information
- GTDB database must be updated to contain the latest GTDB taxonomy
- Data in the updated database should be dumped to a TSV file, e.g.:
gtdb metadata export --format tab --output gtdb_r202_metadata_20210414.tsv
- The path to the data directory for each genome in the GTDB should be dumped to a TSV file, e.g.:
gtdb power genome_paths --output gtdb_r202_genome_paths_20210414.tsv
The config.py
file in the GTDB Release Tk must be
updated to reflect any changes in the path to data files.
Functionality for generating website data files is being moved into the GTDB Release Tk, but currently exists in a number of Python and SQL scripts. The examples below are all for GTDB r89 and version numbers should be updated to reflect the current release.
- We get the
gtdb_clusters_de_novo.tsv
from /srv/db/gtdb/metadata/release202/representatives/sp_cluster_update/u_cluster_de_novo/ - We generate canonicals_to_ncbi.tsv
awk -v OFS='\t' '{ print $2,$1 }' gtdb_r207_metadata_20220322.tsv > canonicals_to_ncbi.tsv
- file creation:
gtdb-release-tk sp_cluster_file data_from_db/gtdb_r202_metadata_20210414.tsv data_from_db/gtdb_clusters_de_novo.tsv data_from_db/canonicals_to_ncbi.tsv 202 sp_cluster_file
Bacterial and archaeal taxonomy files spanning the species representative genomes can be obtained with:
gtdb-release-tk taxonomy_files gtdb_r202_metadata_20210414.tsv sp_clusters_r202.tsv 202 taxonomy_files
This creates the bac120_taxonomy_r89.tsv
and ar122_taxonomy_r89.tsv
taxonomy files. GTDB user genome IDs are
replaced with a NCBI genome accession where available and a UBA ID otherwise.
- The input trees must already be stripped of dummy curation nodes. This can be done with the
remove_dummy
method of the GTDB Validation Tk.
gtdb_validation_tk remove_dummy gtdb_r202_bac120_unscaled_decorated.tree gtdb_r202_bac120_unscaled_decorated_no_dummy.tree
gtdb_validation_tk remove_dummy gtdb_r202_ar122_unscaled_decorated.tree gtdb_r202_ar122_unscaled_decorated_no_dummy.tree
The archaeal and bacterial trees used during curation must be modified to replace all GTDB user genomes IDs. Reference trees for the GTDB website can be created with:
gtdb_release_tk tree_files gtdb_r202_metadata_20210414.tsv gtdb_r202_bac120_unscaled_decorated_no_dummy.tree gtdb_r202_ar122_unscaled_decorated_no_dummy.tree canonicals_to_ncbi.tsv 202 r202_temp_website/tree_files
Three files spanning different sets of 16S rRNA sequences are placed on the GTDB website:
- bac120_ssu_reps_r<release#>.fna: a single 16S rRNA sequence for each bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genome.
- ar53_ssu_reps_r<release#>.fna: a single 16S rRNA sequence for each archaeal representative genomes. The longest identified 16S rRNA sequence is selected for each representative genome.
- ssu_all_r<release#>.fna: contains all 16S rRNA sequences identified across the set of GTDB genomes passing QC.
There files can be created with:
gtdb-release-tk ssu_files gtdb_r202_metadata_20210414.tsv sp_clusters_r202.tsv gtdb_r202_genome_paths_20210414.tsv 202 ssu_files
Information about individual marker genes along with individual MSAs are provided on the GTDB website. Initial version of these files can be obtained from the GTDB:
gtdb -t 30 tree create --no_tree --no_trim --individual --prefix bac120_r207_all --taxa_filter d__Bacteria --genome_batchfile ../data_from_db/bac120_all.lst --marker_set_ids 1 --guaranteed_batchfile ../data_from_db/bac120_all.lst --output bac120_msa_marker_genes_all_r207 --classic_header
gtdb -t 30 tree create --no_tree --no_trim --individual --prefix ar53_r207_all --taxa_filter d__Archaea --genome_batchfile ../data_from_db/ar53_all.lst --marker_set_ids 19 --guaranteed_batchfile ../data_from_db/ar53_all.lst --output ar53_msa_marker_genes_all_r207 --classic_header
once the alignement is finish,
`cd bac120_msa_marker_genes_all_r207`
`mkdir individual ; mv bac120_r207_all_PF* individual ;mv bac120_r207_all_TI* individual; cd individual ; tar czvf bac120_msa_marker_genes_all_r207.tar.gz`
alternatively:
`gtdb-release-tk marker_files bac120_msa_marker_genes_all_r207 ar53_msa_marker_genes_all_r207 207 individual_gene_files `
Similar operation for reps
This produces the files:
- bac120/ar122_msa_marker_info_r89.tsv
- bac120/ar122_msa_individual_genes_r89.tar.gz
Protein and Nucleotide files you need to use the combined taxonomy file
gtdb-release-tk nucleotide_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --release_number 207 --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --output_dir protein_fna_reps
gtdb-release-tk protein_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --release_number 207 --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --output_dir protein_faa_reps
archive the results:
tar -cv protein_fna_reps | pigz -9 > gtdb_proteins_nt_reps_r202.tar.gz
tar -cv protein_faa_reps | pigz -9 > gtdb_proteins_aa_reps_r202.tar.gz
Run the command
gtdb-release-tk hq_genome_file data_from_db/gtdb_r207_metadata.tsv 207 hq_genome_file
Run the command
gtdb-release-tk metadata_files data_from_db/gtdb_r207_metadata_20220322.tsv data_from_db/metadata_field_desc.tsv sp_cluster_file/sp_clusters_r207.tsv 207 metadata_files
Run the command
gtdb-release-tk lpsn_urls /srv/db/gtdb/metadata/release207/lpsn/20210823/species_list.lst 207 lpsn_urls
Get the qc_failed.tsv file from /srv/db/gtdb/metadata/release207/representatives/sp_cluster_update/2_u_qc_genomes
Run
gtdb-release-tk qc_file data_from_db/qc_failed.tsv data_from_db/canonicals_to_ncbi.tsv 207 qc_failed
Run the command
gtdb-release-tk dict_file taxonomy_files/taxonomy_r207.tsv 207 dict_file
- For reps
gtdb-release-tk gene_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --release_number 207 --output_dir gene_files_reps --cpus 30 --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --only_reps
rename and archive output:
mv ar122_202_individual_genes ar122_marker_genes_reps_r202
mv bac120_202_individual_genes bac120_marker_genes_reps_r202
tar cvzf ar122_marker_genes_reps_r202.tar.gz ar122_marker_genes_reps_r202
tar cvzf bac120_marker_genes_reps_r202.tar.gz bac120_marker_genes_reps_r202
- For all genomes
gtdb_release_tk gene_files --taxonomy_file taxonomy_files/taxonomy_r202.tsv --genome_dirs data_from_db/gtdb_r202_genome_paths_20210414.tsv --release_number 202 --output_dir gene_files_all --cpus 30 --metadata_file data_from_db/gtdb_r202_metadata_20210414.tsv
rename and archive output:
mv ar122_202_individual_genes ar122_marker_genes_all_r202
mv bac120_202_individual_genes bac120_marker_genes_all_r202
tar cvzf ar122_marker_genes_all_r202.tar.gz ar122_marker_genes_all_r202
tar cvzf bac120_marker_genes_all_r202.tar.gz bac120_marker_genes_all_r202
MSA files used to produce the GTDB reference trees are created by the gtdb tree create
command. These files need
to be processed, you get those trimmed MSA from the /srv/project/gtdb/release/archaea(bacteria)/pre_curation/bac120(ar53)/msa:
gtdb-release-tk msa_files gtdb_r207_bac120_concatenated.faa gtdb_r207_ar53_concatenated.faa canonicals_to_ncbi.tsv gtdb_r207_metadata_20220322.tsv 207 ../trimmed_msa_files
The JSON tree is used as a reference file to load the tree browser on the website Join both taxonomy and metadata files:
cat bac120_taxonomy_r89.tsv ar122_taxonomy_r89.tsv > taxonomy_r89.tsv
cat bac120_metadata_r89.tsv ar122_metadata_r89.tsv > metadata_r89.tsv
gtdb_release_tk json_tree_file --taxonomy_file taxonomy_r89.tsv --metadata_file metadata_r89.tsv --output_dir . --release_number 89
export the data from the database
SELECT release_ver,rank_domain,rank_phylum,rank_class,rank_order,rank_family,rank_genus,rank_species
FROM taxon_hist WHERE release_ver not like '%NCBI%'
ORDER BY replace(release_ver,'R','')::float
export this file as allranks_allreleases.csv
download the latest information from NCBI to have the latest name
rsync ftp.ncbi.nih.gov::pub/taxonomy/taxdump.tar.gz .
gtdb_release_tk nomenclatural_check --ncbi_node_file 20210420/nodes.dmp --ncbi_name_file 20210420/names.dmp --lpsn_species_file /srv/db/gtdb/metadata/release202/lpsn/20201124/species_list.lst --output_directory test --gtdb_taxonomy ../taxonomy_files/taxonomy_r202.tsv --rank_release_file ../data_from_db/allranks_allreleases.csv
This tool tracks changes between 2 different taxonomy files. The genomes ids in those files are automatically changed to Genbank of UBA ids to run the comparison. This functionality will return 10 files ( 5 Bacterial and 5 Archaeal from phylum to genus ) that can be copied to an excel spreadsheet.
gtdb_release_tk tax_comp_files --reference_taxonomy_file data_from_db/gtdb_taxonomy_ncbi_20210419.tsv --new_taxonomy_file taxonomy_files/taxonomy_r202.tsv --output_dir compare_taxonomy/ncbi_vs_gtdb --changes_only
Plots for the GTDB stats page using both the a default and color blind safe palette can be generated with:
gtdb_release_tk all_release_plots bac120_metadata_r<#>.tsv ar53_metadata_r<#>.tsv <release_number> <output_dir>