Skip to content

Update database metadata

Pierre Chaumeil edited this page Oct 8, 2024 · 250 revisions

Metadata for NCBI genomes needs to be generated after syncing with RefSeq and GenBank. Most of this data is calculated externally to the GTDB to improve efficiency.
In general, scripts need to be run once for RefSeq and then again for GenBank.

Ideally, metadata would only be generated when required, but it some cases it is calculated de novo for all genomes for simplicity. Some data should be calculated de novo for all genomes on a regular bases. For example, 16S classification when there is a new SILVA/LTP database.

All scripts have been moved in gtdb_migration_tk.
TODO: develop a clean conda image for gtdb_migration_tk dev

Call genes on RefSeq and GenBank genomes

NCBI gene calling is very conservative. This can negatively impact the CheckM estimates and results in valid marker genes being missed. As such, genes are called de novo using Prodigal:

Prodigal version: v2.6.3

module load biolib prodigal (optional)
gtdb_migration_tk prodigal -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/prodigal.log

Search and Align HMMs:

pfam_search version: v0.1
hmmer version: v3.1b2
**It is important to get this version of hmmer to get the same annotations of previous versions.

for full database

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/refseq_pfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/genbank_pfam.log

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/refseq_tigrfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/genbank_tigrfam.log

for db_lite

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/refseq_pfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/genbank_pfam_lite.log

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/refseq_tigrfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/genbank_tigrfam_lite.log

OPTIONAL: It is possible to regenerate tophit files without recalling the genes.

gtdb_migration_tk top_hit

Generate Metadata:

generate metadata derived from nucleotide (e.g., GC) and protein (e.g., gene count) files:

gtdb_migration_tk metadata -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/gtdb_metadata.log

Generate Ribosomal Metadata:

Generate-Ribosomal-Metadata

Predict transfer RNA

we use trnascan: identify tRNAs in genomes

module load trnascan-se (optional)
gtdb_migration_tk trnascan -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -c 30 
--gbk_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt 
--gbk_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt 
--rfq_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt 
--rfq_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt
-l logs/trnascan.log

Estimate genome quality

The quality of new genomes can be estimated with CheckM. Estimates should be put in /srv/whitlam/bio/db/gtdb/metadata/<release#>/checkm/.

  1. estimate genome quality for new genomes in RefSeq and GenBank:
module load biolib checkm (optional)
gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/extra_gbk_report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/genbank/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_genbank.log

gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/refseq/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_refseq.log

OPTIONAL In between official releases, checkM may have been calculated for other releases ( i.e. in between R95 and 202, R98,200,201 have been processed) So it's important to join the CheckM output files ( checkm.profiles,checkm.qa_sh100)

mkdir /srv/db/gtdb/metadata/release<release#>/checkm/join_releases
gtdb_migration_tk join_checkm -f /srv/db/gtdb/metadata/release98/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release200/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release201/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release202/checkm/refseq/checkm.profiles.tsv -o /srv/db/gtdb/metadata/release202/checkm/join_releases/checkm_profile_refseq.tsv -l /srv/db/gtdb/metadata/release202/checkm/join_releases/join_checkm_refseq.log

Add new genomes to the DB

With the CheckM profile generated, new and updated genomes can be pushed to the database.

  1. Create a duplicate of the current database in PostgresQL: in watson
CREATE DATABASE <_database_name_> (ie database_name = gtdb_release76)
  1. Populate the new created database with the data from the previous version (ie gtdb_release75):

port : 5432

pg_dump -h <_hostname_> -p <_port_> -U <_user_> gtdb_release<old> -f ~/pg_dumps/gtdb_release<old>_duplicate.sql
psql -h <_hostname_> -p <_port_> -U <_user_> -d gtdb_release<new> -f ~/pg_dumps/gtdb_release<new>_duplicate.sql
  1. Run the commands
gtdb_migration_tk update_db 
--hostname <hostname> -u <user>-p <password> -d <database>
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_refseq.tsv
-g gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv
--report_dir /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210826/
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_refseq_20210826.log
-f 2020-01-29 --repository refseq --cpus 20

gtdb_migration_tk update_db 
--hostname <hostname> -u <user>-p <password> -d <database> 
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_genbank.tsv
-g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv
--report_dir /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210829
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_genbank_20210829.log
-f 2020-01-29 -r genbank --cpus 20

TODO: The function _checkPathRecord needs to be reviewed. sometimes the path is not properly set during a path change

Realign updated NCBI genomes

When updating the NCBI genomes, genomes that have been updated (i.e. GCF_000000.1 -> GCF_000000.2) needs to be realigned because all aligned markers have been removed.

gtdb -r power realign_updated_genomes

Metadata

  1. import CheckM estimates; you will need to export the metadata for the new release first (gtdb metadata export):
gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_genbank.tsv -q checkm/join_releases/checkm_qa_sh100_genbank.tsv -m metadata_r202.tsv -l logs/update_checkm_db_genbank.log

gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_refseq.tsv -q checkm/join_releases/checkm_qa_sh100_refseq.tsv -m metadata_r202.tsv -l logs/update_checkm_db_refseq.log
  1. create tables with metadata for all genomes:
mkdir /srv/db/gtdb/metadata/release<release_number>/metadata_tables
cd /srv/db/gtdb/metadata/release<release_number>/metadata_tables/
gtdb_migration_tk create_tables -g /srv/db/gtdb/genomes/ncbi/release214/genome_dirs.tsv -v 138.2 -o .

gtdb_migration_tk create_tables: creates the metadata_nt.tsv, metadata_gene.tsv, metadata_ssu_silva.tsv, metadata_lsu_silva.tsv,'metadata_lsu_5S.tsv, metadata_ssu_silva_count.tsv, metadata_lsu_silva_count.tsv, metadata_lsu_5S_count.tsv, and metadata_trna_count.tsv tables

  1. create NCBI specific metadata tables
gtdb_migration_tk parse_assemblies --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt -m metadata_r#.tsv 
-o ncbi_assembly_summary.tsv

gtdb_migration_tk parse_ncbi_dir -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -o ncbi_assembly_metadata.tsv

gtdb_migration_tk ncbi_strains --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt -o . -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv 
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/ncbi_strains.log
  1. import all metadata tables into the database:
UPDATE metadata_taxonomy SET ncbi_type_material_designation = NULL
gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d gtdb_release202_temp -i /srv/db/gtdb/metadata/release<releasenumber>/metadata_tables/ --genome_list metadata_r202.tsv

Add Checkm2 Information

Checkm2 metadata

Add 7-rank NCBI taxonomy

The Standardised and non standardised taxonomy files were created during the sync of data ( early stage) Now we just need to add this information to the database.

  1. Add Information to the database:
gtdb_migration_tk update_ncbitax_db --hostname <hostname> -u <user> -p <password> -d <db> -o ncbi_r202_organism_names.tsv --genome_list /srv/home/uqpchaum/tmp_dir/migration_tk_pycharm_tmp/metadata_r202.tsv --filtered ncbi_r202_standardised.tsv --unfiltered ncbi_r202_unfiltered_taxonomy.tsv -l /srv/db/gtdb/genomes/ncbi/release<#>/logs/update_ncbitax_db.log

Propagating GTDB taxonomy

The final GTDB taxonomy from the previous release should be propagated to new releases. This can be done as follows:

  1. Get final GTDB metadata from previous release:
    • Load previous version of database
    • gtdb metadata export --output gtdb_metadata_prev.csv
  2. Get initial metadata for current release:
    • Load current version of database
    • gtdb metadata export --output gtdb_metadata_cur.csv
  3. Propagate over previous taxonomy were possible:
gtdb_migration_tk propagate_gtdb_taxonomy --gtdb_metadata_prev metadata_r95.tsv --gtdb_metadata_cur metadata_r202_20201020.tsv --taxonomy_file gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt --log propagate_taxonomy/propagate.log
gtdb_migration_tk update_propagated_tax --hostname <host> -u gtdb -p <pass> -d gtdb_pierre_r220 -t gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt -l propagate_taxonomy/update_taxonomy.log --truncate_taxonomy --genome_list metadata_export_b4_checkm.tsv -m metadata_export_b4_checkm.tsv

The last command marks GTDB representatives with modified NCBI Accession numbers in the database. This should be done before assigning new NCBI genomes to representatives.

Realign main marker set for new genomes

gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 1,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel We need to compare compare_markers to make sure all markers are properly aligned

  • gtdb power compare_markers

Setting GTDB domain for NCBI genomes

New genomes from NCBI may not yet have a designated domain under the GTDB taxonomy. Setting the GTDB domain is critical since this field is often used for filtering (e.g., to create archaeal and bacterial specific trees). The following script propagates the domain designated within the NCBI taxonomy to the GTDB domain field:

  • gtdb_migration_tk set_gtdb_domain --hostname <hostname> -u <user> -p <password> -d <db> Check the log to see the conflict of domain
  • gtdb power taxonomy_check

Set genome categoryfield

THIS IS A LONG STEP, run it early in the pipeline. Update the "ncbi_genome_category" field to indicate if a genome is marked as a MAG or SAG at NCBI:

 * gtdb_migration_tk ncbi_genome_category --genbank_assembly_summary ncbi/taxonomy/assembly_summary_genbank.txt --refseq_assembly_summary ncbi/taxonomy/assembly_summary_refseq.txt -g /srv/db/gtdb/genomes/ncbi/release207/genome_dirs.tsv -o category_field/ncbi_genome_category.tsv -l category_field/ncbi_genome_category.log
 * gtdb_migration_tk python -m gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d <db> --metadata_table category_field/ncbi_genome_category.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_ncbi_genome_category.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release207/logs/update_ncbi_genome_category.log

Get Seqcode information

Seqcode parsing

Parse LPSN, and NCBI

Select Strains

Add type material information (gtdb_type_designation, gtdb_type_designation_sources, lpsn_priority_year, gtdb_type_species_of_genus) to database:

gtdb_migration_tk update_metadata_db --hostname <server> -u gtdb -p <password> -d gtdb_pierre_r214 --metadata_table gtdb_type_strain_summary.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_type_material.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release214/logs/update_metadata_db_type_strain.log

Realign remaining marker sets for new genomes

gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 11,12,13,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel We need to compare compare_markers to make sure all markers are properly aligned

  • gtdb power compare_markers

Marker sets 11, 12, 13 (these are the ribosomal sets) and marker set 19 (this is the new archaeal marker set with 53 genes). These marker sets will be used each release.

Add surveillance genomes

Add Surveillance Genomes

Clone this wiki locally