-
Notifications
You must be signed in to change notification settings - Fork 2
Update database metadata
Metadata for NCBI genomes needs to be generated after syncing with RefSeq and GenBank. Most of this data is calculated externally to the GTDB to improve efficiency.
In general, scripts need to be run once for RefSeq and then again for GenBank.
Ideally, metadata would only be generated when required, but it some cases it is calculated de novo for all genomes for simplicity. Some data should be calculated de novo for all genomes on a regular bases. For example, 16S classification when there is a new SILVA/LTP database.
All scripts have been moved in gtdb_migration_tk.
TODO: develop a clean conda image for gtdb_migration_tk dev
NCBI gene calling is very conservative. This can negatively impact the CheckM estimates and results in valid marker genes being missed. As such, genes are called de novo using Prodigal:
Prodigal version: v2.6.3
module load biolib prodigal (optional)
gtdb_migration_tk prodigal -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/prodigal.log
pfam_search version: v0.1
hmmer version: v3.1b2
**It is important to get this version of hmmer to get the same annotations of previous versions.
for full database
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/refseq_pfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/genbank_pfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/refseq_tigrfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/genbank_tigrfam.log
for db_lite
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/refseq_pfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/genbank_pfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/refseq_tigrfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/genbank_tigrfam_lite.log
OPTIONAL: It is possible to regenerate tophit files without recalling the genes.
gtdb_migration_tk top_hit
generate metadata derived from nucleotide (e.g., GC) and protein (e.g., gene count) files:
gtdb_migration_tk metadata -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/gtdb_metadata.log
we use trnascan: identify tRNAs in genomes
module load trnascan-se (optional)
gtdb_migration_tk trnascan -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -c 30
--gbk_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt
--gbk_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt
--rfq_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt
--rfq_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt
-l logs/trnascan.log
The quality of new genomes can be estimated with CheckM. Estimates should be put in /srv/whitlam/bio/db/gtdb/metadata/<release#>/checkm/
.
- estimate genome quality for new genomes in RefSeq and GenBank:
module load biolib checkm (optional)
gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/extra_gbk_report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/genbank/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_genbank.log
gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/refseq/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_refseq.log
OPTIONAL In between official releases, checkM may have been calculated for other releases ( i.e. in between R95 and 202, R98,200,201 have been processed) So it's important to join the CheckM output files ( checkm.profiles,checkm.qa_sh100)
mkdir /srv/db/gtdb/metadata/release<release#>/checkm/join_releases
gtdb_migration_tk join_checkm -f /srv/db/gtdb/metadata/release98/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release200/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release201/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release202/checkm/refseq/checkm.profiles.tsv -o /srv/db/gtdb/metadata/release202/checkm/join_releases/checkm_profile_refseq.tsv -l /srv/db/gtdb/metadata/release202/checkm/join_releases/join_checkm_refseq.log
With the CheckM profile generated, new and updated genomes can be pushed to the database.
- Create a duplicate of the current database in PostgresQL: in watson
CREATE DATABASE <_database_name_> (ie database_name = gtdb_release76)
- Populate the new created database with the data from the previous version (ie gtdb_release75):
port : 5432
pg_dump -h <_hostname_> -p <_port_> -U <_user_> gtdb_release<old> -f ~/pg_dumps/gtdb_release<old>_duplicate.sql
psql -h <_hostname_> -p <_port_> -U <_user_> -d gtdb_release<new> -f ~/pg_dumps/gtdb_release<new>_duplicate.sql
- Run the commands
gtdb_migration_tk update_db
--hostname <hostname> -u <user>-p <password> -d <database>
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_refseq.tsv
-g gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv
--report_dir /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210826/
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_refseq_20210826.log
-f 2020-01-29 --repository refseq --cpus 20
gtdb_migration_tk update_db
--hostname <hostname> -u <user>-p <password> -d <database>
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_genbank.tsv
-g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv
--report_dir /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210829
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_genbank_20210829.log
-f 2020-01-29 -r genbank --cpus 20
TODO: The function _checkPathRecord needs to be reviewed. sometimes the path is not properly set during a path change
When updating the NCBI genomes, genomes that have been updated (i.e. GCF_000000.1 -> GCF_000000.2) needs to be realigned because all aligned markers have been removed.
gtdb -r power realign_updated_genomes
- import CheckM estimates; you will need to export the metadata for the new release first (gtdb metadata export):
gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_genbank.tsv -q checkm/join_releases/checkm_qa_sh100_genbank.tsv -m metadata_r202.tsv -l logs/update_checkm_db_genbank.log
gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_refseq.tsv -q checkm/join_releases/checkm_qa_sh100_refseq.tsv -m metadata_r202.tsv -l logs/update_checkm_db_refseq.log
- create tables with metadata for all genomes:
mkdir /srv/db/gtdb/metadata/release<release_number>/metadata_tables
cd /srv/db/gtdb/metadata/release<release_number>/metadata_tables/
gtdb_migration_tk create_tables -g /srv/db/gtdb/genomes/ncbi/release214/genome_dirs.tsv -v 138.2 -o .
gtdb_migration_tk create_tables: creates the metadata_nt.tsv, metadata_gene.tsv, metadata_ssu_silva.tsv, metadata_lsu_silva.tsv,'metadata_lsu_5S.tsv, metadata_ssu_silva_count.tsv, metadata_lsu_silva_count.tsv, metadata_lsu_5S_count.tsv, and metadata_trna_count.tsv tables
- create NCBI specific metadata tables
gtdb_migration_tk parse_assemblies --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt -m metadata_r#.tsv
-o ncbi_assembly_summary.tsv
gtdb_migration_tk parse_ncbi_dir -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -o ncbi_assembly_metadata.tsv
gtdb_migration_tk ncbi_strains --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt -o . -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/ncbi_strains.log
- import all metadata tables into the database:
UPDATE metadata_taxonomy SET ncbi_type_material_designation = NULL
gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d gtdb_release202_temp -i /srv/db/gtdb/metadata/release<releasenumber>/metadata_tables/ --genome_list metadata_r202.tsv
The Standardised and non standardised taxonomy files were created during the sync of data ( early stage) Now we just need to add this information to the database.
- Add Information to the database:
gtdb_migration_tk update_ncbitax_db --hostname <hostname> -u <user> -p <password> -d <db> -o ncbi_r202_organism_names.tsv --genome_list /srv/home/uqpchaum/tmp_dir/migration_tk_pycharm_tmp/metadata_r202.tsv --filtered ncbi_r202_standardised.tsv --unfiltered ncbi_r202_unfiltered_taxonomy.tsv -l /srv/db/gtdb/genomes/ncbi/release<#>/logs/update_ncbitax_db.log
The final GTDB taxonomy from the previous release should be propagated to new releases. This can be done as follows:
- Get final GTDB metadata from previous release:
- Load previous version of database
- gtdb metadata export --output gtdb_metadata_prev.csv
- Get initial metadata for current release:
- Load current version of database
- gtdb metadata export --output gtdb_metadata_cur.csv
- Propagate over previous taxonomy were possible:
gtdb_migration_tk propagate_gtdb_taxonomy --gtdb_metadata_prev metadata_r95.tsv --gtdb_metadata_cur metadata_r202_20201020.tsv --taxonomy_file gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt --log propagate_taxonomy/propagate.log
gtdb_migration_tk update_propagated_tax --hostname <host> -u gtdb -p <pass> -d gtdb_pierre_r220 -t gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt -l propagate_taxonomy/update_taxonomy.log --truncate_taxonomy --genome_list metadata_export_b4_checkm.tsv -m metadata_export_b4_checkm.tsv
The last command marks GTDB representatives with modified NCBI Accession numbers in the database. This should be done before assigning new NCBI genomes to representatives.
gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 1,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel
We need to compare compare_markers to make sure all markers are properly aligned
gtdb power compare_markers
New genomes from NCBI may not yet have a designated domain under the GTDB taxonomy. Setting the GTDB domain is critical since this field is often used for filtering (e.g., to create archaeal and bacterial specific trees). The following script propagates the domain designated within the NCBI taxonomy to the GTDB domain field:
-
gtdb_migration_tk set_gtdb_domain --hostname <hostname> -u <user> -p <password> -d <db>
Check the log to see the conflict of domain gtdb power taxonomy_check
THIS IS A LONG STEP, run it early in the pipeline. Update the "ncbi_genome_category" field to indicate if a genome is marked as a MAG or SAG at NCBI:
* gtdb_migration_tk ncbi_genome_category --genbank_assembly_summary ncbi/taxonomy/assembly_summary_genbank.txt --refseq_assembly_summary ncbi/taxonomy/assembly_summary_refseq.txt -g /srv/db/gtdb/genomes/ncbi/release207/genome_dirs.tsv -o category_field/ncbi_genome_category.tsv -l category_field/ncbi_genome_category.log
* gtdb_migration_tk python -m gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d <db> --metadata_table category_field/ncbi_genome_category.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_ncbi_genome_category.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release207/logs/update_ncbi_genome_category.log
Add type material information (gtdb_type_designation, gtdb_type_designation_sources, lpsn_priority_year, gtdb_type_species_of_genus) to database:
gtdb_migration_tk update_metadata_db --hostname <server> -u gtdb -p <password> -d gtdb_pierre_r214 --metadata_table gtdb_type_strain_summary.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_type_material.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release214/logs/update_metadata_db_type_strain.log
gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 11,12,13,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel
We need to compare compare_markers to make sure all markers are properly aligned
gtdb power compare_markers
Marker sets 11, 12, 13 (these are the ribosomal sets) and marker set 19 (this is the new archaeal marker set with 53 genes). These marker sets will be used each release.