Website update

Updating GTDB website databases

1. Preparing the website instance databases

1.1 Restoring the main database

The full database should be restored to the website instance to improve reliability, excluding the aligned_markers table. Restore the GTDB rXXX database to the website instance with the following name: gtdb_r***

pg_dump --host=??? --username=??? --password --exclude-table=aligned_markers --no-privileges --no-owner -F t gtdb_pierre_r202 > /tmp/gtdb_r202.tar
pg_restore --host=??? --username=??? --password --dbname gtdb_r202 --no-privileges --no-owner --verbose /tmp/gtdb_r202.tar

1.2 Restoring the web database

Create a new database called gtdb_r***_web. This is a database containing pre-processed information used in tools (e.g. taxon history).

The old version of the database should be restored, note that some data will be modified for the new release.

pg_dump --host=??? --username=??? --password --no-privileges --no-owner -F t gtdb_r95_web > /tmp/gtdb_r95_web.tar
pg_restore --host=??? --username=??? --password --dbname gtdb_r202_web --no-privileges --no-owner --verbose /tmp/gtdb_r95_web.tar

1.2.1 taxon_alias

The taxon_alias table contains a list of aliases for a specific canonical genome_id. For example:

GCA_000006945.2, GCF_000006945.1, GCF_000006945.2 = G000006945
Note that this includes UBA IDs (UBA1234, 61231).

Steps:

As all the UBA mapping is in the table, there is no need to re-process it.
Add the new genomes to the table.
- Simply, this can be done by just exporting the data and doing an intersection.

1.2.2 taxon_hist

The taxon_hist table contains the taxon history for each canonical genome id per release.

This is used by the taxon history tool.

Steps:

Query the gtdb_r*** database and export the result to disk:

select 'R202' AS release_ver,
       g.formatted_source_id AS genome_id,
       gtdb_domain,
       gtdb_phylum,
       gtdb_class,
       gtdb_order,
       gtdb_family,
       gtdb_genus,
       gtdb_species
from genomes g
         inner join metadata_taxonomy mt on g.id = mt.id
where gtdb_phylum != 'p__';

1.2.3 genome_path

The genome_path table is used by the FastANI calculator.

Simply query the R202 table, export the query:

select formatted_source_id,
       CASE
           WHEN substring(fasta_file_location, 0, 4) = 'GCA'
               THEN CONCAT('/srv/db/gtdb/genomes/ncbi/release202/genbank/', fasta_file_location)
           WHEN substring(fasta_file_location, 0, 4) = 'GCF'
               THEN CONCAT('/srv/db/gtdb/genomes/ncbi/release202/refseq/', fasta_file_location)
END AS path
from genomes

Then import the result from the above query into the table.

1.2.4

The genome_taxid table is used by the genome page to infer the NCBI taxonomy ID, so that it can be linked out.

This is generated through the GTDB-Migration-Tk package. Truncate the existing table, and import the new records. <<<< TODOS

1.2.5

The gtdb_release table is probably not used but it should be updated as well as per what is already in there.

2. Update the website database connection string

Find the SQLALCHEMY_BINDS variable in config.py and add the new databases, e.g.:

    SQLALCHEMY_BINDS = {
        'r95': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_r95',
        'gtdb_web': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_r95_web',
        'dev': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_release202_temp',
    }
    SQLALCHEMY_BINDS['current'] = SQLALCHEMY_BINDS['r95']

    SQLALCHEMY_BINDS = {
        'r95': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_r95',
        'r202': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_r202',
        'gtdb_web': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_r202_web',
        'dev': f'postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/gtdb_release202_temp',
    }
    SQLALCHEMY_BINDS['current'] = SQLALCHEMY_BINDS['r202']

3. Modifying code that depends on releases

This ideally would be automatically configured, but the time required to do that has not yet reached a point that is worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly