-
Notifications
You must be signed in to change notification settings - Fork 2
Updating website
This outlines the steps required to update the website for each release.
On Watson, run the command below. Then copy it across to the database server.
pg_dump -U gtdb --data-only --format=tar --no-owner --no-acl --no-privileges --dbname=gtdb_pierre_r220 --file "/tmp/gtdb_r220.tar"
On the database server, restore the release into a new database:
docker exec -it gtdb-postgres psql -U postgres gtdb_r220 -f /backup/gtdb_r220.sql
Check what extensions were enabled on the previous release database, and enable them on the new database:
SELECT *
FROM pg_extension;
For example, this enables the pg_trgm extension:
create extension pg_trgm;
There are a number of materialized views that need to be created, and potentially views that may need to be created. Check the previous database and bring across any that do not exist.
Views:
select 'Materialized Views' AS type, matviewname AS name
from pg_matviews
UNION
select 'Views', table_name
from INFORMATION_SCHEMA.views
where table_schema = 'public'
order by 1, 2;
Update permissions on the database so that the API is allowed to read.
GRANT SELECT
ON ALL TABLES IN SCHEMA public
TO gtdb_api;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT
ON TABLES TO gtdb_api;
Make sure that all tables are clustered on their primary keys, this can be done by querying for all primary keys, then copying the SQL from the output and running it:
SELECT conrelid::regclass AS tbl,
conname AS pkey,
pg_get_constraintdef(oid) AS description,
CONCAT('CLUSTER ', conrelid::regclass, ' USING ', conname, ';') AS query
FROM pg_constraint
WHERE contype = 'p'
AND connamespace = 'public'::regnamespace
ORDER BY conrelid::regclass::text, contype DESC;
You will also need to manually re-cluster the materialized views, they should have a "clst" index that dictates what index to use.
Reclaim any unused space by running the following:
VACUUM FULL ANALYZE;
Now, reindex the database and update the statistics:
REINDEX DATABASE gtdb_r220;
ANALYZE;
CHECKPOINT;
To do this, dump the previous release website database, then restore it as follows:
docker exec -it gtdb-postgres pg_dump -U postgres --dbname=gtdb_r214_web --no-acl --no-privileges --no-owner --file="/backup/gtdb_r214_web.sql"
docker exec -it gtdb-postgres psql -U postgres gtdb_r220_web -f /backup/gtdb_r214_web.sql
Update permissions on the database so that the API is allowed to read.
GRANT SELECT
ON ALL TABLES IN SCHEMA public
TO gtdb_api;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT
ON TABLES TO gtdb_api;
First, truncate the genome_taxid table:
TRUNCATE TABLE genome_taxid;
Run the script from the GTDB API repository: scripts/release/update_genome_taxid.py
Ensure that the connection string is setup to use the new web release database (see api/db/__init__.py
).
This is done manually, discuss with the ledger holder to ensure that it is up-to-date.
Run the script from the GTDB API repository: scripts/release/update_tree.py
Run the following command to purge the data from the database:
TRUNCATE TABLE gtdb_tree, gtdb_tree_children;
Import the new tsv files that were generated.
- Connect to the
gtdb_r220_web.taxon_hist
table and run the following command:
DELETE FROM taxon_hist WHERE release_ver = 'NCBI';
Run the following command in the new GTDB release database (not the web database):
select 'NCBI' AS release_ver,
g.formatted_source_id AS genome_id,
split_part(ncbi_taxonomy, ';', 1) AS rank_domain,
split_part(ncbi_taxonomy, ';', 2) AS rank_phylum,
split_part(ncbi_taxonomy, ';', 3) AS rank_class,
split_part(ncbi_taxonomy, ';', 4) AS rank_order,
split_part(ncbi_taxonomy, ';', 5) AS rank_family,
split_part(ncbi_taxonomy, ';', 6) AS rank_genus,
split_part(ncbi_taxonomy, ';', 7) AS rank_species
from genomes g
inner join metadata_taxonomy mt ON mt.id = g.id
where mt.gtdb_domain != 'd__'
AND mt.gtdb_phylum != 'p__'
AND mt.gtdb_class != 'c__'
AND mt.gtdb_order != 'o__'
AND mt.gtdb_family != 'f__'
AND mt.gtdb_genus != 'g__'
AND mt.gtdb_species != 's__'
UNION
select 'R220' AS release_ver,
g.formatted_source_id AS genome_id,
mt.gtdb_domain AS rank_domain,
mt.gtdb_phylum AS rank_phylum,
mt.gtdb_class AS rank_class,
mt.gtdb_order AS rank_order,
mt.gtdb_family AS rank_family,
mt.gtdb_genus AS rank_genus,
mt.gtdb_species AS rank_species
from genomes g
inner join metadata_taxonomy mt ON mt.id = g.id
where mt.gtdb_domain != 'd__'
AND mt.gtdb_phylum != 'p__'
AND mt.gtdb_class != 'c__'
AND mt.gtdb_order != 'o__'
AND mt.gtdb_family != 'f__'
AND mt.gtdb_genus != 'g__'
AND mt.gtdb_species != 's__'
Export the results and import them to the gtdb_r220_web.taxon_hist
table.
This should be done periodically and is not required for the release.
Make sure that all tables are clustered on their primary keys, this can be done by querying for all primary keys, then copying the SQL from the output and running it:
SELECT conrelid::regclass AS tbl,
conname AS pkey,
pg_get_constraintdef(oid) AS description,
CONCAT('CLUSTER ', conrelid::regclass, ' USING ', conname, ';') AS query
FROM pg_constraint
WHERE contype = 'p'
AND connamespace = 'public'::regnamespace
ORDER BY conrelid::regclass::text, contype DESC;
Reclaim any unused space by running the following (~3 minutes):
VACUUM FULL ANALYZE;
Now, reindex the database and update the statistics:
REINDEX DATABASE gtdb_r220_web;
ANALYZE;
CHECKPOINT;
Run the following query to get the numbers for the index page:
SELECT count(*), gtdb_domain
FROM metadata_taxonomy mt
WHERE gtdb_phylum != 'p__'
GROUP BY gtdb_domain;
SELECT COUNT(*), 'Phylum' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_phylum, '_[A-Z]$', '') AS rank
FROM metadata_taxonomy mt
WHERE gtdb_phylum != 'p__'
AND gtdb_domain = 'd__Bacteria') AS sub
UNION
SELECT COUNT(*), 'Class' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_class, '_[A-Z]$', '') AS rank
FROM metadata_taxonomy mt
WHERE gtdb_class != 'c__'
AND gtdb_domain = 'd__Bacteria') AS sub
UNION
SELECT COUNT(*), 'Order' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_order, '_[A-Z]$', '') AS rank
FROM metadata_taxonomy mt
WHERE gtdb_order != 'o__'
AND gtdb_domain = 'd__Bacteria') AS sub
UNION
SELECT COUNT(*), 'Family' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_family, '_[A-Z]$', '') AS rank
FROM metadata_taxonomy mt
WHERE gtdb_family != 'f__'
AND gtdb_domain = 'd__Bacteria') AS sub
UNION
SELECT COUNT(*), 'Genus' AS taxon
FROM (SELECT DISTINCT gtdb_genus AS rank
FROM metadata_taxonomy mt
WHERE gtdb_genus != 'g__'
AND gtdb_domain = 'd__Bacteria') AS sub
UNION
SELECT COUNT(*), 'Species' AS taxon
FROM (SELECT DISTINCT gtdb_species AS rank
FROM metadata_taxonomy mt
WHERE gtdb_species != 's__'
AND gtdb_domain = 'd__Bacteria') AS sub
ORDER BY 1;
TODO: Put the commit with the release changes in.