Using project: open-targets-genetics-dev
.
Currently the genetics team provides input files in a GCP bucket gs://genetics-portal-dev-staging
(staging
). Some of
these files are static, others are annotated with a date (variously YYMMDD and DDMMYY).
A subset of these files are then manually copied by the BE team to gs://genetics-portal-dev-data
(dev
) in a
bucket corresponding to the release.
The files in dev
are used to run the pipeline, typically using Dataproc.
Configuration field | Likely staging location | Notes |
---|---|---|
variant-index.raw |
*provided by data | |
team* | Rarely regenerated, we're still using the 2019 files. | |
ensembl.lut |
*generated by | |
BE* | See Ensembl section | |
vep.homo-sapiens-cons-scores |
*should be in staging | |
bucket* | Typically just use the file from the last release: /lut/vep_consequences.tsv |
|
interval.path |
v2g/interval/* | |
qtl.path |
v2g/qtl/<date>/ | |
variant-disease.studies |
v2d/<date>/studies.parquet | |
variant-disease.toploci |
v2d/<date>/toploci.parquet | |
variant-disease.finemapping |
v2d/<date>/finemapping.parquet/ | |
variant-disease.ld |
v2d/<date>/ld.parquet/ | |
variant-disease.overlapping |
v2d/<date>/locus_overlap.parquet | |
variant-disease.coloc |
coloc/<date>/coloc_processed_w_betas.parquet/ | |
variant-disease.trait_efo |
v2d/<date>/trait_efo-2021-11-16.parquet |
The variant index comes in parquet from the data team after filtering the latest Gnomad release.
If there is no new update keep using the last one used. Currently, the variant annotation is version 190129.
Genetics team do not provide the Ensembl file: we have to download it ourselves and generate the input.
It is a configuration place to bring the latest reference gene table from Ensembl. To generate this file to need to follow the instructions from this script. And the command I use is this as an example
python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_104_38
The above example uses Ensembl '104'. The most recent version is '107'. If the versions have not changed from the previous release feel free to copy the input file from the previous releases' input directory.
The TSV file is provided by the genetics team. If the file is not present in the staging bucket ask the Genetics team for the most recent version.
Provided by the genetics team: these are mainly static and haven't been updated for years. They are in a nested file structure which must be preserved because the ETL uses the file path as an input.
Provided by the genetics team: these are updated on a regular basis.
The following script should install the necessary dependencies to generate the Ensembl LUT
# install dependencies
sudo apt-get update
sudo apt install -y git tmux wget
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -p $HOME/miniconda
source ~/.bashrc
# get repositories
git clone https://github.com/opentargets/genetics-backend.git
# set up conda environments
cd genetics-backend && conda env create -f environment.yaml
conda activate backend-genetics
# Create necessary file
python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_107_38
Use the VM in the open-target-genetics-dev
machine called gp-deploy
. The VM is preconfigured with the necessary
utilities to run a release.
- start deployment
machine:
gcloud compute instances start "jb-release" --project "open-targets-genetics-dev" --zone "europe-west4-a"
- SSH into deployment
machine:
gcloud compute ssh --zone "europe-west4-a" "jb-release" --tunnel-through-iap --project "open-targets-genetics-dev"
- If not already done, clone required repository:
git clone [email protected]:opentargets/genetics-backend.git
- set up environment:
conda activate backend-genetics
- update Ensembl version (latest 106 Apr 22) and run script from
genetics-backend/makeLUTs
:python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_106_38
- add ensembl file to bucket
gsutil cp -n homo_sapiens* gs://genetics-portal-dev-data/22.03/inputs/lut/
- update variables in bash script in
/scripts/prepare_inputs.sh
(input script) - run input script in VM to move files from staging to dev buckets
- Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa)
and
v2d_credset
. - It's to pipe the STDOUT of the script to a file which can be provided to the genetics/data team for
confirmation the correct files were used.
./scripts/prepare_inputs.sh >> genetics_input_log.txt
- Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa)
and
- create a configuration file for release in
config
:-
cp src/main/resources/application.conf config/<release>.conf
and update as necessary.
-
- Run genetics-pipe. There are two options here, you can use either a Dataproc workflow (requires Scala) or
using bash scripts. The former is easier.
- Workflow option: Open the worksheet
scripts/dataproc-workflow.sc
, update top level variables (should only be the input and output directories) and run. You can terminate the worksheet on your local machine once it has started since Dataproc will run in the background. The advantage of using the workflow is that Dataproc will create the specified cluster, run the steps in the right order, then destroy the cluster without the need for any manual intervention. - Script options::
- update top level variables in
scripts/run_cluster.sh
:release
andconfig
should be the only changes necessary. - run script
scripts/run_cluster.sh
from root directory. This script builds a jar file, pushes it to GS storage, starts a cluster and runs all steps. Some of the jobs will fail because of missing dependencies. Consultdocumentation/step_dependencies
for the correct order.- In general run in the following phases (some steps can be run concurrently):
- variant-index (30m), variant-gene (180min)
- dictionaries, variant-disease (2min), variant-disease-coloc (2min)
- disease-variant-gene (25min)
- scored datasets (130min)
- manhattan (25min) (Run this after the following steps)
- In general run in the following phases (some steps can be run concurrently):
- update top level variables in
- Workflow option: Open the worksheet
- inform genetics team that the outputs are ready, and they will run the ML pipeline to generate the
l2g
outputs. The file we need for the final step (manhattan
) is typically found undergenetics-portal-dev-staging/l2g/<date>/predictions/l2g.full.220128.parquet
in the staging area. - Copy L2G file from the staging area to the development area (updating dates as necessary):
gsutil -m cp -r gs://genetics-portal-dev-staging/l2g/220908/predictions/l2g.full.220908.parquet/part-* gs://genetics-portal-dev-data/22.09.1/outputs/l2g/
- Run the
manhattan
step using either scripts for the workflowscripts/dataproc-workflow-manhattan.sc
. Note that the workflow assumes all prior steps have been completed and the inputs are available. - Check all the expected output directories are present using the ammonite script
amm scripts/check_outputs.sc
.
Use case: An issue was identified with the data and the ETL needs to be rerun, but we don't want to recreate the inputs / move static files around. Often we're just updating a single input file, but the change will propagate between steps so we need to rerun the whole ETL.
- Delete ETL outputs (non-static files) using
amm scripts/delete_etl_outputs
- Update ETL configuration with new value
- Push configuration to correct bucket
- Update workflow with new configuration file (if necessary)
- Execute workflow
- Recreate infrastructure
- Clone the genetics output support repository (if necessary)
- Update the values in ...
- Execute
make disk
: This process starts a VM and loads all the data created by the ETL along with the static files. It creates two disk images which can be used to start the Elasticsearch and Clickhouse instances needed for the web presence. - Wait for the process to complete (~3 hours) and then execute
cd terraform_create_images && terraform apply -destroy -auto-approve
to shut down the image generation infrastructure.
- You should check the size of the images and counts in ES and Clickhouse to get an idea of whether there were any problems in loading the data.
- Clickhouse:
- SSH into
image:
gcloud compute ssh --zone "europe-west1-c" "devgen2202-ch-11-clickhouse-gc34" --tunnel-through-iap --project "open-targets-genetics-dev" -- -L 8123:localhost:8123
- Execute the following command (using either Clickhouse-client or another DB manager) to get counts:
- SSH into
image:
SELECT table,
sum(rows) as rows,
formatReadableSize(sum(bytes)) as size,
round(log10(rows), 2) AS row_orderMagnitude
FROM system.parts
WHERE active AND (table NOT ILIKE '%_log') # exclude system tables
GROUP BY table;
The database in the 22.02 release shows:
┌─table──────────────────┬───────rows─┬─size───────┐
│ genes │ 19569 │ 3.59 MiB │
│ studies │ 50719 │ 2.08 MiB │
│ variants │ 72858944 │ 5.41 GiB │
│ v2d_by_stchr │ 20488888 │ 323.71 MiB │
│ v2d_sa_gwas │ 582828390 │ 29.51 GiB │
│ v2g_structure │ 9 │ 3.40 KiB │
│ v2d_coloc │ 4458533 │ 306.86 MiB │
│ l2g_by_gsl │ 3580861 │ 155.29 MiB │
│ v2d_credset │ 38834105 │ 1.34 GiB │
│ v2d_by_chrpos │ 20488888 │ 414.83 MiB │
│ manhattan │ 279116 │ 44.22 MiB │
│ v2g_scored │ 1030927072 │ 20.09 GiB │
│ d2v2g_scored │ 1658712886 │ 41.05 GiB │
│ studies_overlap │ 14570115 │ 154.52 MiB │
│ l2g_by_slg │ 3580861 │ 168.87 MiB │
│ v2d_sa_molecular_trait │ 442006706 │ 14.63 GiB │
└────────────────────────┴────────────┴────────────┘
As far as I know, we would not expect order of magnitude changes.
Using the genetics terraform repository:
For this use the master branch and remember to pull changes from the remote before making your changes
Note, commands are given relative to the repository root directory.
- Update the configuration in the
deployment_context.devgen
file. To see what fields are often changed you can look at the difference between previous releases with the commanddiff deployment_context.devgen2202 deployment_context.devgen2111
. Fields that typically always need updating:config_release_name
: matches the context file name suffixconfig_dns_subdomain_prefix
: same asconfig_release_name
config_vm_elastic_search_disk_name
: Disk image you created in earlier the create infrastructure recipeconfig_vm_clickhouse_disk_name
: Disk image you created in earlier the create infrastructure recipeconfig_vm_api_image_version
: latest API. From the API repository rungit checkout master && git pull && git tag --list
to see options. It's typically the last one.config_vm_webapp_release
: this will be the latest tagged version of the the web appDEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL
: update URL to includeconfig_release_name
.
- Activate
devgen
profilemake tfactivate profile=devgen
- Set remote backend (so multiple users can share state)
make tfbackendremote
- Activate the deployment context you configured earlier.
make depactivate profile=devgen
- Download all dependencies
make tfinit
- Check for existing Terraform state (things that are already deployed)
terraform state list
. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available. Don't be surprised if there is something already deployed, as the state is shared, so that you can see the infrastructure deployed by someone else and vice-versa.
- Inspect the plan:
make tfplan
. This will show you what Terraform plans to do. Especially check that you're deploying into the development environment (check project name and URLs) - Execute the plan:
make tfapply
. Terraform will ask for confirmation of the changes. - Push your deployed changes to github so others can use them if necessary:
git add profiles/deployment_context. devgen && git commit -m "Deployment configuration for <release>" && git push
- Create a new profile which will define the deployment.
cp profiles/deployment_context.2202 profiles/deployment_context.devgen<release>
.- Update the release tag above, and change
2111
to match the most recent release number to minimise the number of changes we need to make.
- Update the configuration in the
devgen
file created above. See a description of these fields above (deploying to development section) if you're unfamiliar with them:config_release_name
config_dns_subdomain_prefix
config_vm_elastic_search_image
config_vm_clickhouse_image
config_vm_api_image_version
config_vm_webapp_release
DEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL
- Activate
production
profilemake tfactivate profile=production
- Set remote backend (so multiple users can share state)
make tfbackendremote
- Activate the deployment context you configured earlier.
make depactivate profile=<file you created earlier>
- Download all dependencies
make tfinit
- Check for existing Terraform state (things that are already deployed)
terraform state list
. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available.
- Inspect the plan:
make tfplan
. This will show you what Terraform plans to do - Execute the plan:
make tfapply
. Terraform will ask for confirmation of the changes. - Push your deployed changes to github so others can use them if necessary:
git add profiles/deployment_context. devgen<release> && git commit -m "Deployment configuration for <release>" && git push
This step assumes that you have generated/collected all of the data as specified in the "get all inputs and run the ot-geckopipe" recipe.
- If you don't have it already, clone the genetics output support repository
- Update the variables under heading
Variables for sync data
in theconfig.tfvars
file. - Run the shell command
make bigquerydev