Skip to content

Latest commit

 

History

History
292 lines (239 loc) · 16.8 KB

ot_genetics_deployment.md

File metadata and controls

292 lines (239 loc) · 16.8 KB

Open targets internal deployment guide

Overview

Using project: open-targets-genetics-dev.

Currently the genetics team provides input files in a GCP bucket gs://genetics-portal-dev-staging (staging). Some of these files are static, others are annotated with a date (variously YYMMDD and DDMMYY).

A subset of these files are then manually copied by the BE team to gs://genetics-portal-dev-data (dev) in a bucket corresponding to the release.

The files in dev are used to run the pipeline, typically using Dataproc.

Configuration field Likely staging location Notes
variant-index.raw *provided by data
team* Rarely regenerated, we're still using the 2019 files.
ensembl.lut *generated by
BE* See Ensembl section
vep.homo-sapiens-cons-scores *should be in staging
bucket* Typically just use the file from the last
release: /lut/vep_consequences.tsv
interval.path v2g/interval/*
qtl.path v2g/qtl/<date>/
variant-disease.studies v2d/<date>/studies.parquet
variant-disease.toploci v2d/<date>/toploci.parquet
variant-disease.finemapping v2d/<date>/finemapping.parquet/
variant-disease.ld v2d/<date>/ld.parquet/
variant-disease.overlapping v2d/<date>/locus_overlap.parquet
variant-disease.coloc coloc/<date>/coloc_processed_w_betas.parquet/
variant-disease.trait_efo v2d/<date>/trait_efo-2021-11-16.parquet

Variant index section

The variant index comes in parquet from the data team after filtering the latest Gnomad release.

If there is no new update keep using the last one used. Currently, the variant annotation is version 190129.

Ensembl

Genetics team do not provide the Ensembl file: we have to download it ourselves and generate the input.

It is a configuration place to bring the latest reference gene table from Ensembl. To generate this file to need to follow the instructions from this script. And the command I use is this as an example

python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_104_38

The above example uses Ensembl '104'. The most recent version is '107'. If the versions have not changed from the previous release feel free to copy the input file from the previous releases' input directory.

VEP consequences

The TSV file is provided by the genetics team. If the file is not present in the staging bucket ask the Genetics team for the most recent version.

Interval

Provided by the genetics team: these are mainly static and haven't been updated for years. They are in a nested file structure which must be preserved because the ETL uses the file path as an input.

QTL

Provided by the genetics team: these are updated on a regular basis.

Recipe: set up machine

The following script should install the necessary dependencies to generate the Ensembl LUT

# install dependencies
sudo apt-get update

sudo apt install -y git tmux wget 

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -p $HOME/miniconda
source ~/.bashrc

# get repositories
git clone https://github.com/opentargets/genetics-backend.git

# set up conda environments
cd genetics-backend && conda env create -f environment.yaml
conda activate backend-genetics

# Create necessary file
python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_107_38

Recipe: get all inputs and run the ot-geckopipe

Use the VM in the open-target-genetics-dev machine called gp-deploy. The VM is preconfigured with the necessary utilities to run a release.

  • start deployment machine: gcloud compute instances start "jb-release" --project "open-targets-genetics-dev" --zone "europe-west4-a"
  • SSH into deployment machine: gcloud compute ssh --zone "europe-west4-a" "jb-release" --tunnel-through-iap --project "open-targets-genetics-dev"
  • If not already done, clone required repository: git clone [email protected]:opentargets/genetics-backend.git
  • set up environment: conda activate backend-genetics
  • update Ensembl version (latest 106 Apr 22) and run script from genetics-backend/makeLUTs:
    • python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_106_38
  • add ensembl file to bucket gsutil cp -n homo_sapiens* gs://genetics-portal-dev-data/22.03/inputs/lut/
  • update variables in bash script in /scripts/prepare_inputs.sh (input script)
  • run input script in VM to move files from staging to dev buckets
    • Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa) and v2d_credset.
    • It's to pipe the STDOUT of the script to a file which can be provided to the genetics/data team for confirmation the correct files were used. ./scripts/prepare_inputs.sh >> genetics_input_log.txt
  • create a configuration file for release in config:
    • cp src/main/resources/application.conf config/<release>.conf and update as necessary.
  • Run genetics-pipe. There are two options here, you can use either a Dataproc workflow (requires Scala) or using bash scripts. The former is easier.
    • Workflow option: Open the worksheet scripts/dataproc-workflow.sc, update top level variables (should only be the input and output directories) and run. You can terminate the worksheet on your local machine once it has started since Dataproc will run in the background. The advantage of using the workflow is that Dataproc will create the specified cluster, run the steps in the right order, then destroy the cluster without the need for any manual intervention.
    • Script options::
      • update top level variables in scripts/run_cluster.sh: release and config should be the only changes necessary.
      • run script scripts/run_cluster.sh from root directory. This script builds a jar file, pushes it to GS storage, starts a cluster and runs all steps. Some of the jobs will fail because of missing dependencies. Consult documentation/step_dependencies for the correct order.
        • In general run in the following phases (some steps can be run concurrently):
          • variant-index (30m), variant-gene (180min)
          • dictionaries, variant-disease (2min), variant-disease-coloc (2min)
          • disease-variant-gene (25min)
          • scored datasets (130min)
          • manhattan (25min) (Run this after the following steps)
  • inform genetics team that the outputs are ready, and they will run the ML pipeline to generate the l2g outputs. The file we need for the final step (manhattan) is typically found under genetics-portal-dev-staging/l2g/<date>/predictions/l2g.full.220128.parquet in the staging area.
  • Copy L2G file from the staging area to the development area (updating dates as necessary): gsutil -m cp -r gs://genetics-portal-dev-staging/l2g/220908/predictions/l2g.full.220908.parquet/part-* gs://genetics-portal-dev-data/22.09.1/outputs/l2g/
  • Run the manhattan step using either scripts for the workflow scripts/dataproc-workflow-manhattan.sc. Note that the workflow assumes all prior steps have been completed and the inputs are available.
  • Check all the expected output directories are present using the ammonite script amm scripts/check_outputs.sc.

Recipe: Rerun ETL only pipeline

Use case: An issue was identified with the data and the ETL needs to be rerun, but we don't want to recreate the inputs / move static files around. Often we're just updating a single input file, but the change will propagate between steps so we need to rerun the whole ETL.

  • Delete ETL outputs (non-static files) using amm scripts/delete_etl_outputs
  • Update ETL configuration with new value
  • Push configuration to correct bucket
  • Update workflow with new configuration file (if necessary)
  • Execute workflow
  • Recreate infrastructure

Recipe to create infrastructure

  • Clone the genetics output support repository (if necessary)
  • Update the values in ...
  • Execute make disk: This process starts a VM and loads all the data created by the ETL along with the static files. It creates two disk images which can be used to start the Elasticsearch and Clickhouse instances needed for the web presence.
  • Wait for the process to complete (~3 hours) and then execute cd terraform_create_images && terraform apply -destroy -auto-approve to shut down the image generation infrastructure.

Sanity checks

  • You should check the size of the images and counts in ES and Clickhouse to get an idea of whether there were any problems in loading the data.
  • Clickhouse:
    • SSH into image: gcloud compute ssh --zone "europe-west1-c" "devgen2202-ch-11-clickhouse-gc34" --tunnel-through-iap --project "open-targets-genetics-dev" -- -L 8123:localhost:8123
    • Execute the following command (using either Clickhouse-client or another DB manager) to get counts:
SELECT table,
sum(rows) as rows,
formatReadableSize(sum(bytes)) as size,
round(log10(rows), 2) AS row_orderMagnitude
FROM system.parts
WHERE active AND (table NOT ILIKE '%_log') # exclude system tables
GROUP BY table;

The database in the 22.02 release shows:

┌─table──────────────────┬───────rows─┬─size───────┐
│ genes                  │      19569 │ 3.59 MiB   │
│ studies                │      50719 │ 2.08 MiB   │
│ variants               │   72858944 │ 5.41 GiB   │
│ v2d_by_stchr           │   20488888 │ 323.71 MiB │
│ v2d_sa_gwas            │  582828390 │ 29.51 GiB  │
│ v2g_structure          │          9 │ 3.40 KiB   │
│ v2d_coloc              │    4458533 │ 306.86 MiB │
│ l2g_by_gsl             │    3580861 │ 155.29 MiB │
│ v2d_credset            │   38834105 │ 1.34 GiB   │
│ v2d_by_chrpos          │   20488888 │ 414.83 MiB │
│ manhattan              │     279116 │ 44.22 MiB  │
│ v2g_scored             │ 1030927072 │ 20.09 GiB  │
│ d2v2g_scored           │ 1658712886 │ 41.05 GiB  │
│ studies_overlap        │   14570115 │ 154.52 MiB │
│ l2g_by_slg             │    3580861 │ 168.87 MiB │
│ v2d_sa_molecular_trait │  442006706 │ 14.63 GiB  │
└────────────────────────┴────────────┴────────────┘

As far as I know, we would not expect order of magnitude changes.

Updating Terraform

Using the genetics terraform repository:

For this use the master branch and remember to pull changes from the remote before making your changes

Updating the development deployment

Note, commands are given relative to the repository root directory.

  • Update the configuration in the deployment_context.devgen file. To see what fields are often changed you can look at the difference between previous releases with the command diff deployment_context.devgen2202 deployment_context.devgen2111. Fields that typically always need updating:
    • config_release_name: matches the context file name suffix
    • config_dns_subdomain_prefix: same as config_release_name
    • config_vm_elastic_search_disk_name: Disk image you created in earlier the create infrastructure recipe
    • config_vm_clickhouse_disk_name: Disk image you created in earlier the create infrastructure recipe
    • config_vm_api_image_version: latest API. From the API repository run git checkout master && git pull && git tag --list to see options. It's typically the last one.
    • config_vm_webapp_release: this will be the latest tagged version of the the web app
    • DEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL: update URL to include config_release_name.
  • Activate devgen profile
    • make tfactivate profile=devgen
  • Set remote backend (so multiple users can share state)
    • make tfbackendremote
  • Activate the deployment context you configured earlier.
    • make depactivate profile=devgen
  • Download all dependencies
    • make tfinit
  • Check for existing Terraform state (things that are already deployed)
    • terraform state list. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available. Don't be surprised if there is something already deployed, as the state is shared, so that you can see the infrastructure deployed by someone else and vice-versa.
  • Inspect the plan: make tfplan. This will show you what Terraform plans to do. Especially check that you're deploying into the development environment (check project name and URLs)
  • Execute the plan: make tfapply. Terraform will ask for confirmation of the changes.
  • Push your deployed changes to github so others can use them if necessary: git add profiles/deployment_context. devgen && git commit -m "Deployment configuration for <release>" && git push
Updating the production environment
  • Create a new profile which will define the deployment.
    • cp profiles/deployment_context.2202 profiles/deployment_context.devgen<release>.
    • Update the release tag above, and change 2111 to match the most recent release number to minimise the number of changes we need to make.
  • Update the configuration in the devgen file created above. See a description of these fields above (deploying to development section) if you're unfamiliar with them:
    • config_release_name
    • config_dns_subdomain_prefix
    • config_vm_elastic_search_image
    • config_vm_clickhouse_image
    • config_vm_api_image_version
    • config_vm_webapp_release
    • DEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL
  • Activate production profile
    • make tfactivate profile=production
  • Set remote backend (so multiple users can share state)
    • make tfbackendremote
  • Activate the deployment context you configured earlier.
    • make depactivate profile=<file you created earlier>
  • Download all dependencies
    • make tfinit
  • Check for existing Terraform state (things that are already deployed)
    • terraform state list. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available.
  • Inspect the plan: make tfplan. This will show you what Terraform plans to do
  • Execute the plan: make tfapply. Terraform will ask for confirmation of the changes.
  • Push your deployed changes to github so others can use them if necessary: git add profiles/deployment_context. devgen<release> && git commit -m "Deployment configuration for <release>" && git push

Recipe: Big Query

This step assumes that you have generated/collected all of the data as specified in the "get all inputs and run the ot-geckopipe" recipe.

  • If you don't have it already, clone the genetics output support repository
  • Update the variables under heading Variables for sync data in the config.tfvars file.
  • Run the shell command make bigquerydev