Using project: open-targets-genetics-dev
.
Currently the genetics team provides input files in a GCP bucket gs://genetics-portal-dev-staging
(staging
). Some of
these files are static, others are annotated with a date (variously YYMMDD and DDMMYY).
A subset of these files are then manually copied by the BE team to gs://genetics-portal-dev-data
(dev
) in a
bucket corresponding to the release.
The files in dev
are used to run the pipeline, typically using Dataproc.
Configuration field | Likely staging location | Standard dev location |
---|---|---|
variant-index.raw |
provided by data team | /variant-annotation//variant-annotation.parquet |
ensembl.lut |
generated by BE | /lut/homo_sapiens_core_105_38_genes.json.gz |
vep.homo-sapiens-cons-scores |
should be in staging bucket | /lut/vep_consequences.tsv |
interval.path |
v2g/interval/* | /v2g/interval/*/*/<date>/data.parquet |
qtl.path |
v2g/qtl/<date>/ | v2g/qlt/<date> |
variant-disease.studies |
v2d/<date>/studies.parquet | v2d/studies.parquet |
variant-disease.toploci |
v2d/<date>/toploci.parquet | v2d/toploci.parquet |
variant-disease.finemapping |
v2d/<date>/finemapping.parquet/ | v2d/finemapping.parquet |
variant-disease.ld |
v2d/<date>/ld.parquet/ | v2d/ld.parquet |
variant-disease.overlapping |
v2d/<date>/locus_overlap.parquet | v2d/locus_overlap.parquet |
variant-disease.coloc |
coloc/<date>/coloc_processed_w_betas.parquet/ | v2d/coloc_processed_w_betas.parquet |
variant-disease.trait_efo |
v2d/<date>/trait_efo-2021-11-16.parquet | v2d/trait_efo.parquet |
The variant index comes in parquet from the data team after filtering the latest Gnomad release.
If there is no new update keep using the last one used. Currently, the variant annotation is version 190129.
Genetics team do not provide the Ensembl file: we have to download it ourselves and generate the input.
It is a configuration place to bring the latest reference gene table from Ensembl. To generate this file to need to follow the instructions from this script. And the command I use is this as an example
python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_104_38
The above example uses Ensembl '104'. The most recent version is '105'. If the versions have not changed from the previous release feel free to copy the input file from the previous releases' input directory.
The TSV file is provided by the genetics team. If the file is not present in the staging bucket ask the Genetics team for the most recent version.
Provided by the genetics team: these are mainly static and haven't been updated for years. They are in a nested file structure which must be preserved because the ETL uses the file path as an input.
Provided by the genetics team: these are updated on a regular basis.
We need a VM to run deployments from. Typically this only needs to be done once and then we can use the machine for future releases.
# install dependencies
sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
/etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-client
sudo apt install -y git \
tmux tree wget htop \
libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6 \
apt-transport-https ca-certificates dirmngr
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -p $HOME/miniconda
source ~/.bashrc
# get repositories
git clone https://github.com/opentargets/genetics-backend.git
git clone https://github.com/opentargets/genetics-pipe.git
# set up conda environments
cd genetics-backend && conda env create -f environment.yaml
conda activate backend-genetics
# add elastic-search loader
# https://github.com/moshe/elasticsearch_loader
pip install elasticsearch-loader
cd loaders/clickhouse
Use the VM in the open-target-genetics-dev
machine called gp-deploy
. The VM is preconfigured with the necessary
utilities to run a release.
- start deployment machine:
gcloud compute instances start "jb-release" --project "open-targets-genetics-dev" --zone "europe-west4-a"
- SSH into deployment machine:
gcloud compute ssh --zone "europe-west4-a" "jb-release" --tunnel-through-iap --project "open-targets-genetics-dev"
- If not already done, clone required repository:
git clone [email protected]:opentargets/genetics-backend.git
- set up environment:
conda activate backend-genetics
- update Ensembl version (latest 106 Apr 22) and run script from
genetics-backend/makeLUTs
:python create_genes_dictionary.py -o "./" -z -n homo_sapiens_core_106_38
- add ensembl file to bucket
gsutil cp -n homo_sapiens* gs://genetics-portal-dev-data/22.03/inputs/lut/
- update variables in bash script in
/scripts/prepare_inputs.sh
(input script) - run input script in VM to move files from staging to dev buckets
- Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa)
and
v2d_credset
. - It's to pipe the STDOUT of the script to a file which can be provided to the genetics/data team for
confirmation the correct files were used.
./scripts/prepare_inputs.sh >> genetics_input_log.txt
- Most of the inputs are used for the pipeline, but there are two static datasets which are copied, sumstats (sa)
and
- create a configuration file for release in
config
:-
cp src/main/resources/application.conf config/<release>.conf
and update as necessary.
-
- Run genetics-pipe. There are two options here, you can use either a Dataproc workflow (requires Scala) or
using bash scripts. The former is easier.
- Workflow option: Open the worksheet
scripts/dataproc-workflow.sc
, update top level variables (should only be the input and output directories) and run. You can terminate the worksheet on your local machine once it has started since Dataproc will run in the background. The advantage of using the workflow is that Dataproc will create the specified cluster, run the steps in the right order, then destroy the cluster without the need for any manual intervention. - Script options::
- update top level variables in
scripts/run_cluster.sh
:release
andconfig
should be the only changes necessary. - run script
scripts/run_cluster.sh
from root directory. This script builds a jar file, pushes it to GS storage, starts a cluster and runs all steps. Some of the jobs will fail because of missing dependencies. Consultdocumentation/step_dependencies
for the correct order.- In general run in the following phases (some steps can be run concurrently):
- variant-index (30m), variant-gene (180min)
- dictionaries, variant-disease (2min), variant-disease-coloc (2min)
- disease-variant-gene (25min)
- scored datasets (130min)
- manhattan (25min) (Run this after the following steps)
- In general run in the following phases (some steps can be run concurrently):
- update top level variables in
- Workflow option: Open the worksheet
- inform genetics team that the outputs are ready, and they will run the ML pipeline to generate the
l2g
outputs. The file we need for the final step (manhattan
) is typically found undergenetics-portal-dev-staging/l2g/<date>/predictions/l2g.full.220128.parquet
in the staging area. - Copy L2G file from the staging area to the development area (updating dates as necessary):
gsutil -m cp -r gs://genetics-portal-dev-staging/l2g/220212/predictions/l2g.full.220212.parquet/part-* gs://genetics-portal-dev-data/22.03/outputs/l2g/
- Run the
manhattan
step using either scripts for the workflowscripts/dataproc-workflow-manhattan.sc
. Note that the workflow assumes all prior steps have been completed and the inputs are available. - Check all the expected output directories are present using the ammonite script
amm scripts/check_outputs.sc
.
- Using the genetics backend project start two VMs: one each
for ES and Clickhouse using the helper scripts:
infrastructure/gcp/genetics/create-clickhouse-node.sh
andinfrastructure/gcp/genetics/create-elasticsearch-node.sh
- export variables for the two created VMs:(bind the internal GCP IP address, this assumes you're in a GCP VM yourself.)
export ES_HOST=$(gcloud compute instances list | grep -i run | grep elasticsearch | awk '{ print $4 }' | tail -1)
export CLICKHOUSE_HOST=$(gcloud compute instances list | grep -i run | grep clickhouse | awk '{ print $4 }' | tail -1)
- activate the correct python environment:
conda activate backend-genetics
- run the script
loaders/clickhouse/create_and_load_everything_from_scratch.sh
in thegenetics-backend
repository, providing a link to the input files.- There can be a short delay while the instances start up and complete their installations of ES and CH. You can
test if they are ready by running
curl $ES_HOST:9200
andcurl $CLICKHOUSE_HOST:8123
which should both return a non-error response. - Note this process is slow: ~17 hours!
./create_and_load_everything_from_scratch.sh gs://genetics-portal-dev-data/22.01.2/outputs
- There can be a short delay while the instances start up and complete their installations of ES and CH. You can
test if they are ready by running
- Once loading is complete, 'bake' the instances so that we can deploy the images using Terraform.
- Find the latest running image:
gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep [elasticsearch|clickhouse] | awk '{ print $1 }' | tail -1
- Bake image using scripts in
genetics-backend/gcp/bake_[es|ch]_node.sh
with the image found above. These create disk images which we can deploy using the Terraform defined in the genetics terraform repo- For example:
./bake_ch_node.sh $(gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep clickhouse | awk '{ print $1 }' | tail -1)
./bake_es_node.sh $(gcloud compute instances list --project=open-targets-genetics-dev | grep -i run | grep elasticsearch | awk '{ print $1 }' | tail -1)
- For example:
- Find the latest running image:
- You should check the size of the images and counts in ES and Clickhouse to get an idea of whether there were any problems in loading the data.
- Clickhouse:
- SSH into image:
gcloud compute ssh --zone "europe-west1-c" "devgen2202-ch-11-clickhouse-gc34" --tunnel-through-iap --project "open-targets-genetics-dev" -- -L 8123:localhost:8123
- Execute the following command (using either Clickhouse-client or another DB manager) to get counts:
- SSH into image:
SELECT table,
sum(rows) as rows,
formatReadableSize(sum(bytes)) as size
FROM system.parts
WHERE active
GROUP BY table;
The database in the 22.02 release shows:
┌─table──────────────────┬───────rows─┬─size───────┐
│ genes │ 19569 │ 3.59 MiB │
│ studies │ 50719 │ 2.08 MiB │
│ variants │ 72858944 │ 5.41 GiB │
│ v2d_by_stchr │ 20488888 │ 323.71 MiB │
│ v2d_sa_gwas │ 582828390 │ 29.51 GiB │
│ v2g_structure │ 9 │ 3.40 KiB │
│ v2d_coloc │ 4458533 │ 306.86 MiB │
│ l2g_by_gsl │ 3580861 │ 155.29 MiB │
│ v2d_credset │ 38834105 │ 1.34 GiB │
│ v2d_by_chrpos │ 20488888 │ 414.83 MiB │
│ manhattan │ 279116 │ 44.22 MiB │
│ v2g_scored │ 1030927072 │ 20.09 GiB │
│ d2v2g_scored │ 1658712886 │ 41.05 GiB │
│ studies_overlap │ 14570115 │ 154.52 MiB │
│ l2g_by_slg │ 3580861 │ 168.87 MiB │
│ v2d_sa_molecular_trait │ 442006706 │ 14.63 GiB │
└────────────────────────┴────────────┴────────────┘
As far as I know, we would not expect order of magnitude changes.
Using the genetics terraform repository:
For this use the master branch and remember to pull changes from the remote before making your changes
- Create a new profile which will define the deployment.
cp profiles/deployment_context.devgen2111 profiles/deployment_context.devgen<release>
- Update the release tag above, and change
2111
to match the most recent release number to minimise the number of changes we need to make.
- Update the configuration in the
devgen
file created above. To see what fields are often changed you can look at the difference between previous releases with the commanddiff deployment_context.devgen2111 deployment_context.devgen2106
. Fields that typically always need updating:config_release_name
: matches the context file name suffixconfig_dns_subdomain_prefix
: same asconfig_release_name
config_vm_elastic_search_image
: Image you baked earlierconfig_vm_clickhouse_image
: Image you baked earlierconfig_vm_api_image_version
: latest API. From the API repository rungit checkout master && git pull && git tag --list
to see options. It's typically the last one.config_vm_webapp_release
: this will be the latest tagged version of the the web appDEVOPS_CONTEXT_PLATFORM_APP_CONFIG_API_URL
: update URL to includeconfig_release_name
.
- Activate
xyz
profilemake tfactivate profile=xyz
- Set remote backend (so multiple users can share state)
make tfbackendremote
- Activate the deployment context you configured earlier.
make depactivate profile=devgen<release>
- Download all dependencies
make tfinit
- Check for existing Terraform state (things that are already deployed)
terraform state list
. If this is the first time running these commands nothing will be displayed. After you have deployed the infrastructure running this command will show you what is currently available.
- Inspect the plan:
make tfplan
. This will show you what Terraform plans to do - Execute the plan:
make tfapply
. Terraform will ask for confirmation of the changes. - Push your deployed changes to github so others can use them if necessary:
git add profiles/deployment_context. devgen<release> && git commit -m "Deployment configuration for <release>" && git push
This step assumes that you have generated/collected all of the data as specified in the "get all inputs and run the ot-geckopipe" recipe.
- If you don't have it already, clone the genetics output support repository
- Update the variables under heading
Variables for sync data
in theconfig.tfvars
file. - Run the shell command
make bigquerydev
Configuration field | Staging location (raw data from point of view of data joining) | Dev location for running data-joining | Notes |
---|---|---|---|
variant-index.raw |
provided by data team | /variant-annotation//variant-annotation.parquet | Copied from release to release, not updated since 2019 |
ensembl.lut |
generated by BE | /lut/homo_sapiens_core_105_38_genes.json.gz | This will be deprecated once we can use the Target Index from the ETL |
vep.homo-sapiens-cons-scores |
recycled from previous release | /lut/vep_consequences.tsv | Copied from previous release |
interval.path |
v2g/interval/* | /v2g/interval/*/*/*/data.parquet | Effectively static as we don't regenerate it. This has one of those annoying name.parquet components, but it's heavily nested and we can read a number higher level with a wildcard. |
qtl.path |
v2g/qtl/YYMMDD/ | v2g/qtl/ | |
variant-gene.weights |
carried over from previous release | lut/v2g_scoring_source_weights.date.json | Copied from previous release: will be moved into ETL config in future |
variant-disease.studies |
v2d/YYMMDD/studies.parquet | v2d/studies.parquet | Single file |
variant-disease.toploci |
v2d/YYMMDD/toploci.parquet | v2d/toploci.parquet | Single file |
variant-disease.finemapping |
v2d/YYMMDD/finemapping.parquet/ | v2d/finemapping | We want the input renamed to get rid of the '.parquet' component |
variant-disease.ld |
v2d/YYMMDD/ld.parquet/ | v2d/ld.parquet | We want the input renamed to get rid of the '.parquet' component |
variant-disease.overlapping |
v2d/YYMMDD/locus_overlap.parquet | v2d/locus_overlap.parquet | Single file |
variant-disease.coloc |
coloc/YYMMDD/coloc_processed_w_betas.parquet/ | v2d/coloc_processed_w_betas.parquet | |
variant-disease.trait_efo |
v2d/YYMMDD/trait_efo-2021-11-16.parquet | v2d/trait_efo.parquet | We want 'trait_efo' to not have the embedded date, as that is in the file path |