Google Cloud Platform services are available in many locations across the globe. You can minimize network latency and network transport costs by running your Dataflow job in the same region as its input bucket, output dataset, and temporary directory are located. More specifically, in order to run Variant Transforms most efficiently you should make sure all the following resources are located in the same region:
- Your source bucket set by
--input_pattern
flag. - Your pipeline's temporary location set by
--temp_location
flag. - Your output BigQuery dataset set by
--output_table
flag. - Your Dataflow pipeline set by
--region
flag. - Your Life Sciences API location set by
--location
flag.
The Dataflow API requires
setting a GCP
region via
--region
flag to run.
When running from Docker, the Cloud Life Sciences API is used to spin up a
worker that launches and monitors the Dataflow job. Cloud Life Sciences API
is a regionalized service
that runs in multiple regions. This is set with the --location
flag. The
Life Sciences API location is where metadata about the pipeline's progress
will be stored, and can be different from the region where the data is
processed. Note that Cloud Life Sciences API is not available in all regions,
and if this flag is left out, the metadata will be stored in us-central1. See
the list of Currently Available Locations.
In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west4 \
--location europe-west4 \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
Note that values of --project
, --region
, and --temp_location
flags will be automatically
passed as COMMAND
inputs in piplines_runner.sh
.
Instead of setting --region
flag for each run, you can set your default region
using the following command. In that case, you will not need to set the --region
flag any more. For more information, please refer to
cloud SDK page.
gcloud config set compute/region "europe-west1"
Similarly, you can set the default project using the following commands:
gcloud config set project GOOGLE_CLOUD_PROJECT
If you are running Variant Transforms from GitHub, you need to specify all three required Dataflow inputs as below.
python3 -m gcp_variant_transforms.vcf_to_bq \
... \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west1 \
--temp_location "${TEMP_LOCATION}"
You can choose your GCS bucket's region when you are creating it. When you create a bucket, you permanently define its name, its geographic location, and the project it is part of. For an existing bucket, you can check its information to find out about its geographic location.
You can choose the region for the BigQuery dataset at dataset creation time.
Variant Transforms supports specifying a subnetwork to use with the --subnetwork
flag.
This can be used to start the processing VMs in a specific network of your Google Cloud
project as opposed to the default network.
Variant Transforms allows disabling the use of external IP addresses with the
--use_public_ips
flag. If not specified, this defaults to true, so to restrict the
use of external IP addresses, use --use_public_ips false
. Note that without external
IP addresses, VMs can only send packets to other internal IP addresses. To allow these
VMs to connect to the external IP addresses used by Google APIs and services, you can
enable Private Google Access
on the subnet.
For example, to run Variant Transforms in a VPC you already created called
custom-network-eu-west
with no public IP addresses you can add these flags to the
example above as follows:
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region europe-west4 \
--location europe-west4 \
--temp_location "${TEMP_LOCATION}" \
--subnetwork custom-network-eu-west \
--use_public_ips false \
"${COMMAND}"