Set up the Autoscaler using Terraform configuration files
Home
·
Scaler component
·
Poller component
·
Forwarder component
·
Terraform configuration
·
Monitoring
Cloud Run functions
·
Google Kubernetes Engine
- Table of Contents
- Overview
- Options for GKE deployment
- Architecture
- Before you begin
- Preparing the Autoscaler Project
- Creating Autoscaler infrastructure
- Importing your Spanner instances
- Building the Autoscaler
- Deploying the Autoscaler
- Metrics in GKE deployment
- Troubleshooting
This directory contains Terraform configuration files to quickly set up the infrastructure for your Autoscaler for a deployment to Google Kubernetes Engine (GKE).
This deployment is ideal for independent teams who want to self-manage the infrastructure and configuration of their own Autoscalers on Kubernetes.
The GKE deployment has the following pros and cons:
- Kubernetes-based: For teams that may not be able to use Google Cloud services such as Cloud Run functions, this design enables the use of the Autoscaler.
- Configuration: The control over scheduler parameters belongs to the team that owns the Spanner instance, therefore the team has the highest degree of freedom to adapt the Autoscaler to its needs.
- Infrastructure: This design establishes a clear boundary of responsibility and security over the Autoscaler infrastructure because the team owner of the Spanner instances is also the owner of the Autoscaler infrastructure.
- Infrastructure: In contrast to the Cloud Run functions design, some long-lived infrastructure and services are required.
- Maintenance: with each team being responsible for the Autoscaler configuration and infrastructure it may become difficult to make sure that all Autoscalers across the company follow the same update guidelines.
- Audit: because of the high level of control by each team, a centralized audit may become more complex.
For deployment to GKE there are two options to choose from:
-
Deployment of decoupled Poller and Scaler components, running in separate pods.
-
Deployment of a unified Autoscaler, with Poller and Scaler components combined.
The decoupled deployment model has the advantage that Poller and Scaler components can be assigned individual permissions (i.e. run as separate service accounts), and the two components can be managed and scaled as required to suit your needs. However, this deployment model relies on the Scaler component being deployed as a long-running service, which consumes resources.
In contrast, the unified deployment model has the advantage that the Poller and Scaler components can be deployed as a single pod, which runs as a Kubernetes cron job. This means there are no long-running components. As well as this, with Poller and Scaler components combined, only a single service account is required.
For most use cases, the unified deployment model is recommended.
-
Using a Kubernetes ConfigMap you define which Spanner instances you would like to be managed by the Autoscaler.
-
Using a Kubernetes CronJob, the Autoscaler is configured to run on a schedule. By default this is every two minutes, though this is configurable.
-
When scheduled, an instance of the Poller is created as a Kubernetes Job.
-
The Poller queries the Cloud Monitoring API to retrieve the utilization metrics for each Spanner instance.
-
For each Spanner instance, the Poller makes a call to the Scaler via its API. The request payload contains the utilization metrics for the specific Spanner instance, and some of its corresponding configuration parameters.
-
Using the chosen scaling method the Scaler compares the Spanner instance metrics against the recommended thresholds, plus or minus an allowed margin and determines if the instance should be scaled, and the number of nodes or processing units that it should be scaled to.
-
The Scaler retrieves the time when the instance was last scaled from the state data stored in Cloud Firestore (or alternatively Spanner) and compares it with the current time.
-
If the configured cooldown period has passed, then the Scaler requests the Spanner Instance to scale out or in.
-
Both Poller and Scaler publish counters to an OpenTelemetry Collector, also running in Kubernetes, which is configured to forward these counters to Google Cloud Monitoring. See section Metrics in GKE deployment
-
Using a Kubernetes ConfigMap you define which Spanner instances you would like to be managed by the Autoscaler.
-
Using a Kubernetes CronJob, the Autoscaler is configured to run on a schedule. By default this is every two minutes, though this is configurable.
-
When scheduled, an instance of the unifed Poller and Scaler components (henceforth "Autoscaler") is created as a Kubernetes Job.
-
The Autoscaler queries the Cloud Monitoring API to retrieve the utilization metrics for each Spanner instance.
-
For each Spanner instance, the Autoscaler makes an internal call with a payload that contains the utilization metrics for the specific Spanner instance, and some of its corresponding configuration parameters.
-
Using the chosen scaling method the Autoscaler compares the Spanner instance metrics against the recommended thresholds, plus or minus an allowed margin and determines if the instance should be scaled, and the number of nodes or processing units that it should be scaled to.
-
The Autoscaler retrieves the time when the instance was last scaled from the state data stored in Cloud Firestore (or alternatively Spanner) and compares it with the current time.
-
If the configured cooldown period has passed, then the Autoscaler requests the Spanner Instance to scale out or in.
-
The Autoscaler publishes counters to an OpenTelemetry Collector, also running in Kubernetes, which is configured to forward these counters to Google Cloud Monitoring. See section Metrics in GKE deployment
In this section you prepare your environment.
-
Open the Cloud Console
-
Activate Cloud Shell
At the bottom of the Cloud Console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including thegcloud
command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize. -
In Cloud Shell, clone this repository:
git clone https://github.com/cloudspannerecosystem/autoscaler.git
-
Export a variable for the Autoscaler working directory:
cd autoscaler && export AUTOSCALER_ROOT="$(pwd)"
-
Export a variable to indicate your chosen deployment model:
For the decoupled deployment model:
export AUTOSCALER_DEPLOYMENT_MODEL=decoupled
Alternatively, for the decoupled deployment model:
export AUTOSCALER_DEPLOYMENT_MODEL=unified
-
Export a variable for the root of the deployment:
export AUTOSCALER_DIR="${AUTOSCALER_ROOT}/terraform/gke/${AUTOSCALER_DEPLOYMENT_MODEL}"
In this section you prepare your project for deployment.
-
Go to the project selector page in the Cloud Console. Select or create a Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
-
In Cloud Shell, configure the environment with the ID of your autoscaler project:
export PROJECT_ID=<INSERT_YOUR_PROJECT_ID> gcloud config set project ${PROJECT_ID}
-
Set the region where the Autoscaler resources will be created:
export REGION=us-central1
-
Enable the required Cloud APIs:
gcloud services enable iam.googleapis.com \ artifactregistry.googleapis.com \ cloudbuild.googleapis.com \ cloudresourcemanager.googleapis.com \ container.googleapis.com \ logging.googleapis.com \ monitoring.googleapis.com \ spanner.googleapis.com
-
If you want to create a new Spanner instance for testing the Autoscaler, set the following variable. The Spanner instance that Terraform creates is named
autoscale-test
.export TF_VAR_terraform_spanner_test=true
On the other hand, if you do not want to create a new Spanner instance because you already have an instance for the Autoscaler to monitor, set the name name of your instance in the following variable
export TF_VAR_spanner_name=<INSERT_YOUR_SPANNER_INSTANCE_NAME>
For more information on how to configure your Spanner instance to be managed by Terraform, see Importing your Spanner instances
-
There are two options for deploying the state store for the Autoscaler:
For Firestore, follow the steps in Using Firestore for Autoscaler State. For Spanner, follow the steps in Using Spanner for Autoscaler state.
-
To use Firestore for the Autoscaler state, choose the App Engine Location where the Autoscaler infrastructure will be created, for example:
export APP_ENGINE_LOCATION=us-central
-
Enable the additional APIs:
gcloud services enable \ appengine.googleapis.com \ firestore.googleapis.com
-
Create a Google App Engine app to enable the API for Firestore:
gcloud app create --region="${APP_ENGINE_LOCATION}"
-
To store the state of the Autoscaler, update the database created with the Google App Engine app to use Firestore native mode.
gcloud firestore databases update --type=firestore-native
You will also need to make a minor modification to the Autoscaler configuration. The required steps to do this are later in these instructions.
-
Next, continue to Creating Autoscaler infrastructure.
-
If you want to store the state in Cloud Spanner and you don't have a Spanner instance yet for that, then set the following variable so that Terraform creates an instance for you named
autoscale-test-state
:export TF_VAR_terraform_spanner_state=true
It is a best practice not to store the Autoscaler state in the same instance that is being monitored by the Autoscaler.
Optionally, you can change the name of the instance that Terraform will create:
export TF_VAR_spanner_state_name=<INSERT_STATE_SPANNER_INSTANCE_NAME>
If you already have a Spanner instance where state must be stored, only set the the name of your instance:
export TF_VAR_spanner_state_name=<INSERT_YOUR_STATE_SPANNER_INSTANCE_NAME>
If you want to manage the state of the Autoscaler in your own Cloud Spanner instance, please create the following table in advance:
CREATE TABLE spannerAutoscaler ( id STRING(MAX), lastScalingTimestamp TIMESTAMP, createdOn TIMESTAMP, updatedOn TIMESTAMP, lastScalingCompleteTimestamp TIMESTAMP, scalingOperationId STRING(MAX), scalingRequestedSize INT64, scalingMethod STRING(MAX), scalingPreviousSize INT64, ) PRIMARY KEY (id)
Note: If you are upgrading from v1.x, then you need to add the 5 new columns to the spanner schema using the following DDL statements
ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS lastScalingCompleteTimestamp TIMESTAMP; ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingOperationId STRING(MAX); ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingRequestedSize INT64; ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingMethod STRING(MAX); ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingPreviousSize INT64;
Note: If you are upgrading from V2.0.x, then you need to add the 3 new columns to the spanner schema using the following DDL statements
ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingRequestedSize INT64; ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingMethod STRING(MAX); ALTER TABLE spannerAutoscaler ADD COLUMN IF NOT EXISTS scalingPreviousSize INT64;
-
Next, continue to Creating Autoscaler infrastructure.
In this section you deploy the Autoscaler infrastructure.
-
Set the project ID and region in the corresponding Terraform environment variables:
export TF_VAR_project_id=${PROJECT_ID} export TF_VAR_region=${REGION}
-
Change directory into the Terraform per-project directory and initialize it:
cd ${AUTOSCALER_DIR} terraform init
-
Create the Autoscaler infrastructure:
terraform plan -out=terraform.tfplan terraform apply -auto-approve terraform.tfplan
If you are running this command in Cloud Shell and encounter errors of the form
"Error: cannot assign requested address
", this is a
known issue in the Terraform Google provider, please retry
with -parallelism=1
.
If you have existing Spanner instances that you want to import to be managed by Terraform, follow the instructions in this section.
-
List your spanner instances
gcloud spanner instances list --format="table(name)"
-
Set the following variable with the instance name from the output of the above command that you want to import
SPANNER_INSTANCE_NAME=<YOUR_SPANNER_INSTANCE_NAME>
-
Create a Terraform config file with an empty
google_spanner_instance
resourceecho "resource \"google_spanner_instance\" \"${SPANNER_INSTANCE_NAME}\" {}" > "${SPANNER_INSTANCE_NAME}.tf"
-
Import the Spanner instance into the Terraform state.
terraform import "google_spanner_instance.${SPANNER_INSTANCE_NAME}" "${SPANNER_INSTANCE_NAME}"
-
After the import succeeds, update the Terraform config file for your instance with the actual instance attributes
terraform state show -no-color "google_spanner_instance.${SPANNER_INSTANCE_NAME}" \ | grep -vE "(id|num_nodes|state|timeouts).*(=|\{)" \ > "${SPANNER_INSTANCE_NAME}.tf"
If you have additional Spanner instances to import, repeat this process.
Importing Spanner databases is also possible using the
google_spanner_database
resource and following a
similar process.
-
Change to the directory that contains the Autoscaler source code:
cd ${AUTOSCALER_ROOT}
-
Build the Autoscaler components by following the instructions in the appropriate section:
- [Building the Autoscaler for a unified deployment model] (#building-the-autoscaler-for-a-unified-deployment-model)
- [Building the Autoscaler for a decoupled deployment model] (#building-the-autoscaler-for-a-decoupled-deployment-model)
To build the Autoscaler and push the image to Artifact Registry, run the following commands:
-
Build the Autoscaler:
gcloud beta builds submit . --config=cloudbuild-unified.yaml --region=${REGION} --service-account="projects/${PROJECT_ID}/serviceAccounts/build-sa@${PROJECT_ID}.iam.gserviceaccount.com"
-
Construct the path to the image:
SCALER_PATH="${REGION}-docker.pkg.dev/${PROJECT_ID}/spanner-autoscaler/scaler"
-
Retrieve the SHA256 hash of the image:
SCALER_SHA=$(gcloud artifacts docker images describe ${SCALER_PATH}:latest --format='value(image_summary.digest)')
-
Construct the full path to the image, including the SHA256 hash:
SCALER_IMAGE="${SCALER_PATH}@${SCALER_SHA}"
Next, follow the instructions in the Deploying the Autoscaler section.
To build the Autoscaler and push the images to Artifact Registry, run the following commands:
-
Build the Autoscaler components:
gcloud beta builds submit . --config=cloudbuild-poller.yaml --region=${REGION} --service-account="projects/${PROJECT_ID}/serviceAccounts/build-sa@${PROJECT_ID}.iam.gserviceaccount.com" && \ gcloud beta builds submit . --config=cloudbuild-scaler.yaml --region=${REGION} --service-account="projects/${PROJECT_ID}/serviceAccounts/build-sa@${PROJECT_ID}.iam.gserviceaccount.com"
-
Construct the paths to the images:
POLLER_PATH="${REGION}-docker.pkg.dev/${PROJECT_ID}/spanner-autoscaler/poller" SCALER_PATH="${REGION}-docker.pkg.dev/${PROJECT_ID}/spanner-autoscaler/scaler"
-
Retrieve the SHA256 hashes of the images:
POLLER_SHA=$(gcloud artifacts docker images describe ${POLLER_PATH}:latest --format='value(image_summary.digest)') SCALER_SHA=$(gcloud artifacts docker images describe ${SCALER_PATH}:latest --format='value(image_summary.digest)')
-
Construct the full paths to the images, including the SHA256 hashes:
POLLER_IMAGE="${POLLER_PATH}@${POLLER_SHA}" SCALER_IMAGE="${SCALER_PATH}@${SCALER_SHA}"
Next, follow the instructions in the Deploying the Autoscaler section.
-
Retrieve the credentials for the cluster where the Autoscaler will be deployed:
gcloud container clusters get-credentials spanner-autoscaler --region=${REGION}
-
Prepare the Autoscaler configuration files by running the following command:
cd ${AUTOSCALER_ROOT}/kubernetes/${AUTOSCALER_DEPLOYMENT_MODEL} && \ for template in $(ls autoscaler-config/*.template) ; do envsubst < ${template} > ${template%.*} ; done
-
Deploy the
otel-collector
service so that it is ready to collect metrics:cd ${AUTOSCALER_ROOT}/kubernetes/${AUTOSCALER_DEPLOYMENT_MODEL} && \ kubectl apply -f autoscaler-config/otel-collector.yaml && \ kubectl apply -f autoscaler-pkg/networkpolicy.yaml && \ kubectl apply -f autoscaler-pkg/otel-collector/otel-collector.yaml
-
Next configure the Kubernetes manifests and deploy the Autoscaler to the clusterusing the following commands:
cd ${AUTOSCALER_ROOT}/kubernetes/${AUTOSCALER_DEPLOYMENT_MODEL} && \ kpt fn eval --image gcr.io/kpt-fn/apply-setters:v0.1.1 autoscaler-pkg -- \ poller_image=${POLLER_IMAGE} scaler_image=${SCALER_IMAGE} && \ kubectl apply -f autoscaler-pkg/ --recursive
-
Next, to see how the Autoscaler is configured, run the following command to output the example configuration:
cat autoscaler-config/autoscaler-config*.yaml
These two files configure each instance of the Autoscaler that you scheduled in the previous step. Notice the environment variable
AUTOSCALER_CONFIG
. You can use this variable to reference a configuration that will be used by that individual instance of the Autoscaler. This means that you can configure multiple scaling schedules across multiple Spanner instances.If you do not supply this value, a default of
autoscaler-config.yaml
will be used.You can autoscale multiple Spanner instances on a single schedule by including multiple YAML stanzas in any of the scheduled configurations. For the schema of the configuration, see the [Poller configuration] autoscaler-config-params section.
The sample configuration creates two schedules to demonstrate autoscaling; a frequently running schedule to dynamically scale the Spanner instance according to utilization, and an hourly schedule to directly scale the Spanner instance every hour. When you configure the Autoscaler for production, you can configure this schedule to fit your needs.
-
If you have chosen to use Firestore to hold the Autoscaler state as described above, edit the above files, and remove the following lines:
stateDatabase: name: spanner instanceId: autoscale-test databaseId: spanner-autoscaler-state
Note: If you do not remove these lines, the Autoscaler will attempt to use the above non-existent Spanner database for its state store, which will result in the Poller component failing to start. Please see the Troubleshooting section for more details.
If you have chosen to use your own Spanner instance, please edit the above configuration files accordingly.
-
To configure the Autoscaler and begin scaling operations, run the following command:
kubectl apply -f autoscaler-config/
-
Any changes made to the configuration files and applied with
kubectl apply
will update the Autoscaler configuration. -
You can view logs for the Autoscaler components via
kubectl
or the Cloud Logging interface in the Google Cloud console.
Unlike in a Cloud Run functions deployment, in a GKE deployment, the counters
generated by the poller
and scaler
components are forwarded to the
OpenTelemetry Collector (otel-collector
) service.
This service is specified by an the environmental variable OTEL_COLLECTOR_URL
passed to the poller and scaler workloads.
This collector is run as a service to receive metrics as gRPC messages on port 4317, then export them to Google Cloud Monitoring. This configuration is defined in a ConfigMap.
Metrics can be sent to other exporters by modifying the Collector ConfigMap.
A NetworkPolicy rule
is also configured to allow traffic from the poller
and scaler
workloads
(labelled with otel-submitter:true
) to the otel-collector
service.
If the environment variable OTEL_COLLECTOR_URL
is not specified, the metrics
will be sent directly to Google Cloud Monitoring.
To allow Google Cloud Monitoring to distinguish metrics from different instances
of the poller and scaler, the Kubernetes Pod name is passed to the poller and
scaler componnents via the environmental variable K8S_POD_NAME
. If this
variable is not specified, and if the Pod name attribute is not appended to the
metrics by configuring the
Kubernetes Attributes Processor
in the OpenTelemetry Collector, then there will be Send TimeSeries errors
reported when the Collector exports the metrics to GCM.
This section contains guidance on what to do if you encounter issues when following the instructions above.
- Check there are no Organizational Policy rules that may conflict with cluster creation.
-
The first step if you are encountering scaling issues is to check the logs for the Autoscaler in Cloud Logging. To retrieve the logs for the
Poller
andScaler
components, use the following query:resource.type="k8s_container" resource.labels.namespace_name="spanner-autoscaler" resource.labels.container_name="poller" OR resource.labels.container_name="scaler"
If you do not see any log entries, check that you have selected the correct time period to display in the Cloud Logging console, and that the GKE cluster nodes have the correct permissions to write logs to the Cloud Logging API (roles/logging.logWriter).
-
If you have chosen to use Firestore for Autoscaler state and you see the following error in the logs:
Error: 5 NOT_FOUND: Database not found: projects/<YOUR_PROJECT>/instances/autoscale-test/databases/spanner-autoscaler-state
Edit the file
${AUTOSCALER_ROOT}/autoscaler-config/autoscaler-config.yaml
and remove the following stanza:stateDatabase: name: spanner instanceId: autoscale-test databaseId: spanner-autoscaler-state
-
Check the formatting of the YAML configration file:
cat ${AUTOSCALER_ROOT}/autoscaler-config/autoscaler-config.yaml
-
Validate the contents of the YAML configuraration file:
npm install npm run validate-config-file -- ${AUTOSCALER_ROOT}/autoscaler-config/autoscaler-config.yaml