-
Notifications
You must be signed in to change notification settings - Fork 7
On Call Runbooks
Gabriel Zurita edited this page Jan 15, 2025
·
16 revisions
This guide outlines the discrete steps required for on-call duties, including deploying new versions of our software.
Common Causes:
- Vulnerabilities detected by Aqua or Snyk
- Dependency issues
Resolution Steps:
- Identify the failing gate check.
- Consult the SecRel Getting Started on VRO guide.
- Update dependencies or apply suppressions as necessary.
Symptoms:
- Unhealthy applications in ArgoCD
Actions:
- Diagnose and fix issues if possible.
- Delay deployment if necessary.
- Every first Tuesday of a new sprint, check #benefits-vro-on-call for a message from the Partner Team Production Deployment Slack Workflow (which is currently manually set off, but will be automated).
- Verify there is a recent eligible build:
- In the
abd-vro-internal
repository on GitHub, under theActions
tab, check the latest successful(Internal) SecRel workflow
run. - If SecRel has not passed for several days, delay deployment and investigate the cause. Click "Not Ready" to halt the workflow if issues can’t be quickly fixed; otherwise, proceed.
- In cases of an emergency deploy, ignoring vulnerabilities in favor of getting a fix out is appropriate (with the intent of getting back to the vulnerabilities after the fix is out):
-
Keep this tab open and refer to the image tag in the GHCR Summary section in future steps.
- The image tag will have the same sha as the
abd-vro
commit being deployed. See screenshots:
- The image tag will have the same sha as the
- In the
- If a
Sign Images
build is ready (example here), click "Ready" to proceed. The Secrel scan must be green (no failures) to sign the image. It is impossible to deploy to Sandbox or Prod environments without a signed image.
- An automated message will request partner teams in #benefits-vro-support to opt-in or opt-out (the workflow will send a message to the #benefits-vro-on-call channel).
- If a team opts out, exclude their applications from the deployment.
-
Begin deployment to lower regions by EOD Tuesday or Wednesday morning:
- In the #benefits-vro-on-call channel workflow, click
Deploy: lower env
in each partner team’s Slack thread. - A GitHub ticket for the deployment will be created; add the
VRO-team
anddeployments
labels, link it to the current sprint andPartner team request
epic, and assign both on-call engineers.
- In the #benefits-vro-on-call channel workflow, click
-
Build the release:
- Create a branch in
va-abd-rrd-argocd-applications-vault
with the formatreleases/sprint-*
, for examplereleases/sprint-5
. - For each app (excluding partner teams that opted out) under the
deploy
directory set for deployment, update theimageTag
field indev.yaml
,qa.yaml
, andsandbox.yaml
configuration files to the latest successful SecRel run image tag. See example PR here. - Push changes and get secondary approval.
- Create a branch in
-
Deploy changes to lower environments:
-
Merge the PR, initiate sync, and monitor each environment:
- Monitor On-Call Alerts: Check the on-call alerts channels for any issues.
-
Verify Sync in ArgoCD for Dev Environment:
- In ArgoCD (namespace:
va-abd-rrd-dev
), confirm that thelast sync
timestamp for each pod matches the deployment time of all deployed applications. - If an application does not sync automatically, manually initiate the "Sync" action.
- In ArgoCD (namespace:
-
Sync QA and Sandbox Environments Manually:
- QA and Sandbox require manual sync. Repeat the following steps for QA (
va-abd-rrd-qa
) and Sandbox (va-abd-rrd-sandbox
):- Select the appropriate namespace (e.g.,
va-abd-rrd-qa
). - Click
Sync Apps
. - Choose
ALL
. - Click
Sync
.
- Select the appropriate namespace (e.g.,
- QA and Sandbox require manual sync. Repeat the following steps for QA (
-
Troubleshoot as Needed: Diagnose and resolve any issues that arise during deployment.
-
-
Validate with Partner Teams:
- Notify partner teams via Slack to validate up to
sandbox
. - Partner teams must validate their applications’ health.
- If any application is unhealthy, coordinate with the partner team to determine whether to patch or defer.
- If a partner team opts out, have them click
Opt-Out
in #benefits-vro-support and revert the image tag change in the repository.
- Notify partner teams via Slack to validate up to
-
Start production deployment:
- Click
Deploy: production
in Slack for each partner team's opt-in slack thread. - In
va-abd-rrd-argocd-applications-vault
, make a PR updating theimageTag
fields inprod-test.yaml
andprod.yaml
for production and get secondary approval.
- Click
-
Deploy to production:
- Merge the PR
- In ArgoCD, manually sync and monitor each app instance (NOTE:
va-abd-rrd-prod*
apps are down due to shutdown — check individual apps that need to be deployed).- If a platform app is unhealthy, attempt to diagnose any issues before deciding to rollback, following the same steps above for the rollback.
-
Complete/Validate production deployment:
- Click
Validate
in the #benefits-vro-on-call Slack thread to confirm app health with partner teams. - If rollback is needed for any app, follow the rollback steps.
- Once partner teams have validated their apps are working, they will sign off on their deployment using the workflow and an automatic confirmation message will be sent to the thread in #benefits-vro-on-call.
- Click
- After all validations, close the GitHub deployment tickets.
- #TODO
- #TODO