Skip to content

On Call Runbooks

Gabriel Zurita edited this page Jan 15, 2025 · 16 revisions

This guide outlines the discrete steps required for on-call duties, including deploying new versions of our software.


Common Issues and Troubleshooting

SecRel Failures

Common Causes:

  • Vulnerabilities detected by Aqua or Snyk
  • Dependency issues

Resolution Steps:

  1. Identify the failing gate check.
  2. Consult the SecRel Getting Started on VRO guide.
  3. Update dependencies or apply suppressions as necessary.

Deployment Failures

Symptoms:

  • Unhealthy applications in ArgoCD

Actions:

  1. Diagnose and fix issues if possible.
  2. Delay deployment if necessary.

Deployment Process

Step 1: Prepare for Deployment

  1. Every first Tuesday of a new sprint, check #benefits-vro-on-call for a message from the Partner Team Production Deployment Slack Workflow (which is currently manually set off, but will be automated).
  2. Verify there is a recent eligible build:
    • In the abd-vro-internal repository on GitHub, under the Actions tab, check the latest successful (Internal) SecRel workflow run.
    • If SecRel has not passed for several days, delay deployment and investigate the cause. Click "Not Ready" to halt the workflow if issues can’t be quickly fixed; otherwise, proceed.
    • In cases of an emergency deploy, ignoring vulnerabilities in favor of getting a fix out is appropriate (with the intent of getting back to the vulnerabilities after the fix is out): image
    • Keep this tab open and refer to the image tag in the GHCR Summary section in future steps.
      • The image tag will have the same sha as the abd-vro commit being deployed. See screenshots: image image
  3. If a Sign Images build is ready (example here), click "Ready" to proceed. The Secrel scan must be green (no failures) to sign the image. It is impossible to deploy to Sandbox or Prod environments without a signed image.

Step 2: Coordinate with Partner Teams

  1. An automated message will request partner teams in #benefits-vro-support to opt-in or opt-out (the workflow will send a message to the #benefits-vro-on-call channel).
  2. If a team opts out, exclude their applications from the deployment.

Step 3: Deploy to Lower Environments

  1. Begin deployment to lower regions by EOD Tuesday or Wednesday morning:

    • In the #benefits-vro-on-call channel workflow, click Deploy: lower env in each partner team’s Slack thread.
    • A GitHub ticket for the deployment will be created; add the VRO-team and deployments labels, link it to the current sprint and Partner team request epic, and assign both on-call engineers.
  2. Build the release:

    • Create a branch in va-abd-rrd-argocd-applications-vault with the format releases/sprint-*, for example releases/sprint-5.
    • For each app (excluding partner teams that opted out) under the deploy directory set for deployment, update the imageTag field in dev.yaml, qa.yaml, and sandbox.yaml configuration files to the latest successful SecRel run image tag. See example PR here.
    • Push changes and get secondary approval.
  3. Deploy changes to lower environments:

    • Merge the PR, initiate sync, and monitor each environment:

      • Monitor On-Call Alerts: Check the on-call alerts channels for any issues.
      • Verify Sync in ArgoCD for Dev Environment:
        • In ArgoCD (namespace: va-abd-rrd-dev), confirm that the last sync timestamp for each pod matches the deployment time of all deployed applications.
        • If an application does not sync automatically, manually initiate the "Sync" action.
      • Sync QA and Sandbox Environments Manually:
        • QA and Sandbox require manual sync. Repeat the following steps for QA (va-abd-rrd-qa) and Sandbox (va-abd-rrd-sandbox):
          1. Select the appropriate namespace (e.g., va-abd-rrd-qa).
          2. Click Sync Apps.
          3. Choose ALL.
          4. Click Sync.
    • Troubleshoot as Needed: Diagnose and resolve any issues that arise during deployment.

  4. Validate with Partner Teams:

    • Notify partner teams via Slack to validate up to sandbox.
    • Partner teams must validate their applications’ health.
    • If any application is unhealthy, coordinate with the partner team to determine whether to patch or defer.
    • If a partner team opts out, have them click Opt-Out in #benefits-vro-support and revert the image tag change in the repository.

Step 4: Production Deployment (Thursday Morning)

  1. Start production deployment:

    • Click Deploy: production in Slack for each partner team's opt-in slack thread.
    • In va-abd-rrd-argocd-applications-vault, make a PR updating the imageTag fields in prod-test.yaml and prod.yaml for production and get secondary approval.
  2. Deploy to production:

    • Merge the PR
    • In ArgoCD, manually sync and monitor each app instance (NOTE: va-abd-rrd-prod* apps are down due to shutdown — check individual apps that need to be deployed).
      • If a platform app is unhealthy, attempt to diagnose any issues before deciding to rollback, following the same steps above for the rollback.
  3. Complete/Validate production deployment:

    • Click Validate in the #benefits-vro-on-call Slack thread to confirm app health with partner teams.
    • If rollback is needed for any app, follow the rollback steps.
    • Once partner teams have validated their apps are working, they will sign off on their deployment using the workflow and an automatic confirmation message will be sent to the thread in #benefits-vro-on-call.

Step 5: Close Out Deployment

  1. After all validations, close the GitHub deployment tickets.

Dependabot

  1. #TODO
  2. #TODO
Clone this wiki locally