Setting scalable and primary celery alarms #1067

ben851 · 2023-12-06T13:59:51Z

Summary | Résumé

Updating celery alarms to reflect primary and scalable replicas

Test instructions | Instructions pour tester la modification

Applied in dev for testing:

Break Celery primary deployment (set invalid node selector), verify alarm goes off
Break scalable celery deployment (set invalid node selector), verify alarm goes off
Break Celery Email Send primary deployment (set invalid node selector), verify alarm goes off
Break Celery Email Send scalable deployment (set invalid node selector), verify alarm goes off
Break Celery SMS Send primary deployment (set invalid node selector), verify alarm goes off
Break Celery SMS Send scalable deployment (set invalid node selector), verify alarm goes off

github-actions · 2023-12-06T14:00:04Z

Updating alarms ⏰? Great! Please update the Google Sheet and add a 👍 to this message after 🙏

github-actions · 2023-12-06T14:00:05Z

Updating alarms ⏰? Great! Please update the Google Sheet and add a 👍 to this message after 🙏

aws/eks/cloudwatch_alarms.tf

sastels · 2023-12-06T16:41:57Z

We only have high cpu / memory warning alarms for celery-primary/scalable, not for the email or sms pods - it that intentional?

ben851 · 2023-12-06T17:51:21Z

We only have high cpu / memory warning alarms for celery-primary/scalable, not for the email or sms pods - it that intentional?
Honestly I was going to make this a separate card/topic but we don't even use these, and they don't work. For example under load we see celery hitting 100+ % CPU and we never get alerted... I modified them here to keep them for now, but was going to raise this after.

github-actions · 2023-12-06T17:54:12Z

Staging: eks

✅ Terraform Init: success
✅ Terraform Validate: success
✅ Terraform Format: success
✅ Terraform Plan: success
✅ Conftest: success

⚠️ Warning: resources will be destroyed by this change!

Plan: 9 to add, 0 to change, 3 to destroy

Show summary

CHANGE	NAME
add	`aws_cloudwatch_metric_alarm.celery-email-send-primary-replicas-unavailable[0]`
	`aws_cloudwatch_metric_alarm.celery-email-send-scalable-replicas-unavailable[0]`
	`aws_cloudwatch_metric_alarm.celery-primary-pods-high-cpu-warning[0]`
	`aws_cloudwatch_metric_alarm.celery-primary-pods-high-memory-warning[0]`
	`aws_cloudwatch_metric_alarm.celery-primary-replicas-unavailable[0]`
	`aws_cloudwatch_metric_alarm.celery-scalable-pods-high-cpu-warning[0]`
	`aws_cloudwatch_metric_alarm.celery-scalable-replicas-unavailable[0]`
	`aws_cloudwatch_metric_alarm.celery-sms-send-primary-replicas-unavailable[0]`
	`aws_cloudwatch_metric_alarm.celery-sms-send-scalable-replicas-unavailable[0]`
delete	`aws_cloudwatch_metric_alarm.celery-pods-high-cpu-warning[0]`
	`aws_cloudwatch_metric_alarm.celery-pods-high-memory-warning[0]`
	`aws_cloudwatch_metric_alarm.celery-replicas-unavailable[0]`

Show plan

Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  # aws_cloudwatch_metric_alarm.celery-email-send-primary-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-email-send-primary-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery Email Send Primary Replicas Unavailable"
      + alarm_name                            = "celery-email-send-primary-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-email-send-primary"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-email-send-scalable-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-email-send-scalable-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery Email Send Scalable Replicas Unavailable"
      + alarm_name                            = "celery-email-send-scalable-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 3
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-email-send-scalable"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-pods-high-cpu-warning[0] will be destroyed
  # (because aws_cloudwatch_metric_alarm.celery-pods-high-cpu-warning is not in configuration)
  - resource "aws_cloudwatch_metric_alarm" "celery-pods-high-cpu-warning" {
      - actions_enabled           = true -> null
      - alarm_actions             = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ] -> null
      - alarm_description         = "Average CPU of Celery pods >=50% during 10 minutes" -> null
      - alarm_name                = "celery-pods-high-cpu-warning" -> null
      - arn                       = "arn:aws:cloudwatch:ca-central-1:239043911459:alarm:celery-pods-high-cpu-warning" -> null
      - comparison_operator       = "GreaterThanOrEqualToThreshold" -> null
      - datapoints_to_alarm       = 0 -> null
      - dimensions                = {
          - "ClusterName" = "notification-canada-ca-staging-eks-cluster"
          - "Namespace"   = "notification-canada-ca"
          - "Service"     = "celery"
        } -> null
      - evaluation_periods        = 2 -> null
      - id                        = "celery-pods-high-cpu-warning" -> null
      - insufficient_data_actions = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ] -> null
      - metric_name               = "pod_cpu_utilization" -> null
      - namespace                 = "ContainerInsights" -> null
      - ok_actions                = [] -> null
      - period                    = 300 -> null
      - statistic                 = "Average" -> null
      - tags                      = {} -> null
      - tags_all                  = {} -> null
      - threshold                 = 50 -> null
      - treat_missing_data        = "missing" -> null
    }

  # aws_cloudwatch_metric_alarm.celery-pods-high-memory-warning[0] will be destroyed
  # (because aws_cloudwatch_metric_alarm.celery-pods-high-memory-warning is not in configuration)
  - resource "aws_cloudwatch_metric_alarm" "celery-pods-high-memory-warning" {
      - actions_enabled           = true -> null
      - alarm_actions             = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ] -> null
      - alarm_description         = "Average memory of Celery pods >=50% during 10 minutes" -> null
      - alarm_name                = "celery-pods-high-memory-warning" -> null
      - arn                       = "arn:aws:cloudwatch:ca-central-1:239043911459:alarm:celery-pods-high-memory-warning" -> null
      - comparison_operator       = "GreaterThanOrEqualToThreshold" -> null
      - datapoints_to_alarm       = 0 -> null
      - dimensions                = {
          - "ClusterName" = "notification-canada-ca-staging-eks-cluster"
          - "Namespace"   = "notification-canada-ca"
          - "Service"     = "celery"
        } -> null
      - evaluation_periods        = 2 -> null
      - id                        = "celery-pods-high-memory-warning" -> null
      - insufficient_data_actions = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ] -> null
      - metric_name               = "pod_memory_utilization" -> null
      - namespace                 = "ContainerInsights" -> null
      - ok_actions                = [] -> null
      - period                    = 300 -> null
      - statistic                 = "Average" -> null
      - tags                      = {} -> null
      - tags_all                  = {} -> null
      - threshold                 = 50 -> null
      - treat_missing_data        = "missing" -> null
    }

  # aws_cloudwatch_metric_alarm.celery-primary-pods-high-cpu-warning[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-primary-pods-high-cpu-warning" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Average CPU of Primary Celery pods >=50% during 10 minutes"
      + alarm_name                            = "celery-primary-pods-high-cpu-warning"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + dimensions                            = {
          + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
          + "Namespace"   = "notification-canada-ca"
          + "Service"     = "celery-primary"
        }
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + insufficient_data_actions             = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + metric_name                           = "pod_cpu_utilization"
      + namespace                             = "ContainerInsights"
      + period                                = 300
      + statistic                             = "Average"
      + tags_all                              = (known after apply)
      + threshold                             = 50
      + treat_missing_data                    = "missing"
    }

  # aws_cloudwatch_metric_alarm.celery-primary-pods-high-memory-warning[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-primary-pods-high-memory-warning" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Average memory of Primary Celery pods >=50% during 10 minutes"
      + alarm_name                            = "celery-primary-pods-high-memory-warning"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + dimensions                            = {
          + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
          + "Namespace"   = "notification-canada-ca"
          + "Service"     = "celery-primary"
        }
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + insufficient_data_actions             = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + metric_name                           = "pod_memory_utilization"
      + namespace                             = "ContainerInsights"
      + period                                = 300
      + statistic                             = "Average"
      + tags_all                              = (known after apply)
      + threshold                             = 50
      + treat_missing_data                    = "missing"
    }

  # aws_cloudwatch_metric_alarm.celery-primary-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-primary-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery Primary Replicas Unavailable"
      + alarm_name                            = "celery-primary-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-primary"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-replicas-unavailable[0] will be destroyed
  # (because aws_cloudwatch_metric_alarm.celery-replicas-unavailable is not in configuration)
  - resource "aws_cloudwatch_metric_alarm" "celery-replicas-unavailable" {
      - actions_enabled           = true -> null
      - alarm_actions             = [
          - "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ] -> null
      - alarm_description         = "Celery Replicas Unavailable" -> null
      - alarm_name                = "celery-replicas-unavailable" -> null
      - arn                       = "arn:aws:cloudwatch:ca-central-1:239043911459:alarm:celery-replicas-unavailable" -> null
      - comparison_operator       = "GreaterThanOrEqualToThreshold" -> null
      - datapoints_to_alarm       = 0 -> null
      - dimensions                = {} -> null
      - evaluation_periods        = 2 -> null
      - id                        = "celery-replicas-unavailable" -> null
      - insufficient_data_actions = [] -> null
      - ok_actions                = [] -> null
      - period                    = 0 -> null
      - tags                      = {} -> null
      - tags_all                  = {} -> null
      - threshold                 = 1 -> null
      - treat_missing_data        = "notBreaching" -> null

      - metric_query {
          - id          = "m1" -> null
          - period      = 0 -> null
          - return_data = true -> null

          - metric {
              - dimensions  = {
                  - "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  - "deployment"  = "celery"
                  - "namespace"   = "notification-canada-ca"
                } -> null
              - metric_name = "kube_deployment_status_replicas_unavailable" -> null
              - namespace   = "ContainerInsights/Prometheus" -> null
              - period      = 300 -> null
              - stat        = "Minimum" -> null
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-scalable-pods-high-cpu-warning[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-scalable-pods-high-cpu-warning" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Average CPU of Scalable Celery pods >=50% during 10 minutes"
      + alarm_name                            = "celery-scalable-pods-high-cpu-warning"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + dimensions                            = {
          + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
          + "Namespace"   = "notification-canada-ca"
          + "Service"     = "celery-scalable"
        }
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + insufficient_data_actions             = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + metric_name                           = "pod_cpu_utilization"
      + namespace                             = "ContainerInsights"
      + period                                = 300
      + statistic                             = "Average"
      + tags_all                              = (known after apply)
      + threshold                             = 50
      + treat_missing_data                    = "missing"
    }

  # aws_cloudwatch_metric_alarm.celery-scalable-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-scalable-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery Scalable Replicas Unavailable"
      + alarm_name                            = "celery-scalable-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 3
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-scalable"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-sms-send-primary-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-sms-send-primary-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery SMS Send Primary Replicas Unavailable"
      + alarm_name                            = "celery-sms-send-primary-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 2
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-sms-send-primary"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

  # aws_cloudwatch_metric_alarm.celery-sms-send-scalable-replicas-unavailable[0] will be created
  + resource "aws_cloudwatch_metric_alarm" "celery-sms-send-scalable-replicas-unavailable" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:239043911459:alert-warning",
        ]
      + alarm_description                     = "Celery SMS Send Scalable Replicas Unavailable"
      + alarm_name                            = "celery-sms-send-scalable-replicas-unavailable"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanOrEqualToThreshold"
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 3
      + id                                    = (known after apply)
      + tags_all                              = (known after apply)
      + threshold                             = 1
      + treat_missing_data                    = "notBreaching"

      + metric_query {
          + id          = "m1"
          + return_data = true

          + metric {
              + dimensions  = {
                  + "ClusterName" = "notification-canada-ca-staging-eks-cluster"
                  + "deployment"  = "celery-sms-send-scalable"
                  + "namespace"   = "notification-canada-ca"
                }
              + metric_name = "kube_deployment_status_replicas_unavailable"
              + namespace   = "ContainerInsights/Prometheus"
              + period      = 300
              + stat        = "Minimum"
            }
        }
    }

Plan: 9 to add, 0 to change, 3 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: plan.tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "plan.tfplan"

Show Conftest results

WARN - plan.json - main - Cloudwatch log metric pattern is invalid: ["aws_cloudwatch_log_metric_filter.celery-error[0]"]
WARN - plan.json - main - Cloudwatch log metric pattern is invalid: ["aws_cloudwatch_log_metric_filter.scanfiles-timeout[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_acm_certificate.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_acm_certificate.notification-canada-ca-alt[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_listener.notification-canada-ca"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-admin"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-api"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-document"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-document-api"]
WARN - plan.json - main - Missing Common Tags: ["aws_alb_target_group.notification-canada-ca-documentation"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-application-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-cluster-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notification-canada-ca-eks-prometheus-logs[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-evicted-pods[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-pods-high-cpu-warning[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-pods-high-memory-warning[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.admin-replicas-unavailable[0]"]
WARN - plan.json - main - Missing Common Tags:...

sastels

LGTM!

Setting scalable and primary celery alarms

a0bc9dc

ben851 requested review from jimleroyer and sastels December 6, 2023 13:59

sastels reviewed Dec 6, 2023

View reviewed changes

aws/eks/cloudwatch_alarms.tf Outdated Show resolved Hide resolved

fixing typo

fd64c13

sastels approved these changes Dec 6, 2023

View reviewed changes

ben851 merged commit b1b305a into main Dec 6, 2023
3 checks passed

ben851 deleted the celery-alarm-updates branch December 6, 2023 18:10

jzbahrai mentioned this pull request Dec 7, 2023

deploy 2.3.52 #1068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting scalable and primary celery alarms #1067

Setting scalable and primary celery alarms #1067

ben851 commented Dec 6, 2023 •

edited

Loading

github-actions bot commented Dec 6, 2023

github-actions bot commented Dec 6, 2023

sastels commented Dec 6, 2023

ben851 commented Dec 6, 2023

github-actions bot commented Dec 6, 2023

sastels left a comment

Setting scalable and primary celery alarms #1067

Setting scalable and primary celery alarms #1067

Conversation

ben851 commented Dec 6, 2023 • edited Loading

Summary | Résumé

Test instructions | Instructions pour tester la modification

github-actions bot commented Dec 6, 2023

github-actions bot commented Dec 6, 2023

sastels commented Dec 6, 2023

ben851 commented Dec 6, 2023

github-actions bot commented Dec 6, 2023

Staging: eks

sastels left a comment

Choose a reason for hiding this comment

ben851 commented Dec 6, 2023 •

edited

Loading