Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(KONFLUX-5297): Adding rules to detect Tekton Resolver leases not distributed correctly #409

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhutar
Copy link
Member

@jhutar jhutar commented Nov 11, 2024

No description provided.

@jhutar
Copy link
Member Author

jhutar commented Nov 12, 2024

Used this script to generate alerts and tests as it is a repetitive a lot:

#!/bin/bash

set -eux

ALERTS_FILE=rhobs/alerting/data_plane/prometheus.pipeline_alerts.yaml
TESTS_FILE=test/promql/tests/data_plane/pipeline_leases_distribution_test.yaml

###cat << EOF >$ALERTS_FILE
###apiVersion: monitoring.coreos.com/v1
###kind: PrometheusRule
###metadata:
###  name: rhtap-pipeline-alerting
###  labels:
###    tenant: rhtap
###spec:
###  groups:
###    - name: pipeline_alerts
###      interval: 1m
###      rules:
###EOF
###
###cat << EOF >$TESTS_FILE
###evaluation_interval: 1m
###
###rule_files:
###  - prometheus.pipeline_alerts.yaml
###
###tests:
###EOF


for resolver in Bundle,bundleresolver Cluster,cluster Git,git Http,http Hub,hub; do
    pretty=$( echo "$resolver" | cut -d , -f 1 )
    lower=$( echo "$resolver" | cut -d , -f 2 )

    test_lease="controller.tektonresolverframework.${lower}"
    test_id="TektonResolver${pretty}ResolverLeasesNotDistributed"
    test_id_2="TektonResolver${pretty}ResolverLeasesHolderMissing"
    test_name="Tekton Resolver ${pretty} resolver leases not distributed across more than one pod"
    test_name_2="Tekton Resolver ${pretty} resolver leases holder does not have pod assigned to it"

cat << EOF >>$ALERTS_FILE
        - alert: ${test_id}
          expr: (count by (source_cluster) (count by (lease_holder, source_cluster) (kube_lease_owner{lease=~"${test_lease}..*", lease_holder!=""}))) <= 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: ${test_name}
            description: >-
              ${test_name}
              on cluster {{ \$labels.source_cluster }}.
            alert_team_handle: <!subteam^S04PYECHCCU>
            team: pipelines
            runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
EOF

cat << EOF >>$ALERTS_FILE
        - alert: ${test_id_2}
          expr: kube_lease_owner{lease=~"${test_lease}..*", lease_holder=""}
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: ${test_name_2}
            description: >-
              ${test_name_2}
              on cluster {{ \$labels.source_cluster }}.
            alert_team_handle: <!subteam^S04PYECHCCU>
            team: pipelines
            runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # Leases on one cluster are distributed only across 1 pod, so let's raise an alert
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id}
        exp_alerts:
          - exp_labels:
              severity: warning
              source_cluster: cluster01
            exp_annotations:
              summary: ${test_name}
              description: >-
                ${test_name}
                on cluster cluster01.
              alert_team_handle: <!subteam^S04PYECHCCU>
              team: pipelines
              runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
EOF

cat << EOF >>$TESTS_FILE
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # Leases on both clusters are distributed only across 1 pod, so let's raise an alert
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id}
        exp_alerts:
          - exp_labels:
              severity: warning
              source_cluster: cluster01
            exp_annotations:
              summary: ${test_name}
              description: >-
                ${test_name}
                on cluster cluster01.
              alert_team_handle: <!subteam^S04PYECHCCU>
              team: pipelines
              runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
          - exp_labels:
              severity: warning
              source_cluster: cluster02
            exp_annotations:
              summary: ${test_name}
              description: >-
                ${test_name}
                on cluster cluster02.
              alert_team_handle: <!subteam^S04PYECHCCU>
              team: pipelines
              runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # Leases on both clusters are distributed across more pods, so all is good
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-ddd"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-eee"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id}
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name_2} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # Leases metric is present, but lease_holder is empty, so let's raise an alert
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder=""}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id_2}
        exp_alerts:
          - exp_labels:
              severity: warning
              source_cluster: cluster01
              lease: ${test_lease}.00-of-04
              namespace: openshift-pipelines
            exp_annotations:
              summary: ${test_name_2}
              description: >-
                ${test_name_2}
                on cluster cluster01.
              alert_team_handle: <!subteam^S04PYECHCCU>
              team: pipelines
              runbook_url: https://gitlab.cee.redhat.com/konflux/docs/sop/-/blob/main/o11y/alert-rule-PipelinesControllerLeasesNotDistributed.md
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name_2} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # If metric was not there, but then appeared, it should not trigger the alert
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '_x6 1x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '_x6 1x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '_x6 1x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '_x6 1x14'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-ddd"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-ddd"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id_2}
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name_2} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # If metric was there, but then disappeared, it is bad (incorrect state) but it is not captured by this alert, so we do not want alert to be triggered by this test as well
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x6 _x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x6 _x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x6 _x14'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x6 _x14'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-ddd"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-ccc"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster02", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-ddd"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id_2}
EOF

cat << EOF >>$TESTS_FILE
  # ----- ${test_name_2} Tests ----
  - interval: 1m
    input_series:

      # see https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ for explanations of the expanding notation used for the values
      # Leases metric is there with right labels, so all is good
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.00-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.01-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.02-of-04", lease_holder="tekton-pipelines-remote-resolvers-aaa"}'
        values: '1x20'
      - series: 'kube_lease_owner{source_cluster="cluster01", namespace="openshift-pipelines", lease="${test_lease}.03-of-04", lease_holder="tekton-pipelines-remote-resolvers-bbb"}'
        values: '1x20'

    alert_rule_test:
      - eval_time: 10m
        alertname: ${test_id_2}
EOF
done

# Convert alerts to format promtool can work with
yq .spec $ALERTS_FILE | tee test/promql/tests/data_plane/prometheus.pipeline_alerts.yaml

@ralphbean
Copy link
Member

If you rebase on main that policy failure should go away now that #413 is merged.

This is adding rules to detect Tekton Resolver leases not distributed
correctly and also renames target SOP URL as it will change in
https://gitlab.cee.redhat.com/konflux/docs/sop/-/merge_requests/225

Signed-off-by: Jan Hutar <[email protected]>
@jhutar
Copy link
Member Author

jhutar commented Nov 14, 2024

Thank you! Rebased.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants