Add MimirGossipMembersEndpointsOutOfSync alert (#9347)

* Add MimirGossipMembersEndpointsOutOfSync alert Signed-off-by: Marco Pracucci <[email protected]> * Update docs/sources/mimir/manage/mimir-runbooks/_index.md Co-authored-by: Arve Knudsen <[email protected]> * Update docs/sources/mimir/manage/mimir-runbooks/_index.md Co-authored-by: Arve Knudsen <[email protected]> * Apply suggestions from code review Co-authored-by: Peter Štibraný <[email protected]> * Update runbooks Signed-off-by: Marco Pracucci <[email protected]> * Make doc linter happy Signed-off-by: Marco Pracucci <[email protected]> * Improve alerting query Signed-off-by: Marco Pracucci <[email protected]> * Use count() instead of sum() Signed-off-by: Marco Pracucci <[email protected]> --------- Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Arve Knudsen <[email protected]> Co-authored-by: Peter Štibraný <[email protected]>
grafana · Sep 20, 2024 · 6c4f733 · 6c4f733
1 parent 3c4f00e
commit 6c4f733
Show file tree

Hide file tree

Showing 6 changed files with 241 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -146,6 +146,7 @@
 * [ENHANCEMENT] Dashboards: add 'Read path' selector to 'Mimir / Queries' dashboard. #8878
 * [ENHANCEMENT] Dashboards: add annotation indicating active series are being reloaded to 'Mimir / Tenants' dashboard. #9257
 * [ENHANCEMENT] Dashboards: limit results on the 'Failed evaluations rate' panel of the 'Mimir / Tenants' dashboard to 50 to avoid crashing the page when there are many failing groups. #9262
+* [FEATURE] Alerts: add `MimirGossipMembersEndpointsOutOfSync` alert. #9347
 * [BUGFIX] Dashboards: fix "current replicas" in autoscaling panels when HPA is not active. #8566
 * [BUGFIX] Alerts: do not fire `MimirRingMembersMismatch` during the migration to experimental ingest storage. #8727
 * [BUGFIX] Dashboards: avoid over-counting of ingesters metrics when migrating to experimental ingest storage. #9170

diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md
@@ -920,6 +920,56 @@ How to **investigate**:
     - These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
 - Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:<line>`.
 
+### MimirGossipMembersEndpointsOutOfSync
+
+This alert fires when the list of endpoints returned by the `gossip-ring` service is out-of-sync.
+
+How it **works**:
+
+- The Kubernetes service `gossip-ring` is used by Mimir to find memberlist seed nodes to join at startup. The service
+  DNS returns all Mimir pods by default, which means any Mimir pod can be used as a seed node (this is the safest option).
+- Due to Kubernetes bugs (for example, [this one](https://github.com/kubernetes/kubernetes/issues/127370)) the pod IPs
+  returned by the service DNS address may go out-of-sync, up to a point where none of the returned IPs belongs to any
+  live pod. If that happens, then new Mimir pods can't join memberlist at startup.
+
+How to **investigate**:
+
+- Check the number of endpoints matching the `gossip-ring` service:
+  ```
+  kubectl --namespace <namespace> get endpoints gossip-ring
+  ```
+- If the number of endpoints is 1000 then it means you reached the Kubernetes limit, the endpoints get truncated and
+  you could be hit by [this bug](https://github.com/kubernetes/kubernetes/issues/127370). Having more than 1000 pods
+  matched by the `gossip-ring` service and then getting endpoints truncated to 1000 is not an issue per-se, but it's
+  an issue if you're running a version of Kubernetes affected by the mentioned bug.
+- If you've been affected by the Kubernetes bug:
+
+  1. Stop the bleed re-creating the service endpoints list
+
+     ```sh
+     CONTEXT="TODO"
+     NAMESPACE="TODO"
+     SERVICE="gossip-ring"
+
+     # Re-apply the list of bad endpoints as is.
+     kubectl --context "$CONTEXT" --namespace "$NAMESPACE" get endpoints "$SERVICE" -o yaml > /tmp/service-endpoints.yaml
+     kubectl --context "$CONTEXT" --namespace "$NAMESPACE" apply -f /tmp/service-endpoints.yaml
+
+     # Delete a random querier pod to trigger K8S service endpoints reconciliation.
+     POD=$(kubectl --context "$CONTEXT" --namespace "$NAMESPACE" get pods -l name=querier --output="jsonpath={.items[0].metadata.name}")
+     kubectl --context "$CONTEXT" --namespace "$NAMESPACE" delete pod "$POD"
+     ```
+
+  2. Consider removing some deployments from `gossip-ring` selector label, to reduce the number of matching pods below 1000.
+     This is a temporarily workaround, and you should revert it once you upgrade Kubernetes to a version with the bug fixed.
+
+     An example of how you can do it with jsonnet:
+
+     ```
+     querier_deployment+:
+       $.apps.v1.statefulSet.spec.template.metadata.withLabelsMixin({ [$._config.gossip_member_label]: 'false' }),
+     ```
+
 ### EtcdAllocatingTooMuchMemory
 
 This can be triggered if there are too many HA dedupe keys in etcd. We saw this when one of our clusters hit 20K tenants that were using HA dedupe config. Raise the etcd limits via:

diff --git a/...amonitoring-values-generated/mimir-distributed/templates/metamonitoring/mixin-alerts.yaml b/...amonitoring-values-generated/mimir-distributed/templates/metamonitoring/mixin-alerts.yaml
@@ -552,6 +552,50 @@ spec:
             for: 20m
             labels:
               severity: warning
+          - alert: MimirGossipMembersEndpointsOutOfSync
+            annotations:
+              message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+              runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+            expr: |
+              (
+                count by(cluster, namespace) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                  unless on (cluster, namespace, ip)
+                  label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+                /
+                count by(cluster, namespace) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                )
+                * 100 > 10
+              )
+  
+              # Filter by Mimir only.
+              and (count by(cluster, namespace) (cortex_build_info) > 0)
+            for: 15m
+            labels:
+              severity: warning
+          - alert: MimirGossipMembersEndpointsOutOfSync
+            annotations:
+              message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+              runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+            expr: |
+              (
+                count by(cluster, namespace) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                  unless on (cluster, namespace, ip)
+                  label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+                /
+                count by(cluster, namespace) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                )
+                * 100 > 50
+              )
+  
+              # Filter by Mimir only.
+              and (count by(cluster, namespace) (cortex_build_info) > 0)
+            for: 5m
+            labels:
+              severity: critical
       - name: etcd_alerts
         rules:
           - alert: EtcdAllocatingTooMuchMemory

diff --git a/operations/mimir-mixin-compiled-baremetal/alerts.yaml b/operations/mimir-mixin-compiled-baremetal/alerts.yaml
@@ -530,6 +530,50 @@ groups:
           for: 20m
           labels:
             severity: warning
+        - alert: MimirGossipMembersEndpointsOutOfSync
+          annotations:
+            message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+            runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+          expr: |
+            (
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+                unless on (cluster, namespace, ip)
+                label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+              /
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+              )
+              * 100 > 10
+            )
+
+            # Filter by Mimir only.
+            and (count by(cluster, namespace) (cortex_build_info) > 0)
+          for: 15m
+          labels:
+            severity: warning
+        - alert: MimirGossipMembersEndpointsOutOfSync
+          annotations:
+            message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+            runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+          expr: |
+            (
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+                unless on (cluster, namespace, ip)
+                label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+              /
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+              )
+              * 100 > 50
+            )
+
+            # Filter by Mimir only.
+            and (count by(cluster, namespace) (cortex_build_info) > 0)
+          for: 5m
+          labels:
+            severity: critical
     - name: etcd_alerts
       rules:
         - alert: EtcdAllocatingTooMuchMemory

diff --git a/operations/mimir-mixin-compiled/alerts.yaml b/operations/mimir-mixin-compiled/alerts.yaml
@@ -540,6 +540,50 @@ groups:
           for: 20m
           labels:
             severity: warning
+        - alert: MimirGossipMembersEndpointsOutOfSync
+          annotations:
+            message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+            runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+          expr: |
+            (
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+                unless on (cluster, namespace, ip)
+                label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+              /
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+              )
+              * 100 > 10
+            )
+
+            # Filter by Mimir only.
+            and (count by(cluster, namespace) (cortex_build_info) > 0)
+          for: 15m
+          labels:
+            severity: warning
+        - alert: MimirGossipMembersEndpointsOutOfSync
+          annotations:
+            message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
+            runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
+          expr: |
+            (
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+                unless on (cluster, namespace, ip)
+                label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+              /
+              count by(cluster, namespace) (
+                kube_endpoint_address{endpoint="gossip-ring"}
+              )
+              * 100 > 50
+            )
+
+            # Filter by Mimir only.
+            and (count by(cluster, namespace) (cortex_build_info) > 0)
+          for: 5m
+          labels:
+            severity: critical
     - name: etcd_alerts
       rules:
         - alert: EtcdAllocatingTooMuchMemory

diff --git a/operations/mimir-mixin/alerts/alerts.libsonnet b/operations/mimir-mixin/alerts/alerts.libsonnet
@@ -806,6 +806,64 @@ local utils = import 'mixin-utils/utils.libsonnet';
             message: 'One or more %(product)s instances in %(alert_aggregation_variables)s consistently sees a lower than expected number of gossip members.' % $._config,
           },
         },
+        {
+          // Alert if the list of endpoints returned by the gossip-ring service (used as memberlist seed nodes)
+          // is out-of-sync. This is a warning alert with 10% out-of-sync threshold.
+          alert: $.alertName('GossipMembersEndpointsOutOfSync'),
+          expr:
+            |||
+              (
+                count by(%(alert_aggregation_labels)s) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                  unless on (%(alert_aggregation_labels)s, ip)
+                  label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+                /
+                count by(%(alert_aggregation_labels)s) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                )
+                * 100 > 10
+              )
+
+              # Filter by Mimir only.
+              and (count by(%(alert_aggregation_labels)s) (cortex_build_info) > 0)
+            ||| % $._config,
+          'for': '15m',
+          labels: {
+            severity: 'warning',
+          },
+          annotations: {
+            message: '%(product)s gossip-ring service endpoints list in %(alert_aggregation_variables)s is out of sync.' % $._config,
+          },
+        },
+        {
+          // Alert if the list of endpoints returned by the gossip-ring service (used as memberlist seed nodes)
+          // is out-of-sync. This is a critical alert with 50% out-of-sync threshold.
+          alert: $.alertName('GossipMembersEndpointsOutOfSync'),
+          expr:
+            |||
+              (
+                count by(%(alert_aggregation_labels)s) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                  unless on (%(alert_aggregation_labels)s, ip)
+                  label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
+                /
+                count by(%(alert_aggregation_labels)s) (
+                  kube_endpoint_address{endpoint="gossip-ring"}
+                )
+                * 100 > 50
+              )
+
+              # Filter by Mimir only.
+              and (count by(%(alert_aggregation_labels)s) (cortex_build_info) > 0)
+            ||| % $._config,
+          'for': '5m',
+          labels: {
+            severity: 'critical',
+          },
+          annotations: {
+            message: '%(product)s gossip-ring service endpoints list in %(alert_aggregation_variables)s is out of sync.' % $._config,
+          },
+        },
       ],
     },
     {