Skip to content

Commit

Permalink
Add MimirGossipMembersEndpointsOutOfSync alert (#9347)
Browse files Browse the repository at this point in the history
* Add MimirGossipMembersEndpointsOutOfSync alert

Signed-off-by: Marco Pracucci <[email protected]>

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Arve Knudsen <[email protected]>

* Update docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Arve Knudsen <[email protected]>

* Apply suggestions from code review

Co-authored-by: Peter Štibraný <[email protected]>

* Update runbooks

Signed-off-by: Marco Pracucci <[email protected]>

* Make doc linter happy

Signed-off-by: Marco Pracucci <[email protected]>

* Improve alerting query

Signed-off-by: Marco Pracucci <[email protected]>

* Use count() instead of sum()

Signed-off-by: Marco Pracucci <[email protected]>

---------

Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Arve Knudsen <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
  • Loading branch information
3 people authored Sep 20, 2024
1 parent 3c4f00e commit 6c4f733
Show file tree
Hide file tree
Showing 6 changed files with 241 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@
* [ENHANCEMENT] Dashboards: add 'Read path' selector to 'Mimir / Queries' dashboard. #8878
* [ENHANCEMENT] Dashboards: add annotation indicating active series are being reloaded to 'Mimir / Tenants' dashboard. #9257
* [ENHANCEMENT] Dashboards: limit results on the 'Failed evaluations rate' panel of the 'Mimir / Tenants' dashboard to 50 to avoid crashing the page when there are many failing groups. #9262
* [FEATURE] Alerts: add `MimirGossipMembersEndpointsOutOfSync` alert. #9347
* [BUGFIX] Dashboards: fix "current replicas" in autoscaling panels when HPA is not active. #8566
* [BUGFIX] Alerts: do not fire `MimirRingMembersMismatch` during the migration to experimental ingest storage. #8727
* [BUGFIX] Dashboards: avoid over-counting of ingesters metrics when migrating to experimental ingest storage. #9170
Expand Down
50 changes: 50 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -920,6 +920,56 @@ How to **investigate**:
- These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
- Logs coming directly from memberlist are also logged by Mimir; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:<line>`.

### MimirGossipMembersEndpointsOutOfSync

This alert fires when the list of endpoints returned by the `gossip-ring` service is out-of-sync.

How it **works**:

- The Kubernetes service `gossip-ring` is used by Mimir to find memberlist seed nodes to join at startup. The service
DNS returns all Mimir pods by default, which means any Mimir pod can be used as a seed node (this is the safest option).
- Due to Kubernetes bugs (for example, [this one](https://github.com/kubernetes/kubernetes/issues/127370)) the pod IPs
returned by the service DNS address may go out-of-sync, up to a point where none of the returned IPs belongs to any
live pod. If that happens, then new Mimir pods can't join memberlist at startup.

How to **investigate**:

- Check the number of endpoints matching the `gossip-ring` service:
```
kubectl --namespace <namespace> get endpoints gossip-ring
```
- If the number of endpoints is 1000 then it means you reached the Kubernetes limit, the endpoints get truncated and
you could be hit by [this bug](https://github.com/kubernetes/kubernetes/issues/127370). Having more than 1000 pods
matched by the `gossip-ring` service and then getting endpoints truncated to 1000 is not an issue per-se, but it's
an issue if you're running a version of Kubernetes affected by the mentioned bug.
- If you've been affected by the Kubernetes bug:

1. Stop the bleed re-creating the service endpoints list

```sh
CONTEXT="TODO"
NAMESPACE="TODO"
SERVICE="gossip-ring"

# Re-apply the list of bad endpoints as is.
kubectl --context "$CONTEXT" --namespace "$NAMESPACE" get endpoints "$SERVICE" -o yaml > /tmp/service-endpoints.yaml
kubectl --context "$CONTEXT" --namespace "$NAMESPACE" apply -f /tmp/service-endpoints.yaml

# Delete a random querier pod to trigger K8S service endpoints reconciliation.
POD=$(kubectl --context "$CONTEXT" --namespace "$NAMESPACE" get pods -l name=querier --output="jsonpath={.items[0].metadata.name}")
kubectl --context "$CONTEXT" --namespace "$NAMESPACE" delete pod "$POD"
```

2. Consider removing some deployments from `gossip-ring` selector label, to reduce the number of matching pods below 1000.
This is a temporarily workaround, and you should revert it once you upgrade Kubernetes to a version with the bug fixed.

An example of how you can do it with jsonnet:

```
querier_deployment+:
$.apps.v1.statefulSet.spec.template.metadata.withLabelsMixin({ [$._config.gossip_member_label]: 'false' }),
```

### EtcdAllocatingTooMuchMemory

This can be triggered if there are too many HA dedupe keys in etcd. We saw this when one of our clusters hit 20K tenants that were using HA dedupe config. Raise the etcd limits via:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,50 @@ spec:
for: 20m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 10
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 15m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 50
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 5m
labels:
severity: critical
- name: etcd_alerts
rules:
- alert: EtcdAllocatingTooMuchMemory
Expand Down
44 changes: 44 additions & 0 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,50 @@ groups:
for: 20m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 10
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 15m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 50
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 5m
labels:
severity: critical
- name: etcd_alerts
rules:
- alert: EtcdAllocatingTooMuchMemory
Expand Down
44 changes: 44 additions & 0 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,50 @@ groups:
for: 20m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 10
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 15m
labels:
severity: warning
- alert: MimirGossipMembersEndpointsOutOfSync
annotations:
message: Mimir gossip-ring service endpoints list in {{ $labels.cluster }}/{{ $labels.namespace }} is out of sync.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirgossipmembersendpointsoutofsync
expr: |
(
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (cluster, namespace, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(cluster, namespace) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 50
)
# Filter by Mimir only.
and (count by(cluster, namespace) (cortex_build_info) > 0)
for: 5m
labels:
severity: critical
- name: etcd_alerts
rules:
- alert: EtcdAllocatingTooMuchMemory
Expand Down
58 changes: 58 additions & 0 deletions operations/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -806,6 +806,64 @@ local utils = import 'mixin-utils/utils.libsonnet';
message: 'One or more %(product)s instances in %(alert_aggregation_variables)s consistently sees a lower than expected number of gossip members.' % $._config,
},
},
{
// Alert if the list of endpoints returned by the gossip-ring service (used as memberlist seed nodes)
// is out-of-sync. This is a warning alert with 10% out-of-sync threshold.
alert: $.alertName('GossipMembersEndpointsOutOfSync'),
expr:
|||
(
count by(%(alert_aggregation_labels)s) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (%(alert_aggregation_labels)s, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(%(alert_aggregation_labels)s) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 10
)
# Filter by Mimir only.
and (count by(%(alert_aggregation_labels)s) (cortex_build_info) > 0)
||| % $._config,
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
message: '%(product)s gossip-ring service endpoints list in %(alert_aggregation_variables)s is out of sync.' % $._config,
},
},
{
// Alert if the list of endpoints returned by the gossip-ring service (used as memberlist seed nodes)
// is out-of-sync. This is a critical alert with 50% out-of-sync threshold.
alert: $.alertName('GossipMembersEndpointsOutOfSync'),
expr:
|||
(
count by(%(alert_aggregation_labels)s) (
kube_endpoint_address{endpoint="gossip-ring"}
unless on (%(alert_aggregation_labels)s, ip)
label_replace(kube_pod_info, "ip", "$1", "pod_ip", "(.*)"))
/
count by(%(alert_aggregation_labels)s) (
kube_endpoint_address{endpoint="gossip-ring"}
)
* 100 > 50
)
# Filter by Mimir only.
and (count by(%(alert_aggregation_labels)s) (cortex_build_info) > 0)
||| % $._config,
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
message: '%(product)s gossip-ring service endpoints list in %(alert_aggregation_variables)s is out of sync.' % $._config,
},
},
],
},
{
Expand Down

0 comments on commit 6c4f733

Please sign in to comment.