expanding series: too many unhealthy instances in the ring #5158

abhinavDhulipala · 2023-06-03T22:06:06Z

abhinavDhulipala
Jun 3, 2023

Describe the bug

Turned mimir on for the first time last night. Woke up today to all my dashboards throwing this error

["expanding series: too many unhealthy instances in the ring"](internal: rpc error: code = Code(500) desc = {"status":"error","errorType":"internal","error":"expanding series: too many unhealthy instances in the ring"})

To Reproduce

Steps to reproduce the behavior:

Install Mimir deploying the small config
Setup prometheus remote write
Wait????

Expected behavior

Expected mimir to continue running normally. Buckets, PV's, and Node health are just fine...

Environment

Infrastructure: On-prem K8s cluster administered through kubeadm

Client Version: v1.24.14
Kustomize Version: v4.5.4
Server Version: v1.24.0

Deployment tool: helm chart

Additional Context

All of our monitoring goes through a single anonymous user on a prometheus instance deployed else-where in our k8s cluster.
Here are our remote-write configs. Our mimir config follows the small config more or less. With a couple tweaks to ingest limits. Our S3 bucket is deployed on local powerscale s3 compliant filestores, but that shouldn't be a factor here considering none of the problems seem to be with bucket read/write

remoteWrite:
      - url: "http://mimir-nginx.observability.svc:80/api/v1/push"
        queueConfig:
          capacity: 5000  
          maxShards: 100
          maxSamplesPerSend: 1000

$ helm show chart grafana/mimir-distributed
apiVersion: v2
appVersion: 2.8.0
dependencies:
- alias: minio
  condition: minio.enabled
  name: minio
  repository: https://charts.min.io/
  version: 5.0.7
- alias: grafana-agent-operator
  condition: metaMonitoring.grafanaAgent.installOperator
  name: grafana-agent-operator
  repository: https://grafana.github.io/helm-charts
  version: 0.2.8
- alias: rollout_operator
  condition: rollout_operator.enabled
  name: rollout-operator
  repository: https://grafana.github.io/helm-charts
  version: 0.4.2
description: Grafana Mimir
home: https://grafana.com/docs/mimir/v2.8.x/
icon: https://grafana.com/static/img/logos/logo-mimir.svg
kubeVersion: ^1.20.0-0
name: mimir-distributed
version: 4.4.1

Relevant logs from ingesters:

ts=2023-06-03T21:37:21.568191115Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-zone-a-0-2a08c024' from=172.16.86.215:7946"
ts=2023-06-03T21:37:21.568827254Z caller=log.go:194 level=warn msg="Got ping for unexpected node 'mimir-ingester-zone-a-0-2a08c024' from=172.16.119.135:7946"
ts=2023-06-03T21:37:23.108260172Z caller=log.go:194 level=info msg="Suspect mimir-ingester-zone-a-0-2a08c024 has failed, no acks received"
ts=2023-06-03T21:37:24.923539679Z caller=head.go:728 level=info user=anonymous msg="WAL checkpoint loaded"
ts=2023-06-03T21:37:25.603902136Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=166 maxSegment=246
ts=2023-06-03T21:37:26.287417665Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=167 maxSegment=246
ts=2023-06-03T21:37:27.021362334Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=168 maxSegment=246
ts=2023-06-03T21:37:27.716951701Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=169 maxSegment=246
ts=2023-06-03T21:37:28.480874939Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=170 maxSegment=246
ts=2023-06-03T21:37:29.184559413Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=171 maxSegment=246
ts=2023-06-03T21:37:30.104357757Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=172 maxSegment=246
ts=2023-06-03T21:37:31.631196134Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=173 maxSegment=246
ts=2023-06-03T21:37:33.097860242Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=174 maxSegment=246
ts=2023-06-03T21:37:33.602815194Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=175 maxSegment=246
ts=2023-06-03T21:37:34.260580977Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=176 maxSegment=246
ts=2023-06-03T21:37:34.910500285Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=177 maxSegment=246
ts=2023-06-03T21:37:35.546847321Z caller=head.go:763 level=info user=anonymous msg="WAL segment loaded" segment=178 maxSegment=246
ts=2023-06-03T21:37:35.55304165Z caller=log.go:194 level=info msg="Marking mimir-ingester-zone-a-0-2a08c024 as failed, suspect timeout reached (2 peer confirmations)"

Answered by dimitarvdimitrov

Jul 11, 2023

the joining memberlist clusters can happen because of IP reuse. One cluster will try to gossip to the old IP, but the IP will already be in use by the Loki/Tempo cluster. The Loki team has a blog post on how we've encountered it - see "Why memberlist labels matter" https://grafana.com/blog/2022/09/28/inside-the-migration-from-consul-to-memberlist-at-grafana-labs/

It also gives a brief overview of how to do the migration to have Loki/Tempo/Mimir have their own memberlist labels, so they "know" not to join the same memberlist cluster. The section is "Migration steps for using labels." The respective config options for Mimir are memberlist.cluster_label: "mimir" and memberlist.cluster_label_…

View full answer

abhinavDhulipala · 2023-06-03T22:32:49Z

abhinavDhulipala
Jun 3, 2023
Author

Turning off zone replication doesn't seem to help either

0 replies

aknuds1 · 2023-06-04T14:23:03Z

aknuds1
Jun 4, 2023
Maintainer

Converting this into a discussion, as there's no evidence of an actual bug yet.

0 replies

clouedoc · 2023-07-05T10:15:05Z

clouedoc
Jul 5, 2023

I get the same issue.

What I've tried

disabling availability zones
adding replicas for each service
deleting all pods to reset the memberlist ring - this worked

Tracking down the error

The error happens in Cortex:

https://github.com/cortexproject/cortex/blob/e2e5bcf64066890d483a4edf4cd7a3c562050d7a/pkg/ring/ring.go#L120

Here are related tests:

https://github.com/cortexproject/cortex/blob/master/pkg/ring/ring_test.go#L965-L1421

Fixing the issue by resetting the internal ring

The memberlist ring is stored in-memory.

Consequently, deleting all pods will delete the faulty ring.

Deleting all pods

Important: if some part of your infrastructure recreates the pods while you delete them (e.g. ArgoCD auto-sync), this procedure won't work as the faulty ring will be propagated again.

kubectl get deploy -l app.kubernetes.io/instance=mimir -n grafana | awk '{print $1}' | tail +2 | xargs kubectl scale --replicas=0 -n grafana deploye
kubectl get statefulset -l app.kubernetes.io/instance=mimir -n grafana | awk '{print $1}'  | tail +2 | xargs kubectl scale --replicas=0 -n grafana statefulset

Recreating all pods

If using ArgoCD/Flux/GitOps, just sync the app back (you also need to have disabled auto-sync for this procedure to work)
OR manually edit the commands above to have 1 replica of each.

0 replies

sinthetix · 2023-07-05T16:21:26Z

sinthetix
Jul 5, 2023

Having the same problem as well. Only thing able to fix it is deleting all pods to reset the memberlist ring as @clouedoc also states. However, within an hour, the ring is corrupted again.

2 replies

clouedoc Jul 6, 2023

Can you check your query-scheduler; does it gets a "shutdown notification received" log?
Or generally, can you dump your logs here?
I can check if I find common patterns that I noticed with mine

clouedoc Jul 6, 2023

To expand a bit further: the query-scheduler, in my case, printed logs of having received "shutdown notifications" from the mimir-querier.
My theory is also that k8s does not detect correctly when the query-scheduler has received a shutdown notification / the query-scheduler doesn't exist even though it should. (the pod was detected as healthy)

More generally; I wonder if there is a way to see the ring?
If so, you might be able to see the faulty node's IP, and trace back which workload caused the ring corruption.

dimitarvdimitrov · 2023-07-07T11:06:17Z

dimitarvdimitrov
Jul 7, 2023
Maintainer

@sinthetix @clouedoc @abhinavDhulipala have you tried checking the hash ring page for the ingesters? That page is exposed by the distributor pods (but not proxied by the nginx) on /ingester/ring and shows the state of the ring (API reference). It would be helpful to see that and see why the queriers think that there are too many ingesters.

5 replies

sinthetix Jul 10, 2023

For me, Tempo was joining the Mimir ring and Mimir was interpreting it as unhealthy instances in the ring. I inherited the system so I'm not entirely sure what went wrong. I worry it's a uniquely us problem as this is the closest issue I could find to the problem we were having. All the configurations look correct but Tempo keeps joining Mimir & Loki's rings. I'm all for suggestions but I realize it's probably out of this scope :)

clouedoc Jul 10, 2023

@sinthetix wow, what a crazy issue! This made me curious, so I looked, but it looks like I have 2 separate rings, 1 for Mimir, and 1 for Tempo; nothing unusual

sinthetix Jul 10, 2023

@clouedoc oh yes. They each have their own memberlist but Tempo was getting cozy and joining Mimir & Lokis. And the best part was, if I forgot a Tempo ingester in Mimir's ring, Tempo would join right back. But forget a Mimir ingester? it would respect. It has really stumped us. It would be a more interesting problem if it wasn't currently breaking our system. We currently have it solved by turning off Tempo 😅

dimitarvdimitrov Jul 11, 2023
Maintainer

the joining memberlist clusters can happen because of IP reuse. One cluster will try to gossip to the old IP, but the IP will already be in use by the Loki/Tempo cluster. The Loki team has a blog post on how we've encountered it - see "Why memberlist labels matter" https://grafana.com/blog/2022/09/28/inside-the-migration-from-consul-to-memberlist-at-grafana-labs/

It also gives a brief overview of how to do the migration to have Loki/Tempo/Mimir have their own memberlist labels, so they "know" not to join the same memberlist cluster. The section is "Migration steps for using labels." The respective config options for Mimir are memberlist.cluster_label: "mimir" and memberlist.cluster_label_verification_disabled: true.

We've been planning to make this the default in a major helm release, but never got the time to write up the migration doc (#2865)

Answer selected by abhinavDhulipala

sinthetix Jul 18, 2023

thank you!! this is a huge help and really appreciate the time. it just wasn't something that came up after copious google searches. :)

clouedoc · 2023-07-07T18:02:31Z

clouedoc
Jul 7, 2023

I didn't know about this, I should've known that Mimir had a wide API like this 👀 I'll try it next time I get the issue 👍 7 juil. 2023, 13:06 de ***@***.***:

…

@sinthetix <https://github.com/sinthetix>> > @clouedoc <https://github.com/clouedoc>> > @abhinavDhulipala <https://github.com/abhinavDhulipala>> have you tried checking the hash ring page for the ingesters? That page is exposed by the distributor pods (but not proxied by the nginx) on > /ingester/ring> and shows the state of the ring (> API reference <https://grafana.com/docs/mimir/latest/references/http-api/#ingesters-ring-status>> ). It would be helpful to see that and see why the queriers think that there are too many ingesters. — Reply to this email directly, > view it on GitHub <#5158 (comment)>> , or > unsubscribe <https://github.com/notifications/unsubscribe-auth/ADKG2SVPZ4AC3ARWG564KNTXO7UTHANCNFSM6AAAAAAZ6XOZOQ>> . You are receiving this because you were mentioned.> Message ID: > <grafana/mimir/repo-discussions/5158/comments/6383446> @> github> .> com>

0 replies

halvorstein · 2024-10-10T11:16:12Z

halvorstein
Oct 10, 2024

Hi, team.
This issue is still occur at our Mimir cluster.
Our query-frontend pods are giving the following logs:

ts=2024-10-10T11:09:57.371169741Z caller=spanlogger.go:86 user=anonymous level=error user=anonymous msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: too many unhealthy instances in the ring\"}"

We've entered a few weeks ago the mentioned changes in the values.yaml of the Helm chart but without any luck.
Would love to get a push in the right direction.

1 reply

davidgiga1993s Oct 11, 2024

We're facing the same issue with a brand new minimal setup mimir cluster. All members in the list area alive

Update: Turns out you need at least 2 ingester instances. Thinking about it it makes sense, but the error was a bit unspecific

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expanding series: too many unhealthy instances in the ring #5158

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

expanding series: too many unhealthy instances in the ring #5158

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

Replies: 7 comments · 8 replies

abhinavDhulipala Jun 3, 2023 Author

aknuds1 Jun 4, 2023 Maintainer

What I've tried

Tracking down the error

Fixing the issue by resetting the internal ring

Deleting all pods

Recreating all pods

dimitarvdimitrov Jul 7, 2023 Maintainer

dimitarvdimitrov Jul 11, 2023 Maintainer

Replies: 7 comments 8 replies

abhinavDhulipala
Jun 3, 2023
Author

aknuds1
Jun 4, 2023
Maintainer

dimitarvdimitrov
Jul 7, 2023
Maintainer

dimitarvdimitrov Jul 11, 2023
Maintainer