Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes get lost during view change (infinite loop) #1720

Open
lynnbendixsen opened this issue Dec 15, 2021 · 2 comments
Open

Nodes get lost during view change (infinite loop) #1720

lynnbendixsen opened this issue Dec 15, 2021 · 2 comments

Comments

@lynnbendixsen
Copy link
Contributor

I am unable to determine the cause yet, but I think I have seen the same thing happen for 2 different networks in the last week. Here is what I think happens (more details might be coming later, when I know more)

Symptom: All nodes go "Out of consensus"

  1. In the logs I see a View change is requested near the beginning of the problems.
  2. Logs show view change requests coming at a rapid pace and the view number moving up rapidly with no actual view change happening (or sometimes happening, but not as fast as the numbers are going up)
  3. Logs fill up rapidly with view change requests
  4. Looks like an infinite loop (Logs filled rapidly with view change requests for at least an hour and a half)
  5. A restart of the network fixed one of the networks, but not the other.
    Logs available on request (48Mb)
@WadeBarnes
Copy link
Member

@lynnbendixsen, is this still an issue. Any further insight?

@lynnbendixsen
Copy link
Contributor Author

"Rapid view change requests" still happens regularly on my networks when any other issue pops up, it seems, and I am pretty sure there's a bug related to view changes that needs addressed. Nodes seem to continue to make new view change requests of increasing view change numbers in an "out of sync" manner when it gets into this "view change loop" (2 nodes asking for a view change, 1 asking for 110, another for 111, for example).
This might be replicable using the case where the primary, and the "next primary" are both out of consensus or go down at the same time, or using the case I saw where a node that had not "caught up" yet was assigned to be the primary. (in other words, the above might have led to the issue happening, not that there isn't still a bug in managing those issues when they occur)
Separate issue maybe:
I don't think I wrote a separate issue for it yet, but it seems like a new node is regularly assigned to be "the next primary node" so if a view change happens when the new node has not yet caught up, then you restart the new node before it has caught up, the network might never recover. Maybe a new node should get ordered at the end of the "current" list of primaries? Or maybe before switching to a new primary there should be a quick check to see if it has caught up? Right now, after the switch there is another quick switch, I think, but in a case where there are already other issues occurring on the network, the view change to a node that is not yet caught up can cause a "split brain" issue (about half the nodes thing that one node is the primary, and the rest thinking another is)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants