Nodes get lost during view change (infinite loop) #1720

lynnbendixsen · 2021-12-15T18:28:19Z

I am unable to determine the cause yet, but I think I have seen the same thing happen for 2 different networks in the last week. Here is what I think happens (more details might be coming later, when I know more)

Symptom: All nodes go "Out of consensus"

In the logs I see a View change is requested near the beginning of the problems.
Logs show view change requests coming at a rapid pace and the view number moving up rapidly with no actual view change happening (or sometimes happening, but not as fast as the numbers are going up)
Logs fill up rapidly with view change requests
Looks like an infinite loop (Logs filled rapidly with view change requests for at least an hour and a half)
A restart of the network fixed one of the networks, but not the other.
Logs available on request (48Mb)

WadeBarnes · 2023-05-23T15:32:41Z

@lynnbendixsen, is this still an issue. Any further insight?

lynnbendixsen · 2023-05-25T14:11:46Z

"Rapid view change requests" still happens regularly on my networks when any other issue pops up, it seems, and I am pretty sure there's a bug related to view changes that needs addressed. Nodes seem to continue to make new view change requests of increasing view change numbers in an "out of sync" manner when it gets into this "view change loop" (2 nodes asking for a view change, 1 asking for 110, another for 111, for example).
This might be replicable using the case where the primary, and the "next primary" are both out of consensus or go down at the same time, or using the case I saw where a node that had not "caught up" yet was assigned to be the primary. (in other words, the above might have led to the issue happening, not that there isn't still a bug in managing those issues when they occur)
Separate issue maybe:
I don't think I wrote a separate issue for it yet, but it seems like a new node is regularly assigned to be "the next primary node" so if a view change happens when the new node has not yet caught up, then you restart the new node before it has caught up, the network might never recover. Maybe a new node should get ordered at the end of the "current" list of primaries? Or maybe before switching to a new primary there should be a quick check to see if it has caught up? Right now, after the switch there is another quick switch, I think, but in a case where there are already other issues occurring on the network, the view change to a node that is not yet caught up can cause a "split brain" issue (about half the nodes thing that one node is the primary, and the rest thinking another is)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes get lost during view change (infinite loop) #1720

Nodes get lost during view change (infinite loop) #1720

lynnbendixsen commented Dec 15, 2021

WadeBarnes commented May 23, 2023

lynnbendixsen commented May 25, 2023

Nodes get lost during view change (infinite loop) #1720

Nodes get lost during view change (infinite loop) #1720

Comments

lynnbendixsen commented Dec 15, 2021

WadeBarnes commented May 23, 2023

lynnbendixsen commented May 25, 2023