Post mortem of a broken head #1374

ch1bo · 2024-03-27T09:15:37Z

Situation

Franco changed his machine instance type monday morning 2024-03-25, which updated his IP address
The others did update their node configs throughout 2024-03-25 and 2023-03-26 with this new IP configuration
Today, 2024-03-26, the hydraw did not work

Latest confirmed snapshot is 228
Franco's node latest stored (in state) SnapshotRequested is 228, while the others have SnapshotRequested for 229
Judging from the sent network-messages, Arnaud was the snapshot leader for 229
Franco gets the ReqSn from Arnaud, but it results in WaitOnTxs with 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 missing
Sasha is sending most ReqTx in this setup as he is hosting the hydraw instance
Ack counters:
- Franco: [276,608,275,276,278]
- Sasha: [276,615,275,276,278]
- Dan: [276,615,275,276,278]
- Sebastian: [276,615,275,276,278]
Sasha's network-messages only has 607 lines
Sasha's logs contain ReliabilityFailedToFindMsg

Identify root causes and address in this or follow-up items:

Sasha's ReqTx for tx id 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 never reached franco and his "Reliability" persistence failed?
- We could manually "fix" this part of the problem by duplicating the last 8 messages in Sasha's network-messages, which would have them resent to franco's nod.
The ReqSn sent from Arnaud did reach franco, which started to WaitOnTxs, but restarting franco's node made that ReqSn disappear and never to be resent (it was acknowledged through he network acks). Consequently, the protocol is stuck because it assumes a node keeps messages received once acknowledged. But, in fact, the head logic re-enqueues inputs which it can't act on and this queue is ephemeral!
- If the ReqSn would have included the transaction it snapshots (which it did once), that would be less of a problem.
- Is this a combined problem of optimizing snapshot requests vs. having a coordinated protocol?

The text was updated successfully, but these errors were encountered:

locallycompact · 2024-05-16T11:37:13Z

Closing in favour of #1436

ch1bo changed the title ~~Post mortem of the broken head~~ Post mortem of a broken head Mar 27, 2024

ch1bo added the bug 🐛 Something isn't working label Mar 27, 2024

ch1bo mentioned this issue May 7, 2024

Diagnose currently stuck head / spike to fix our head #1415

Closed

locallycompact mentioned this issue May 16, 2024

Stress test the network reliability #1436

Closed

2 tasks

locallycompact closed this as completed May 16, 2024