Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post mortem of a broken head #1374

Closed
2 tasks done
ch1bo opened this issue Mar 27, 2024 · 1 comment
Closed
2 tasks done

Post mortem of a broken head #1374

ch1bo opened this issue Mar 27, 2024 · 1 comment
Labels
bug 🐛 Something isn't working

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Mar 27, 2024

Situation

  • Franco changed his machine instance type monday morning 2024-03-25, which updated his IP address
  • The others did update their node configs throughout 2024-03-25 and 2023-03-26 with this new IP configuration
  • Today, 2024-03-26, the hydraw did not work

Observations

  • Latest confirmed snapshot is 228
  • Franco's node latest stored (in state) SnapshotRequested is 228, while the others have SnapshotRequested for 229
  • Judging from the sent network-messages, Arnaud was the snapshot leader for 229
  • Franco gets the ReqSn from Arnaud, but it results in WaitOnTxs with 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 missing
  • Sasha is sending most ReqTx in this setup as he is hosting the hydraw instance
  • Ack counters:
    • Franco: [276,608,275,276,278]
    • Sasha: [276,615,275,276,278]
    • Dan: [276,615,275,276,278]
    • Sebastian: [276,615,275,276,278]
  • Sasha's network-messages only has 607 lines
  • Sasha's logs contain ReliabilityFailedToFindMsg

What happened?

Identify root causes and address in this or follow-up items:

  • Sasha's ReqTx for tx id 523c82b607eb5e449de5cec68f0a40501cf6be19db5bc9f246bcd93502d41cd8 never reached franco and his "Reliability" persistence failed?
    • We could manually "fix" this part of the problem by duplicating the last 8 messages in Sasha's network-messages, which would have them resent to franco's nod.
  • The ReqSn sent from Arnaud did reach franco, which started to WaitOnTxs, but restarting franco's node made that ReqSn disappear and never to be resent (it was acknowledged through he network acks). Consequently, the protocol is stuck because it assumes a node keeps messages received once acknowledged. But, in fact, the head logic re-enqueues inputs which it can't act on and this queue is ephemeral!
    • If the ReqSn would have included the transaction it snapshots (which it did once), that would be less of a problem.
    • Is this a combined problem of optimizing snapshot requests vs. having a coordinated protocol?
@ch1bo ch1bo changed the title Post mortem of the broken head Post mortem of a broken head Mar 27, 2024
@ch1bo ch1bo added the bug 🐛 Something isn't working label Mar 27, 2024
@locallycompact
Copy link
Contributor

Closing in favour of #1436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants