You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our testnet was composed of two nodes, sharing the same Horcrux remote signer (i.e. there was only one validator).
One of the two nodes halted due to a apphash mismatch, the chain didn't halt because the other node continued acting as the validator. By the time we noticed, the pruning settings already wiped the interested block from the node that continued running, hence I wasn't able to do some proper investigation.
The interesting this I noticed from the application.db of the node that halted, is that it completely wiped an entire module storage. The previous block height was there, the next it doesn't have any keys:
go run . data ~/Downloads/chiado-1-dump.db "s/k:act/" 17698
Got version: 19520
Printing all keys with hashed values (to detect diff)
000000000000000001
1F878D20753082B2905FCFE18F98D3B7D4E8C866EF7A1781D7BC9D455767A11E
... [omitted for brevity]
action/count
7A42E3892368F826928202014A6CA95A3D8D846DF25088DA80018663EDF96B1C
Hash: 389087F05766BD6552DF3FC40AACD8A1B5A9FC16927CBCFFAA5EC73D79128A35
Size: 34
❯ go run . data ~/Downloads/chiado-1-dump.db "s/k:act/" 17699
Got version: 19520
Printing all keys with hashed values (to detect diff)
Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
Size: 0
Both nodes had an interesting line in the logs, just before this error occured.
12:09PM ERR iavl set error error="Value missing for key [0 0 0 0 0 0 0 1 0 0 0 1] corresponding to nodeKey 73000000000000000100000001" module=server
At this point, one thing I noticed is that the node wasn't shutting down properly on SIGTERM/SIGINT, it was panicking right before closing the databases. So my initial thought was that the db got corrupted because of that.
We fixed the bug causing the panic and restarted a new chain from scratch using the fixed version, with the same setup.
The issue happened again, and the same module data was gone, just like the first time. (weird coincidence?)
And an identical log entry for both nodes appeared before the apphash mismatch:
2024-10-07T01:09:22.79390127Z stdout F 1:09AM ERR iavl set error error=“Value missing for key [0 0 0 0 0 0 0 1 0 0 0 1] corresponding to nodeKey 73000000000000000100000001” module=server”
We disabled pruning and ran yet another chain from scratch, but the issue haven't happened yet and it's been a week now. Could the issue be related to pruning?
FWIW we've been using the same IAVL version for months in our public testnet, with the exact same pruning parameters, without any issues.
If anyone has any idea on how to investigate further the root cause, I'd be more than happy. Thanks!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey everyone, I'm asking for help investigating an apphash mismatch we have seen in one of our internal testnet.
We are using:
Our testnet was composed of two nodes, sharing the same Horcrux remote signer (i.e. there was only one validator).
One of the two nodes halted due to a apphash mismatch, the chain didn't halt because the other node continued acting as the validator. By the time we noticed, the pruning settings already wiped the interested block from the node that continued running, hence I wasn't able to do some proper investigation.
The interesting this I noticed from the
application.db
of the node that halted, is that it completely wiped an entire module storage. The previous block height was there, the next it doesn't have any keys:Both nodes had an interesting line in the logs, just before this error occured.
At this point, one thing I noticed is that the node wasn't shutting down properly on SIGTERM/SIGINT, it was panicking right before closing the databases. So my initial thought was that the db got corrupted because of that.
We fixed the bug causing the panic and restarted a new chain from scratch using the fixed version, with the same setup.
The issue happened again, and the same module data was gone, just like the first time. (weird coincidence?)
And an identical log entry for both nodes appeared before the apphash mismatch:
We disabled pruning and ran yet another chain from scratch, but the issue haven't happened yet and it's been a week now. Could the issue be related to pruning?
FWIW we've been using the same IAVL version for months in our public testnet, with the exact same pruning parameters, without any issues.
If anyone has any idea on how to investigate further the root cause, I'd be more than happy. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions