Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combined signature is not valid for cosmos-sdk v0.50 #255

Open
chillyvee opened this issue Mar 5, 2024 · 15 comments
Open

combined signature is not valid for cosmos-sdk v0.50 #255

chillyvee opened this issue Mar 5, 2024 · 15 comments

Comments

@chillyvee
Copy link
Contributor

chillyvee commented Mar 5, 2024

On seda testnet 2/3 horcrux signers sometimes has this error for hundreds of blocks in a row:
combined signature is not valid

Single signer horcrux does not seem to have an issue

2/3 signer horcrux has an issue sometimes. Restarting 2/3 of the cosigners appears to resolve the issue.

Sometimes the issue occurs after restarting the chain binary and reconnecting to a working horcrux 2/3 cluster.

Is there a known resolution that is waiting to be implemented? We can attempt to assist.

commit 54beead80f63c7c8407674d112363cb752a13c61 (HEAD)

cosigner1 - message set repeats

Nonce cache is empty, triggering reload
Loading additional nonces to meet demand     target=2 remaining=0 additional=2 nonces_per_min=90.27712309797253 avg_nonces_per_min=27.380684601388218
Loaded nonces                                desired=2 added=2
Cosigner nonce cache ahead of demand         target=2 remaining=2 nonces_per_min=0 avg_nonces_per_min=27.379249357780424
Cosigner nonce cache ahead of demand         target=1 remaining=2 nonces_per_min=0 avg_nonces_per_min=12.85992754860993
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=129343 round=0 type=prevote
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=129343 round=0 type=precommit

cosigner2 - message set repeats

Failed to sign                               type=prevote chain_id=seda-1-testnet height=129338 round=0 error="rpc error: code = Unknown desc = combined signature is not valid"
I am not the leader. Proxying request to the leader chain_id=seda-1-testnet height=129338 round=0 step=3

cosigner 3 - similar messages repeat

Signed with shard                            chain_id=seda-1-testnet height=129329 round=0 step=2
Signed with shard                            chain_id=seda-1-testnet height=129329 round=0 step=3
Signed with shard                            chain_id=seda-1-testnet height=129330 round=0 step=2
Signed with shard                            chain_id=seda-1-testnet height=129330 round=0 step=3
Signed with shard                            chain_id=seda-1-testnet height=129331 round=0 step=2
Signed with shard                            chain_id=seda-1-testnet height=129331 round=0 step=3
Signed with shard                            chain_id=seda-1-testnet height=129332 round=0 step=2

messages are similar when signing works and when signing fails (no major difference)

@jasperdg
Copy link

jasperdg commented Mar 6, 2024

+1

@nitronit
Copy link
Contributor

nitronit commented Mar 7, 2024

@chillyvee
Ah you are running latest commit to main. Have you tried the pre-release below?
https://github.com/strangelove-ventures/horcrux/releases/tag/v3.3.0-rc1
Shouldn't be any difference though.

Most likely something else. Like its not syncing or communication is not done properly.

Whats your set-up in terms of sentries/nodes? And the connection between the nodes and the signer-nodes. (is it one to many or one to one).

@chillyvee
Copy link
Contributor Author

v3.3.0-rc1 is just 2 commits behind main head. You are right, it shouldn't have any difference.

Setup is:

sedad single node only + all 3 horcrux cosigners on the same hardware port 1000,1001,1002 - no external network dependency (one chain binary to many horcrux cosigners)

@chillyvee
Copy link
Contributor Author

Resetting to v3.3.0-rc1 and will report back. There are slight differences but none that look significant

-       github.com/cometbft/cometbft v0.38.2
+       github.com/cometbft/cometbft v0.38.0
-       github.com/opencontainers/runc v1.1.12 // indirect
+       github.com/opencontainers/runc v1.1.10 // indirect

@chillyvee
Copy link
Contributor Author

Can you give any suggestions regarding any expected signing values to check?

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 7, 2024

Missed many blocks with 3.3.0-rc1 then suddenly recovered without any restarts.
One cosigner log looks different when signing recovered

Missed many blocks

Nonce cache is empty, triggering reload      
Loading additional nonces to meet demand     target=2 remaining=0 additional=2 nonces_per_min=90.12627123205712 avg_nonces_per_min=27.138295536221573
Loaded nonces                                desired=2 added=2
Cosigner nonce cache ahead of demand         target=2 remaining=2 nonces_per_min=0 avg_nonces_per_min=27.137969282502716
Cosigner nonce cache ahead of demand         target=1 remaining=2 nonces_per_min=0 avg_nonces_per_min=12.857157584098498
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=151873 round=0 type=prevote
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=151873 round=0 type=precommit
Nonce cache is empty, triggering reload      
Loading additional nonces to meet demand     target=2 remaining=0 additional=2 nonces_per_min=88.78479168013699 avg_nonces_per_min=27.63585550722192
Loaded nonces                                desired=2 added=2
Cosigner nonce cache ahead of demand         target=2 remaining=2 nonces_per_min=0 avg_nonces_per_min=27.62958040256905

Stopped missing blocks

Signed with shard                            chain_id=seda-1-testnet height=151874 round=0 step=2
Signed with shard                            chain_id=seda-1-testnet height=151874 round=0 step=3
Signed with shard                            chain_id=seda-1-testnet height=151875 round=0 step=2

@chillyvee
Copy link
Contributor Author

Different but possibly related issue

cosigner 1 On sentry switch switch blocks miss

Signed with shard                            chain_id=seda-1-testnet height=152002 round=0 step=3
Connected to Sentry                          address=tcp://127.0.0.1:1234

<-- miss start
I am not the leader. Proxying request to the leader chain_id=seda-1-testnet height=152003 round=0 step=2

cosigner 2 detached

Failed to write message to connection        address=tcp://127.0.0.1:1234 err="write tcp 127.0.0.1:52498->127.0.0.1:1234: write: broken pipe"
Cosigner failed to set nonces and sign       chain_id=seda-1-testnet height=152003 round=0 type=prevote cosigner=2 err="unexpected state, metadata does not exist for U: 0d3860ef-9493-493e-82c6-932efe052aac"
Cosigner failed to set nonces and sign       chain_id=seda-1-testnet height=152003 round=0 type=prevote cosigner=3 err="rpc error: code = Unknown desc = unexpected state, metadata does not exist for U: 0d3860ef-9493-493e-82c6-932efe052aac"
Connected to Sentry                          address=tcp://127.0.0.1:1234
Cosigner failed to set nonces and sign       chain_id=seda-1-testnet height=152003 round=0 type=prevote cosigner=1 err="rpc error: code = Unknown desc = unexpected state, metadata does not exist for U: 0d3860ef-9493-493e-82c6-932efe052aac"

cosigner 3

Failed to sign with shard                    chain_id=seda-1-testnet height=152003 round=0 step=2 error="unexpected state, metadata does not exist for U: 0d3860ef-9493-493e-82c6-932efe052aac"
Failed to sign                               type=prevote chain_id=seda-1-testnet height=152003 round=0 error="rpc error: code = Unknown desc = error from cosigner(s): unexpected state, metadata does not exist for U: 0d3860ef-9493-493e-82c6-932efe052aac"
Failed to write message to connection        address=tcp://127.0.0.1:1234 err="write tcp 127.0.0.1:40268->127.0.0.1:1234: write: broken pipe"

Recovered after 15 blocks

However typically expect horcrux to keep signing even in cases of temporary disconnect

@nitronit
Copy link
Contributor

nitronit commented Mar 8, 2024

@chillyvee
Just curious if you could maybe test:
(one to one), I.e one chain binary to one horcrux cosigner and see if you got any better luck? Note its still a cluster of Horcrux signers. Not sure it will change anything per se. But would be interesting. Also I am not sure which

Not 100% sure from the logs but I believe its something with the nonce pre-sharing. What its warning about its that the unique ID for the preshare doesnt exist. I.e either its not generated or its cleared for other reasons.

As you say, the connection (theoretically) shouldn't impact. Thats one of the ideas with Horcrux.

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 8, 2024

Reconfigured so that 2 cosigners do not have a chain node to connect to. All cosigners restart

All on single machine

cosigner1 -> invalid node
cosigner2 -> invalid node
cosigner3 -> sedad node

cosigner1 log

Cosigner nonce cache ahead of demand         target=1 remaining=2 nonces_per_min=0 avg_nonces_per_min=12.752869678631988
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=160388 round=0 type=prevote
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=160388 round=0 type=precommit
Nonce cache is empty, triggering reload   
Loading additional nonces to meet demand     target=2 remaining=0 additional=2 nonces_per_min=63.456343614070065 avg_nonces_per_min=25.805804869486476
Loaded nonces                                desired=2 added=2
Cosigner nonce cache ahead of demand         target=2 remaining=2 nonces_per_min=0 avg_nonces_per_min=24.911426135209958

cosigner2 log

Signed with shard                            chain_id=seda-1-testnet height=160388 round=0 step=2
Signed with shard                            chain_id=seda-1-testnet height=160388 round=0 step=3

cosigner3 log

Failed to sign                               type=prevote chain_id=seda-1-testnet height=160388 round=0 error="rpc error: code = Unknown desc = combined signature is not valid"

message sets repeat

combined signature is not valid continues and block misses continuously

restarted all 3 cosigners, blocks are getting signed again

@chillyvee
Copy link
Contributor Author

later

cosigner1

Loading additional nonces to meet demand     target=1 remaining=0 additional=1 nonces_per_min=0 avg_nonces_per_min=0
Loaded nonces                                desired=1 added=1
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=160577 round=0 type=prevote
Nonce cache is empty, triggering reload
Loading additional nonces to meet demand     target=1 remaining=0 additional=1 nonces_per_min=162.32679267913417 avg_nonces_per_min=0.0476845357405875
Loaded nonces                                desired=1 added=1
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=160577 round=0 type=precommit
 Nonce cache is empty, triggering reload   
Loading additional nonces to meet demand     target=1 remaining=0 additional=1 nonces_per_min=526.3371453725081 avg_nonces_per_min=0.14302769147887343
Loaded nonces                                desired=1 added=1
Cosigner nonce cache ahead of demand         target=1 remaining=1 nonces_per_min=0 avg_nonces_per_min=0.1426875400133508
Cosigner nonce cache ahead of demand         target=1 remaining=1 nonces_per_min=0 avg_nonces_per_min=0.1423490044580546
I am the leader. Managing the sign process for this block chain_id=seda-1-testnet height=160578 round=0 type=prevote

cosigner2

I[2024-03-08|12:55:04.707] Signed with shard                            chain_id=seda-1-testnet height=160577 round=0 step=2
I[2024-03-08|12:55:04.942] Signed with shard                            chain_id=seda-1-testnet height=160577 round=0 step=3

cosigner3

Failed to sign                               type=prevote chain_id=seda-1-testnet height=160577 round=0 error="rpc err
or: code = Unknown desc = combined signature is not valid"

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 12, 2024

Adjusted nonces CosignerNonceCache target to always return 1. Worked for a while then started to error.

Raised defaultNonceExpiration to 20. Continuing to monitor. Failed after some time.

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 14, 2024

With 2/3 cosigner and only cosigner 1 connected to node, it does not matter whether cosigner 1 is leader or not.

Without restart
Signing will work then suddenly break.
Then it will suddenly work again.

Logs look the same.

Forcing restart of cosigner sometimes fixes the problem. Restarting many times will fix the problem unless it is too many times then it will unfix the problem until it is restarted even more times.

@chillyvee
Copy link
Contributor Author

ignore this issue until further comment

@joelsmith-2019
Copy link

Hello @chillyvee, were you able to resolve the issue?

I attempted to reproduce the errors you were running into above, but was unsuccessful. I used interchaintest and the latest version of seda-chain (which is using SDK v0.50) to spin up several different horcrux clusters under different scenarios.

Scenario 1:

  • Setup: 3 Cosigners all connected to 1 node
  • Looped through each cosigner, shut it down for 100 blocks, then turned it back on
  • The cluster continued publishing blocks with no issues or delays

Results are normal, blocks are being signed:

image
image
image

Scenario 2:

  • Setup: 3 Cosigners all connected to 3 nodes
  • Looped through each cosigner, shut it down for 100 blocks, then turned it back on
  • The cluster continued publishing blocks with no issues or delays

Results are normal, blocks are being signed:

image
image
image

Scenario 3:

  • Setup: 1/3 Cosigners connected to a node, 2/3 cosigners configured with non-existent node
  • Looped through cosigners not configured with a valid node, shut it down for 100 blocks, then turned it back on
  • The cluster continued publishing blocks with no issues or delays

Cosigners not connected to a node: (timeout is expected)

image
image

Leader & connected to a node:

image

Conclusion

I wasn't able to reproduce your error. Is it possible one of your cluster's cosigners is misconfigured with a signing shard? If there is a different scenario I missed, please let me know!

@chillyvee
Copy link
Contributor Author

Scenario resolved. PR will be provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants