-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non persistent TSS signer and x25519 keypair #1216
base: master
Are you sure you want to change the base?
Conversation
…- generate it internally
* master: Add TDX test network chainspec (#1204) Test CLI command to retrieve quote and change endpoint / TSS account in one command (#1198) Bump the patch-dependencies group with 2 updates (#1212) Bump thiserror from 2.0.4 to 2.0.6 in the patch-dependencies group (#1206) Downgrade parity-scale-codec as version we currently use has been yanked (#1205) Bump clap from 4.5.22 to 4.5.23 in the patch-dependencies group (#1202)
@@ -101,26 +93,6 @@ use crate::{ | |||
validation::EncryptedSignedMessage, | |||
}; | |||
|
|||
#[tokio::test] | |||
#[serial] | |||
async fn test_get_signer_does_not_throw_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is no longer needed as the process of getting a signer from app state is now infallible
} | ||
|
||
/// Convenience function to get chain api and rpc | ||
pub async fn get_api_rpc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR but i though once being here why not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a Good First Issue
to go through the codebase and use this?
Ok(()) | ||
}; | ||
|
||
if let Err(error) = backoff::future::retry(backoff.clone(), balance_query).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This balance check and the one below could maybe be combined into a state machine which makes both checks, but i think for now its ok to make one after the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like multiple balance queries like this are handled natively by subxt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like multiple balance queries like this are handled natively by subxt
Can you explain what you mean a bit more i dont get it. It looks like the person who made that issue wants to check balances for a bunch of different accounts, but here we are making a balance query for one account several times one after the other.
let backoff = backoff::ExponentialBackoff::default(); | ||
match backoff::future::retry(backoff, connect_to_substrate_node).await { | ||
// Never give up trying to connect | ||
let backoff = backoff::ExponentialBackoff { max_elapsed_time: None, ..Default::default() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked at the backoff
crate in too much detail but i think adding max_elapsed_time: None
means it will keep checking indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also think it means it won't stop based off time (but may stop under other conditions).
Why do you want to do it this way as opposed to the 15 min limit we had before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what is to gain by limiting it. It will be not immediately obvious to the operator that the entropy-tss
process has stopped if the VM is still up and running, so i am imagining entropy-tss
is going to be automatically re-started if it bails, in which case it will continue to attempt to connect anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By having it spin down after 15 (or some amount) of minutes it becomes a very clear error state for any operators, assuming they have some way to get notified that the process is down.
While it may "just get restarted" by the operator it will also signal to them that some manual action (in this case funding the accounts) needs to be taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok i will add a limit but make an issue regarding the dev-ops-related workflow for how the host operator can be notified that the process has stopped.
kv: &KvManager, | ||
) -> Result<(PairSigner<EntropyConfig, sr25519::Pair>, StaticSecret), UserErr> { | ||
let hkdf = get_hkdf(kv).await?; | ||
pub fn get_signer_and_x25519_secret( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now only used when a ValidatorName
(eg: --alice
) is given, or in tests.
@@ -179,6 +179,8 @@ pub enum UserErr { | |||
TooFewSigners, | |||
#[error("Non signer sent from relayer")] | |||
IncorrectSigner, | |||
#[error("Node has started fresh and not yet successfully set up")] | |||
NotReady, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to react programmatically to this error (eg: for slashing) we could change the impl IntoResponse
below to give a some special status code if this variant is present.
/// - Communication has been established with the chain node | ||
/// - The TSS account is funded | ||
/// - The TSS account is registered with the staking extension pallet | ||
ready: Arc<RwLock<bool>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bool
could maybe be replaced with an enum with the various states of readiness to make it easier to determine why the node is not ready: no connection to chain, no funds, or not registered with staking pallet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have a defined state transition flow here I don't think this makes sense. But I could be convinced either way
are we fully set on not keeping this in the kvdb, I thought we were still on the fence about that, if so then I can push removing the whole kvdb as its purpose is now kinda obsolute and everything can be held in state, also probably needs to get devops eyes on this too as it will break their flow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, but I need another review to finish up
let backoff = backoff::ExponentialBackoff::default(); | ||
match backoff::future::retry(backoff, connect_to_substrate_node).await { | ||
// Never give up trying to connect | ||
let backoff = backoff::ExponentialBackoff { max_elapsed_time: None, ..Default::default() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also think it means it won't stop based off time (but may stop under other conditions).
Why do you want to do it this way as opposed to the 15 min limit we had before?
/// - Communication has been established with the chain node | ||
/// - The TSS account is funded | ||
/// - The TSS account is registered with the staking extension pallet | ||
ready: Arc<RwLock<bool>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have a defined state transition flow here I don't think this makes sense. But I could be convinced either way
} | ||
|
||
/// Get a [PairSigner] for submitting extrinsics with subxt | ||
pub fn signer(&self) -> PairSigner<EntropyConfig, sr25519::Pair> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like all the getters 👍
Personally, i think it would be great to remove the kvdb. Im still not totally sure what our best option is for keyshare storage - i am looking into using the SGX seal API which Hang from Phala suggested to us. But i would suggest for now we keep it in memory only.
I have a call with @vitropy tomorrow to look at the setup we have on the TDX machine, which hopefully will make clear why these changes are needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work with this 💪
cc @entropyxyz/system-reliability-engineers since this will impact your flow
Ok(()) | ||
}; | ||
|
||
if let Err(error) = backoff::future::retry(backoff.clone(), balance_query).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like multiple balance queries like this are handled natively by subxt
} | ||
|
||
/// Convenience function to get chain api and rpc | ||
pub async fn get_api_rpc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a Good First Issue
to go through the codebase and use this?
@@ -220,12 +220,14 @@ pub async fn sign_tx( | |||
State(app_state): State<AppState>, | |||
Json(encrypted_msg): Json<EncryptedSignedMessage>, | |||
) -> Result<(StatusCode, Body), UserErr> { | |||
let (signer, x25519_secret) = get_signer_and_x25519_secret(&app_state.kv_store).await?; | |||
if !app_state.is_ready() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do worry a bit that we'll forget this check in some new endpoint and it could cause problems. But maybe it'll be quite obvious when somebody tries to do something when the app isn't ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this is a problem. The other way to approach this would be to start with a mini axum app which only exposes /version
and /info
, then when the checks have passed kill it and start our full axum app. But the disadvantage is that when an entropy-tss goes down and restarts, we would get a 404 when hitting eg: /sign_tx
, which is hard to distinguish from some other problem.
@@ -98,8 +92,7 @@ pub async fn create_clients( | |||
values: Vec<Vec<u8>>, | |||
keys: Vec<String>, | |||
validator_name: &Option<ValidatorName>, | |||
) -> (IntoMakeService<Router>, KvManager) { | |||
let listener_state = ListenerState::default(); | |||
) -> (IntoMakeService<Router>, KvManager, SubxtAccountId32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return the account ID here? Or maybe said another way, is there no way to access the app state in the test context after this setup helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i hear you, but i am not sure how to access app state after we have called .into_make_service()
@@ -32,6 +32,9 @@ runtime | |||
- In [#1147](https://github.com/entropyxyz/entropy-core/pull/1147) a field is added to the | |||
chainspec: `jump_started_signers` which allows the chain to be started in a pre-jumpstarted state | |||
for testing. If this is not desired it should be set to `None`. | |||
- In [#1216](https://github.com/entropyxyz/entropy-core/pull/1216) the `--setup-only` option for `entropy-tss` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part of this breaking change kinda understates the impact this will have on the operator flow...we can maybe elaborate on this more when cutting the next release
As this is a significant breaking change and i feel like there is still a bit of unsureness i am not gonna merge until we have talked about it at the core sync. Having looked at the option of SGX-sealing the keyshare - i have concerns about how we can use a recovered keyshare if we have a new TSS account id - since the keyshare's party id is the TSS account id, and we need to sign protocol messages with it in order to do anything useful with the keyshare. But that said i don't really see another option than this, because of needing the keypairs to prove attestation. |
…account, and registering on chain
entropy-tss
has two keypairs which define its identity on the network: the TSS account used to sign extrinsics, and the x25519 encryption keypair. Since both are generated in a confidential virtual machine and their public keys included in the attestation, authenticating with these keypairs proves that the attestation still holds.However, because they are currently stored on disk, it is possible that the validator operator restarts the virtual machine, makes some modifications, and continues to use these keypairs without needing to make another attestation.
This PR makes the signing / x25519 keypairs no longer persistent, but generated new every time entropy-tss launches (except in development / testing where test mnemonics are used). If the
entropy-tss
process is killed for whatever reason, it will not be able to continue to participate in the protocols until the validator operator updates the TSS details using thechange_threshold_accounts
extrinsic. I am hoping to find a way to automate this - see #1214This PR has implications for our devops flow - the
--setup-only
command line option is no longer available.entropy-tss
should be run only once, and the public keys retrieved using the/info
http route.Closes #1203 by adding a boolean
ready
flag to the application state, which is set to true once the 'prerequisite checks' are complete. These checks now also include a check the the TSS account id has been registered with the staking pallet, and the balance check is now mandatory (before, only a warning was logged if the TSS account had no funds). In a non-ready state, all the the HTTP routes relating the the protocols will return an error.This also has implications for slashing, as a TSS node should not be in a non-ready state for too long. How long is acceptable depends a bit on whether we are able to automate the process of the node getting funded and calling
change_threshold_accounts
.