Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non persistent TSS signer and x25519 keypair #1216

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

ameba23
Copy link
Contributor

@ameba23 ameba23 commented Dec 13, 2024

entropy-tss has two keypairs which define its identity on the network: the TSS account used to sign extrinsics, and the x25519 encryption keypair. Since both are generated in a confidential virtual machine and their public keys included in the attestation, authenticating with these keypairs proves that the attestation still holds.

However, because they are currently stored on disk, it is possible that the validator operator restarts the virtual machine, makes some modifications, and continues to use these keypairs without needing to make another attestation.

This PR makes the signing / x25519 keypairs no longer persistent, but generated new every time entropy-tss launches (except in development / testing where test mnemonics are used). If the entropy-tss process is killed for whatever reason, it will not be able to continue to participate in the protocols until the validator operator updates the TSS details using the change_threshold_accounts extrinsic. I am hoping to find a way to automate this - see #1214

This PR has implications for our devops flow - the --setup-only command line option is no longer available. entropy-tss should be run only once, and the public keys retrieved using the /info http route.

Closes #1203 by adding a boolean ready flag to the application state, which is set to true once the 'prerequisite checks' are complete. These checks now also include a check the the TSS account id has been registered with the staking pallet, and the balance check is now mandatory (before, only a warning was logged if the TSS account had no funds). In a non-ready state, all the the HTTP routes relating the the protocols will return an error.

This also has implications for slashing, as a TSS node should not be in a non-ready state for too long. How long is acceptable depends a bit on whether we are able to automate the process of the node getting funded and calling change_threshold_accounts.

* master:
  Add TDX test network chainspec (#1204)
  Test CLI command to retrieve quote and change endpoint / TSS account in one command (#1198)
  Bump the patch-dependencies group with 2 updates (#1212)
  Bump thiserror from 2.0.4 to 2.0.6 in the patch-dependencies group (#1206)
  Downgrade parity-scale-codec as version we currently use has been yanked (#1205)
  Bump clap from 4.5.22 to 4.5.23 in the patch-dependencies group (#1202)
@ameba23 ameba23 marked this pull request as draft December 13, 2024 10:58
@ameba23 ameba23 changed the title Non persistant TSS signer and x25519 keypair Non persistent TSS signer and x25519 keypair Dec 16, 2024
@@ -101,26 +93,6 @@ use crate::{
validation::EncryptedSignedMessage,
};

#[tokio::test]
#[serial]
async fn test_get_signer_does_not_throw_err() {
Copy link
Contributor Author

@ameba23 ameba23 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is no longer needed as the process of getting a signer from app state is now infallible

}

/// Convenience function to get chain api and rpc
pub async fn get_api_rpc(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but i though once being here why not

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a Good First Issue to go through the codebase and use this?

Base automatically changed from peg/generate-mnemonic to master December 18, 2024 07:20
Ok(())
};

if let Err(error) = backoff::future::retry(backoff.clone(), balance_query).await {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This balance check and the one below could maybe be combined into a state machine which makes both checks, but i think for now its ok to make one after the other

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like multiple balance queries like this are handled natively by subxt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like multiple balance queries like this are handled natively by subxt

Can you explain what you mean a bit more i dont get it. It looks like the person who made that issue wants to check balances for a bunch of different accounts, but here we are making a balance query for one account several times one after the other.

let backoff = backoff::ExponentialBackoff::default();
match backoff::future::retry(backoff, connect_to_substrate_node).await {
// Never give up trying to connect
let backoff = backoff::ExponentialBackoff { max_elapsed_time: None, ..Default::default() };
Copy link
Contributor Author

@ameba23 ameba23 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked at the backoff crate in too much detail but i think adding max_elapsed_time: None means it will keep checking indefinitely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think it means it won't stop based off time (but may stop under other conditions).

Why do you want to do it this way as opposed to the 15 min limit we had before?

Copy link
Contributor Author

@ameba23 ameba23 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what is to gain by limiting it. It will be not immediately obvious to the operator that the entropy-tss process has stopped if the VM is still up and running, so i am imagining entropy-tss is going to be automatically re-started if it bails, in which case it will continue to attempt to connect anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having it spin down after 15 (or some amount) of minutes it becomes a very clear error state for any operators, assuming they have some way to get notified that the process is down.

While it may "just get restarted" by the operator it will also signal to them that some manual action (in this case funding the accounts) needs to be taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i will add a limit but make an issue regarding the dev-ops-related workflow for how the host operator can be notified that the process has stopped.

kv: &KvManager,
) -> Result<(PairSigner<EntropyConfig, sr25519::Pair>, StaticSecret), UserErr> {
let hkdf = get_hkdf(kv).await?;
pub fn get_signer_and_x25519_secret(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now only used when a ValidatorName (eg: --alice) is given, or in tests.

@@ -179,6 +179,8 @@ pub enum UserErr {
TooFewSigners,
#[error("Non signer sent from relayer")]
IncorrectSigner,
#[error("Node has started fresh and not yet successfully set up")]
NotReady,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to react programmatically to this error (eg: for slashing) we could change the impl IntoResponse below to give a some special status code if this variant is present.

@ameba23 ameba23 marked this pull request as ready for review December 18, 2024 09:55
/// - Communication has been established with the chain node
/// - The TSS account is funded
/// - The TSS account is registered with the staking extension pallet
ready: Arc<RwLock<bool>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bool could maybe be replaced with an enum with the various states of readiness to make it easier to determine why the node is not ready: no connection to chain, no funds, or not registered with staking pallet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a defined state transition flow here I don't think this makes sense. But I could be convinced either way

@JesseAbram
Copy link
Member

are we fully set on not keeping this in the kvdb, I thought we were still on the fence about that, if so then I can push removing the whole kvdb as its purpose is now kinda obsolute and everything can be held in state, also probably needs to get devops eyes on this too as it will break their flow

Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but I need another review to finish up

let backoff = backoff::ExponentialBackoff::default();
match backoff::future::retry(backoff, connect_to_substrate_node).await {
// Never give up trying to connect
let backoff = backoff::ExponentialBackoff { max_elapsed_time: None, ..Default::default() };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think it means it won't stop based off time (but may stop under other conditions).

Why do you want to do it this way as opposed to the 15 min limit we had before?

/// - Communication has been established with the chain node
/// - The TSS account is funded
/// - The TSS account is registered with the staking extension pallet
ready: Arc<RwLock<bool>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a defined state transition flow here I don't think this makes sense. But I could be convinced either way

crates/threshold-signature-server/src/lib.rs Show resolved Hide resolved
}

/// Get a [PairSigner] for submitting extrinsics with subxt
pub fn signer(&self) -> PairSigner<EntropyConfig, sr25519::Pair> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like all the getters 👍

@ameba23
Copy link
Contributor Author

ameba23 commented Jan 7, 2025

@JesseAbram

are we fully set on not keeping this in the kvdb, I thought we were still on the fence about that, if so then I can push removing the whole kvdb as its purpose is now kinda obsolute and everything can be held in state,

Personally, i think it would be great to remove the kvdb. Im still not totally sure what our best option is for keyshare storage - i am looking into using the SGX seal API which Hang from Phala suggested to us. But i would suggest for now we keep it in memory only.

also probably needs to get devops eyes on this too as it will break their flow

I have a call with @vitropy tomorrow to look at the setup we have on the TDX machine, which hopefully will make clear why these changes are needed.

Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with this 💪

cc @entropyxyz/system-reliability-engineers since this will impact your flow

crates/threshold-signature-server/src/helpers/launch.rs Outdated Show resolved Hide resolved
crates/threshold-signature-server/src/helpers/launch.rs Outdated Show resolved Hide resolved
crates/threshold-signature-server/src/helpers/launch.rs Outdated Show resolved Hide resolved
crates/threshold-signature-server/src/helpers/launch.rs Outdated Show resolved Hide resolved
Ok(())
};

if let Err(error) = backoff::future::retry(backoff.clone(), balance_query).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like multiple balance queries like this are handled natively by subxt

}

/// Convenience function to get chain api and rpc
pub async fn get_api_rpc(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a Good First Issue to go through the codebase and use this?

@@ -220,12 +220,14 @@ pub async fn sign_tx(
State(app_state): State<AppState>,
Json(encrypted_msg): Json<EncryptedSignedMessage>,
) -> Result<(StatusCode, Body), UserErr> {
let (signer, x25519_secret) = get_signer_and_x25519_secret(&app_state.kv_store).await?;
if !app_state.is_ready() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do worry a bit that we'll forget this check in some new endpoint and it could cause problems. But maybe it'll be quite obvious when somebody tries to do something when the app isn't ready

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this is a problem. The other way to approach this would be to start with a mini axum app which only exposes /version and /info, then when the checks have passed kill it and start our full axum app. But the disadvantage is that when an entropy-tss goes down and restarts, we would get a 404 when hitting eg: /sign_tx, which is hard to distinguish from some other problem.

@@ -98,8 +92,7 @@ pub async fn create_clients(
values: Vec<Vec<u8>>,
keys: Vec<String>,
validator_name: &Option<ValidatorName>,
) -> (IntoMakeService<Router>, KvManager) {
let listener_state = ListenerState::default();
) -> (IntoMakeService<Router>, KvManager, SubxtAccountId32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return the account ID here? Or maybe said another way, is there no way to access the app state in the test context after this setup helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i hear you, but i am not sure how to access app state after we have called .into_make_service()

@@ -32,6 +32,9 @@ runtime
- In [#1147](https://github.com/entropyxyz/entropy-core/pull/1147) a field is added to the
chainspec: `jump_started_signers` which allows the chain to be started in a pre-jumpstarted state
for testing. If this is not desired it should be set to `None`.
- In [#1216](https://github.com/entropyxyz/entropy-core/pull/1216) the `--setup-only` option for `entropy-tss`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second part of this breaking change kinda understates the impact this will have on the operator flow...we can maybe elaborate on this more when cutting the next release

@ameba23
Copy link
Contributor Author

ameba23 commented Jan 8, 2025

As this is a significant breaking change and i feel like there is still a bit of unsureness i am not gonna merge until we have talked about it at the core sync.

Having looked at the option of SGX-sealing the keyshare - i have concerns about how we can use a recovered keyshare if we have a new TSS account id - since the keyshare's party id is the TSS account id, and we need to sign protocol messages with it in order to do anything useful with the keyshare.

But that said i don't really see another option than this, because of needing the keypairs to prove attestation.

@JesseAbram @HCastano

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

entropy-tss /info route to get account ID cannot be used until a connection to chain has been made
3 participants