Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST API: return participation key from generate endpoint #6158

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

PhearZero
Copy link

Summary

  • Returns the model.ParticipationKey from /v2/participation/generate

Test Plan

Copy link
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about the idea of blocking until the part keys are created. Generation can take a very long time because of Falcon keys.
Perhaps we can generate and return the ID immediately, and the caller can be expected to ask for the details, perhaps repeatedly, until they have been generated.

@@ -5629,4 +5629,4 @@
"name": "private"
}
]
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove end-of-file newlines gratuitously, else we'll end up with commit wars that bring them back and remove them depending on who edits them last.

daemon/algod/api/server/v2/handlers.go Outdated Show resolved Hide resolved
Comment on lines 312 to 315
// Semaphore was acquired, generate the key.
go func() {
defer v2.KeygenLimiter.Release(1)
err := v2.generateKeyHandler(address, params)
if err != nil {
v2.Log.Warnf("Error generating participation keys: %v", err)
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts a (nearly empty) go routine which immediately releases the KeygenLimiter semaphore. Therefore, this allows any number of keygens to happen concurrently. I doubt that's what we want, so the defer line should be moved outside of the go routine, which should go away. I would put it before line 307.

Since generating part keys can take a very long time, I'm not sure what will happen to the TCP connection that's being held open the entire time. I expect timeouts will occur in many situations.

Copy link
Author

@PhearZero PhearZero Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only was I could get the lock to release while testing. I will take another look at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat off-topic. It's unclear to me why a semaphore.Weighted was used. It seems to be used like a simple sync.WaitGroup. You don't have to fix that, I suppose.

@PhearZero
Copy link
Author

PhearZero commented Nov 12, 2024

I'm unsure about the idea of blocking until the part keys are created. Generation can take a very long time because of Falcon keys. Perhaps we can generate and return the ID immediately, and the caller can be expected to ask for the details, perhaps repeatedly, until they have been generated.

@jannotti This is a good question, I am currently watching the participation endpoint during key generation. I agree, returning the participation ID would be an improvement to the current implementation. It would still require the caller to handle errors, effectively just shifting the problem to a different machine and adds RPC overhead to the node.

Since it's a private api guarded by the admin token, it may make sense to allow a long lived connection. At least that was my thinking, would love to know what you have in mind.

@jannotti
Copy link
Contributor

jannotti commented Nov 12, 2024

Since it's a private api guarded by the admin token, it may make sense to allow a long lived connection.

I think that's fine, in principle. But I think that extremely long-lived TCP connections that don't send any data are sometimes subject to timeouts. For example, if there is a NAT box between client and algod, it's going to timeout the state for the connection eventually. I think this is an area that you just can't count on any particular behavior past a minute or two, so I would not want our APIs to depend on the correct behaviour of very long-lived, silent connections.

Polling by the client using an ID that is returned quickly may seem inelegant, but I think it's the most robust.

A particularly annoying thing that could occur:

  1. ask for a lot of part keys
  2. server starts chugging along.
  3. connection is lost
  4. client tries again
  5. NOPE. Server is still working, so mutex is held. Client must wait before asking again.

@PhearZero
Copy link
Author

Polling by the client using an ID that is returned quickly may seem inelegant, but I think it's the most robust.

I agree with the returning just an ID approach as long as we can handle the errors elegantly it should be a good solution. I wasn't able to find an elegant way to handle the error when keys already existed for the given first and last. Do you have any reservations on checking the file system similar to the generate method?

For example, if there is a NAT box between client and algod, it's going to timeout the state for the connection eventually. I think this is an area that you just can't count on any particular behavior past a minute or two, so I would not want out APIs to depend on the correct behaviour of very long lived connections.

Absolutely, whatever you think is best. The server generally should be in charge of state and in this case it would require a state machine on clients for a resource owned by the service. We could help the consumers here, maybe a status after block style endpoint for the generator which would allow for both? It would also allow resiliency to failures since we could restart the wait with a given ID. The generate endpoint could return an ID/fail early then we can handle state changes client side by watching the long lived endpoint. (Could be a separate PR for status endpoint)

What do you think @jannotti?

@PhearZero
Copy link
Author

I'm not seeing a way to return the ID since it is dependent on the generated secrets (I may be missing something obvious). There are quite a few error messages that are not returned, is this something that is expected?

Just refactoring the error handling a bit may get us what we need until we find a solution for the ID

@jannotti
Copy link
Contributor

I'm not seeing a way to return the ID since it is dependent on the generated secrets

That seems to be the case, which explains why this wasn't done before, despite the comment saying that maybe it could be done. Without looking into at all, I don't know why the ID is a hash of the contents. I don't know if that's just "cute" - a interesting way to generate a unique ID - or an actual meaningful aspect of the implementation. Generally speaking, I like my database IDs to just be sequential, or perhaps uuids. I'd support such a change if it's staightforward. If we did that, we could generate the ID first, and perform the insertion later, with the pregenerated ID.

I don't know if there's any code that needs to find the right partkey based on contents. That would explain the deterministic ID. If there is, it would have to be refactored - presumably it could be done with a more elaborate where clause on the SQL to find the key.

returns participation id on generate
@PhearZero
Copy link
Author

PhearZero commented Nov 14, 2024

It appears as if the table has a sequential primary key and is storing the Participation ID as an additional field. I was able to remove the generated keys from the Participation Identity which produces the hash. This should keep changes low but you raise a good point

I don't know if that's just "cute" - a interesting way to generate a unique ID - or an actual meaningful aspect of the implementation

It seems like it could be the former, I will ask around to see if anyone has any information. If this implementation is accepted it would be good to expose a few errors before shipping (just the most common ones)

@PhearZero
Copy link
Author

I've come up empty for justifications so far. The ID was introduced with the RPC endpoints in #3164. I assume it was just an ad-hoc decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants