Resiliency for Parsec service providers #606

ionut-arm · 2022-05-12T13:08:21Z

Problem statement

At the moment providers are unaware and unbothered by the physical state of the hardware they depend on. They identify and probe the relevant hardware at service startup time, establish a (long-lasting) context with the hardware, and then operate under the assumption that the context will remain valid as long as the service is running. This last assumption is at least somewhat divorced from physical reality, where physical PKCS11 tokens (for example) could be unplugged and plugged back in. In such circumstances, the most likely outcome is that the service becomes unusable, returning errors for every request that actually tries to interact with the hardware. This applies to TPM, PKCS11, and possibly CAL providers - some level of state is cached within the service.

Solution space

There are a couple of possible solutions for this problem.

Don't cache any state. Instead, re-establish the connection for every request. I'm unsure whether this is even possible for all of the providers - for PKCS11 at least, I think, it's not recommended (perhaps not even possible) to have multiple (independent) instances of the context object within the same app. For TPMs we'd have to make all the calls single-threaded or keep track of whether an AB/RM is in use, otherwise the service might lock up. It's also very heavyweight in some cases: for TPM we need to start up the context, create a new session, create a primary key, then tear everything back down.
Try to identify whether the hardware is ok, and if something seems to have gone awry try to recycle the connexion. This is more feasible, but it also comes with its own issues: do we run these checks regularly, or do we just try to identify error codes that might indicate an issue? Can we even identify such a failure? We'll need provider-specific answers to these.

I think only option 2 is viable, but comes with some important design questions attached, and most likely with quite a bit of engineering to make it work. Some of the relevant questions: What do we do if we've set up the connexion again but we get the same error? Do we have some sort of back-off mechanism? Do we put the provider in some sort of "zombie" state while it's like this, responding to all requests with the same response code? Which bits of all this should be configurable?

Testing

We'll obviosly need to test this, and the tests themselves might require some investigation around, e.g., SoftHSM2 and tpm_server, in order to properly simulate a "disconnect" while the service is running.

The text was updated successfully, but these errors were encountered:

ionut-arm added enhancement New feature or request large Effort label labels May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resiliency for Parsec service providers #606

Resiliency for Parsec service providers #606

ionut-arm commented May 12, 2022

Resiliency for Parsec service providers #606

Resiliency for Parsec service providers #606

Comments

ionut-arm commented May 12, 2022

Problem statement

Solution space

Testing