Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resiliency for Parsec service providers #606

Open
ionut-arm opened this issue May 12, 2022 · 0 comments
Open

Resiliency for Parsec service providers #606

ionut-arm opened this issue May 12, 2022 · 0 comments
Labels
enhancement New feature or request large Effort label

Comments

@ionut-arm
Copy link
Member

Problem statement

At the moment providers are unaware and unbothered by the physical state of the hardware they depend on. They identify and probe the relevant hardware at service startup time, establish a (long-lasting) context with the hardware, and then operate under the assumption that the context will remain valid as long as the service is running. This last assumption is at least somewhat divorced from physical reality, where physical PKCS11 tokens (for example) could be unplugged and plugged back in. In such circumstances, the most likely outcome is that the service becomes unusable, returning errors for every request that actually tries to interact with the hardware. This applies to TPM, PKCS11, and possibly CAL providers - some level of state is cached within the service.

Solution space

There are a couple of possible solutions for this problem.

  1. Don't cache any state. Instead, re-establish the connection for every request. I'm unsure whether this is even possible for all of the providers - for PKCS11 at least, I think, it's not recommended (perhaps not even possible) to have multiple (independent) instances of the context object within the same app. For TPMs we'd have to make all the calls single-threaded or keep track of whether an AB/RM is in use, otherwise the service might lock up. It's also very heavyweight in some cases: for TPM we need to start up the context, create a new session, create a primary key, then tear everything back down.
  2. Try to identify whether the hardware is ok, and if something seems to have gone awry try to recycle the connexion. This is more feasible, but it also comes with its own issues: do we run these checks regularly, or do we just try to identify error codes that might indicate an issue? Can we even identify such a failure? We'll need provider-specific answers to these.

I think only option 2 is viable, but comes with some important design questions attached, and most likely with quite a bit of engineering to make it work. Some of the relevant questions: What do we do if we've set up the connexion again but we get the same error? Do we have some sort of back-off mechanism? Do we put the provider in some sort of "zombie" state while it's like this, responding to all requests with the same response code? Which bits of all this should be configurable?

Testing

We'll obviosly need to test this, and the tests themselves might require some investigation around, e.g., SoftHSM2 and tpm_server, in order to properly simulate a "disconnect" while the service is running.

@ionut-arm ionut-arm added enhancement New feature or request large Effort label labels May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request large Effort label
Projects
None yet
Development

No branches or pull requests

1 participant