Healthcheck plugins on dispense, re-instantiating broken ones #931
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Disclaimer: I don't really think this is the correct solution. It's just a first pass at it, with the most important bit being the test that reliably reproduces the problem. Very happy to work on the solution in a different way.
At the moment, if an external plugin that nomad-autoscaler relies on dies for any reason (OOM, for instance, or someone coming along and running
kill
on it), the nomad-autoscaler process cannot recover. Instead, we see a steady stream of errors of the formrpc error: code = Canceled desc = context canceled
for the remaining lifetime of the process.Attempt to solve this by:
PluginInfo
RPC against plugins when they are being dispenseddispensePlugins()
if a request is made for a plugin not in the instance mapThis seems like the minimal approach in terms of moving code around, but I suspect it has a significant performance impact, and probably spams the logs heavily in a range of scenarios. Still, it's a starting point.
Closes #711