Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck plugins on dispense, re-instantiating broken ones #931

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nick-kentik
Copy link

@nick-kentik nick-kentik commented Jul 2, 2024

Disclaimer: I don't really think this is the correct solution. It's just a first pass at it, with the most important bit being the test that reliably reproduces the problem. Very happy to work on the solution in a different way.

At the moment, if an external plugin that nomad-autoscaler relies on dies for any reason (OOM, for instance, or someone coming along and running kill on it), the nomad-autoscaler process cannot recover. Instead, we see a steady stream of errors of the form rpc error: code = Canceled desc = context canceled for the remaining lifetime of the process.

Attempt to solve this by:

  • Running the PluginInfo RPC against plugins when they are being dispensed
  • Killing the plugin if that fails, removing it from the plugin instances map
  • Re-running dispensePlugins() if a request is made for a plugin not in the instance map

This seems like the minimal approach in terms of moving code around, but I suspect it has a significant performance impact, and probably spams the logs heavily in a range of scenarios. Still, it's a starting point.

Closes #711

Copy link

hashicorp-cla-app bot commented Jul 2, 2024

CLA assistant check
All committers have signed the CLA.

@nick-kentik
Copy link
Author

Ping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad Autoscaler Idle forever with context.canceled error
1 participant