Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure machine-controller webhook timeout #1857

Open
dharapvj opened this issue Sep 5, 2024 · 3 comments
Open

azure machine-controller webhook timeout #1857

dharapvj opened this issue Sep 5, 2024 · 3 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale.

Comments

@dharapvj
Copy link

dharapvj commented Sep 5, 2024

Lately, we see continuous failures to rollout new MD in Azure environments.

The error is always about machine-controller-webhook timing out. Error is seen in kubeone as well as KKP user-clusters.

Some API (mostly about VM sizes) in azure has become very slow (or we need better filters in our API call)

Here are logs from KKP user-cluster based MD

failed to create machine deployment: Internal error occurred: failed calling webhook "machine-controller.kubermatic.io-machinedeployments": failed to call webhook: Post "https://machine-controller-webhook.cluster-XXXXX.svc.cluster.local./machinedeployments?timeout=10s": context deadline exceeded
{
  "error": {
    "code": 500,
    "message": "failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'"
  }
}

I have seen that if I increase wehbook timeout to 30s situation improves a bit.

But in general - since webhook can only have max 30s timeout - we should consider caching the list of VMs to speed things up.

@dharapvj
Copy link
Author

dharapvj commented Sep 5, 2024

even with 30seconds on webhook - it takes many attempts before finally applying the MD.

here are log entries from KKP apiserver after 30 second timeout

{"level":"error","time":"2024-09-05T06:09:34.605Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}
{"level":"error","time":"2024-09-05T06:09:42.670Z","caller":"handler/routing.go:152","msg":"failed to create machine deployment: admission webhook \"machine-controller.kubermatic.io-machinedeployments\" denied the request: validation failed: failed to get VM SKU: failed to list available SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'context canceled'","request":"/api/v2/projects/hzpqm5hzd5/clusters/evrv86lkgm/machinedeployments"}
{"level":"error","time":"2024-09-05T06:14:49.226Z","caller":"handler/routing.go:152","msg":"Cluster components are not ready yet","request":"/api/v2/projects/XXX/clusters/YYY/machinedeployments"}

@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2024
@kubermatic-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubermatic-bot kubermatic-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale.
Projects
None yet
Development

No branches or pull requests

2 participants