-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't detect failure of own pods ~doesn't scale down reliably~ #13
Comments
this is actually a pretty serious problem:
zombie pods are keeping the cluster scaled up even 7 days after the job completed |
I'll investigate when I can. The most useful information here would be a log from the worker pod. |
if i remember correctly, the pods that hold the jobs open stay in "ContainerCreating" forever |
OK, so those pods never actually ran any falconeri code at all, and they're stuck at the Kubernetes level. You can find the underlying error using We're not going to have the tools to deal with this until I do the next batch of scalability work, which will mostly be about periodically inspecting Kubernetes jobs, reacting to lower-level problems, and recovering from failed containers. |
the simple solution here is to delete the k8s job when the falc job finishes what's the problem with that? |
@seamusabshere Because something totally else may be going wrong. If containers are getting stuck in ContainerCreating, we're we'll off of any we'll understood execution flow. Also, deleting jobs too aggressively will utterly destroy any hope of ever debugging anything, because it wipes all logs and state. So rather than slapping wallpaper over a poorly understood leak in the underlying system, I actually want to observe this myself, and probably add a bit of code so that falconerid is actually aware of what pods are running. When we have hundreds of CPUs running a distributed job, it's important to cross our t's and dot our i's, or else the system will rapidly become incomprehensible. Poorly understood states need to be detected, logged, understood and fixed. |
At the moment, what we need is a |
here's a falc job that finished more than a day ago, but its k8s job (and thus node) is still around: (see here's the node
and here's the job
|
i think this now i would say the main issue here is that falconeri can't see when pods die |
Hours after a job has finished successfully, I see things like:
I delete the job (
kubectl delete job/vendor-union-az8f1
) and the node is autoscaled away.➡️ maybe: when it's done with a falc job, delete the k8s job.
The text was updated successfully, but these errors were encountered: