Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't detect failure of own pods ~doesn't scale down reliably~ #13

Open
seamusabshere opened this issue Jun 13, 2019 · 9 comments
Open
Labels
faraday Requested by Faraday

Comments

@seamusabshere
Copy link
Member

seamusabshere commented Jun 13, 2019

Hours after a job has finished successfully, I see things like:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    vendor-union-az8f1-pvwvl                                    1 (12%)       0 (0%)      4G (14%)         4G (14%)       15h

I delete the job (kubectl delete job/vendor-union-az8f1) and the node is autoscaled away.

➡️ maybe: when it's done with a falc job, delete the k8s job.

@seamusabshere
Copy link
Member Author

this is actually a pretty serious problem:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    property-2019-06-geocode-mift3-q2ttt                        1 (12%)       0 (0%)      2G (7%)          2G (7%)        7d23h

zombie pods are keeping the cluster scaled up even 7 days after the job completed

@emk
Copy link
Contributor

emk commented Jul 21, 2019

I'll investigate when I can.

The most useful information here would be a log from the worker pod.

@seamusabshere
Copy link
Member Author

if i remember correctly, the pods that hold the jobs open stay in "ContainerCreating" forever

@emk
Copy link
Contributor

emk commented Jul 22, 2019

OK, so those pods never actually ran any falconeri code at all, and they're stuck at the Kubernetes level. You can find the underlying error using kubectl describe pod/$ID.

We're not going to have the tools to deal with this until I do the next batch of scalability work, which will mostly be about periodically inspecting Kubernetes jobs, reacting to lower-level problems, and recovering from failed containers.

@seamusabshere
Copy link
Member Author

the simple solution here is to delete the k8s job when the falc job finishes

what's the problem with that?

@emk
Copy link
Contributor

emk commented Jul 23, 2019

@seamusabshere Because something totally else may be going wrong. If containers are getting stuck in ContainerCreating, we're we'll off of any we'll understood execution flow.

Also, deleting jobs too aggressively will utterly destroy any hope of ever debugging anything, because it wipes all logs and state. So rather than slapping wallpaper over a poorly understood leak in the underlying system, I actually want to observe this myself, and probably add a bit of code so that falconerid is actually aware of what pods are running.

When we have hundreds of CPUs running a distributed job, it's important to cross our t's and dot our i's, or else the system will rapidly become incomprehensible. Poorly understood states need to be detected, logged, understood and fixed.

@emk
Copy link
Contributor

emk commented Jul 23, 2019

At the moment, what we need is a kubectl describe for the job, and for all pods associated with the job. For any pods which actually ran, I'd also like to look at the last few lines of the logs.

@seamusabshere
Copy link
Member Author

seamusabshere commented Oct 21, 2019

here's a falc job that finished more than a day ago, but its k8s job (and thus node) is still around:

(see Age 33h down there?)

here's the node

$ kubectl describe no $NODE
[...]
Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    foobar-2019-09-geocode-wdlyf-84pfp                         1 (12%)       0 (0%)      2G (7%)          2G (7%)        33h
  kube-system                fluentd-gcp-v3.1.1-w5f7x                                    100m (1%)     1 (12%)     200Mi (0%)       500Mi (1%)     33h
  kube-system                kube-proxy-gke-falconeri-falconeri-workers-xxx-dmtn    100m (1%)     0 (0%)      0 (0%)           0 (0%)         33h
[...]

and here's the job

$ kubectl describe job $JOB
Name:           foobar-2019-09-geocode-wdlyf
Namespace:      default
Selector:       controller-uid=xxxx
Labels:         created-by=falconeri
Annotations:    kubectl.kubernetes.io/last-applied-configuration:
                  {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"labels":{"created-by":"falconeri"},"name":"foobar-2019-09-geocode-wdl...
Parallelism:    256
Completions:    <unset>
Start Time:     Sat, 19 Oct 2019 11:09:46 -0400
Pods Statuses:  4 Running / 34 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=xxxx
           created-by=falconeri
           job-name=foobar-2019-09-geocode-wdlyf
  Containers:
   worker:
    Image:      us.gcr.io/myproject/myimage:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/falconeri-worker
      0805ce0a-1111-455c-96bd-1063bdf6abdf
    Limits:
      memory:  2G
    Requests:
      cpu:     1
      memory:  2G
    Environment:
      RUST_BACKTRACE:       1
      RUST_LOG:             falconeri_common=trace,falconeri_worker=trace
      FALCONERI_NODE_NAME:   (v1:spec.nodeName)
      FALCONERI_POD_NAME:    (v1:metadata.name)
    Mounts:
      /etc/falconeri/secrets from secrets (rw)
      /pfs from pfs (rw)
      /scratch from scratch (rw)
      /secrets/my-service-account from secret-my-service-account (rw)
  Volumes:
   pfs:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
   scratch:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
   secrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  falconeri
    Optional:    false
   secret-my-service-account:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-service-account
    Optional:    false
Events:          <none>

@emk emk added the faraday Requested by Faraday label Jan 26, 2020
@seamusabshere
Copy link
Member Author

i think this ContainerCreating has gone away with more recent versions of k8s (1.14 -> 1.17)

now i would say the main issue here is that falconeri can't see when pods die

@seamusabshere seamusabshere changed the title doesn't scale down reliably can't detect failure of own pods ~doesn't scale down reliably~ Aug 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
faraday Requested by Faraday
Projects
None yet
Development

No branches or pull requests

2 participants