can't detect failure of own pods ~doesn't scale down reliably~ #13

seamusabshere · 2019-06-13T19:35:17Z

Hours after a job has finished successfully, I see things like:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    vendor-union-az8f1-pvwvl                                    1 (12%)       0 (0%)      4G (14%)         4G (14%)       15h

I delete the job (kubectl delete job/vendor-union-az8f1) and the node is autoscaled away.

➡️ maybe: when it's done with a falc job, delete the k8s job.

The text was updated successfully, but these errors were encountered:

seamusabshere · 2019-07-21T14:05:23Z

this is actually a pretty serious problem:

Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    property-2019-06-geocode-mift3-q2ttt                        1 (12%)       0 (0%)      2G (7%)          2G (7%)        7d23h

zombie pods are keeping the cluster scaled up even 7 days after the job completed

emk · 2019-07-21T16:25:15Z

I'll investigate when I can.

The most useful information here would be a log from the worker pod.

seamusabshere · 2019-07-22T13:45:29Z

if i remember correctly, the pods that hold the jobs open stay in "ContainerCreating" forever

emk · 2019-07-22T14:14:08Z

OK, so those pods never actually ran any falconeri code at all, and they're stuck at the Kubernetes level. You can find the underlying error using kubectl describe pod/$ID.

We're not going to have the tools to deal with this until I do the next batch of scalability work, which will mostly be about periodically inspecting Kubernetes jobs, reacting to lower-level problems, and recovering from failed containers.

seamusabshere · 2019-07-22T19:43:19Z

the simple solution here is to delete the k8s job when the falc job finishes

what's the problem with that?

emk · 2019-07-23T05:41:53Z

@seamusabshere Because something totally else may be going wrong. If containers are getting stuck in ContainerCreating, we're we'll off of any we'll understood execution flow.

Also, deleting jobs too aggressively will utterly destroy any hope of ever debugging anything, because it wipes all logs and state. So rather than slapping wallpaper over a poorly understood leak in the underlying system, I actually want to observe this myself, and probably add a bit of code so that falconerid is actually aware of what pods are running.

When we have hundreds of CPUs running a distributed job, it's important to cross our t's and dot our i's, or else the system will rapidly become incomprehensible. Poorly understood states need to be detected, logged, understood and fixed.

emk · 2019-07-23T08:12:09Z

At the moment, what we need is a kubectl describe for the job, and for all pods associated with the job. For any pods which actually ran, I'd also like to look at the last few lines of the logs.

seamusabshere · 2019-10-21T00:25:24Z

here's a falc job that finished more than a day ago, but its k8s job (and thus node) is still around:

(see Age 33h down there?)

here's the node

$ kubectl describe no $NODE
[...]
Non-terminated Pods:         (3 in total)
  Namespace                  Name                                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                        ------------  ----------  ---------------  -------------  ---
  default                    foobar-2019-09-geocode-wdlyf-84pfp                         1 (12%)       0 (0%)      2G (7%)          2G (7%)        33h
  kube-system                fluentd-gcp-v3.1.1-w5f7x                                    100m (1%)     1 (12%)     200Mi (0%)       500Mi (1%)     33h
  kube-system                kube-proxy-gke-falconeri-falconeri-workers-xxx-dmtn    100m (1%)     0 (0%)      0 (0%)           0 (0%)         33h
[...]

and here's the job

$ kubectl describe job $JOB
Name:           foobar-2019-09-geocode-wdlyf
Namespace:      default
Selector:       controller-uid=xxxx
Labels:         created-by=falconeri
Annotations:    kubectl.kubernetes.io/last-applied-configuration:
                  {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"labels":{"created-by":"falconeri"},"name":"foobar-2019-09-geocode-wdl...
Parallelism:    256
Completions:    <unset>
Start Time:     Sat, 19 Oct 2019 11:09:46 -0400
Pods Statuses:  4 Running / 34 Succeeded / 0 Failed
Pod Template:
  Labels:  controller-uid=xxxx
           created-by=falconeri
           job-name=foobar-2019-09-geocode-wdlyf
  Containers:
   worker:
    Image:      us.gcr.io/myproject/myimage:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/falconeri-worker
      0805ce0a-1111-455c-96bd-1063bdf6abdf
    Limits:
      memory:  2G
    Requests:
      cpu:     1
      memory:  2G
    Environment:
      RUST_BACKTRACE:       1
      RUST_LOG:             falconeri_common=trace,falconeri_worker=trace
      FALCONERI_NODE_NAME:   (v1:spec.nodeName)
      FALCONERI_POD_NAME:    (v1:metadata.name)
    Mounts:
      /etc/falconeri/secrets from secrets (rw)
      /pfs from pfs (rw)
      /scratch from scratch (rw)
      /secrets/my-service-account from secret-my-service-account (rw)
  Volumes:
   pfs:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
   scratch:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
   secrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  falconeri
    Optional:    false
   secret-my-service-account:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  my-service-account
    Optional:    false
Events:          <none>

seamusabshere · 2020-08-23T17:47:46Z

i think this ContainerCreating has gone away with more recent versions of k8s (1.14 -> 1.17)

now i would say the main issue here is that falconeri can't see when pods die

emk added the faraday Requested by Faraday label Jan 26, 2020

seamusabshere changed the title ~~doesn't scale down reliably~~ can't detect failure of own pods ~doesn't scale down reliably~ Aug 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't detect failure of own pods ~doesn't scale down reliably~ #13

can't detect failure of own pods ~doesn't scale down reliably~ #13

seamusabshere commented Jun 13, 2019 •

edited

Loading

seamusabshere commented Jul 21, 2019

emk commented Jul 21, 2019

seamusabshere commented Jul 22, 2019

emk commented Jul 22, 2019

seamusabshere commented Jul 22, 2019

emk commented Jul 23, 2019 •

edited

Loading

emk commented Jul 23, 2019

seamusabshere commented Oct 21, 2019 •

edited

Loading

seamusabshere commented Aug 23, 2020

can't detect failure of own pods ~doesn't scale down reliably~ #13

can't detect failure of own pods ~doesn't scale down reliably~ #13

Comments

seamusabshere commented Jun 13, 2019 • edited Loading

seamusabshere commented Jul 21, 2019

emk commented Jul 21, 2019

seamusabshere commented Jul 22, 2019

emk commented Jul 22, 2019

seamusabshere commented Jul 22, 2019

emk commented Jul 23, 2019 • edited Loading

emk commented Jul 23, 2019

seamusabshere commented Oct 21, 2019 • edited Loading

seamusabshere commented Aug 23, 2020

seamusabshere commented Jun 13, 2019 •

edited

Loading

emk commented Jul 23, 2019 •

edited

Loading

seamusabshere commented Oct 21, 2019 •

edited

Loading