GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

dmcguire81 · 2024-08-30T06:09:01Z

Here's a short example flow to illustrate the issue:

from metaflow import FlowSpec, step, resources


class NonGPUFlow(FlowSpec):
    @step
    def start(self):
        print("Starting flow")
        self.next(self.end)

    # this step hangs indefinitely without any feedback,
    # and no indication of error such as PodUnschedulable,
    # because nothing can statisfy the constraint gpu=0
    @resources(cpu=2, memory=2048)
    @step
    def end(self):
        print("Flow completed")


if __name__ == "__main__":
    NonGPUFlow()

When this is run on Kubernetes, as follows, the end step can never be allocated resources:

$ python NonGPUFlow.py run --with kubernetes
Metaflow 2.12.10+netflix-ext(1.2.1) executing NonGPUFlow for user:davidmcguire
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-08-29 22:35:30.205 Workflow starting (run-id 49):
2024-08-29 22:35:34.857 [49/start/240 (pid 6056)] Task is starting.
2024-08-29 22:35:37.488 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting (Pod is pending)...
2024-08-29 22:37:06.836 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Setting up task environment.
2024-08-29 22:37:32.401 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Downloading code package...
2024-08-29 22:37:33.310 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Code package downloaded.
2024-08-29 22:37:33.411 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting.
2024-08-29 22:37:35.124 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Starting flow
2024-08-29 22:37:40.147 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task finished with exit code 0.
2024-08-29 22:37:41.789 [49/start/240 (pid 6056)] Task finished successfully.
2024-08-29 22:37:43.602 [49/end/241 (pid 15724)] Task is starting.
2024-08-29 22:37:46.470 [49/end/241 (pid 15724)] [job t-0ed12c7d-9dwb2] Task is starting (Job status is unknown)...

This will never make any progress against GKE, nor will it report a failure mode.

The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):

        resources:
          limits:
            cloud.google.com/pod-slots: "1"
            nvidia.com/gpu: "0"
          requests:
            cloud.google.com/pod-slots: "1"
            cpu: "2"
            ephemeral-storage: 10240M
            memory: 4096M
            nvidia.com/gpu: "0"

The root cause turns out to be a combination of a default value of 0 in the ResourcesDecorator, and the fact that the KubernetesDecorator only filters for None, which is not the default value.

There is a work-around of explicitly setting gpu=None in the @resources decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of the KubernetesDecorator can be corrected without considering what the impact on the BatchDecorator would be (which would be the case if the default for gpu in the ResourcesDecorator were changed), this seems like an easy win for usability.

The text was updated successfully, but these errors were encountered:

dmcguire81 mentioned this issue Aug 30, 2024

Only use a GPU value for Kubernetes that is non-null and non-zero #2006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

dmcguire81 commented Aug 30, 2024 •

edited

Loading

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

Comments

dmcguire81 commented Aug 30, 2024 • edited Loading

dmcguire81 commented Aug 30, 2024 •

edited

Loading