Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU constraint of 0 added to Kubernetes step spec, making it unsatisfiable #2005

Open
dmcguire81 opened this issue Aug 30, 2024 · 0 comments

Comments

@dmcguire81
Copy link

dmcguire81 commented Aug 30, 2024

Here's a short example flow to illustrate the issue:

from metaflow import FlowSpec, step, resources


class NonGPUFlow(FlowSpec):
    @step
    def start(self):
        print("Starting flow")
        self.next(self.end)

    # this step hangs indefinitely without any feedback,
    # and no indication of error such as PodUnschedulable,
    # because nothing can statisfy the constraint gpu=0
    @resources(cpu=2, memory=2048)
    @step
    def end(self):
        print("Flow completed")


if __name__ == "__main__":
    NonGPUFlow()

When this is run on Kubernetes, as follows, the end step can never be allocated resources:

$ python NonGPUFlow.py run --with kubernetes
Metaflow 2.12.10+netflix-ext(1.2.1) executing NonGPUFlow for user:davidmcguire
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-08-29 22:35:30.205 Workflow starting (run-id 49):
2024-08-29 22:35:34.857 [49/start/240 (pid 6056)] Task is starting.
2024-08-29 22:35:37.488 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting (Pod is pending)...
2024-08-29 22:37:06.836 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Setting up task environment.
2024-08-29 22:37:32.401 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Downloading code package...
2024-08-29 22:37:33.310 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Code package downloaded.
2024-08-29 22:37:33.411 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting.
2024-08-29 22:37:35.124 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Starting flow
2024-08-29 22:37:40.147 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task finished with exit code 0.
2024-08-29 22:37:41.789 [49/start/240 (pid 6056)] Task finished successfully.
2024-08-29 22:37:43.602 [49/end/241 (pid 15724)] Task is starting.
2024-08-29 22:37:46.470 [49/end/241 (pid 15724)] [job t-0ed12c7d-9dwb2] Task is starting (Job status is unknown)...

This will never make any progress against GKE, nor will it report a failure mode.

The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):

        resources:
          limits:
            cloud.google.com/pod-slots: "1"
            nvidia.com/gpu: "0"
          requests:
            cloud.google.com/pod-slots: "1"
            cpu: "2"
            ephemeral-storage: 10240M
            memory: 4096M
            nvidia.com/gpu: "0"

The root cause turns out to be a combination of a default value of 0 in the ResourcesDecorator, and the fact that the KubernetesDecorator only filters for None, which is not the default value.

There is a work-around of explicitly setting gpu=None in the @resources decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of the KubernetesDecorator can be corrected without considering what the impact on the BatchDecorator would be (which would be the case if the default for gpu in the ResourcesDecorator were changed), this seems like an easy win for usability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant