You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's a short example flow to illustrate the issue:
frommetaflowimportFlowSpec, step, resourcesclassNonGPUFlow(FlowSpec):
@stepdefstart(self):
print("Starting flow")
self.next(self.end)
# this step hangs indefinitely without any feedback,# and no indication of error such as PodUnschedulable,# because nothing can statisfy the constraint gpu=0@resources(cpu=2, memory=2048)@stepdefend(self):
print("Flow completed")
if__name__=="__main__":
NonGPUFlow()
When this is run on Kubernetes, as follows, the end step can never be allocated resources:
$ python NonGPUFlow.py run --with kubernetes
Metaflow 2.12.10+netflix-ext(1.2.1) executing NonGPUFlow for user:davidmcguire
Validating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
2024-08-29 22:35:30.205 Workflow starting (run-id 49):
2024-08-29 22:35:34.857 [49/start/240 (pid 6056)] Task is starting.
2024-08-29 22:35:37.488 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting (Pod is pending)...
2024-08-29 22:37:06.836 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Setting up task environment.
2024-08-29 22:37:32.401 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Downloading code package...
2024-08-29 22:37:33.310 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Code package downloaded.
2024-08-29 22:37:33.411 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task is starting.
2024-08-29 22:37:35.124 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Starting flow
2024-08-29 22:37:40.147 [49/start/240 (pid 6056)] [pod t-592d481e-nhznc-2j5cx] Task finished with exit code 0.
2024-08-29 22:37:41.789 [49/start/240 (pid 6056)] Task finished successfully.
2024-08-29 22:37:43.602 [49/end/241 (pid 15724)] Task is starting.
2024-08-29 22:37:46.470 [49/end/241 (pid 15724)] [job t-0ed12c7d-9dwb2] Task is starting (Job status is unknown)...
This will never make any progress against GKE, nor will it report a failure mode.
The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):
There is a work-around of explicitly setting gpu=None in the @resources decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of the KubernetesDecorator can be corrected without considering what the impact on the BatchDecorator would be (which would be the case if the default for gpu in the ResourcesDecorator were changed), this seems like an easy win for usability.
The text was updated successfully, but these errors were encountered:
Here's a short example flow to illustrate the issue:
When this is run on Kubernetes, as follows, the
end
step can never be allocated resources:This will never make any progress against GKE, nor will it report a failure mode.
The unsatisfiable constraint is specified in the portion of the Spec for the step that is communicated to Kubernetes, shown below (showing the default value for KUBERNETES_GPU_VENDOR):
The root cause turns out to be a combination of a default value of
0
in theResourcesDecorator
, and the fact that theKubernetesDecorator
only filters forNone
, which is not the default value.There is a work-around of explicitly setting
gpu=None
in the@resources
decorator, but forgetting to do this makes for an unpleasant footgun that gives no hints as to the underlying problem. Because the behavior of theKubernetesDecorator
can be corrected without considering what the impact on theBatchDecorator
would be (which would be the case if the default forgpu
in theResourcesDecorator
were changed), this seems like an easy win for usability.The text was updated successfully, but these errors were encountered: