Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ephemeral volume claims to kubernetes/argo #2103

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

trhodeos
Copy link
Contributor

@trhodeos trhodeos commented Oct 17, 2024

As defined here: https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes

This would allow a single step to have a dynamically attached "ephemeral volume" dedicated to itself (rather than a pvc which needs to be created before running the job)

Tested:
with hello_cloud.py example:

$ git diff
diff --git a/metaflow/tutorials/05-hello-cloud/hello-cloud.py b/metaflow/tutorials/05-hello-cloud/hello-cloud.py
index 20fcfe6..5240e16 100644
--- a/metaflow/tutorials/05-hello-cloud/hello-cloud.py
+++ b/metaflow/tutorials/05-hello-cloud/hello-cloud.py
@@ -1,4 +1,5 @@
 from metaflow import FlowSpec, step, kubernetes, retry
+import time
 
 
 class HelloCloudFlow(FlowSpec):
@@ -27,7 +28,7 @@ class HelloCloudFlow(FlowSpec):
 
         self.next(self.hello)
 
-    @kubernetes(cpu=1, memory=500)
+    @kubernetes(cpu=1, memory=500, ephemeral_volume_claims={"my-temp-volume": {"path": "/my_temp_volume"}})
     @retry
     @step
     def hello(self):
@@ -41,6 +42,10 @@ class HelloCloudFlow(FlowSpec):
         """
         self.message = "Hi from the cloud!"
         print("Metaflow says: %s" % self.message)
+        with open("/my_temp_volume/my_file.txt", "w") as f:
+            f.write("hello_world!")
+        with open("/my_temp_volume/my_file.txt", "r") as f:
+            print("From file: %s" % f.read())
         self.next(self.end)
 
     @step

I've tested:

  • python metaflow/tutorials/05-hello-cloud/hello-cloud.py run
  • python metaflow/tutorials/05-hello-cloud/hello-cloud.py run --with kubernetes:ephemeral_volume_claims='{"my-temp-volume":{"path":"/my_temp_volume"}}'
  • python metaflow/tutorials/05-hello-cloud/hello-cloud.py argo-workflows create + trigger
    And verified that the flow runs successfully, and that the ephemeral volume is created / destroyed as intended.

Also, I tested:

  • python metaflow/tutorials/05-hello-cloud/hello-cloud.py airflow create my_flow.py returns an error

I haven't yet tested @parallel because I don't have access to a cluster with JobSet installed.. If I did, I would run:

  • python test/parallel/parallel_test_flow.py run --with kubernetes:ephemeral_volume_claims='{"my-temp-volume":{"path":"/my_temp_volume"}}'

@savingoyal
Copy link
Collaborator

Also, can you help me with the scenarios you have been able to test (across @kubernetes, @parallel - locally as well as with argo-workflows and airflow) and the outputs. I am particularly curious about the unhappy paths - particularly what happens when a lot of data (TBs) is written to the EBS volume - how does that impact workload termination.

Additionally, the UX can be potentially simplified significantly - in line with the UX for persistent_volume_claims.

@@ -3445,6 +3461,26 @@ def pvc_volumes(self, pvcs=None):
)
return self

def ephemeral_volume_claims(self, claims=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! how does it behave when ephemeral volumes are specified through the CLI - --with kubernetes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, are there any use cases when the mode is not ReadWriteOnce? if there are, it would be good to document those scenarios and call out the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fair. I've set the default to be ReadWriteOnce. I don't think you'd ever need it to change, but I've left the ability to override just in case.

@trhodeos
Copy link
Contributor Author

trhodeos commented Oct 17, 2024

Also, can you help me with the scenarios you have been able to test (across @kubernetes, @parallel - locally as well as with argo-workflows and airflow) and the outputs. I am particularly curious about the unhappy paths - particularly what happens when a lot of data (TBs) is written to the EBS volume - how does that impact workload termination.

I'll work on running through the scenarios below:

I don't have access to airflow, so that one will be harder to test

re: impact workload termination: is your concern that the step will fail to terminate if a lot of data is being written during the request to terminate?

Additionally, the UX can be potentially simplified significantly - in line with the UX for persistent_volume_claims.

Are you thinking just having ephemeral_volumes just be a dict[str, str] from name to path? I wasn't sure if users would want to customize the storage class or other parameters in spec, hence why I made them optional.

UPDATE: I added the tests I've run through to the PR description

metaflow/plugins/kubernetes/kubernetes.py Outdated Show resolved Hide resolved
"volumeClaimTemplate": {
"metadata": values.get("metadata", None),
"spec": {
**values.get("spec", {"abc": "def"}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some odd placeholder values here?

metaflow/plugins/argo/argo_workflows.py Show resolved Hide resolved
metaflow/plugins/kubernetes/kubernetes_job.py Show resolved Hide resolved
@saikonen
Copy link
Collaborator

tested this with @parallel and seems to be working as expected so we can cross that off the list.

@saikonen
Copy link
Collaborator

I'm going back and forth regarding the UX with spec:

On one hand exposing the controls seems like a valid use case for fine-tuning the ephemeral storage.
On the other hand it makes using the decorator rely on some Kubernetes implementation details (namely, what is accepted in spec) which makes usage cumbersome.

So far I'm slightly leaning towards passing the raw spec: due to easier maintainability.

@trhodeos
Copy link
Contributor Author

trhodeos commented Nov 4, 2024

On the other hand it makes using the decorator rely on some Kubernetes implementation details (namely, what is accepted in spec) which makes usage cumbersome.

At a minimum, I think you'd need:

  • mount path
  • storage class
  • request size

FWIW this API has been "stable" since 1.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants