-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ray Autoscaler to the Flyte-Ray plugin. #4187
Comments
Hi @samhita-alla , need help from you, I had question.
I believe this should be workflow to enable this autoscaling, if that's the case I am happy to create a PR for this but if any other details are required please let me know. 🙂 |
I think we also need to be able to set enableInTreeAutoscaling: true in the cluster spec for autoscaling to work (https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html#enabling-autoscaling) |
@Yicheng-Lu-llll, could you clarify if we need to incorporate the changes suggested by @asingh9530 and @osevin? |
I think we should support the evolving spec completely. We should make it a json |
There is no pydantic as this is in golang, but what I would love is to have the ability to keep the spec more evolvable as things change without sacrificing simplicity and correctness- we can brainstorm on solutions |
@pingsutw, can you confirm if the changes @asingh9530 mentioned are the correct ones? |
Hi @pingsutw @samhita-alla Guys, do you have any update on this ? |
Hey. @pingsutw's currently not available. @eapolinario, can you chime in here please? |
I think the final generated yaml should looks like this: enableInTreeAutoscaling: true
# `autoscalerOptions` is an OPTIONAL field specifying configuration overrides for the Ray Autoscaler.
# The example configuration shown below below represents the DEFAULT values.
# (You may delete autoscalerOptions if the defaults are suitable.)
autoscalerOptions:
# `upscalingMode` is "Default" or "Aggressive."
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
# Default: Upscaling is not rate-limited.
# Aggressive: An alias for Default; upscaling is not rate-limited.
upscalingMode: Default
# `idleTimeoutSeconds` is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
idleTimeoutSeconds: 60
# `image` optionally overrides the Autoscaler's container image. The Autoscaler uses the same image as the Ray container by default.
## image: "my-repo/my-custom-autoscaler-image:tag"
# `imagePullPolicy` optionally overrides the Autoscaler container's default image pull policy (IfNotPresent).
imagePullPolicy: IfNotPresent
# Optionally specify the Autoscaler container's securityContext.
securityContext: {}
env: []
envFrom: []
# resources specifies optional resource request and limit overrides for the Autoscaler container.
# The default Autoscaler resource limits and requests should be sufficient for production use-cases.
# However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "500m"
memory: "512Mi"
# Ray head pod template To do so, We should: |
We will close this issue as it was already merged. |
Motivation: Why do you think this is important?
Currently, the Flyte-Ray plugin utilizes Rayjob. However, there are cases where Rayjob may require an autoscaler.
For instance, after completing a workload with Rayjob, a user might want to retain all the information, logs, past tasks, and actor execution history for a period. As of now, Ray lacks a mechanism to persist these data, necessitating the continuous operation of the Ray cluster even after workload completion. With an autoscaler, the Ray cluster will maintain only the head pod while scaling down all worker pods.
Goal: What should the final outcome look like, ideally?
Have config to enable Ray Autoscaler.
Describe alternatives you've considered
None
Propose: Link/Inline OR Additional context
None
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: