Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Yicheng-Lu-llll · 2023-10-09T19:16:11Z

Motivation: Why do you think this is important?

Currently, the Flyte-Ray plugin utilizes Rayjob. However, there are cases where Rayjob may require an autoscaler.

For instance, after completing a workload with Rayjob, a user might want to retain all the information, logs, past tasks, and actor execution history for a period. As of now, Ray lacks a mechanism to persist these data, necessitating the continuous operation of the Ray cluster even after workload completion. With an autoscaler, the Ray cluster will maintain only the head pod while scaling down all worker pods.

Goal: What should the final outcome look like, ideally?

Have config to enable Ray Autoscaler.

Describe alternatives you've considered

None

Propose: Link/Inline OR Additional context

None

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

asingh9530 · 2023-10-19T15:48:08Z

Hi @samhita-alla ,

need help from you, I had question.
In current setting for WorkerGroupSpec we already have config enabled for min_replica with default as 0 and max_replica can be seen in here, also for autoscaling as per ray doc we only need small additional config class with following setting

max_workers[default_value=2, min_value=0]
min_workers[default_value=0, min_value=0]
and the input from user end should be able to accept max_workers config and min_workers config.

I believe this should be workflow to enable this autoscaling, if that's the case I am happy to create a PR for this but if any other details are required please let me know. 🙂

osevin · 2023-10-24T15:25:33Z

I think we also need to be able to set

enableInTreeAutoscaling: true

in the cluster spec for autoscaling to work (https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html#enabling-autoscaling)

samhita-alla · 2023-10-25T09:04:15Z

@Yicheng-Lu-llll, could you clarify if we need to incorporate the changes suggested by @asingh9530 and @osevin?

kumare3 · 2023-10-25T14:27:38Z

I think we should support the evolving spec completely. We should make it a json

asingh9530 · 2023-10-27T12:19:02Z

Hi @osevin, agreed need to add this.

@kumare3 just confirming, you are suggesting accepting input in json and validating json using some dataclass like pydantic ?

kumare3 · 2023-10-27T13:49:40Z

There is no pydantic as this is in golang, but what I would love is to have the ability to keep the spec more evolvable as things change without sacrificing simplicity and correctness- we can brainstorm on solutions

asingh9530 · 2023-10-27T14:46:00Z

@kumare3 this issue was tagged under flytekit not in flyte here. That's why I suggested to directly incorporate it under RayFunctionTask.

samhita-alla · 2023-10-28T10:03:57Z

@pingsutw, can you confirm if the changes @asingh9530 mentioned are the correct ones?

asingh9530 · 2023-11-02T14:05:01Z

Hi @pingsutw @samhita-alla Guys, do you have any update on this ?

samhita-alla · 2023-11-02T15:21:32Z

Hey. @pingsutw's currently not available. @eapolinario, can you chime in here please?

Yicheng-Lu-llll · 2023-11-02T15:32:18Z

I think the final generated yaml should looks like this:

  enableInTreeAutoscaling: true
  # `autoscalerOptions` is an OPTIONAL field specifying configuration overrides for the Ray Autoscaler.
  # The example configuration shown below below represents the DEFAULT values.
  # (You may delete autoscalerOptions if the defaults are suitable.)
  autoscalerOptions:
    # `upscalingMode` is "Default" or "Aggressive."
    # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
    # Default: Upscaling is not rate-limited.
    # Aggressive: An alias for Default; upscaling is not rate-limited.
    upscalingMode: Default
    # `idleTimeoutSeconds` is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
    idleTimeoutSeconds: 60
    # `image` optionally overrides the Autoscaler's container image. The Autoscaler uses the same image as the Ray container by default.
    ## image: "my-repo/my-custom-autoscaler-image:tag"
    # `imagePullPolicy` optionally overrides the Autoscaler container's default image pull policy (IfNotPresent).
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    # resources specifies optional resource request and limit overrides for the Autoscaler container.
    # The default Autoscaler resource limits and requests should be sufficient for production use-cases.
    # However, for large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template

To do so, We should:

Add these config to flytekit. Like here and here.
Add these config to flyteidl. Like here.
Add these config to flyte-ray plugins. Like here.

kumare3 · 2024-03-22T23:46:35Z

We will close this issue as it was already merged.

Yicheng-Lu-llll added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Oct 9, 2023

pingsutw assigned Yicheng-Lu-llll Oct 9, 2023

pingsutw added Epic: Ray Ray/KubeRay Support in Flyte Epic: Flyte Agent feature-request hacktoberfest and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 9, 2023

samhita-alla mentioned this issue Oct 10, 2023

Flyte Hacktoberfest 2023: Issues and Guidelines #4064

Closed

45 tasks

This was referenced Nov 6, 2023

Add Ray Autoscaler to the Flyte-Ray plugin flyteorg/flytekit#1937

Merged

Add Ray Autoscaler to the Flyte-Ray plugin #4363

Merged

kumare3 closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Yicheng-Lu-llll commented Oct 9, 2023

asingh9530 commented Oct 19, 2023

osevin commented Oct 24, 2023 •

edited

Loading

samhita-alla commented Oct 25, 2023

kumare3 commented Oct 25, 2023

asingh9530 commented Oct 27, 2023

kumare3 commented Oct 27, 2023

asingh9530 commented Oct 27, 2023

samhita-alla commented Oct 28, 2023

asingh9530 commented Nov 2, 2023

samhita-alla commented Nov 2, 2023 •

edited

Loading

Yicheng-Lu-llll commented Nov 2, 2023 •

edited

Loading

kumare3 commented Mar 22, 2024

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Add Ray Autoscaler to the Flyte-Ray plugin. #4187

Comments

Yicheng-Lu-llll commented Oct 9, 2023

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

asingh9530 commented Oct 19, 2023

osevin commented Oct 24, 2023 • edited Loading

samhita-alla commented Oct 25, 2023

kumare3 commented Oct 25, 2023

asingh9530 commented Oct 27, 2023

kumare3 commented Oct 27, 2023

asingh9530 commented Oct 27, 2023

samhita-alla commented Oct 28, 2023

asingh9530 commented Nov 2, 2023

samhita-alla commented Nov 2, 2023 • edited Loading

Yicheng-Lu-llll commented Nov 2, 2023 • edited Loading

kumare3 commented Mar 22, 2024

osevin commented Oct 24, 2023 •

edited

Loading

samhita-alla commented Nov 2, 2023 •

edited

Loading

Yicheng-Lu-llll commented Nov 2, 2023 •

edited

Loading