[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

rueian · 2025-01-10T18:38:01Z

ray-project/kuberay#2663 adds a new suspend field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed.

Since the available_node_types definition does not provide similar functionality (suspending a node type), the best way to let the auto scaler know a worker group has been suspended is to set its max_workers and min_workers to 0.

This PR makes the KubeRay autoscaling config producer produce an autoscaling config with both max_workers and min_workers of a suspended worker group set to 0 to inform the autoscaler that the suspended group should not have nodes. The autoscaler will periodically take the config and do its work.
Autoscaler will then scale down the suspended group. However, this PR filters out the actual k8s patches to the suspended group because we want to keep the original replicas values on the RayCluster CR. The suspended group will be scaled down by KubeRay operator instead.

Related issue number

ray-project/kuberay#2666
ray-project/kuberay#2663

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Resolves ray-project/kuberay#2666. ray-project/kuberay#2663 adds a new `suspend` field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed. Since there is no similar functionality in the `available_node_types` definition, the best way to let the autoscaler know a worker group has been suspended is to set its max_workers to 0 as well as its min_workers. This PR makes the KubeRay autoscaling config producer produce a config with both max_workers and min_workers set to 0 if the worker group has been suspended. The autoscaler will periodically take the config and do its work. Signed-off-by: Rueian <[email protected]>

rueian · 2025-01-10T22:53:34Z

Hi @kevin85421, could you help review this?

rueian · 2025-01-11T04:45:05Z

python/ray/autoscaler/_private/kuberay/autoscaling_config.py

-        "min_workers": min_workers,
-        "max_workers": max_workers,
+        "min_workers": min_workers if not suspend else 0,
+        "max_workers": max_workers if not suspend else 0,


We need to let the autoscaler know that the suspended worker group should have no workers; otherwise, the autoscaler will keep trying to scale it up.

Since the available_node_types definition does not provide similar functionality (suspending a node type), the best way to do that is to set its max_workers and min_workers to 0.

rueian · 2025-01-11T04:49:14Z

python/ray/tests/kuberay/test_autoscaling_config.py

@@ -312,6 +330,14 @@ def test_resource_quantity(input: str, output: int):
            None,
            id="autoscaler-options",
        ),
+        pytest.param(
+            _get_ray_cr_with_groups_suspended(),
+            _get_autoscaling_config_with_groups_suspended(),


This tests both max_workers and min_workers are set to 0 in the generated autoscaling config.

python/ray/tests/kuberay/test_kuberay_node_provider.py

rueian marked this pull request as ready for review January 10, 2025 22:50

rueian requested review from hongchaodeng and a team as code owners January 10, 2025 22:50

jcotant1 added the core Issues that should be addressed in Ray Core label Jan 11, 2025

rueian commented Jan 11, 2025

View reviewed changes

rueian force-pushed the kuberay-autoscaling-suspend branch from afed960 to fa119fd Compare January 14, 2025 03:02

rueian mentioned this pull request Jan 15, 2025

[Feature] Add force to worker group suspend API ray-project/kuberay#2744

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

rueian commented Jan 10, 2025 •

edited

Loading

rueian commented Jan 10, 2025

rueian Jan 11, 2025

rueian Jan 11, 2025

rueian Jan 11, 2025

[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

Are you sure you want to change the base?

[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

Conversation

rueian commented Jan 10, 2025 • edited Loading

Related issue number

Checks

rueian commented Jan 10, 2025

rueian Jan 11, 2025

Choose a reason for hiding this comment

rueian Jan 11, 2025

Choose a reason for hiding this comment

rueian Jan 11, 2025

Choose a reason for hiding this comment

rueian commented Jan 10, 2025 •

edited

Loading