Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KubeRay] support suspending worker groups in KubeRay autoscaler #49768

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jan 10, 2025

Resolves ray-project/kuberay#2666.

ray-project/kuberay#2663 adds a new suspend field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed.

Since the available_node_types definition does not provide similar functionality (suspending a node type), the best way to let the auto scaler know a worker group has been suspended is to set its max_workers and min_workers to 0.

  1. This PR makes the KubeRay autoscaling config producer produce an autoscaling config with both max_workers and min_workers of a suspended worker group set to 0 to inform the autoscaler that the suspended group should not have nodes. The autoscaler will periodically take the config and do its work.
  2. Autoscaler will then scale down the suspended group. However, this PR filters out the actual k8s patches to the suspended group because we want to keep the original replicas values on the RayCluster CR. The suspended group will be scaled down by KubeRay operator instead.

Related issue number

ray-project/kuberay#2666
ray-project/kuberay#2663

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Resolves ray-project/kuberay#2666.

ray-project/kuberay#2663 adds a new `suspend` field to the
KubeRay worker group spec for suspending worker groups. A suspended worker group should
be scaled to 0 and never be scaled up until the group is resumed.

Since there is no similar functionality in the `available_node_types` definition,
the best way to let the autoscaler know a worker group has been suspended is to
set its max_workers to 0 as well as its min_workers.

This PR makes the KubeRay autoscaling config producer produce a config with
both max_workers and min_workers set to 0 if the worker group has been suspended.
The autoscaler will periodically take the config and do its work.

Signed-off-by: Rueian <[email protected]>
@rueian rueian marked this pull request as ready for review January 10, 2025 22:50
@rueian rueian requested review from hongchaodeng and a team as code owners January 10, 2025 22:50
@rueian
Copy link
Contributor Author

rueian commented Jan 10, 2025

Hi @kevin85421, could you help review this?

@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Jan 11, 2025
"min_workers": min_workers,
"max_workers": max_workers,
"min_workers": min_workers if not suspend else 0,
"max_workers": max_workers if not suspend else 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to let the autoscaler know that the suspended worker group should have no workers; otherwise, the autoscaler will keep trying to scale it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the available_node_types definition does not provide similar functionality (suspending a node type), the best way to do that is to set its max_workers and min_workers to 0.

@@ -312,6 +330,14 @@ def test_resource_quantity(input: str, output: int):
None,
id="autoscaler-options",
),
pytest.param(
_get_ray_cr_with_groups_suspended(),
_get_autoscaling_config_with_groups_suspended(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests both max_workers and min_workers are set to 0 in the generated autoscaling config.

python/ray/tests/kuberay/test_kuberay_node_provider.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Don't submit scale requests if the worker group is suspended
2 participants