-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KubeRay] support suspending worker groups in KubeRay autoscaler #49768
base: master
Are you sure you want to change the base?
Conversation
Resolves ray-project/kuberay#2666. ray-project/kuberay#2663 adds a new `suspend` field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed. Since there is no similar functionality in the `available_node_types` definition, the best way to let the autoscaler know a worker group has been suspended is to set its max_workers to 0 as well as its min_workers. This PR makes the KubeRay autoscaling config producer produce a config with both max_workers and min_workers set to 0 if the worker group has been suspended. The autoscaler will periodically take the config and do its work. Signed-off-by: Rueian <[email protected]>
Hi @kevin85421, could you help review this? |
"min_workers": min_workers, | ||
"max_workers": max_workers, | ||
"min_workers": min_workers if not suspend else 0, | ||
"max_workers": max_workers if not suspend else 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to let the autoscaler know that the suspended worker group should have no workers; otherwise, the autoscaler will keep trying to scale it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the available_node_types
definition does not provide similar functionality (suspending a node type), the best way to do that is to set its max_workers and min_workers to 0.
@@ -312,6 +330,14 @@ def test_resource_quantity(input: str, output: int): | |||
None, | |||
id="autoscaler-options", | |||
), | |||
pytest.param( | |||
_get_ray_cr_with_groups_suspended(), | |||
_get_autoscaling_config_with_groups_suspended(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests both max_workers and min_workers are set to 0 in the generated autoscaling config.
afed960
to
fa119fd
Compare
Resolves ray-project/kuberay#2666.
ray-project/kuberay#2663 adds a new
suspend
field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed.Since the
available_node_types
definition does not provide similar functionality (suspending a node type), the best way to let the auto scaler know a worker group has been suspended is to set its max_workers and min_workers to 0.max_workers
andmin_workers
of a suspended worker group set to 0 to inform the autoscaler that the suspended group should not have nodes. The autoscaler will periodically take the config and do its work.replicas
values on the RayCluster CR. The suspended group will be scaled down by KubeRay operator instead.Related issue number
ray-project/kuberay#2666
ray-project/kuberay#2663
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.