Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear cluster state on head node fail #194

Open
approxit opened this issue Feb 23, 2024 · 1 comment
Open

Clear cluster state on head node fail #194

approxit opened this issue Feb 23, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@approxit
Copy link
Contributor

approxit commented Feb 23, 2024

As we're closing to support multiple clusters, our use of ray up / ray down / ray-on-golem start / ray-on-golem stop intensifies, we are encountering new problems. In the case of head node creation failure, when running a fresh ray up, the kinda intuitive way is to call ray up again. The problem is that the webserver has an existing "corrupted" state, and retrying ray up is not making any progress. The user needs to know that a manual call to ray-on-golem stop is required to proceed. Let's address that.

As Ray does not have a concept of the cluster as we do, we can tie our idea to the fate of the head node - as in Ray head node plays the role of a central single point of state.

In the case of failure in the head node setup, webserver needs to clean up the whole cluster state, to be ready for the next ray up call.

@mateuszsrebrny mateuszsrebrny added enhancement New feature or request bug Something isn't working and removed enhancement New feature or request labels Mar 7, 2024
@mateuszsrebrny
Copy link
Contributor

It requires refactoring of ray service and golem service.
It is a valid UX issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants