Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandle RayOnGolemClientError exception during ray down #119

Open
lucekdudek opened this issue Nov 14, 2023 · 1 comment
Open

Unhandle RayOnGolemClientError exception during ray down #119

lucekdudek opened this issue Nov 14, 2023 · 1 comment

Comments

@lucekdudek
Copy link
Contributor

ray down -y golem-cluster.dev.yaml 
2023-11-14 15:08:15,139 WARNING util.py:251 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2023-11-14 15:08:15,139 WARNING util.py:251 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
2023-11-14 15:08:15,145 WARNING util.py:251 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2023-11-14 15:08:15,145 WARNING util.py:251 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
Ray On Golem webserver
  Not starting webserver, as it's already running
Fetched IP: 192.168.0.3
Stopped only 5 out of 6 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=751, name='gcs_server', status='terminated', started='13:56:16')] will be forcefully terminated.
You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination.
Shared connection to 192.168.0.3 closed.
2023-11-14 15:08:24,367 INFO node_provider.py:173 -- NodeProvider: node0: Terminating node
Traceback (most recent call last):
  File "/home/lucjan/Repos/golem-ray/.venv/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
 File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1337, in down
    teardown_cluster(
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/_private/commands.py", line 548, in teardown_cluster
    provider.terminate_nodes(A)
  File "/home/lucjan/Repos/golem-ray/.venv/lib/python3.10/site-packages/ray/autoscaler/node_provider.py", line 174, in terminate_nodes
    self.terminate_node(node_id)
  File "/home/lucjan/Repos/golem-ray/ray_on_golem/provider/node_provider.py", line 138, in terminate_node
    terminated_nodes = self._ray_on_golem_client.terminate_node(node_id)
  File "/home/lucjan/Repos/golem-ray/ray_on_golem/client/client.py", line 58, in terminate_node
    response = self._make_request(
  File "/home/lucjan/Repos/golem-ray/ray_on_golem/client/client.py", line 198, in _make_request
    raise RayOnGolemClientError(f"{error_message}: {response.text}")
ray_on_golem.client.exceptions.RayOnGolemClientError: Couldn't terminate node: 500 Internal Server Error

Server got itself in trouble
...
@shadeofblue
Copy link
Contributor

@lucekdudek does it still manifest itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants