-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Unable to succeed in selecting a random port #22352
Comments
Hi @ckw017, can I assign this to you? |
@architkulkarni Yep, that's fine @birgerbr As a sanity check, in the logs how many Also, what version of Ray were you on before upgrading? |
We were using version 1.9.2. There are 16 |
Got it. If possible can you share the full If you still have the cluster up, or if you run into this again can you try running this on the head node of the cluster:
|
Here is the log file FYI: After the cluster was restarted I found another issue with our cluster. Not sure these issues can be connected, but the other issue was that the latest ray operator image for Kubernetes had some changes that made it incompatible with our current configuration. We solved that by using the 1.10.0 operator image for now. I'm assuming that we will need to update our configuration as done here f51566e when we update to ray 1.11.0. |
Sounds good, digging through the logs it looks like it hit "Server startup failed" 1,000 times before it ran into the "Unable to succeed in selecting a random port", which likely explains how all the ports were exhausted. If the incompatible image is what's causing then server failures then reverting might resolve this as well. |
Our cluster is again getting "Server startup failed", and the operator is running The "Unable to succeed in selecting a random port." were not in the logs, but they might have come if we had let it continue. Your script above ran without any issues. |
Hello, I think I am hitting a similar issue. My issue has nothing to do with ports, but I do see "Server startup failed". If my issue isn't related, I can file a new one. My Ray cluster is installed through the Helm chart in this repo. The operator and head node are both using version 1.10.0. This line fails for me: However, this does work if I am on a node in the cluster, eg. Here is the client traceback: >>> ray.init("ray://mycluster.internal:10001", runtime_env={"pip": ["torchaudio==0.10.0", "boto3"]})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/venv38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/venv38/lib/python3.8/site-packages/ray/worker.py", line 785, in init
return builder.connect()
File "/venv38/lib/python3.8/site-packages/ray/client_builder.py", line 151, in connect
client_info_dict = ray.util.client_connect.connect(
File "/venv38/lib/python3.8/site-packages/ray/util/client_connect.py", line 33, in connect
conn = ray.connect(
File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 228, in connect
conn = self.get_context().connect(*args, **kw_args)
File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 88, in connect
self.client_worker._server_init(job_config, ray_init_kwargs)
File "/venv38/lib/python3.8/site-packages/ray/util/client/worker.py", line 697, in _server_init
raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 623, in Datapath
if not self.proxy_manager.start_specific_server(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 279, in start_specific_server
serialized_runtime_env_context = self._create_runtime_env(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 233, in _create_runtime_env
raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server: Failed to install pip requirements:
Collecting torchaudio==0.10.0
Using cached torchaudio-0.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting boto3
Using cached boto3-1.21.12-py3-none-any.whl (132 kB)
Collecting torch==1.10.0 And on the head node, I tailed some logs...
I also get "Server startup failed". Also the first line of dashboard_agent.log is Any help is appreciated. |
cc @shrekris-anyscale can you help triage vicyap's issue, looks like an issue with runtime envs |
Sure, let me take a look. |
@vicyap looking at the traceback, it looks like the the process or container might be OOM killed. This is a common problem for installing PyTorch within a container. Please take a look at pytorch/pytorch#1022 (comment) for workaround. |
@vicyap is it possible to try the same thing without the |
@architkulkarni is there a way to do something like |
@vicyap you can try setting PIP_NO_CACHE_DIR=1 on the cluster:https://pip.pypa.io/en/latest/topics/configuration/#environment-variables Actually, you might have to set it to 0 instead of 1, as this issue seems to still be open: pypa/pip#5735 Another user reported that using the |
@architkulkarni I seem to be experiencing similar behaviors but a little differently. I documented an issue here, if it's okay for you to take a look at. Could use some eyes. #23865 |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Search before asking
Ray Component
Ray Core
What happened + What you expected to happen
After running for 6 days, the ray server fails to accept new clients.
I found this error repeated in
ray_client_server.err
:I've can not recall seeing this error before upgrading to 1.10.0.
Versions / Dependencies
Ray version 1.10.0, Python 3.8, Ubuntu 20.04.
Reproduction script
I do not have any script for reproduction, the server was running for 6 days before the issue started.
Anything else
The server was running in Kubernetes.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: