fix: SocketError torch timeout (#135) · kaito-project/kaito@1d854e3

Commit

fix: SocketError torch timeout (#135)

torch.init_process_group includes a default 30 minute timeout. 
```
While the worker is listening for instructions, after thirty idle minutes an error gets thrown: 
[1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f3ff9fb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7f3ff9f6b7cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xf8 (0x7f3fc834c858 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a (0x7f3fc834d4ca in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0x7f3fc834d594 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1fc (0x7f3f8c6e443c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x530 (0x7f3f8c6e7c10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x4c2 (0x7f3f8c6f5922 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x4c1a310 (0x7f3fc82eb310 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4c23bfb (0x7f3fc82f4bfb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4c43ab3 (0x7f3fc8314ab3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xb8c97a (0x7f3fceaa197a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #15: <unknown function> + 0x39bb46 (0x7f3fce2b0b46 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x15cc9e (0x5572975c2c9e in /usr/bin/python)
frame #17: _PyObject_MakeTpCall + 0x25b (0x5572975b972b in /usr/bin/python)
frame #18: <unknown function> + 0x16b1eb (0x5572975d11eb in /usr/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x640a (0x5572975b175a in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #21: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #25: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #26: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python)
frame #28: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python)
frame #30: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x6cd (0x5572975aba1d in /usr/bin/python)
frame #32: <unknown function> + 0x142176 (0x5572975a8176 in /usr/bin/python)
frame #33: PyEval_EvalCode + 0x86 (0x55729769dc56 in /usr/bin/python)
frame #34: <unknown function> + 0x264b18 (0x5572976cab18 in /usr/bin/python)
frame #35: <unknown function> + 0x25d96b (0x5572976c396b in /usr/bin/python)
frame #36: <unknown function> + 0x264865 (0x5572976ca865 in /usr/bin/python)
frame #37: _PyRun_SimpleFileObject + 0x1a8 (0x5572976c9d48 in /usr/bin/python)
frame #38: _PyRun_AnyFileObject + 0x43 (0x5572976c9a43 in /usr/bin/python)
frame #39: Py_RunMain + 0x2be (0x5572976bac3e in /usr/bin/python)
frame #40: Py_BytesMain + 0x2d (0x557297690bcd in /usr/bin/python)
frame #41: <unknown function> + 0x29d90 (0x7f4018b72d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #42: __libc_start_main + 0x80 (0x7f4018b72e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #43: _start + 0x25 (0x557297690ac5 in /usr/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.

```
After this error, any attempt by the worker to reconnect and establish a
new session with the master fails. Including pod restarts. Trying to
establish any new connection from worker is met with this error:
```
root@llama-2-13b-chat-pod-1:/workspace/llama/llama-2# torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint 10.224.0.181:29500 --master_port 29500 --rdzv_backend c10d inference-api.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 635, in run
    raise RendezvousClosedError()
torch.distributed.elastic.rendezvous.api.RendezvousClosedError
```

Cleaning up the process group and reinitalizing on the worker side does
not resolve this issue either. The state between the worker and master
is inconsistent, requires a master restart.

The best solution here is increasing the timeout to prevent the
connection from being closed. I have included a dockerfile fix here.

---------

Co-authored-by: Fei Guo <[email protected]>

Loading branch information

ishaansehgal99 and Fei-Guo authored Nov 6, 2023

1 parent 478e45e commit 1d854e3

docker/presets/llama-2/Dockerfile

-Original file line number
+Diff line change
@@ Expand Up / @@ -17,6 +17,8 @@ RUN git clone https://github.com/facebookresearch/llama @@
     WORKDIR /workspace/llama
+    RUN sed -i $'/torch.distributed.init_process_group("nccl")/c\\\t\t\timport datetime\\\n\\\t\t\ttorch.distributed.init_process_group("nccl", timeout=datetime.timedelta(days=365*100))' /workspace/llama/llama/generation.py
     RUN pip install -e .
     RUN pip install fastapi pydantic
     RUN pip install 'uvicorn[standard]'
@@ Expand Down @@

presets/llama-2-chat/inference-api.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -212,6 +212,8 @@ def worker_listen_tasks(): @@
                 os.killpg(os.getpgrp(), signal.SIGTERM)
             except Exception as e:
                 print(f"Error in Worker Listen Task", e)
+                if 'Socket Timeout' in str(e):
+                    print("A socket timeout occurred.")
                 os.killpg(os.getpgrp(), signal.SIGTERM)
     if __name__ == "__main__":
@@ Expand Down @@

presets/llama-2/inference-api.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -201,6 +201,8 @@ def worker_listen_tasks(): @@
                 os.killpg(os.getpgrp(), signal.SIGTERM)
             except Exception as e:
                 print(f"Error in Worker Listen Task", e)
+                if 'Socket Timeout' in str(e):
+                    print("A socket timeout occurred.")
                 os.killpg(os.getpgrp(), signal.SIGTERM)
     if __name__ == "__main__":
@@ Expand Down @@

0 comments on commit `1d854e3`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `1d854e3`

Commit

There are no files selected for viewing

0 comments on commit 1d854e3

0 comments on commit `1d854e3`