Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: SocketError torch timeout (#135)
torch.init_process_group includes a default 30 minute timeout. ``` While the worker is listening for instructions, after thirty idle minutes an error gets thrown: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:605 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f3ff9fb295e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7f3ff9f6b7cd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xf8 (0x7f3fc834c858 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3a (0x7f3fc834d4ca in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0x7f3fc834d594 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f3fc82fe063 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) + 0x1fc (0x7f3f8c6e443c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x530 (0x7f3f8c6e7c10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x4c2 (0x7f3f8c6f5922 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x4c1a310 (0x7f3fc82eb310 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x4c23bfb (0x7f3fc82f4bfb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x4c43ab3 (0x7f3fc8314ab3 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0xb8c97a (0x7f3fceaa197a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #15: <unknown function> + 0x39bb46 (0x7f3fce2b0b46 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so) frame #16: <unknown function> + 0x15cc9e (0x5572975c2c9e in /usr/bin/python) frame #17: _PyObject_MakeTpCall + 0x25b (0x5572975b972b in /usr/bin/python) frame #18: <unknown function> + 0x16b1eb (0x5572975d11eb in /usr/bin/python) frame #19: _PyEval_EvalFrameDefault + 0x640a (0x5572975b175a in /usr/bin/python) frame #20: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #21: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python) frame #23: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #24: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python) frame #25: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #26: PyObject_Call + 0x122 (0x5572975d1bc2 in /usr/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x2a37 (0x5572975add87 in /usr/bin/python) frame #28: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #29: _PyEval_EvalFrameDefault + 0x1a1b (0x5572975acd6b in /usr/bin/python) frame #30: _PyFunction_Vectorcall + 0x7c (0x5572975c34ec in /usr/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x6cd (0x5572975aba1d in /usr/bin/python) frame #32: <unknown function> + 0x142176 (0x5572975a8176 in /usr/bin/python) frame #33: PyEval_EvalCode + 0x86 (0x55729769dc56 in /usr/bin/python) frame #34: <unknown function> + 0x264b18 (0x5572976cab18 in /usr/bin/python) frame #35: <unknown function> + 0x25d96b (0x5572976c396b in /usr/bin/python) frame #36: <unknown function> + 0x264865 (0x5572976ca865 in /usr/bin/python) frame #37: _PyRun_SimpleFileObject + 0x1a8 (0x5572976c9d48 in /usr/bin/python) frame #38: _PyRun_AnyFileObject + 0x43 (0x5572976c9a43 in /usr/bin/python) frame #39: Py_RunMain + 0x2be (0x5572976bac3e in /usr/bin/python) frame #40: Py_BytesMain + 0x2d (0x557297690bcd in /usr/bin/python) frame #41: <unknown function> + 0x29d90 (0x7f4018b72d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #42: __libc_start_main + 0x80 (0x7f4018b72e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #43: _start + 0x25 (0x557297690ac5 in /usr/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ``` After this error, any attempt by the worker to reconnect and establish a new session with the master fails. Including pod restarts. Trying to establish any new connection from worker is met with this error: ``` root@llama-2-13b-chat-pod-1:/workspace/llama/llama-2# torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint 10.224.0.181:29500 --master_port 29500 --rdzv_backend c10d inference-api.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent result = agent.run() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run result = self._invoke_run(role) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run self._initialize_workers(self._worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers self._rendezvous(worker_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous self._op_executor.run(join_op, deadline) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 635, in run raise RendezvousClosedError() torch.distributed.elastic.rendezvous.api.RendezvousClosedError ``` Cleaning up the process group and reinitalizing on the worker side does not resolve this issue either. The state between the worker and master is inconsistent, requires a master restart. The best solution here is increasing the timeout to prevent the connection from being closed. I have included a dockerfile fix here. --------- Co-authored-by: Fei Guo <[email protected]>
- Loading branch information