You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:
When using iter twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers calls nonblocking_next()here. Once this worker dies, the data loader is deadlocked.
I noticed this when using Lightning with torchdata: Lightning's fit will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.
and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.
Versions
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 13.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.26.4
Libc version: N/A
Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-13.4.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.0.1
[pip3] torchdata==0.6.1
[conda] Could not collect
The text was updated successfully, but these errors were encountered:
A possible workaround is to wrap DataLoader2.__iter__ so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:
🐛 Describe the bug
I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:
When using
iter
twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers callsnonblocking_next()
here. Once this worker dies, the data loader is deadlocked.I noticed this when using Lightning with torchdata: Lightning's
fit
will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.Code to reproduce:
This results in the output:
and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.
Versions
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 13.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.26.4
Libc version: N/A
Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-13.4.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.0.1
[pip3] torchdata==0.6.1
[conda] Could not collect
The text was updated successfully, but these errors were encountered: