Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling __iter__ twice on DataLoader2 causes hang with MPRS #1198

Open
JohnHBrock opened this issue Jul 21, 2023 · 2 comments
Open

Calling __iter__ twice on DataLoader2 causes hang with MPRS #1198

JohnHBrock opened this issue Jul 21, 2023 · 2 comments

Comments

@JohnHBrock
Copy link

🐛 Describe the bug

I'm aware torchdata isn't being maintained anymore, but thought I'd post this here for posterity:

When using iter twice for the same instance of DataLoader2, trying to iterate over the 2nd one results in a hang. One of the worker processes terminates due to an exception "Can not reset while we are still waiting response for previous request", although this isn't obvious unless you run a debugger. This exception occurs when one of the workers calls nonblocking_next() here. Once this worker dies, the data loader is deadlocked.

I noticed this when using Lightning with torchdata: Lightning's fit will run a few iterations of the validation loop as a sanity check before training, then do a training loop, followed by the validation loop again. This 2nd validation loop never finishes because of the hang.

Code to reproduce:

from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService
from torch.utils.data.datapipes.iter.sharding import SHARDING_PRIORITIES
from torchdata.datapipes.iter import IterableWrapper

def main():
	dp = IterableWrapper([1, 2, 3, 4, 5, 6, 7]*100).sharding_round_robin_dispatch(SHARDING_PRIORITIES.MULTIPROCESSING)
	reading_service = MultiProcessingReadingService(num_workers=2, main_prefetch_cnt=0, worker_prefetch_cnt=0)

	dataloader = DataLoader2(dp, reading_service=reading_service)
	print(next(iter(dataloader)))
	print(next(iter(dataloader)))
	print("done")

if __name__ == "__main__":
    main()

This results in the output:

1

and nothing else. The data loader processes continue to run, except for the one terminating worker that I mentioned above.

Versions

Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.4.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.8.17 (default, Jul 19 2023, 14:02:02) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-13.4.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.0.1
[pip3] torchdata==0.6.1
[conda] Could not collect

@JohnHBrock
Copy link
Author

JohnHBrock commented Jul 21, 2023

A possible workaround is to wrap DataLoader2.__iter__ so that it gets recreated from scratch each time, rather than just resetting the existing DataLoader2 instance, for example something like this:

from torchdata.dataloader2 import DataLoader2

class DataLoader2Workaround():
    def __init__(self, datapipe, reading_service):
        self.datapipe = datapipe
        self.reading_service = reading_service
        self.dataloader2 = None

    def _create_dataloader2(self):
        self.dataloader2 = DataLoader2(self.datapipe, reading_service=self.reading_service)

    def __getattr__(self, attr):
        if self.dataloader2 is None:
            self._create_dataloader2()
        return getattr(self.dataloader2, attr)

    def __iter__(self):
        self._create_dataloader2()
        return iter(self.dataloader2)

@JohnHBrock
Copy link
Author

Possibly related to #1148.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant