Pickling the accelerator after preparing data loader no longer possible #3070

BenjaminBossan · 2024-09-03T15:37:07Z

System Info

accelerate v0.34.0

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

import pickle
import torch
from accelerate import Accelerator

data = torch.arange(10)
ds = torch.utils.data.TensorDataset(data)
dl = torch.utils.data.DataLoader(ds)

accelerator = Accelerator()
pickle.dumps(accelerator)  # works
accelerator.prepare(dl)
pickle.dumps(accelerator)  # fails

The error is:

    pickle.dumps(accelerator)  # fails
    ^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.PicklingError: Can't pickle <class 'accelerate.data_loader.DataLoaderShard'>: it's not the same object as accelerate.data_loader.DataLoaderShard

Expected behavior

Pickling and unpickling the Accelerator instance used to work but now it fails if a data loader has been prepared. I could trace down the issue to this PR: #2895. Although I haven't investigated further, I'm fairly certain that messing with the __class__ attribute is the reason:

accelerate/src/accelerate/data_loader.py

Line 436 in b5235f2

self.__class__ = type(base_cls_name, (base_cls, parent_cls_name), {})

If that's so, there might be the necessity to adjust __getstate__ or __reduce__ on DataLoaderAdapter to account for the fact that the class is dynamically changed.

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-09-03T15:38:07Z

Ping @byi8220. I already discussed this issue internally with Zach, who might take a look later today.

byi8220 · 2024-09-03T16:00:47Z

Although I haven't investigated further, I'm fairly certain that messing with the class attribute is the reason

Fully agree on that.

there might be the necessity to adjust __getstate__ or __reduce__ on DataLoaderAdapter to account for the fact that the class is dynamically changed.

Makes sense, IIUC it would be preferred to adjust __getstate__ (and possibly __setstate__) to be aware of the calling object's dynamic __class__ type?

Alternatively, we could just abandon the dynamic class solution entirely and incur a lot of code duplication (which we discussed in the original PR, but decided against).

who might take a look later today.

Ack, I'll hold off on writing PR then?

BenjaminBossan · 2024-09-03T16:38:35Z

Ack, I'll hold off on writing PR then?

I think you can give it a try, I'm sure Zach will write here when he starts working on this.

muellerzr · 2024-09-03T17:53:56Z

@byi8220 let's go with the getstate/setstate version rather than do code duplication. I haven't gotten there yet, so feel free to have at it!

byi8220 · 2024-09-03T22:48:29Z

Working on a possible fix for this in #3074 (Currently marked as draft, but it might be worth looking at in its current state, since it might be enough?)

Honestly, even before breaking the class pickling DataLoaders seems really sort of fragile. There's more details in the PR, but I'm not sure if StatefulDataLoader could (or should) even be pickled.

Also, my current multigpu tests are failing with an error message of TypeError: cannot pickle 'torch._C.Generator' object when I try to pickle the accelerator. I am not sure if that's related to the issue at hand or if that is a known existing bug.

At the very least, I ran test_hf.py within https://github.com/skorch-dev/skorch with the above PR, and managed to get those tests to pass. So it should unblock their CI. Although it's sort of strange, considering their failing test literally says pickling doesn't work.

However, I can't run their integration tests, so there's still a chance this is not sufficient. I still think there's some rough edges with pickling, especially in a distributed context.

byi8220 · 2024-09-03T23:52:10Z

Also, my current multigpu tests are failing with an error message of TypeError: cannot pickle 'torch._C.Generator'

Yet again, this was just me not updating to the latest pytorch version. Pickling torch generators is only available in torch version
2.4.0 and above, which unfortunately is incompatible with torchvision-0.18.1

BenjaminBossan mentioned this issue Sep 3, 2024

Fix warning from torch.load starting in torch 2.4 skorch-dev/skorch#1064

Merged

muellerzr assigned byi8220 Sep 3, 2024

byi8220 mentioned this issue Sep 3, 2024

Allow DataLoaderAdapter subclasses to be pickled by implementing __reduce__ #3074

Merged

5 tasks

muellerzr closed this as completed in #3074 Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickling the accelerator after preparing data loader no longer possible #3070

Pickling the accelerator after preparing data loader no longer possible #3070

BenjaminBossan commented Sep 3, 2024

BenjaminBossan commented Sep 3, 2024

byi8220 commented Sep 3, 2024

BenjaminBossan commented Sep 3, 2024

muellerzr commented Sep 3, 2024

byi8220 commented Sep 3, 2024 •

edited

Loading

byi8220 commented Sep 3, 2024 •

edited

Loading

Pickling the accelerator after preparing data loader no longer possible #3070

Pickling the accelerator after preparing data loader no longer possible #3070

Comments

BenjaminBossan commented Sep 3, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented Sep 3, 2024

byi8220 commented Sep 3, 2024

BenjaminBossan commented Sep 3, 2024

muellerzr commented Sep 3, 2024

byi8220 commented Sep 3, 2024 • edited Loading

byi8220 commented Sep 3, 2024 • edited Loading

byi8220 commented Sep 3, 2024 •

edited

Loading

byi8220 commented Sep 3, 2024 •

edited

Loading