Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling the accelerator after preparing data loader no longer possible #3070

Closed
1 of 4 tasks
BenjaminBossan opened this issue Sep 3, 2024 · 6 comments · Fixed by #3074
Closed
1 of 4 tasks

Pickling the accelerator after preparing data loader no longer possible #3070

BenjaminBossan opened this issue Sep 3, 2024 · 6 comments · Fixed by #3074
Assignees

Comments

@BenjaminBossan
Copy link
Member

System Info

accelerate v0.34.0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

import pickle
import torch
from accelerate import Accelerator

data = torch.arange(10)
ds = torch.utils.data.TensorDataset(data)
dl = torch.utils.data.DataLoader(ds)

accelerator = Accelerator()
pickle.dumps(accelerator)  # works
accelerator.prepare(dl)
pickle.dumps(accelerator)  # fails

The error is:

    pickle.dumps(accelerator)  # fails
    ^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.PicklingError: Can't pickle <class 'accelerate.data_loader.DataLoaderShard'>: it's not the same object as accelerate.data_loader.DataLoaderShard

Expected behavior

Pickling and unpickling the Accelerator instance used to work but now it fails if a data loader has been prepared. I could trace down the issue to this PR: #2895. Although I haven't investigated further, I'm fairly certain that messing with the __class__ attribute is the reason:

self.__class__ = type(base_cls_name, (base_cls, parent_cls_name), {})

If that's so, there might be the necessity to adjust __getstate__ or __reduce__ on DataLoaderAdapter to account for the fact that the class is dynamically changed.

@BenjaminBossan
Copy link
Member Author

Ping @byi8220. I already discussed this issue internally with Zach, who might take a look later today.

@byi8220
Copy link
Contributor

byi8220 commented Sep 3, 2024

Although I haven't investigated further, I'm fairly certain that messing with the class attribute is the reason

Fully agree on that.

there might be the necessity to adjust __getstate__ or __reduce__ on DataLoaderAdapter to account for the fact that the class is dynamically changed.

Makes sense, IIUC it would be preferred to adjust __getstate__ (and possibly __setstate__) to be aware of the calling object's dynamic __class__ type?

Alternatively, we could just abandon the dynamic class solution entirely and incur a lot of code duplication (which we discussed in the original PR, but decided against).

who might take a look later today.

Ack, I'll hold off on writing PR then?

@BenjaminBossan
Copy link
Member Author

Ack, I'll hold off on writing PR then?

I think you can give it a try, I'm sure Zach will write here when he starts working on this.

@muellerzr
Copy link
Collaborator

@byi8220 let's go with the getstate/setstate version rather than do code duplication. I haven't gotten there yet, so feel free to have at it!

@byi8220
Copy link
Contributor

byi8220 commented Sep 3, 2024

Working on a possible fix for this in #3074 (Currently marked as draft, but it might be worth looking at in its current state, since it might be enough?)

Honestly, even before breaking the class pickling DataLoaders seems really sort of fragile. There's more details in the PR, but I'm not sure if StatefulDataLoader could (or should) even be pickled.

Also, my current multigpu tests are failing with an error message of TypeError: cannot pickle 'torch._C.Generator' object when I try to pickle the accelerator. I am not sure if that's related to the issue at hand or if that is a known existing bug.

At the very least, I ran test_hf.py within https://github.com/skorch-dev/skorch with the above PR, and managed to get those tests to pass. So it should unblock their CI. Although it's sort of strange, considering their failing test literally says pickling doesn't work.

However, I can't run their integration tests, so there's still a chance this is not sufficient. I still think there's some rough edges with pickling, especially in a distributed context.

@byi8220
Copy link
Contributor

byi8220 commented Sep 3, 2024

Also, my current multigpu tests are failing with an error message of TypeError: cannot pickle 'torch._C.Generator'

Yet again, this was just me not updating to the latest pytorch version. Pickling torch generators is only available in torch version
2.4.0 and above, which unfortunately is incompatible with torchvision-0.18.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants