You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
Traceback (most recent call last):
File "/home/ubuntu/doc/code/training/main.py", line 16, in <module>
main(**vars(args))
File "/home/ubuntu/doc/code/training/main.py", line 12, in main
pipe.train()
File "/home/ubuntu/doc/code/training/training_pipeline.py", line 500, in train
num_update_steps_per_epoch, num_train_epochs = self.accelerate()
File "/home/ubuntu/doc/code/training/training_pipeline.py", line 209, in accelerate
self.model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
result = tuple(
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
model = model.to(self.device)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
return self._apply(convert)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
module._apply(fn)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
module._apply(fn)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
module._apply(fn)
[Previous line repeated 7 more times]
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
param_applied = fn(param)
File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 70.68 GiB is free. Including non-PyTorch memory, this process has 8.40 GiB memory in use. Of the allocated memory 7.70 GiB is allocated by PyTorch, and 144.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
It works when doing python training/main.py
It does not work when doing accelerate launch training/main.py
I works when doing accelerate launch training/main.py on a A10G instance with the same requirements :
I think accelerator prepare reserves extra memory for distributed stuff such as gradient communication etc. (you only need to do accelerator prepare if you want to train with the model) so it is expected that accelerator prepare uses more memory than usually. is your model almost maxing out GPU memory? This could explain it.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I have a Cuda OOM error when doing:
Here is the traceback:
Expected behavior
It works when doing
python training/main.py
It does not work when doing
accelerate launch training/main.py
I works when doing
accelerate launch training/main.py
on a A10G instance with the same requirements :Can you help me please?
The text was updated successfully, but these errors were encountered: