Cuda OOM when accelerator.prepare #3200

antoinedelplace · 2024-10-25T14:26:09Z

System Info

- `Accelerate` version: 1.0.1
- Platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/doc/code/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1574.85 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I have a Cuda OOM error when doing:

self.model, self.optimizer, self.train_dataloader, self.val_dataloader, self.lr_scheduler = self.accelerator.prepare(
    self.model, self.optimizer, self.train_dataloader, self.val_dataloader, self.lr_scheduler
)

Here is the traceback:

Traceback (most recent call last):
  File "/home/ubuntu/doc/code/training/main.py", line 16, in <module>
    main(**vars(args))
  File "/home/ubuntu/doc/code/training/main.py", line 12, in main
    pipe.train()
  File "/home/ubuntu/doc/code/training/training_pipeline.py", line 500, in train
    num_update_steps_per_epoch, num_train_epochs = self.accelerate()
  File "/home/ubuntu/doc/code/training/training_pipeline.py", line 209, in accelerate
    self.model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
    result = tuple(
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
    model = model.to(self.device)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
    return self._apply(convert)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  [Previous line repeated 7 more times]
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
    param_applied = fn(param)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
    return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 70.68 GiB is free. Including non-PyTorch memory, this process has 8.40 GiB memory in use. Of the allocated memory 7.70 GiB is allocated by PyTorch, and 144.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

It works when doing python training/main.py
It does not work when doing accelerate launch training/main.py

I works when doing accelerate launch training/main.py on a A10G instance with the same requirements :

- `Accelerate` version: 1.0.1
- Platform: Linux-6.2.0-1011-aws-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/doc/code/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Can you help me please?

The text was updated successfully, but these errors were encountered:

jubueche · 2024-10-28T12:10:54Z

I think accelerator prepare reserves extra memory for distributed stuff such as gradient communication etc. (you only need to do accelerator prepare if you want to train with the model) so it is expected that accelerator prepare uses more memory than usually. is your model almost maxing out GPU memory? This could explain it.

antoinedelplace · 2024-10-28T12:52:35Z

I don't think it is due to extra memory for distributed stuff. When launching python training/main.py, the GPU is only 21GB/80GB.

Just found out accelerate launch training/main.py works when using all 8 GPUs. So the problem may be in the simple launcher.

yguooo · 2024-11-05T21:32:46Z

I don't think it is due to extra memory for distributed stuff. When launching python training/main.py, the GPU is only 21GB/80GB.

Just found out accelerate launch training/main.py works when using all 8 GPUs. So the problem may be in the simple launcher.

I encountered a similar problem when replicating https://github.com/vwxyzjn/summarize_from_feedback_details/blob/main/summarize_from_feedback_details/sft.py.

The code works for distributed setup 1 machine 8 GPUs(5-6 G Memory usage for a 1B model), but fails for 1 GPU case (30 G Memory for 1B model).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda OOM when accelerator.prepare #3200

Cuda OOM when accelerator.prepare #3200

antoinedelplace commented Oct 25, 2024

jubueche commented Oct 28, 2024

antoinedelplace commented Oct 28, 2024

yguooo commented Nov 5, 2024

Cuda OOM when accelerator.prepare #3200

Cuda OOM when accelerator.prepare #3200

Comments

antoinedelplace commented Oct 25, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

jubueche commented Oct 28, 2024

antoinedelplace commented Oct 28, 2024

yguooo commented Nov 5, 2024