Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda OOM when accelerator.prepare #3200

Open
2 of 4 tasks
antoinedelplace opened this issue Oct 25, 2024 · 3 comments
Open
2 of 4 tasks

Cuda OOM when accelerator.prepare #3200

antoinedelplace opened this issue Oct 25, 2024 · 3 comments

Comments

@antoinedelplace
Copy link

System Info

- `Accelerate` version: 1.0.1
- Platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/doc/code/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1574.85 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I have a Cuda OOM error when doing:

self.model, self.optimizer, self.train_dataloader, self.val_dataloader, self.lr_scheduler = self.accelerator.prepare(
    self.model, self.optimizer, self.train_dataloader, self.val_dataloader, self.lr_scheduler
)

Here is the traceback:

Traceback (most recent call last):
  File "/home/ubuntu/doc/code/training/main.py", line 16, in <module>
    main(**vars(args))
  File "/home/ubuntu/doc/code/training/main.py", line 12, in main
    pipe.train()
  File "/home/ubuntu/doc/code/training/training_pipeline.py", line 500, in train
    num_update_steps_per_epoch, num_train_epochs = self.accelerate()
  File "/home/ubuntu/doc/code/training/training_pipeline.py", line 209, in accelerate
    self.model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
    result = tuple(
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
    model = model.to(self.device)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
    return self._apply(convert)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  [Previous line repeated 7 more times]
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
    param_applied = fn(param)
  File "/home/ubuntu/doc/code/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
    return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 70.68 GiB is free. Including non-PyTorch memory, this process has 8.40 GiB memory in use. Of the allocated memory 7.70 GiB is allocated by PyTorch, and 144.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

It works when doing python training/main.py
It does not work when doing accelerate launch training/main.py

I works when doing accelerate launch training/main.py on a A10G instance with the same requirements :

- `Accelerate` version: 1.0.1
- Platform: Linux-6.2.0-1011-aws-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/doc/code/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Can you help me please?

@jubueche
Copy link
Contributor

I think accelerator prepare reserves extra memory for distributed stuff such as gradient communication etc. (you only need to do accelerator prepare if you want to train with the model) so it is expected that accelerator prepare uses more memory than usually. is your model almost maxing out GPU memory? This could explain it.

@antoinedelplace
Copy link
Author

I don't think it is due to extra memory for distributed stuff. When launching python training/main.py, the GPU is only 21GB/80GB.

Just found out accelerate launch training/main.py works when using all 8 GPUs. So the problem may be in the simple launcher.

@yguooo
Copy link

yguooo commented Nov 5, 2024

I don't think it is due to extra memory for distributed stuff. When launching python training/main.py, the GPU is only 21GB/80GB.

Just found out accelerate launch training/main.py works when using all 8 GPUs. So the problem may be in the simple launcher.

I encountered a similar problem when replicating https://github.com/vwxyzjn/summarize_from_feedback_details/blob/main/summarize_from_feedback_details/sft.py.

The code works for distributed setup 1 machine 8 GPUs(5-6 G Memory usage for a 1B model), but fails for 1 GPU case (30 G Memory for 1B model).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants