CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct #65

zhj2022 · 2024-11-20T06:20:23Z

I was trying to finetuning Meta-Llama-3-8B-Instruct using 4 gpus with the following command:

torchrun --nproc_per_node 4 -m training.run --output_dir llama3test --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --train_data training/toy_data --learning_rate 1e-5 --num_train_epochs 5 --per_device_train_batch_size 1 --dataloader_drop_last True --normalized True --temperature 0.02 --query_max_len 32 --passage_max_len 128 --train_group_size 2 --mode unified --attn cccc --attn_implementation sdpa --no_gen_gas --no_emb_gas --split_emb --bf16

and all 4 gpus are out of memory

W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] ***************************************** W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1120 13:47:19.838000 2742257 site-packages/torch/distributed/run.py:793] ***************************************** 11/20/2024 13:47:23 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 11/20/2024 13:47:23 - INFO - __main__ - Training/evaluation parameters CustomTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, emb_p_only=False, emb_q_only=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=llama3test/runs/Nov20_13-47-23_u, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=steps, lora=False, loss_gen_factor=1.0, loss_gen_type=mixed, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mode=unified, mp_parameters=, neftune_noise_alpha=None, negatives_cross_device=False, no_cuda=False, no_emb_gas=True, no_gen_gas=True, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=llama3test, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_generative_bs=None, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, qlora=False, ray_scope=last, remove_unused_columns=True, report_to=['wandb'], resume_from_checkpoint=None, run_name=llama3test, save_on_each_node=False, save_only_model=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, split_emb=True, split_emb_full=False, temperature=0.02, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 11/20/2024 13:47:23 - INFO - __main__ - Model parameters ModelArguments(model_name_or_path='meta-llama/Meta-Llama-3-8B-Instruct', config_name=None, tokenizer_name=None, pooling_method='weightedmean', normalized=True, attn_implementation='sdpa', attn='cccc', projection=None) 11/20/2024 13:47:23 - INFO - __main__ - Data parameters DataArguments(train_data='training/toy_data', train_group_size=2, query_max_len=32, passage_max_len=128, generative_max_len=None, max_example_num_per_dataset=100000000, num_samples=None, use_unique_indices=False, prefixlm=False) 11/20/2024 13:47:23 - INFO - __main__ - Using GradCache with chunk size 1 /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False 11/20/2024 13:47:24 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True. warnings.warn( /home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_downloadis deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True`.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
11/20/2024 13:47:24 - INFO - main - Config: LlamaConfig {
"_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128009,
"hidden_act": "silu",
"hidden_size": 4096,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 14336,
"label2id": {
"LABEL_0": 0
},
"max_position_embeddings": 8192,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.2",
"use_cache": true,
"vocab_size": 128256
}

11/20/2024 13:47:24 - INFO - main - Set pad token to bos token: <|begin_of_text|>
11/20/2024 13:47:24 - INFO - main - Loading dataset training/toy_data/toy_data_generative.jsonl
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
11/20/2024 13:47:25 - INFO - main - Loading dataset training/toy_data/toy_data_embedding.jsonl
11/20/2024 13:47:26 - INFO - main - Filtering out embedding samples with too long instructions for training/toy_data/toy_data_embedding.jsonl
11/20/2024 13:47:26 - INFO - main - Unified mode: 10 embedding samples, 10 generative samples
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5.85it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.50it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.31it/s]
Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.44it/s]
Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn
Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn
Created GritLM: torch.bfloat16 dtype, weightedmean pool, unified mode, cccc attn
11/20/2024 13:47:32 - INFO - main - Starting training
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - * Running training *
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Num examples = 10
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Num Epochs = 5
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Instantaneous batch size per device = 1
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Total train batch size (w. parallel, distributed & accumulation) = 4
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Gradient Accumulation steps = 1
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Total optimization steps = 10
11/20/2024 13:47:34 - INFO - training.gradcache_trainer - Number of trainable parameters = 8,030,261,248
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: hongjizhang183 (hongjizhang183-shanghai-jiao-tong-university). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.7
wandb: Run data is saved locally in /home/hongjizhang/gritlm/gritlm/wandb/run-20241120_134735-vujcrvoq
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run leafy-moon-48
wandb: ⭐️ View project at https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface
wandb: 🚀 View run at https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface/runs/vujcrvoq
0%| | 0/10 [00:00<?, ?it/s][rank1]:[W1120 13:47:36.928805439 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank2]:[W1120 13:47:36.929845239 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W1120 13:47:36.936399679 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank3]:[W1120 13:47:36.938326946 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank2]: return _run_code(code, main_globals, None,
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
[rank2]: exec(code, run_globals)
[rank2]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in
[rank2]: main()
[rank2]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main
[rank2]: trainer.train()
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank2]: return inner_training_loop(
[rank2]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop
[rank2]: self.optimizer.step()
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
[rank2]: self.optimizer.step(closure)
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
[rank2]: return func.get(opt, opt.class)(*args, **kwargs)
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank2]: out = func(*args, **kwargs)
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
[rank2]: ret = func(self, *args, **kwargs)
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step
[rank2]: adamw(
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
[rank2]: return func(*args, **kwargs)
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw
[rank2]: func(
[rank2]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
[rank2]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 2 has a total capacity of 79.33 GiB of which 22.00 MiB is free. Including non-PyTorch memory, this process has 79.30 GiB memory in use. Of the allocated memory 77.58 GiB is allocated by PyTorch, and 501.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in
main()
File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main
trainer.train()
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop
self.optimizer.step()
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.get(opt, opt.class)(*args, **kwargs)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step
adamw(
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
return func(*args, **kwargs)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw
func(
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 12.00 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 463.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in
[rank0]: main()
[rank0]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main
[rank0]: trainer.train()
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop
[rank0]: self.optimizer.step()
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
[rank0]: self.optimizer.step(closure)
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
[rank0]: return func.get(opt, opt.class)(*args, **kwargs)
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank0]: out = func(*args, **kwargs)
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
[rank0]: ret = func(self, *args, **kwargs)
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step
[rank0]: adamw(
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw
[rank0]: func(
[rank0]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
[rank0]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 79.33 GiB of which 12.00 MiB is free. Including non-PyTorch memory, this process has 79.31 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 463.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank1]: return _run_code(code, main_globals, None,
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
[rank1]: exec(code, run_globals)
[rank1]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in
[rank1]: main()
[rank1]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main
[rank1]: trainer.train()
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop
[rank1]: self.optimizer.step()
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
[rank1]: self.optimizer.step(closure)
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
[rank1]: return func.get(opt, opt.class)(*args, **kwargs)
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
[rank1]: ret = func(self, *args, **kwargs)
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step
[rank1]: adamw(
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw
[rank1]: func(
[rank1]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
[rank1]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacity of 79.33 GiB of which 38.00 MiB is free. Including non-PyTorch memory, this process has 79.28 GiB memory in use. Of the allocated memory 77.61 GiB is allocated by PyTorch, and 453.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank3]: return _run_code(code, main_globals, None,
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
[rank3]: exec(code, run_globals)
[rank3]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 438, in
[rank3]: main()
[rank3]: File "/home/hongjizhang/gritlm/gritlm/training/run.py", line 419, in main
[rank3]: trainer.train()
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank3]: return inner_training_loop(
[rank3]: File "/home/hongjizhang/gritlm/gritlm/training/gradcache_trainer.py", line 766, in _inner_training_loop
[rank3]: self.optimizer.step()
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
[rank3]: self.optimizer.step(closure)
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
[rank3]: return func.get(opt, opt.class)(*args, **kwargs)
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
[rank3]: out = func(*args, **kwargs)
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
[rank3]: ret = func(self, *args, **kwargs)
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 220, in step
[rank3]: adamw(
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 154, in maybe_fallback
[rank3]: return func(*args, kwargs)
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 782, in adamw
[rank3]: func(
[rank3]: File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/optim/adamw.py", line 606, in _multi_tensor_adamw
[rank3]: exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 3 has a total capacity of 79.33 GiB of which 34.00 MiB is free. Including non-PyTorch memory, this process has 79.29 GiB memory in use. Of the allocated memory 77.72 GiB is allocated by PyTorch, and 442.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1120 13:47:39.700000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742332 closing signal SIGTERM
W1120 13:47:39.701000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742333 closing signal SIGTERM
W1120 13:47:39.702000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2742334 closing signal SIGTERM
wandb: 🚀 View run leafy-moon-48 at: https://wandb.ai/hongjizhang183-shanghai-jiao-tong-university/huggingface/runs/vujcrvoq
wandb: Find logs at: wandb/run-20241120_134735-vujcrvoq/logs
E1120 13:47:40.167000 2742257 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 2742335) of binary: /home/hongjizhang/.conda/envs/gritlm/bin/python
Traceback (most recent call last):
File "/home/hongjizhang/.conda/envs/gritlm/bin/torchrun", line 8, in
sys.exit(main())
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 355, in wrapper
return f(*args, kwargs)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hongjizhang/.conda/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training.run FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-20_13:47:39
host : u
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2742335)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================`

GPU used are NVIDIA A800-SXM4-80GB which should not be OOM when loading a 8B model in bf16 precision in principle. I don't know the reason why a Llama-3-8B model takes so much memory space.

The text was updated successfully, but these errors were encountered:

Muennighoff · 2024-11-20T06:27:59Z

your script is not using fsdp afaict i.e. it is trying to do data parallel which won't fit; i recommend using accelerate launch with an fsdp config (https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct #65

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct #65

zhj2022 commented Nov 20, 2024

Muennighoff commented Nov 20, 2024

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct #65

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct #65

Comments

zhj2022 commented Nov 20, 2024

training.run FAILED

Failures: <NO_OTHER_FAILURES>

Muennighoff commented Nov 20, 2024

Failures:
<NO_OTHER_FAILURES>