out of memory #58

xxrrnn · 2024-04-01T16:03:44Z

在执行stag1.sh的时候，出现以下报错：

    module._apply(fn)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 23.65 GiB total capacity; 23.08 GiB already allocated; 58.06 MiB free; 23.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8396 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8397) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()

我看到了之前的issue中的回答，但并不清楚--gpu的参数应当加在哪里。我尝试加在stage1.sh中，报错说这个参数不是代码需要的参数。想详细了解这个指示多gpu load参数的指令该如何添加？谢谢。
stage1.sh指令如下：

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/train_instruct_graphmatch.json
graph_data_path=./graph_data/graph_data_all.pt
pretra_gnn=clip_gt_arxiv
output_model=./stage_1
wandb offline
python3 -m  torch.distributed.run  --nnodes=1 --nproc_per_node=2 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 256 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

The text was updated successfully, but these errors were encountered:

yuh-yang · 2024-04-22T15:17:57Z

你好，可以尝试使用tune_script_light中的脚本在lightning下进行训练，以更好地管理显存使用和分布式训练，在bfloat16精度下应该是可以24G单卡训练的

msy0513 · 2024-06-29T08:19:33Z

你好，可以尝试使用tune_script_light中的脚本在lightning下进行训练，以更好地管理显存使用和分布式训练，在bfloat16精度下应该是可以24G单卡训练的

我使用了light中脚本运行时报错：AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

 if model_args.graph_tower is not None:
            self.model = GraphLlamaForCausalLM.from_pretrained(
                    model_args.model_name_or_path,
                    cache_dir=training_args.cache_dir,
                    **bnb_model_from_pretrained_args
                ) ## TODO: add real Graph Llama model 
        else:
            self.model = transformers.LlamaForCausalLM.from_pretrained(
                model_args.model_name_or_path,
                cache_dir=training_args.cache_dir,
                **bnb_model_from_pretrained_args
            )
        self.model.config.pretrain_graph_model_path = self.model.config.pretrain_graph_model_path + model_args.graph_tower

定位到最后一行报错的，请问这里是因为todo这里要修改吗？还是别的问题，怎么解决呢？

applekeyword · 2024-07-30T08:01:16Z

可能是忘记在light的脚本中添加模型参数了

sustech-lz · 2024-10-21T08:37:21Z

你好，可以尝试使用tune_script_light中的脚本在lightning下进行训练，以更好地管理显存使用和分布式训练，在bfloat16精度下应该是可以24G单卡训练的

想请问一下，是否可以使用6卡4090进行分布式训练解决上述问题呢？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory #58

out of memory #58

xxrrnn commented Apr 1, 2024

yuh-yang commented Apr 22, 2024

msy0513 commented Jun 29, 2024 •

edited

Loading

applekeyword commented Jul 30, 2024

sustech-lz commented Oct 21, 2024

out of memory #58

out of memory #58

Comments

xxrrnn commented Apr 1, 2024

yuh-yang commented Apr 22, 2024

msy0513 commented Jun 29, 2024 • edited Loading

applekeyword commented Jul 30, 2024

sustech-lz commented Oct 21, 2024

msy0513 commented Jun 29, 2024 •

edited

Loading