Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory #58

Open
xxrrnn opened this issue Apr 1, 2024 · 4 comments
Open

out of memory #58

xxrrnn opened this issue Apr 1, 2024 · 4 comments

Comments

@xxrrnn
Copy link

xxrrnn commented Apr 1, 2024

在执行stag1.sh的时候,出现以下报错:

    module._apply(fn)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 23.65 GiB total capacity; 23.08 GiB already allocated; 58.06 MiB free; 23.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8396 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8397) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()

我看到了之前的issue中的回答,但并不清楚--gpu的参数应当加在哪里。我尝试加在stage1.sh中,报错说这个参数不是代码需要的参数。想详细了解这个指示多gpu load参数的指令该如何添加?谢谢。
stage1.sh指令如下:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/train_instruct_graphmatch.json
graph_data_path=./graph_data/graph_data_all.pt
pretra_gnn=clip_gt_arxiv
output_model=./stage_1
wandb offline
python3 -m  torch.distributed.run  --nnodes=1 --nproc_per_node=2 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 256 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb
@yuh-yang
Copy link
Collaborator

你好,可以尝试使用tune_script_light中的脚本在lightning下进行训练,以更好地管理显存使用和分布式训练,在bfloat16精度下应该是可以24G单卡训练的

@msy0513
Copy link

msy0513 commented Jun 29, 2024

你好,可以尝试使用tune_script_light中的脚本在lightning下进行训练,以更好地管理显存使用和分布式训练,在bfloat16精度下应该是可以24G单卡训练的

我使用了light中脚本运行时报错:AttributeError: 'GraphLlamaConfig' object has no attribute 'pretrain_graph_model_path'

 if model_args.graph_tower is not None:
            self.model = GraphLlamaForCausalLM.from_pretrained(
                    model_args.model_name_or_path,
                    cache_dir=training_args.cache_dir,
                    **bnb_model_from_pretrained_args
                ) ## TODO: add real Graph Llama model 
        else:
            self.model = transformers.LlamaForCausalLM.from_pretrained(
                model_args.model_name_or_path,
                cache_dir=training_args.cache_dir,
                **bnb_model_from_pretrained_args
            )
        self.model.config.pretrain_graph_model_path = self.model.config.pretrain_graph_model_path + model_args.graph_tower

定位到最后一行报错的,请问这里是因为todo这里要修改吗?还是别的问题,怎么解决呢?

@applekeyword
Copy link

可能是忘记在light的脚本中添加模型参数了

@sustech-lz
Copy link

你好,可以尝试使用tune_script_light中的脚本在lightning下进行训练,以更好地管理显存使用和分布式训练,在bfloat16精度下应该是可以24G单卡训练的

想请问一下,是否可以使用6卡4090进行分布式训练解决上述问题呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants