fineune_hf.py 运行卡在开始训练时 #1345

Bayson-create · 2025-01-07T00:28:50Z

System Info / 系統信息

Cuda 11.5
Transformers 4.40.2
Python 3.12.2
GPU 4090 单卡
内存 32GB
系统：windows wsl2

Who can help? / 谁可以帮助到您？

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

lora_finetune.ipynb 运行 !CUDA_VISIBLE_DEVICES=0 NCCL_P2P_DISABLE="1" NCCL_IB_DISABLE="1" python finetune_hf.py data/AdvertiseGen_fix /home/xiebeichen/chatglm3-6b configs/lora.yaml

输出

Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████| 7/7 [00:03<00:00, 2.02it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model

--> model has 1.949696M params

train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
--> Sanity check
'[gMASK]': 64790 -> -100
'sop': 64792 -> -100
'<|user|>': 64795 -> -100
...
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
0%| | 0/3000 [00:00<?, ?it/s]

会卡在开始训练时，0/3000 这个节点，没有报错，nvidia-smi 运行发现显卡并没有被调用

Expected behavior / 期待表现

正常调用 gpu，正常训练

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fineune_hf.py 运行卡在开始训练时 #1345

fineune_hf.py 运行卡在开始训练时 #1345

Bayson-create commented Jan 7, 2025

fineune_hf.py 运行卡在开始训练时 #1345

fineune_hf.py 运行卡在开始训练时 #1345

Comments

Bayson-create commented Jan 7, 2025

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现