第二篇论文中奖励模型训练的问题 #58

Syaoran1 · 2024-09-27T13:09:00Z

我想复现第二篇论文的相关实验，但是遇到了一点问题不知道怎么解决
[2024-09-27 20:21:18,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 20:21:22,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
1
1
[2024-09-27 20:21:23,268] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 20:21:23,268] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(split_batches=True)
warnings.warn(
2024-09-27 20:21:23 - INFO - Load init model from /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf
2024-09-27 20:21:23 - INFO - Loading tokenizer from huggingface: /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf...
2024-09-27 20:21:23 - INFO - Llama tokenizer size: 32000
2024-09-27 20:21:23 - INFO - Llama tokenizer pad token: , pad_token_id: 0
2024-09-27 20:21:23 - INFO - Llama tokenizer. special token: {'bos_token': '~~', 'eos_token': '~~', 'unk_token': '', 'pad_token': ''}

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.12it/s]
Some weights of LlamaRewardModel were not initialized from the model checkpoint at /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf and are newly initialized: ['reward_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-27 20:21:31 - INFO - Got 151214 samples from /mnt/sda/pr/new_proj/MOSS-RLHF/data/data_clean/hh-rlhf-strength-cleaned/train.json
2024-09-27 20:21:31 - INFO - Got 151214 samples totally from ['train.json']
[2024-09-27 20:21:31,825] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-27 20:21:31,825] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
E0927 20:21:54.924243 132676778354496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 1339036) of binary: /home/pr/conda/ENTER/envs/rlhf/bin/python3.8
Traceback (most recent call last):
File "/home/pr/conda/ENTER/envs/rlhf/bin/accelerate", line 10, in
sys.exit(main())
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_rm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

第二篇论文中奖励模型训练的问题 #58

第二篇论文中奖励模型训练的问题 #58

Syaoran1 commented Sep 27, 2024

第二篇论文中奖励模型训练的问题 #58

第二篇论文中奖励模型训练的问题 #58

Comments

Syaoran1 commented Sep 27, 2024

train_rm.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-09-27_20:21:54 host : ps.ps rank : 0 (local_rank: 0) exitcode : -11 (pid: 1339036) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 1339036

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036