Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

第二篇论文中奖励模型训练的问题 #58

Open
Syaoran1 opened this issue Sep 27, 2024 · 0 comments
Open

第二篇论文中奖励模型训练的问题 #58

Syaoran1 opened this issue Sep 27, 2024 · 0 comments

Comments

@Syaoran1
Copy link

我想复现第二篇论文的相关实验,但是遇到了一点问题不知道怎么解决
[2024-09-27 20:21:18,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 20:21:22,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
1
1
[2024-09-27 20:21:23,268] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 20:21:23,268] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(split_batches=True)
warnings.warn(
2024-09-27 20:21:23 - INFO - Load init model from /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf
2024-09-27 20:21:23 - INFO - Loading tokenizer from huggingface: /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf...
2024-09-27 20:21:23 - INFO - Llama tokenizer size: 32000
2024-09-27 20:21:23 - INFO - Llama tokenizer pad token: , pad_token_id: 0
2024-09-27 20:21:23 - INFO - Llama tokenizer. special token: {'bos_token': '', 'eos_token': '', 'unk_token': '', 'pad_token': ''}

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.12it/s]
Some weights of LlamaRewardModel were not initialized from the model checkpoint at /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf and are newly initialized: ['reward_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-27 20:21:31 - INFO - Got 151214 samples from /mnt/sda/pr/new_proj/MOSS-RLHF/data/data_clean/hh-rlhf-strength-cleaned/train.json
2024-09-27 20:21:31 - INFO - Got 151214 samples totally from ['train.json']
[2024-09-27 20:21:31,825] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-27 20:21:31,825] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
E0927 20:21:54.924243 132676778354496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 1339036) of binary: /home/pr/conda/ENTER/envs/rlhf/bin/python3.8
Traceback (most recent call last):
File "/home/pr/conda/ENTER/envs/rlhf/bin/accelerate", line 10, in
sys.exit(main())
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_rm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant