You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
我想复现第二篇论文的相关实验,但是遇到了一点问题不知道怎么解决
[2024-09-27 20:21:18,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 20:21:22,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
1
1
[2024-09-27 20:21:23,268] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 20:21:23,268] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['split_batches']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(split_batches=True)
warnings.warn(
2024-09-27 20:21:23 - INFO - Load init model from /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf
2024-09-27 20:21:23 - INFO - Loading tokenizer from huggingface: /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf...
2024-09-27 20:21:23 - INFO - Llama tokenizer size: 32000
2024-09-27 20:21:23 - INFO - Llama tokenizer pad token: , pad_token_id: 0
2024-09-27 20:21:23 - INFO - Llama tokenizer. special token: {'bos_token': '', 'eos_token': '', 'unk_token': '', 'pad_token': ''}
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.12it/s]
Some weights of LlamaRewardModel were not initialized from the model checkpoint at /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf and are newly initialized: ['reward_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-27 20:21:31 - INFO - Got 151214 samples from /mnt/sda/pr/new_proj/MOSS-RLHF/data/data_clean/hh-rlhf-strength-cleaned/train.json
2024-09-27 20:21:31 - INFO - Got 151214 samples totally from ['train.json']
[2024-09-27 20:21:31,825] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-27 20:21:31,825] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
E0927 20:21:54.924243 132676778354496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 1339036) of binary: /home/pr/conda/ENTER/envs/rlhf/bin/python3.8
Traceback (most recent call last):
File "/home/pr/conda/ENTER/envs/rlhf/bin/accelerate", line 10, in
sys.exit(main())
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_rm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036
The text was updated successfully, but these errors were encountered:
我想复现第二篇论文的相关实验,但是遇到了一点问题不知道怎么解决
[2024-09-27 20:21:18,768] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-27 20:21:22,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
1
1
[2024-09-27 20:21:23,268] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-09-27 20:21:23,268] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the following arguments to
Accelerator
is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['split_batches']). Please pass anaccelerate.DataLoaderConfiguration
instead:dataloader_config = DataLoaderConfiguration(split_batches=True)
warnings.warn(
2024-09-27 20:21:23 - INFO - Load init model from /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf
2024-09-27 20:21:23 - INFO - Loading tokenizer from huggingface: /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf...
2024-09-27 20:21:23 - INFO - Llama tokenizer size: 32000
2024-09-27 20:21:23 - INFO - Llama tokenizer pad token: , pad_token_id: 0
2024-09-27 20:21:23 - INFO - Llama tokenizer. special token: {'bos_token': '
', 'eos_token': '', 'unk_token': '', 'pad_token': ''}Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:01<00:01, 1.27s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.21it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00, 1.12it/s]
Some weights of LlamaRewardModel were not initialized from the model checkpoint at /mnt/sda/pr/new_proj/MOSS-RLHF/models/llama-2-7b-hf and are newly initialized: ['reward_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-09-27 20:21:31 - INFO - Got 151214 samples from /mnt/sda/pr/new_proj/MOSS-RLHF/data/data_clean/hh-rlhf-strength-cleaned/train.json
2024-09-27 20:21:31 - INFO - Got 151214 samples totally from ['train.json']
[2024-09-27 20:21:31,825] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.15.1, git-hash=unknown, git-branch=unknown
[2024-09-27 20:21:31,825] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
E0927 20:21:54.924243 132676778354496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 1339036) of binary: /home/pr/conda/ENTER/envs/rlhf/bin/python3.8
Traceback (most recent call last):
File "/home/pr/conda/ENTER/envs/rlhf/bin/accelerate", line 10, in
sys.exit(main())
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
deepspeed_launcher(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/accelerate/commands/launch.py", line 852, in deepspeed_launcher
distrib_run.run(args)
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pr/conda/ENTER/envs/rlhf/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_rm.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-09-27_20:21:54
host : ps.ps
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1339036)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1339036
The text was updated successfully, but these errors were encountered: