Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救! #214

Open
Eternal-Yan opened this issue Dec 20, 2023 · 1 comment

Comments

@Eternal-Yan
Copy link

(glm130b) zdbp@zdbp-ThinkStation-P920:~/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING]
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:31:38,081] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,121] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,198] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,205] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,225] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,250] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,261] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:31:38,294] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
WARNING: No training data specified
initializing model parallel with size 8
WARNING: No training data specified
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199262 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199263 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199265 closing signal SIGTERM
[2023-12-20 12:31:45,600] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 199266) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 199267)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 199268)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 199269)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 199270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:31:45
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 199266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$ pip install torchrun
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement torchrun (from versions: none)
ERROR: No matching distribution found for torchrun
(glm130b) zdbp@zdbp-ThinkStation-P920:
/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
python: can't open file '/home/zdbp/PengJian/GLM-130B-main/8': [Errno 2] No such file or directory
(glm130b) zdbp@zdbp-ThinkStation-P920:/PengJian/GLM-130B-main$
(glm130b) zdbp@zdbp-ThinkStation-P920:
/PengJian/GLM-130B-main$ pip install bminf
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting bminf
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/9b/56bbb3f30672e11e64ab0da315459f65d5ae8608e379a41ea6ef442dffb6/bminf-2.0.1-py3-none-any.whl (52 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.3/52.3 kB 690.4 kB/s eta 0:00:00
Requirement already satisfied: torch in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (2.1.1+cu121)
Requirement already satisfied: cpm-kernels>=1.0.9 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (1.0.11)
Requirement already satisfied: typing-extensions in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (4.9.0)
Requirement already satisfied: filelock in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.9.0)
Requirement already satisfied: sympy in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (1.12)
Requirement already satisfied: networkx in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.0)
Requirement already satisfied: jinja2 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.1.2)
Requirement already satisfied: fsspec in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2023.10.0)
Requirement already satisfied: triton==2.1.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from jinja2->torch->bminf) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from sympy->torch->bminf) (1.3.0)
Installing collected packages: bminf
Successfully installed bminf-2.0.1
(glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING]
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] *****************************************
[2023-12-20 12:36:34,707] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,756] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:34,961] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,021] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,036] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,073] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,147] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-20 12:36:35,153] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced
WARNING: No training data specified
using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
initializing model parallel with size 8
Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args)
File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize
args = get_args(args_list)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args
initialize_distributed(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed
torch.cuda.set_device(args.device)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209418 closing signal SIGTERM
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209419 closing signal SIGTERM
[2023-12-20 12:36:42,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209420 closing signal SIGTERM
[2023-12-20 12:36:42,431] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 209422) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 209423)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 209424)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 209425)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 209426)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-12-20_12:36:42
host : zdbp-ThinkStation-P920
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 209422)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@dahaobenhao
Copy link

请问解决了吗,我也碰到了一样的错误,救

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants