You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel
python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py
and my config is
[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python
Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
launch_command(accelerate_args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
multi_gpu_launcher(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
distrib_run.run(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Concerning your first issue, the stack trace you provided misses the real issue that happened so I can't tell. Anyway accelerate has some difficulties with launching two processes on a single machine with only 1 GPU. This is why we provided a custom version of accelerate (which is outdated). Could you try these two PRs (1, 2) please? Or manually launch the two processes as shown in Lamorel's documentation.
Concerning your second issue, this is weird. Let me try to launch some experiments and find out what happens.
Hi,
Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel
python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py
and my config is
But I meet the error:
[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python
Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
launch_command(accelerate_args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
multi_gpu_launcher(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
distrib_run.run(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/yanxue/Grounding/experiments/train_language_agent.py FAILED
Failures:
[1]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3946797)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3946796)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you kindly suggest why the error happen?
The text was updated successfully, but these errors were encountered: