How to run the train_language_agent.py without using slurm #11

yanxue7 · 2023-09-14T13:15:48Z

Hi,

Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel

python -m lamorel_launcher.launch --config-path /home/yanxue/Grounding/experiments/configs --config-name local_gpu_config rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py
and my config is


lamorel_args:
  log_level: info
  allow_subgraph_use_whith_gradient: true
  distributed_setup_args:
    n_rl_processes: 1
    n_llm_processes: 1
  accelerate_args:
    config_file: accelerate/default_config.yaml
    machine_rank: 0
    num_machines: 1
  llm_args:
    model_type: seq2seq
    model_path: t5-small
    pretrained: true
    minibatch_size: 3
    parallelism:
      use_gpu: true
      model_parallelism_size: 1
      synchronize_gpus_after_scoring: false
      empty_cuda_cache_after_scoring: false
  updater_args:

But I meet the error:

[2023-09-14 20:45:32,837][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 3946796) of binary: /home/yanxue/anaconda3/envs/dlp/bin/python
Error executing job with overrides: ['rl_script_args.path=/home/yanxue/Grounding/experiments/train_language_agent.py']
Traceback (most recent call last):
File "/home/yanxue/Grounding/lamorel/lamorel/src/lamorel_launcher/launch.py", line 46, in main
launch_command(accelerate_args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 909, in launch_command
multi_gpu_launcher(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 604, in multi_gpu_launcher
distrib_run.run(args)
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yanxue/anaconda3/envs/dlp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures:
[1]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3946797)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3946796)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you kindly suggest why the error happen?

The text was updated successfully, but these errors were encountered:

yanxue7 · 2023-09-14T13:17:22Z

and I run the example/ppo_finetuning for BabyAI-MixedTrainLocal enviroment in lamorel with modified config

rl_script_args:
  path: ???
  name_environment: 'BabyAI-MixedTrainLocal'

  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTrainLocal'
  #'BabyAI-GoToRedBall-v0'
  #'BabyAI-MixedTestLocal'
  #'BabyAI-GoToRedBall-v0'
  epochs: 1000
  steps_per_epoch: 1500
  minibatch_size: 64
  gradient_batch_size: 16
  ppo_epochs: 4
  lam: 0.99
  gamma: 0.99
  target_kl: 0.01
  max_ep_len: 1000
  lr: 1e-4
  entropy_coef: 0.01
  value_loss_coef: 0.5
  clip_eps: 0.2
  max_grad_norm: 0.5
  save_freq: 100
  output_dir: "/home/yanxue/lamoral/pposmalltrain"

But it seems failed to train, it only gets score around 0.2 less than 0.6 in your paper

ClementRomac · 2023-09-27T07:30:19Z

Hi,

Concerning your first issue, the stack trace you provided misses the real issue that happened so I can't tell. Anyway accelerate has some difficulties with launching two processes on a single machine with only 1 GPU. This is why we provided a custom version of accelerate (which is outdated). Could you try these two PRs (1, 2) please? Or manually launch the two processes as shown in Lamorel's documentation.

Concerning your second issue, this is weird. Let me try to launch some experiments and find out what happens.

andreapisa9 mentioned this issue Jun 28, 2024

Code Reproduction Issues #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run the train_language_agent.py without using slurm #11

How to run the train_language_agent.py without using slurm #11

yanxue7 commented Sep 14, 2023

yanxue7 commented Sep 14, 2023

ClementRomac commented Sep 27, 2023

How to run the train_language_agent.py without using slurm #11

How to run the train_language_agent.py without using slurm #11

Comments

yanxue7 commented Sep 14, 2023

/home/yanxue/Grounding/experiments/train_language_agent.py FAILED

Failures: [1]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3946797) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-09-14_20:45:32 host : taizun-R282-Z96-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3946796) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

yanxue7 commented Sep 14, 2023

ClementRomac commented Sep 27, 2023

Failures:
[1]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3946797)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-09-14_20:45:32
host : taizun-R282-Z96-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3946796)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html