Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'AdamW' object has no attribute 'optim_bits' #2191

Open
1 task done
e-p-armstrong opened this issue Dec 15, 2024 · 8 comments
Open
1 task done

'AdamW' object has no attribute 'optim_bits' #2191

e-p-armstrong opened this issue Dec 15, 2024 · 8 comments
Assignees
Labels
bug Something isn't working waiting on upstream wip

Comments

@e-p-armstrong
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Full-parameter chatml finetuning of Llama 3.1 should work on the main:latest docker image on runpod on 6x A40s with deepspeed

Current behaviour

Training never gets a chance to start:

Stacktrace:

[2024-12-08 22:54:15,191] [INFO] [axolotl.load_model:1115] [PID:13086] [RANK:2] Converting modules to torch.bfloat16
[2024-12-08 22:54:15,296] [INFO] [axolotl.load_model:1082] [PID:13084] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.099GB misc)
[2024-12-08 22:54:15,306] [INFO] [axolotl.load_model:1115] [PID:13084] [RANK:0] Converting modules to torch.bfloat16
[rank3]: Traceback (most recent call last):
[rank3]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank3]:   File "<frozen runpy>", line 88, in _run_code
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank3]:     fire.Fire(do_cli)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank3]:     return do_train(parsed_cfg, parsed_cli_args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank3]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/train.py", line 192, in train
[rank3]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2275, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank3]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank3]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank3]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank3]: AttributeError: 'AdamW' object has no attribute 'optim_bits'

This issue has been around for about a week now? I first reported it on the Discord.

Steps to reproduce

Attempt to full finetune llama 3 using the settings provided (need to add some generic chatml dataset as I had to redact my data files)

Config yaml

base_model: Heralax/private-llama3.1-model-whose-name-is-censored
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false

datasets: # data files have to be redacted sorry
  
  

dataset_prepared_path: last_run_prepared-ft-lowerbatchsize
output_dir: ./out

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: llama_3.1_8b
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 7 # meant for use on 6 GPUs to achieve same effective batch size as earlier. Swapped # GPUs and Grad accumulation steps.
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 4
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000012
weight_decay: 0
# Gradient clipping max norm``
max_grad_norm: 1.0
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
logging_steps: 1
xformers_attention:
flash_attention: true

chat_template: chatml

warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
special_tokens:
  pad_token: "<|end_of_text|>"


### Possible solution

Rolling back to axolotlai/axolotl-cloud:main-20241129-py3.11-cu124-2.4.1 lets me train again. Unfortunately, the pinned nightly version I was relying on (winglian/axolotl-cloud:main-20241124) no longer lets me connect. by that I mean, direct SSH connection does not appear as an option and when I try to go through the proxy it hangs and then tells me the container is not running. This has happened for all winglian/axolotl-cloud images I have tried to run sometime after the date 12/8/24, but that is a separate issue.

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

whatever version the main latest comes with.

### axolotl branch-commit

main/whatever the most recent docker image update comes with

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
@e-p-armstrong e-p-armstrong added the bug Something isn't working label Dec 15, 2024
@winglian
Copy link
Collaborator

Could you try with the regular adamw_8bit optimizer please?

@e-p-armstrong
Copy link
Author

OK I will try with that and get back to you

@bursteratom
Copy link
Collaborator

@e-p-armstrong @winglian Looks like the issue is with accelerate. I find that downgrading accelerate to version 1.0.1 bypass this error for now. Will follow up on accelerate upstream

@bursteratom
Copy link
Collaborator

This issue seems to only affect zero2. Zero3 works fine.

@e-p-armstrong
Copy link
Author

@winglian Reproduced with a different optimizer and it happened even with DPO tuning.

pytorch_model.bin.index.json: 100%|_______________________________________________| 23.9k/23.9k [00:00<00:00, 80.9MB/s]
pytorch_model-00001-of-00002.bin: 100%|____________________________________________| 16.1G/16.1G [00:34<00:00, 465MB/s]
pytorch_model-00002-of-00002.bin: 100%|_____________________________________________| 542k/542k [00:00<00:00, 77.7MB/s]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.53s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:34<00:00, 17.48s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.54s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:04<00:00,  2.03s/it]
generation_config.json: 100%|_________________________________________________________| 180/180 [00:00<00:00, 1.10MB/s]
[2024-12-18 21:31:27,225] [INFO] [axolotl.load_model:1077] [PID:1521] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:27,229] [INFO] [axolotl.load_model:1110] [PID:1521] [RANK:0] Converting modules to torch.bfloat16
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
[2024-12-18 21:31:28,223] [INFO] [axolotl.load_model:1077] [PID:1523] [RANK:2] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,227] [INFO] [axolotl.load_model:1110] [PID:1523] [RANK:2] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:28,828] [INFO] [axolotl.train.train:174] [PID:1521] [RANK:0] Starting trainer...
[2024-12-18 21:31:28,892] [INFO] [axolotl.load_model:1077] [PID:1522] [RANK:1] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,896] [INFO] [axolotl.load_model:1110] [PID:1522] [RANK:1] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:30,383] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:197] [PID:1521] [RANK:0] gather_len_batches: [580, 580, 580]
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank1]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank1]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank1]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank1]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank2]: Traceback (most recent call last):
[rank2]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank2]:   File "<frozen runpy>", line 88, in _run_code
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank2]:     fire.Fire(do_cli)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank2]:     return do_train(parsed_cfg, parsed_cli_args)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank2]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank2]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank2]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank2]:     result = self._prepare_deepspeed(*args)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank2]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank2]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank2]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank0]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank0]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank0]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank0]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
W1218 21:31:33.244000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1521 closing signal SIGTERM
W1218 21:31:33.245000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1523 closing signal SIGTERM
E1218 21:31:33.390000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 1522) of binary: /root/miniconda3/envs/py3.11/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_21:31:33
  host      : b6131755c915
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1522)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@bursteratom Thanks for searching to find the problem!

@bursteratom
Copy link
Collaborator

bursteratom commented Dec 18, 2024

@e-p-armstrong In the meantime I recommend keeping the current accelerate version 1.2.1 while using zero3 instead of zero2

@bursteratom
Copy link
Collaborator

@e-p-armstrong @winglian started an upstream PR on accelerate to fix this: huggingface/accelerate#3305

@bursteratom
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting on upstream wip
Projects
None yet
Development

No branches or pull requests

3 participants