'AdamW' object has no attribute 'optim_bits' #2191

e-p-armstrong · 2024-12-15T04:32:22Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Full-parameter chatml finetuning of Llama 3.1 should work on the main:latest docker image on runpod on 6x A40s with deepspeed

Current behaviour

Training never gets a chance to start:

Stacktrace:

[2024-12-08 22:54:15,191] [INFO] [axolotl.load_model:1115] [PID:13086] [RANK:2] Converting modules to torch.bfloat16
[2024-12-08 22:54:15,296] [INFO] [axolotl.load_model:1082] [PID:13084] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.099GB misc)
[2024-12-08 22:54:15,306] [INFO] [axolotl.load_model:1115] [PID:13084] [RANK:0] Converting modules to torch.bfloat16
[rank3]: Traceback (most recent call last):
[rank3]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank3]:   File "<frozen runpy>", line 88, in _run_code
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank3]:     fire.Fire(do_cli)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank3]:     return do_train(parsed_cfg, parsed_cli_args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank3]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank3]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/workspace/axolotl/src/axolotl/train.py", line 192, in train
[rank3]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2275, in _inner_training_loop
[rank3]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank3]:     result = self._prepare_deepspeed(*args)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank3]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank3]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank3]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank3]: AttributeError: 'AdamW' object has no attribute 'optim_bits'

This issue has been around for about a week now? I first reported it on the Discord.

Steps to reproduce

Attempt to full finetune llama 3 using the settings provided (need to add some generic chatml dataset as I had to redact my data files)

Config yaml

base_model: Heralax/private-llama3.1-model-whose-name-is-censored
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false

datasets: # data files have to be redacted sorry
  
  

dataset_prepared_path: last_run_prepared-ft-lowerbatchsize
output_dir: ./out

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: llama_3.1_8b
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 7 # meant for use on 6 GPUs to achieve same effective batch size as earlier. Swapped # GPUs and Grad accumulation steps.
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 4
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000012
weight_decay: 0
# Gradient clipping max norm``
max_grad_norm: 1.0
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
logging_steps: 1
xformers_attention:
flash_attention: true

chat_template: chatml

warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
special_tokens:
  pad_token: "<|end_of_text|>"



### Possible solution

Rolling back to axolotlai/axolotl-cloud:main-20241129-py3.11-cu124-2.4.1 lets me train again. Unfortunately, the pinned nightly version I was relying on (winglian/axolotl-cloud:main-20241124) no longer lets me connect. by that I mean, direct SSH connection does not appear as an option and when I try to go through the proxy it hangs and then tells me the container is not running. This has happened for all winglian/axolotl-cloud images I have tried to run sometime after the date 12/8/24, but that is a separate issue.

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

whatever version the main latest comes with.

### axolotl branch-commit

main/whatever the most recent docker image update comes with

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2024-12-15T17:56:00Z

Could you try with the regular adamw_8bit optimizer please?

e-p-armstrong · 2024-12-16T00:49:38Z

OK I will try with that and get back to you

bursteratom · 2024-12-18T17:24:36Z

@e-p-armstrong @winglian Looks like the issue is with accelerate. I find that downgrading accelerate to version 1.0.1 bypass this error for now. Will follow up on accelerate upstream

bursteratom · 2024-12-18T17:25:26Z

This issue seems to only affect zero2. Zero3 works fine.

e-p-armstrong · 2024-12-18T21:59:57Z

@winglian Reproduced with a different optimizer and it happened even with DPO tuning.

pytorch_model.bin.index.json: 100%|_______________________________________________| 23.9k/23.9k [00:00<00:00, 80.9MB/s]
pytorch_model-00001-of-00002.bin: 100%|____________________________________________| 16.1G/16.1G [00:34<00:00, 465MB/s]
pytorch_model-00002-of-00002.bin: 100%|_____________________________________________| 542k/542k [00:00<00:00, 77.7MB/s]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.53s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:34<00:00, 17.48s/it]
Downloading shards: 100%|________________________________________________________________| 2/2 [00:35<00:00, 17.54s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:04<00:00,  2.03s/it]
generation_config.json: 100%|_________________________________________________________| 180/180 [00:00<00:00, 1.10MB/s]
[2024-12-18 21:31:27,225] [INFO] [axolotl.load_model:1077] [PID:1521] [RANK:0] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:27,229] [INFO] [axolotl.load_model:1110] [PID:1521] [RANK:0] Converting modules to torch.bfloat16
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
Loading checkpoint shards: 100%|_________________________________________________________| 2/2 [00:05<00:00,  2.56s/it]
[2024-12-18 21:31:28,223] [INFO] [axolotl.load_model:1077] [PID:1523] [RANK:2] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,227] [INFO] [axolotl.load_model:1110] [PID:1523] [RANK:2] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:28,828] [INFO] [axolotl.train.train:174] [PID:1521] [RANK:0] Starting trainer...
[2024-12-18 21:31:28,892] [INFO] [axolotl.load_model:1077] [PID:1522] [RANK:1] cuda memory usage after model load: 14.958GB (+0.126GB cache, +1.447GB misc)
[2024-12-18 21:31:28,896] [INFO] [axolotl.load_model:1110] [PID:1522] [RANK:1] Converting modules to torch.bfloat16
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
/workspace/axolotl/src/axolotl/core/trainer_builder.py:446: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `AxolotlTrainer.__init__`. Use `processing_class` instead.
  super().__init__(*_args, **kwargs)
[2024-12-18 21:31:30,383] [INFO] [axolotl.utils.samplers.multipack.calc_min_len:197] [PID:1521] [RANK:0] gather_len_batches: [580, 580, 580]
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank1]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank1]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank1]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank1]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank1]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank2]: Traceback (most recent call last):
[rank2]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank2]:   File "<frozen runpy>", line 88, in _run_code
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank2]:     fire.Fire(do_cli)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank2]:     return do_train(parsed_cfg, parsed_cli_args)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank2]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank2]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank2]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank2]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank2]:     result = self._prepare_deepspeed(*args)
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank2]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank2]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank2]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank2]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 58, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 47, in do_train
[rank0]:     model, tokenizer = train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/axolotl/src/axolotl/train.py", line 188, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2323, in _inner_training_loop
[rank0]:     model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank0]:     result = self._prepare_deepspeed(*args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/accelerator.py", line 1843, in _prepare_deepspeed
[rank0]:     optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 53, in map_pytorch_optim_to_deepspeed
[rank0]:     is_adaw = isinstance(optimizer, (bnb_opt.AdamW, bnb_opt.AdamW32bit)) and optimizer.optim_bits == 32
[rank0]:                                                                              ^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'AdamW' object has no attribute 'optim_bits'
W1218 21:31:33.244000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1521 closing signal SIGTERM
W1218 21:31:33.245000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1523 closing signal SIGTERM
E1218 21:31:33.390000 140323527751488 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 1522) of binary: /root/miniconda3/envs/py3.11/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_21:31:33
  host      : b6131755c915
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1522)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@bursteratom Thanks for searching to find the problem!

bursteratom · 2024-12-18T22:13:02Z

@e-p-armstrong In the meantime I recommend keeping the current accelerate version 1.2.1 while using zero3 instead of zero2

bursteratom · 2024-12-19T00:19:33Z

@e-p-armstrong @winglian started an upstream PR on accelerate to fix this: huggingface/accelerate#3305

bursteratom · 2024-12-23T05:43:30Z

huggingface/accelerate#3311

e-p-armstrong added the bug Something isn't working label Dec 15, 2024

bursteratom mentioned this issue Dec 19, 2024

Fix 'AdamW' object has no attribute 'optim_bits' error when deepspeed zero2 is enabled huggingface/accelerate#3305

Open

bursteratom self-assigned this Dec 19, 2024

bursteratom added waiting on upstream wip labels Dec 19, 2024

e-p-armstrong mentioned this issue Dec 28, 2024

Very High Loss (~15) and Instability with Previously-working Config From A While Ago #2224

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'AdamW' object has no attribute 'optim_bits' #2191

'AdamW' object has no attribute 'optim_bits' #2191

e-p-armstrong commented Dec 15, 2024

winglian commented Dec 15, 2024

e-p-armstrong commented Dec 16, 2024

bursteratom commented Dec 18, 2024

bursteratom commented Dec 18, 2024

e-p-armstrong commented Dec 18, 2024

bursteratom commented Dec 18, 2024 •

edited

Loading

bursteratom commented Dec 19, 2024

bursteratom commented Dec 23, 2024

'AdamW' object has no attribute 'optim_bits' #2191

'AdamW' object has no attribute 'optim_bits' #2191

Comments

e-p-armstrong commented Dec 15, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

winglian commented Dec 15, 2024

e-p-armstrong commented Dec 16, 2024

bursteratom commented Dec 18, 2024

bursteratom commented Dec 18, 2024

e-p-armstrong commented Dec 18, 2024

bursteratom commented Dec 18, 2024 • edited Loading

bursteratom commented Dec 19, 2024

bursteratom commented Dec 23, 2024

bursteratom commented Dec 18, 2024 •

edited

Loading