Accelerate v1.2.1 Causes Consistent Errors #2215

williambarberjr · 2024-12-23T00:57:49Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Fine tuning Qwen 2.5 14B Instruct runs without errors...

Current behaviour

Was consistently getting errors like this one across multiple different cloud GPU providers (AWS, runpod, lambda) when fine tuning with axolotl v0.6.0 which uses accelerate v1.2.1. Reverting to accelerate v1.1.0 immediately resolved the issue.

Errors typically looked like this:

{'loss': 0.151, 'grad_norm': 0.0081498883664608, 'learning_rate': 4.749498997995992e-06, 'epoch': 0.29}

 10%|▉         | 51/525 [2:12:20<20:11:06, 153.30s/it]W1221 17:13:59.240000 46660 torch/distributed/elastic/agent/server/api.py:704] Received 1 death signal, shutting down workers
W1221 17:13:59.241000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46807 closing signal SIGHUP
W1221 17:13:59.242000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46808 closing signal SIGHUP
W1221 17:13:59.243000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46809 closing signal SIGHUP
W1221 17:13:59.243000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46810 closing signal SIGHUP
W1221 17:13:59.244000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46811 closing signal SIGHUP
W1221 17:13:59.244000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46812 closing signal SIGHUP
W1221 17:13:59.245000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46813 closing signal SIGHUP
W1221 17:13:59.246000 46660 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 46814 closing signal SIGHUP
Traceback (most recent call last):
  File "/ephemeral/axolotl/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run
    time.sleep(monitor_interval)
  File "/ephemeral/axolotl/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 46660 got signal: 1

Steps to reproduce

Use my config yml and any training data on 8xH100 or 8xA100

Config yaml

base_model: arcee-ai/Virtuoso-Small
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: /home/ubuntu/training_data/your_training_data_here.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    roles:
      system:
        - system
      user:
        - human
      assistant:
        - gpt

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
- model.layers.6.input_layernorm
- model.layers.7.input_layernorm
- model.layers.8.input_layernorm
- model.layers.9.input_layernorm
- model.layers.10.input_layernorm
- model.layers.11.input_layernorm
- model.layers.12.input_layernorm
- model.layers.13.input_layernorm
- model.layers.14.input_layernorm
- model.layers.15.input_layernorm
- model.layers.16.input_layernorm
- model.layers.17.input_layernorm
- model.layers.18.input_layernorm
- model.layers.19.input_layernorm
- model.layers.20.input_layernorm
- model.layers.21.input_layernorm
- model.layers.22.input_layernorm
- model.layers.23.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.1.mlp.down_proj
- model.layers.38.mlp.down_proj
- model.layers.35.mlp.down_proj
- model.layers.37.mlp.down_proj
- model.layers.36.mlp.down_proj
- model.layers.15.mlp.down_proj
- model.layers.11.mlp.down_proj
- model.layers.12.mlp.down_proj
- model.layers.34.mlp.down_proj
- model.layers.44.mlp.down_proj
- model.layers.45.mlp.down_proj
- model.layers.9.mlp.down_proj
- model.layers.41.mlp.down_proj
- model.layers.33.mlp.down_proj
- model.layers.43.mlp.down_proj
- model.layers.40.mlp.down_proj
- model.layers.13.mlp.down_proj
- model.layers.39.mlp.down_proj
- model.layers.8.mlp.down_proj
- model.layers.10.mlp.down_proj
- model.layers.14.mlp.down_proj
- model.layers.16.mlp.down_proj
- model.layers.31.mlp.down_proj
- model.layers.32.mlp.down_proj
# mlp.gate_proj layers
- model.layers.1.mlp.gate_proj
- model.layers.44.mlp.gate_proj
- model.layers.46.mlp.gate_proj
- model.layers.45.mlp.gate_proj
- model.layers.43.mlp.gate_proj
- model.layers.47.mlp.gate_proj
- model.layers.42.mlp.gate_proj
- model.layers.32.mlp.gate_proj
- model.layers.27.mlp.gate_proj
- model.layers.33.mlp.gate_proj
- model.layers.28.mlp.gate_proj
- model.layers.39.mlp.gate_proj
- model.layers.41.mlp.gate_proj
- model.layers.40.mlp.gate_proj
- model.layers.30.mlp.gate_proj
- model.layers.29.mlp.gate_proj
- model.layers.31.mlp.gate_proj
- model.layers.26.mlp.gate_proj
- model.layers.37.mlp.gate_proj
- model.layers.38.mlp.gate_proj
- model.layers.12.mlp.gate_proj
- model.layers.36.mlp.gate_proj
- model.layers.10.mlp.gate_proj
- model.layers.13.mlp.gate_proj
# mlp.up_proj layers
- model.layers.1.mlp.up_proj
- model.layers.13.mlp.up_proj
- model.layers.11.mlp.up_proj
- model.layers.14.mlp.up_proj
- model.layers.15.mlp.up_proj
- model.layers.12.mlp.up_proj
- model.layers.8.mlp.up_proj
- model.layers.16.mlp.up_proj
- model.layers.9.mlp.up_proj
- model.layers.19.mlp.up_proj
- model.layers.10.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.17.mlp.up_proj
- model.layers.20.mlp.up_proj
- model.layers.21.mlp.up_proj
- model.layers.18.mlp.up_proj
- model.layers.38.mlp.up_proj
- model.layers.37.mlp.up_proj
- model.layers.39.mlp.up_proj
- model.layers.42.mlp.up_proj
- model.layers.41.mlp.up_proj
- model.layers.27.mlp.up_proj
- model.layers.28.mlp.up_proj
- model.layers.34.mlp.up_proj
# model.embed_tokens layers
# model.norm layers
# post_attention_layernorm layers
- model.layers.0.post_attention_layernorm
- model.layers.1.post_attention_layernorm
- model.layers.2.post_attention_layernorm
- model.layers.3.post_attention_layernorm
- model.layers.4.post_attention_layernorm
- model.layers.5.post_attention_layernorm
- model.layers.6.post_attention_layernorm
- model.layers.7.post_attention_layernorm
- model.layers.8.post_attention_layernorm
- model.layers.9.post_attention_layernorm
- model.layers.10.post_attention_layernorm
- model.layers.11.post_attention_layernorm
- model.layers.12.post_attention_layernorm
- model.layers.13.post_attention_layernorm
- model.layers.14.post_attention_layernorm
- model.layers.15.post_attention_layernorm
- model.layers.16.post_attention_layernorm
- model.layers.17.post_attention_layernorm
- model.layers.18.post_attention_layernorm
- model.layers.19.post_attention_layernorm
- model.layers.20.post_attention_layernorm
- model.layers.21.post_attention_layernorm
- model.layers.22.post_attention_layernorm
- model.layers.23.post_attention_layernorm
# self_attn.k_proj layers
- model.layers.47.self_attn.k_proj
- model.layers.39.self_attn.k_proj
- model.layers.41.self_attn.k_proj
- model.layers.37.self_attn.k_proj
- model.layers.35.self_attn.k_proj
- model.layers.44.self_attn.k_proj
- model.layers.38.self_attn.k_proj
- model.layers.14.self_attn.k_proj
- model.layers.7.self_attn.k_proj
- model.layers.12.self_attn.k_proj
- model.layers.11.self_attn.k_proj
- model.layers.32.self_attn.k_proj
- model.layers.10.self_attn.k_proj
- model.layers.8.self_attn.k_proj
- model.layers.9.self_attn.k_proj
- model.layers.6.self_attn.k_proj
- model.layers.45.self_attn.k_proj
- model.layers.42.self_attn.k_proj
- model.layers.5.self_attn.k_proj
- model.layers.40.self_attn.k_proj
- model.layers.33.self_attn.k_proj
- model.layers.0.self_attn.k_proj
- model.layers.34.self_attn.k_proj
- model.layers.13.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.12.self_attn.o_proj
- model.layers.5.self_attn.o_proj
- model.layers.14.self_attn.o_proj
- model.layers.16.self_attn.o_proj
- model.layers.20.self_attn.o_proj
- model.layers.13.self_attn.o_proj
- model.layers.11.self_attn.o_proj
- model.layers.4.self_attn.o_proj
- model.layers.6.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.7.self_attn.o_proj
- model.layers.18.self_attn.o_proj
- model.layers.8.self_attn.o_proj
- model.layers.38.self_attn.o_proj
- model.layers.15.self_attn.o_proj
- model.layers.17.self_attn.o_proj
- model.layers.9.self_attn.o_proj
- model.layers.10.self_attn.o_proj
- model.layers.21.self_attn.o_proj
- model.layers.28.self_attn.o_proj
- model.layers.32.self_attn.o_proj
- model.layers.35.self_attn.o_proj
- model.layers.39.self_attn.o_proj
- model.layers.3.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.1.self_attn.q_proj
- model.layers.2.self_attn.q_proj
- model.layers.3.self_attn.q_proj
- model.layers.44.self_attn.q_proj
- model.layers.29.self_attn.q_proj
- model.layers.45.self_attn.q_proj
- model.layers.43.self_attn.q_proj
- model.layers.32.self_attn.q_proj
- model.layers.19.self_attn.q_proj
- model.layers.38.self_attn.q_proj
- model.layers.42.self_attn.q_proj
- model.layers.34.self_attn.q_proj
- model.layers.36.self_attn.q_proj
- model.layers.40.self_attn.q_proj
- model.layers.26.self_attn.q_proj
- model.layers.20.self_attn.q_proj
- model.layers.39.self_attn.q_proj
- model.layers.28.self_attn.q_proj
- model.layers.35.self_attn.q_proj
- model.layers.41.self_attn.q_proj
- model.layers.25.self_attn.q_proj
- model.layers.33.self_attn.q_proj
- model.layers.30.self_attn.q_proj
- model.layers.27.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.0.self_attn.v_proj
- model.layers.7.self_attn.v_proj
- model.layers.39.self_attn.v_proj
- model.layers.31.self_attn.v_proj
- model.layers.15.self_attn.v_proj
- model.layers.10.self_attn.v_proj
- model.layers.32.self_attn.v_proj
- model.layers.41.self_attn.v_proj
- model.layers.6.self_attn.v_proj
- model.layers.33.self_attn.v_proj
- model.layers.42.self_attn.v_proj
- model.layers.29.self_attn.v_proj
- model.layers.14.self_attn.v_proj
- model.layers.9.self_attn.v_proj
- model.layers.35.self_attn.v_proj
- model.layers.38.self_attn.v_proj
- model.layers.13.self_attn.v_proj
- model.layers.30.self_attn.v_proj
- model.layers.5.self_attn.v_proj
- model.layers.34.self_attn.v_proj
- model.layers.28.self_attn.v_proj
- model.layers.37.self_attn.v_proj
- model.layers.27.self_attn.v_proj
- model.layers.11.self_attn.v_proj

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./spectrum/out

sequence_len: 8192
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

gradient_accumulation_steps: 16
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: linear
learning_rate: 5e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

gradient_checkpointing: unsloth
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

seed: 81
warmup_ratio: 0.05
evals_per_epoch: 2
saves_per_epoch: 2
save_total_limit: 10
debug:
deepspeed: /home/ubuntu/axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.05
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.12

axolotl branch-commit

main/3742deb

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

williambarberjr · 2024-12-23T13:18:02Z

Spoke too soon. I eventually run into the same error again.

 13%|█▎        | 66/525 [4:46:09<28:32:48, 223.90s/it]W1223 02:30:02.508000 40309 torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGHUP death signal, shutting down workers
W1223 02:30:02.509000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40530 closing signal SIGHUP
W1223 02:30:02.510000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40531 closing signal SIGHUP
W1223 02:30:02.511000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40532 closing signal SIGHUP
W1223 02:30:02.519000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40533 closing signal SIGHUP
W1223 02:30:02.520000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40534 closing signal SIGHUP
W1223 02:30:02.520000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40535 closing signal SIGHUP
W1223 02:30:02.521000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40536 closing signal SIGHUP
W1223 02:30:02.521000 40309 torch/distributed/elastic/multiprocessing/api.py:896] Sending process 40537 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 137, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
  File "/usr/lib/python3/dist-packages/torch/distributed/elastic/metrics/api.py", line 136, in wrapper
    result = f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
    result = self._invoke_run(role)
  File "/usr/lib/python3/dist-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run
    time.sleep(monitor_interval)
  File "/usr/lib/python3/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 83, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 40309 got signal: 1

williambarberjr added the bug Something isn't working label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate v1.2.1 Causes Consistent Errors #2215

Accelerate v1.2.1 Causes Consistent Errors #2215

williambarberjr commented Dec 23, 2024

williambarberjr commented Dec 23, 2024

Accelerate v1.2.1 Causes Consistent Errors #2215

Accelerate v1.2.1 Causes Consistent Errors #2215

Comments

williambarberjr commented Dec 23, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

williambarberjr commented Dec 23, 2024