You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use accelerate framework for distributed training qwen2.5-7b, and the training mode is deepspeed zero3. Backpropagation error will be reported, but no error will be reported on qwen2.5-1.5b and 3b models.
The preliminary diagnosis points to an issue with LigerFusedLinearCrossEntropyLoss. When I set the fused_linear_cross_entropy to False and cross_entropy to True in apply_liger_kernel_to_qwen2, the training runs normally.
I suspect there is an issue with the gradient calculation of logits within fused_linear_cross_entropy, but I do not know how to modify it.
/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: Error detected in torch::autograd::AccumulateGrad. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647382455/work/torch/csrc/autograd/python_anomaly_mode.cpp:88.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
[rank3]: main()
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
[rank3]: globals = debugger.run(setup['file'], None, None, is_module)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run
[rank3]: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
[rank3]: globals = pydevd_runpy.run_path(file, globals, 'main')
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank3]: return _run_module_code(code, init_globals, run_name,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank3]: _run_code(code, mod_globals, init_globals,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank3]: exec(code, run_globals)
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 270, in
[rank3]: run()
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 218, in run
[rank3]: accelerator.backward(loss)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2048, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2263, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1
Versions
Environment Report:
Operating System: Linux-5.15.0-119-generic-x86_64-with-glibc2.31
Python version: 3.10.15
Liger Kernel version: 0.5.2
PyTorch version: 2.5.1
CUDA version: 12.4
HIP(ROCm) version: Not available
Triton version: 3.0.0
Transformers version: 4.46.3
XPU version: XPU Not Available
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
I use accelerate framework for distributed training qwen2.5-7b, and the training mode is deepspeed zero3. Backpropagation error will be reported, but no error will be reported on qwen2.5-1.5b and 3b models.
The preliminary diagnosis points to an issue with LigerFusedLinearCrossEntropyLoss. When I set the fused_linear_cross_entropy to False and cross_entropy to True in apply_liger_kernel_to_qwen2, the training runs normally.
I suspect there is an issue with the gradient calculation of logits within fused_linear_cross_entropy, but I do not know how to modify it.
ds3.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Reproduce
/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: Error detected in torch::autograd::AccumulateGrad. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647382455/work/torch/csrc/autograd/python_anomaly_mode.cpp:88.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
[rank3]: main()
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
[rank3]: globals = debugger.run(setup['file'], None, None, is_module)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run
[rank3]: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
[rank3]: globals = pydevd_runpy.run_path(file, globals, 'main')
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank3]: return _run_module_code(code, init_globals, run_name,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank3]: _run_code(code, mod_globals, init_globals,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank3]: exec(code, run_globals)
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 270, in
[rank3]: run()
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 218, in run
[rank3]: accelerator.backward(loss)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2048, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2263, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1
Versions
Environment Report:
Operating System: Linux-5.15.0-119-generic-x86_64-with-glibc2.31
Python version: 3.10.15
Liger Kernel version: 0.5.2
PyTorch version: 2.5.1
CUDA version: 12.4
HIP(ROCm) version: Not available
Triton version: 3.0.0
Transformers version: 4.46.3
XPU version: XPU Not Available
The text was updated successfully, but these errors were encountered: