Fine-tuned qwen2.5-7b reported a backward error #480

chenchen0611 · 2024-12-16T06:51:51Z

🐛 Describe the bug

I use accelerate framework for distributed training qwen2.5-7b, and the training mode is deepspeed zero3. Backpropagation error will be reported, but no error will be reported on qwen2.5-1.5b and 3b models.
The preliminary diagnosis points to an issue with LigerFusedLinearCrossEntropyLoss. When I set the fused_linear_cross_entropy to False and cross_entropy to True in apply_liger_kernel_to_qwen2, the training runs normally.
I suspect there is an issue with the gradient calculation of logits within fused_linear_cross_entropy, but I do not know how to modify it.

ds3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Reproduce

/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: Error detected in torch::autograd::AccumulateGrad. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647382455/work/torch/csrc/autograd/python_anomaly_mode.cpp:88.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
[rank3]: main()
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
[rank3]: globals = debugger.run(setup['file'], None, None, is_module)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run
[rank3]: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
[rank3]: globals = pydevd_runpy.run_path(file, globals, 'main')
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank3]: return _run_module_code(code, init_globals, run_name,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank3]: _run_code(code, mod_globals, init_globals,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank3]: exec(code, run_globals)
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 270, in
[rank3]: run()
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 218, in run
[rank3]: accelerator.backward(loss)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2048, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2263, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1

Versions

Environment Report:

Operating System: Linux-5.15.0-119-generic-x86_64-with-glibc2.31
Python version: 3.10.15
Liger Kernel version: 0.5.2
PyTorch version: 2.5.1
CUDA version: 12.4
HIP(ROCm) version: Not available
Triton version: 3.0.0
Transformers version: 4.46.3
XPU version: XPU Not Available

chenchen0611 · 2024-12-16T09:17:27Z

@ByronHsu

chenchen0611 closed this as completed Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuned qwen2.5-7b reported a backward error #480

Fine-tuned qwen2.5-7b reported a backward error #480

chenchen0611 commented Dec 16, 2024

chenchen0611 commented Dec 16, 2024

Fine-tuned qwen2.5-7b reported a backward error #480

Fine-tuned qwen2.5-7b reported a backward error #480

Comments

chenchen0611 commented Dec 16, 2024

🐛 Describe the bug

ds3.yaml

Reproduce

Versions

Environment Report:

chenchen0611 commented Dec 16, 2024