Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuned qwen2.5-7b reported a backward error #480

Closed
chenchen0611 opened this issue Dec 16, 2024 · 1 comment
Closed

Fine-tuned qwen2.5-7b reported a backward error #480

chenchen0611 opened this issue Dec 16, 2024 · 1 comment

Comments

@chenchen0611
Copy link

🐛 Describe the bug

I use accelerate framework for distributed training qwen2.5-7b, and the training mode is deepspeed zero3. Backpropagation error will be reported, but no error will be reported on qwen2.5-1.5b and 3b models.
The preliminary diagnosis points to an issue with LigerFusedLinearCrossEntropyLoss. When I set the fused_linear_cross_entropy to False and cross_entropy to True in apply_liger_kernel_to_qwen2, the training runs normally.
I suspect there is an issue with the gradient calculation of logits within fused_linear_cross_entropy, but I do not know how to modify it.

ds3.yaml

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Reproduce

/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: Error detected in torch::autograd::AccumulateGrad. No forward pass information available. Enable detect anomaly during forward pass for more information. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647382455/work/torch/csrc/autograd/python_anomaly_mode.cpp:88.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3489, in
[rank3]: main()
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3482, in main
[rank3]: globals = debugger.run(setup['file'], None, None, is_module)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2510, in run
[rank3]: return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2517, in _exec
[rank3]: globals = pydevd_runpy.run_path(file, globals, 'main')
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
[rank3]: return _run_module_code(code, init_globals, run_name,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
[rank3]: _run_code(code, mod_globals, init_globals,
[rank3]: File "/home/aps/.local/share/code-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
[rank3]: exec(code, run_globals)
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 270, in
[rank3]: run()
[rank3]: File "/opt/aps/workdir/unsloth_dev/test_multi.py", line 218, in run
[rank3]: accelerator.backward(loss)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/accelerator.py", line 2233, in backward
[rank3]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
[rank3]: self.engine.backward(loss, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2048, in backward
[rank3]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2263, in backward
[rank3]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank3]: scaled_loss.backward(retain_graph=retain_graph)
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/_tensor.py", line 581, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/init.py", line 347, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/opt/aps/workdir/anaconda3/envs/unsloth_dev/lib/python3.10/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1

Versions

Environment Report:

Operating System: Linux-5.15.0-119-generic-x86_64-with-glibc2.31
Python version: 3.10.15
Liger Kernel version: 0.5.2
PyTorch version: 2.5.1
CUDA version: 12.4
HIP(ROCm) version: Not available
Triton version: 3.0.0
Transformers version: 4.46.3
XPU version: XPU Not Available

@chenchen0611
Copy link
Author

@ByronHsu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant