Gemma 2 + unsloth + fa2 full SFT RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #5435

hengdos · 2024-09-13T10:43:59Z

Reminder

I have read the README and searched the existing issues.

System Info

[2024-09-13 18:40:14,881] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: BNB_CUDA_VERSION=121 environment variable detected; loading libbitsandbytes_cuda121.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

llamafactory version: 0.8.4.dev0
Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.31
Python version: 3.10.14
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.44.2
Datasets version: 2.21.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800 80GB PCIe
DeepSpeed version: 0.15.1
Bitsandbytes version: 0.43.3

Reproduction

config file

#gemma2_full_sft.yaml
### model
model_name_or_path: /root/.cache/modelscope/hub/LLM-Research/gemma-2-9b-it

### method
stage: sft
do_train: true
finetuning_type: full
use_unsloth: true
flash_attn: fa2

### dataset
dataset: identity
template: gemma
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/gemma-2-9b-it/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

run training

llamafactory-cli train examples/train_full/gemma2_full_sft.yaml

output

[INFO|trainer.py:648] 2024-09-13 18:34:56,525 >> Using auto half precision backend
[WARNING|<string>:213] 2024-09-13 18:34:56,905 >> ==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 81 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 120
 "-____-"     Number of trainable parameters = 0
  0%|                                                                                                                  | 0/120 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/miniconda3/envs/unsloth_env/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/root/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main
    run_exp()
  File "/root/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "<string>", line 363, in _fast_inner_training_loop
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward
    loss.backward(**kwargs)
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
    torch.autograd.backward(
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward
    _engine_run_backward(
  File "/root/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
  0%|                                                                                                                  | 0/120 [00:27<?, ?it/s]

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

github-actions bot added the pending This problem is yet to be addressed label Sep 13, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Nov 2, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 2 + unsloth + fa2 full SFT RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #5435

Gemma 2 + unsloth + fa2 full SFT RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #5435

hengdos commented Sep 13, 2024

Gemma 2 + unsloth + fa2 full SFT RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #5435

Gemma 2 + unsloth + fa2 full SFT RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #5435

Comments

hengdos commented Sep 13, 2024

Reminder

System Info

Reproduction

config file

run training

output

Expected behavior

Others