`max_grad_norm` doesn't appear to be clipping gradients #2214

DevonPeroutky · 2024-12-22T23:50:58Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I'm fine-tuning on a single GPU (no deepspeed configuration). I applied max_grad_norm: 1 and I would expect the normalized gradients to be clipped at 1 and never exceed.

Current behaviour

I'm seeing large gradient spikes on Weights & Biases?

Steps to reproduce

Train a qlora with axolotl with max_grad_norm: 1 and see the gradients be >1.

Config yaml

# -----------------------------------
# ---- Base Model Configuration -----
# -----------------------------------
base_model: meta-llama/Meta-Llama-3.1-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
chat_template: llama3

# -----------------------------------
# ------------ Dataset -------------
# -----------------------------------
datasets:
  # - path: databricks/databricks-dolly-15k
  - path: /home/ubuntu/kindo-base/notebooks/truncated_dolly_15K
    ds_type: json
    type:
      system_prompt: ""
      field_system: system
      field_instruction: instruction
      field_input: context
      field_output: response
      format: "[INST] {instruction} {input} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"
    train_split: train
dataset_prepared_path: last_run_prepared

# How much to set out across all datasets
val_set_size: .05
output_dir: ./outputs/qlora-out

# -----------------------------------
# ----------- Lora Config -----------
# -----------------------------------
adapter: qlora
lora_model_dir:

lora_r: 128
lora_alpha: 32 # alpha = r/4 is in the qlora paper
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

lora_modules_to_save:
  - embed_tokens
  - lm_head

# -----------------------------------
# ------- Training parameters -------
# -----------------------------------
sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 8
micro_batch_size: 8 
num_epochs: 2

optimizer: paged_adamw_8bit
max_grad_norm: 1.0

# Learning Rate
lr_scheduler: cosine
learning_rate: 0.0004
warmup_ratio: 0.05

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

evals_per_epoch: 3
eval_batch_size: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|end_of_text|>"
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

# -----------------------------------
# ------- Liger Integration ---------
# -----------------------------------
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

Possible solution

I'm wondering if the metrics reported to Weights & Biases are before clipping?
The gradients are really, really high. Is it possible the way I'm mapping the dataset to the expected format for llama is wrong?

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10.13

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

DevonPeroutky added the bug Something isn't working label Dec 22, 2024

DevonPeroutky changed the title ~~max_grad_norm doesn't appear to be respected.~~ max_grad_norm doesn't appear to be clipping gradients Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`max_grad_norm` doesn't appear to be clipping gradients #2214

`max_grad_norm` doesn't appear to be clipping gradients #2214

DevonPeroutky commented Dec 22, 2024 •

edited

Loading

max_grad_norm doesn't appear to be clipping gradients #2214

max_grad_norm doesn't appear to be clipping gradients #2214

Comments

DevonPeroutky commented Dec 22, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

`max_grad_norm` doesn't appear to be clipping gradients #2214

`max_grad_norm` doesn't appear to be clipping gradients #2214

DevonPeroutky commented Dec 22, 2024 •

edited

Loading