Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very High Loss (~15) and Instability with Previously-working Config From A While Ago #2224

Open
1 task done
e-p-armstrong opened this issue Dec 28, 2024 · 3 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@e-p-armstrong
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training should proceed roughly as it did many versions ago, without a catastrophic loss graph.

Current behaviour

Wanting to track down whether poor recent SFT performance is my fault or the cause of underlying software changes, I recently tried to use a very old config for finetuning a continued-pretrain mistral 7b.

The original loss was normal back in the day, and looked like this:
image

This original loss was recorded on 2024-09-25.

However using the config (slight changes to make it not error, both versions provided below) with the latest docker image:
image

Starts at 15.3 loss and is spikey as all hell.

Notes:
8x A40 Runpod instance, both times
command used to run: accelerate launch --use_deepspeed -m axolotl.cli.train ./pathtoyamlfile.yaml
Some changes had to be made to the original config to make it not error on the newest axolotl version. Specifically: deepspeed had to be changed from zero2 to zero1 due to #2191 and the datasets had to be changed from type: sharegpt to having them be manually specified.

Steps to reproduce

Rent 8x A40 instance on Runpod
Train Mistral 7b base with given deepspeed, hyperparams, on a sharegpt dataset
Observe spikey loss

Config yaml

I had to redact some details due to NDAs, but here are BOTH the original and the new config.yamls. The original one is first, and is what had the normal graph. The broken one for new axo is second. Note that runs with both is_mistral_derived_model=True and =False were tried, but failed both times, with the same high loss values.

Config 1 (original but reliant on deprecated settings )

base_model: Heralax/redacted-custom-7b-mistral-base
tokenizer_type: AutoTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: redacted_sharegpt_chatml_dataset.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  # ...quite a few more datasets were used here but are redacted in this shared version

dataset_prepared_path: 1b_run_prepared
output_dir: ./1b_out_pretrained_base

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: 
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 4
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000020
weight_decay: 0
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
logging_steps: 1
xformers_attention:
flash_attention: true

chat_template: chatml

warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
special_tokens:
  pad_token: "<|end_of_text|>"

Config 2 (NEW/BROKEN):

base_model: Heralax/redacted-custom-7b-mistral-base
tokenizer_type: AutoTokenizer
is_mistral_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: redacted_sharegpt_chatml_dataset.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  # ...quite a few more datasets were used here but are redacted in this shared version

dataset_prepared_path: 1b_run_prepared
output_dir: ./1b_out_pretrained_base

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: 
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 4
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000020
weight_decay: 0
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 5
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
logging_steps: 1
xformers_attention:
flash_attention: true

chat_template: chatml

warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
special_tokens:
  pad_token: "<|end_of_text|>"

Areas that were changed have been indicated with configs.



### Possible solution

Since only two things were changed from the old to the new, if this is user error it is either in the datasets area or with something deepspeed related. If it is axolotl's fault (or more likely, something upstream) then it happened between 2024-09-25 and now.

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.11.10

### axolotl branch-commit

latest docker

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
@e-p-armstrong e-p-armstrong added the bug Something isn't working label Dec 28, 2024
@Hasnonname
Copy link

I experienced the same problem using the axolotlai/axolotl-cloud:main-latest image on Runpod with the following config:

base_model: mistralai/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM

load_in_8bit:
load_in_4bit: false
strict: false

datasets:
  - path: datasets/training_creative_writing_conversation-sharegpt.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
  - path: datasets/Claude-Sonnet35-Charcard-Unslop.json
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
  - path: Dampfinchen/Creative_Writing_Multiturn-Balanced-8192
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
  - path: anthracite-org/kalo-opus-instruct-22k-no-refusal
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value

dataset_prepared_path: last_run_prepared
val_set_size: 0.03
output_dir: ./outputs/lora-out

adapter: lora
lora_model_dir:

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

lora_r: 128
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

weight_decay: 0.02
max_grad_norm: 1

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: unsloth
gradient_checkpointing_kwargs:
  use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.05
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128

saves_per_epoch: 4

debug:

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

deepspeed: deepspeed_configs/zero3_bf16.json

special_tokens:
  pad_token: <pad>

I tried on a 4x4090 pod and a 1xH100 SXM pod, using the command accelerate launch -m axolotl.cli.train config.yml on both. I used the above config without changes on the 4x4090 pod, and the only changes I made on the 1xH100 pod were removing deepspeed and setting micro_batch_size to 4. Both had high training losses that started around 11-12 and fluctuated wildly:

image

@e-p-armstrong
Copy link
Author

@winglian Any thoughts on this? I pinged nano on the discord about it a while back but afaik still broken

@NanoCode012
Copy link
Collaborator

Thanks both of you for the report. The time diff between the two commits in the original report is very large, so I was trying to run some mini repro on my end earlier.

@Hasnonname , do you perhaps know which timeframe / commits caused the loss to behave differently for you? Did the same config had much lower loss earlier?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants