-
-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very High Loss (~15) and Instability with Previously-working Config From A While Ago #2224
Comments
I experienced the same problem using the base_model: mistralai/Mistral-Nemo-Instruct-2407
model_type: AutoModelForCausalLM
load_in_8bit:
load_in_4bit: false
strict: false
datasets:
- path: datasets/training_creative_writing_conversation-sharegpt.jsonl
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
- path: datasets/Claude-Sonnet35-Charcard-Unslop.json
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
- path: Dampfinchen/Creative_Writing_Multiturn-Balanced-8192
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
- path: anthracite-org/kalo-opus-instruct-22k-no-refusal
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
dataset_prepared_path: last_run_prepared
val_set_size: 0.03
output_dir: ./outputs/lora-out
adapter: lora
lora_model_dir:
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
lora_r: 128
lora_alpha: 128
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5
weight_decay: 0.02
max_grad_norm: 1
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: unsloth
gradient_checkpointing_kwargs:
use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_ratio: 0.05
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 4
debug:
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
deepspeed: deepspeed_configs/zero3_bf16.json
special_tokens:
pad_token: <pad> I tried on a 4x4090 pod and a 1xH100 SXM pod, using the command |
@winglian Any thoughts on this? I pinged nano on the discord about it a while back but afaik still broken |
Thanks both of you for the report. The time diff between the two commits in the original report is very large, so I was trying to run some mini repro on my end earlier. @Hasnonname , do you perhaps know which timeframe / commits caused the loss to behave differently for you? Did the same config had much lower loss earlier? |
Please check that this issue hasn't been reported before.
Expected Behavior
Training should proceed roughly as it did many versions ago, without a catastrophic loss graph.
Current behaviour
Wanting to track down whether poor recent SFT performance is my fault or the cause of underlying software changes, I recently tried to use a very old config for finetuning a continued-pretrain mistral 7b.
The original loss was normal back in the day, and looked like this:
This original loss was recorded on 2024-09-25.
However using the config (slight changes to make it not error, both versions provided below) with the latest docker image:
Starts at 15.3 loss and is spikey as all hell.
Notes:
8x A40 Runpod instance, both times
command used to run: accelerate launch --use_deepspeed -m axolotl.cli.train ./pathtoyamlfile.yaml
Some changes had to be made to the original config to make it not error on the newest axolotl version. Specifically: deepspeed had to be changed from zero2 to zero1 due to #2191 and the datasets had to be changed from type: sharegpt to having them be manually specified.
Steps to reproduce
Rent 8x A40 instance on Runpod
Train Mistral 7b base with given deepspeed, hyperparams, on a sharegpt dataset
Observe spikey loss
Config yaml
Config 2 (NEW/BROKEN):
Areas that were changed have been indicated with configs.
The text was updated successfully, but these errors were encountered: