Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QA] 训练到一半Nan grad norm occurs, please check it #375

Open
wang-benqiang opened this issue Nov 24, 2024 · 0 comments
Open

[QA] 训练到一半Nan grad norm occurs, please check it #375

wang-benqiang opened this issue Nov 24, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@wang-benqiang
Copy link

描述问题

训练到六百多步,报Nan grad norm occurs, please check it。请问我该怎么调整?以下是报错信息和训练config。

报错信息:

2024-11-24 02:35:28,531 INFO pipeline.py:770 in record_current_batch_training_metrics -- tflops=229.64161376788178 step=642 loss=2.1725409030914307 real_tgs=719.84 tgs (tokens/gpu/second)=720.46 tgs/last_tgs_1=720.46 tgs/tgs_all=716.63 tgs/tgs_avg=722.47 tgs/tgs_SMA=722.84 tgs/last_tgs_10=723.59 tgs/last_tgs_50=722.37 lr=9.808683347597691e-05 loss_scale=16384.0 grad_norm={'0_default': 2.156615516531637, '1_fp32': 0.0} micro_num=16 num_consumed_tokens=674234368 inf_nan_skip_batches=3 num_samples_in_batch=102 largest_length=7543 largest_batch=11 smallest_batch=2 adam_beta2=0.95 fwd_bwd_time=181.49 bwd_time=120.98 acc=0.5494 perplexity=9.0888 acc/zh=0.5494 tokens/zh=1047661 loss_from_metric=2.207 loss/zh=2.207
2024-11-24 02:38:29,509 INFO pipeline.py:770 in record_current_batch_training_metrics -- tflops=230.8624288772513 step=643 loss=2.046380043029785 real_tgs=723.69 tgs (tokens/gpu/second)=724.29 tgs/last_tgs_1=724.29 tgs/tgs_all=716.64 tgs/tgs_avg=722.47 tgs/tgs_SMA=722.86 tgs/last_tgs_10=723.59 tgs/last_tgs_50=722.37 lr=9.8080255311898e-05 loss_scale=16384.0 grad_norm={'0_default': 1.7590333217221075, '1_fp32': 0.0} micro_num=16 num_consumed_tokens=675282944 inf_nan_skip_batches=3 num_samples_in_batch=108 largest_length=6803 largest_batch=10 smallest_batch=2 adam_beta2=0.95 fwd_bwd_time=180.53 bwd_time=120.24 acc=0.5622 perplexity=8.3492 acc/zh=0.5622 tokens/zh=1047699 loss_from_metric=2.1222 loss/zh=2.1222
2024-11-24 02:41:32,346 INFO pipeline.py:770 in record_current_batch_training_metrics -- tflops=228.51365321663746 step=644 loss=2.3020739555358887 real_tgs=716.29 tgs (tokens/gpu/second)=716.93 tgs/last_tgs_1=716.93 tgs/tgs_all=716.64 tgs/tgs_avg=722.46 tgs/tgs_SMA=722.75 tgs/last_tgs_10=723.59 tgs/last_tgs_50=722.37 lr=9.807366610463195e-05 loss_scale=16384.0 grad_norm={'0_default': 1.1157139483556522, '1_fp32': 0.0} micro_num=16 num_consumed_tokens=676331520 inf_nan_skip_batches=3 num_samples_in_batch=78 largest_length=7693 largest_batch=9 smallest_batch=2 adam_beta2=0.95 fwd_bwd_time=182.39 bwd_time=121.65 acc=0.5478 perplexity=9.0845 acc/zh=0.5478 tokens/zh=1047634 loss_from_metric=2.2066 loss/zh=2.2066
2024-11-24 02:44:33,969 INFO pipeline.py:770 in record_current_batch_training_metrics -- tflops=230.04208104683008 step=645 loss=2.0069305896759033 real_tgs=721.13 tgs (tokens/gpu/second)=721.72 tgs/last_tgs_1=721.72 tgs/tgs_all=716.65 tgs/tgs_avg=722.46 tgs/tgs_SMA=722.74 tgs/last_tgs_10=723.59 tgs/last_tgs_50=722.37 lr=9.806706585586778e-05 loss_scale=16384.0 grad_norm={'0_default': 1.1654005105502763, '1_fp32': 0.0} micro_num=16 num_consumed_tokens=677380096 inf_nan_skip_batches=3 num_samples_in_batch=88 largest_length=7649 largest_batch=10 smallest_batch=2 adam_beta2=0.95 fwd_bwd_time=181.17 bwd_time=120.76 acc=0.5658 perplexity=8.2723 acc/zh=0.5658 tokens/zh=1047719 loss_from_metric=2.1129 loss/zh=2.1129
2024-11-24 02:47:35,188 INFO pipeline.py:770 in record_current_batch_training_metrics -- tflops=230.55453843330315 step=646 loss=1.931685209274292 real_tgs=722.71 tgs (tokens/gpu/second)=723.33 tgs/last_tgs_1=723.33 tgs/tgs_all=716.66 tgs/tgs_avg=722.46 tgs/tgs_SMA=722.73 tgs/last_tgs_10=723.59 tgs/last_tgs_50=722.37 lr=9.806045456729744e-05 loss_scale=16384.0 grad_norm={'0_default': 1.6949513837430747, '1_fp32': 0.0} micro_num=16 num_consumed_tokens=678428672 inf_nan_skip_batches=3 num_samples_in_batch=87 largest_length=5648 largest_batch=10 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=180.77 bwd_time=120.48 acc=0.5596 perplexity=8.4299 acc/zh=0.5596 tokens/zh=1047665 loss_from_metric=2.1318 loss/zh=2.1318
2024-11-24 02:50:36,860 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 02:53:37,825 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 02:56:38,023 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 02:59:38,265 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:02:38,901 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:05:40,250 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:08:41,350 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:11:41,621 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:14:42,524 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:17:42,189 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:20:42,677 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:23:43,389 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:26:44,261 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:29:45,623 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.
2024-11-24 03:32:45,724 WARNING hybrid_zero_optim.py:791 in _step -- Nan grad norm occurs, please check it.

config:

JOB_NAME = "32b_qwen2_train"
model_type = "QWEN2"
DO_ALERT = False
VOCAB_SIZE = 152064
SEQ_LEN = 8192
HIDDEN_SIZE = 5120
NUM_ATTENTION_HEAD = 40
NUM_KV_ATTENTION_HEAD = 8
# MLP_RATIO = 5.25
# MLP_RATIO = 3.4
MLP_RATIO = 5.4
NUM_LAYER = 64
MODEL_ONLY_FOLDER = "/data"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:./save_model/pretrain"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
CHECKPOINT_EVERY = 323
ckpt = dict(
    enable_save_ckpt=True,  # enable ckpt save.
    enable_internevo2hf_ckpt=True, # enable ckpt save for huggingface format.
    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
    auto_resume=True,
    checkpoint_every=CHECKPOINT_EVERY,
    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/root/data"
VALID_FOLDER = None # "/path/to/dataset"
data = dict(
    seq_len=SEQ_LEN,
    # micro_num means the number of micro_batch contained in one gradient update
    micro_num=16,
    # packed_length = micro_bsz * SEQ_LEN
    micro_bsz=1,
    # defaults to the value of micro_num
    valid_micro_num=1,
    # defaults to 0, means disable evaluate
    valid_every=0,
    pack_sample_into_one=False,
    total_steps=6267,
    skip_batches="630-700",
    # rampup_batch_size (str): A string with three space-separated integers representing the
    #       starting batch size, the increment, and the number of steps between
    #       each increment. For example, "192 24 8" means that the batch size (micro_num)
    #       starts at 192 and increases by 24 every 8 steps. Defaults to None.
    #       (IMPORTANT): The interval step size is 'micro_bsz'.
    rampup_batch_size="",
    # Datasets with less than 50 rows will be discarded
    min_length=50,
    train_folder=TRAIN_FOLDER,
    valid_folder=VALID_FOLDER,
    empty_cache_and_diag_interval=0,
    diag_outlier_ratio=1.1,
)
grad_scaler = dict(
    fp16=dict(
        # the initial loss scale, defaults to 2**16
        initial_scale=2**16,
        # the minimum loss scale, defaults to None
        min_scale=1,
        # the number of steps to increase loss scale when no overflow occurs
        growth_interval=1000,
    ),
    # the multiplication factor for increasing loss scale, defaults to 2
    growth_factor=2,
    # the multiplication factor for decreasing loss scale, defaults to 0.5
    backoff_factor=0.5,
    # the maximum loss scale, defaults to None
    max_scale=2**24,
    # the number of overflows before decreasing loss scale, defaults to 2
    hysteresis=2,
)

hybrid_zero_optimizer = dict(
    # Enable low_level_optimzer overlap_communication
    overlap_sync_grad=True,
    overlap_sync_param=False,
    # bucket size for nccl communication params
    reduce_bucket_size=512 * 1024 * 1024,
    # grad clipping
    clip_grad_norm=1.0,
)

loss = dict(
    label_smoothing=0,
)

adam = dict(
    lr=1e-4,
    adam_beta1=0.9,
    adam_beta2=0.95,
    adam_beta2_c=0,
    adam_eps=1e-8,
    weight_decay=0.01,
)

lr_scheduler = dict(
    total_steps=data["total_steps"],
    init_steps=0,  # optimizer_warmup_step
    warmup_ratio=0.01,
    eta_min=1e-5,
    last_epoch=-1,
)

beta2_scheduler = dict(
    init_beta2=adam["adam_beta2"],
    c=adam["adam_beta2_c"],
    cur_iter=-1,
)
use_fp32_norm = False
model = dict(
    checkpoint=True,
    num_chunks=1,
    num_attention_heads=NUM_ATTENTION_HEAD,
    num_kv_attention_heads=NUM_KV_ATTENTION_HEAD,
    embed_split_hidden=True,
    vocab_size=VOCAB_SIZE,
    embed_grad_scale=1,
    parallel_output=True,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYER,
    qkv_bias=True,
    o_bias=False,
    mlp_ratio=MLP_RATIO,
    apply_post_layer_norm=False,
    dtype="torch.float16",
    norm_type="rmsnorm",
    layer_norm_epsilon=1e-6,
    use_flash_attn=True,
    # Whether the odd and even columns of the query and key in the model are normally interleaved.
    # If it's True, the model's odd and even columns are normally ordered; if it's False,
    # it means that the model has prematurely concatenated all odd columns and even columns in front
    # and back, in order to improve the RoPE's computational efficiency.
    # Example:
    # qk_interleaved = True: q[-1] = [q1,q2,q3,q4,q5,q6,...], k[-1] = [k1,k2,k3,k4,k5,k6,...]
    # qk_interleaved = False: q[-1] = [q1,q3,q5,...,q2,q4,q6,...], k[-1] = [k1,k3,k5,...,k2,k4,k6,...]
    qk_interleaved=False,
    multiple_of=1,
    rope_base=1000000,
    use_sliding_window=False,
    sliding_window=131072,
    max_window_layers=70,
)

parallel = dict(
    zero1=dict(size=-1),
    tensor=dict(size=1, mode="isp"),
    pipeline=dict(size=1, interleaved_overlap=True),
    weight=dict(size=8, overlap=True),
)
cudnn_deterministic = False
cudnn_benchmark = False

monitor = dict(
    # feishu alert configs
    alert=dict(
        enable_feishu_alert=DO_ALERT,
        feishu_alert_address=None,  # feishu webhook to send alert message
        light_monitor_address=None,  # light_monitor address to send heartbeat
        alert_file_path=f"llm_alter/{JOB_NAME}_alert.log",
    ),
    tensorboard=dict(
        queue_max_length=10,
    ),
)

generation = dict(
    ckpt_folder="/path/to/saved/ckpt",
    output_folder="/path/to/save/generation",
    batch_size=1,
    eos_id=[2, 0],
    bos_id=1,
    max_length=100,
    do_sample=True,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1,
    length_penalty=1.0,
@wang-benqiang wang-benqiang added the question Further information is requested label Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants