Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

二阶段微调训练的问题 #29

Open
wangyin717 opened this issue Jul 2, 2024 · 3 comments
Open

二阶段微调训练的问题 #29

wangyin717 opened this issue Jul 2, 2024 · 3 comments

Comments

@wangyin717
Copy link

wangyin717 commented Jul 2, 2024

报错信息:
Using downloaded and verified file: /data/MiniGPT4Qwen/lavis/../cache/dataset/llava_instruct/llava_instruction_156k.json
2024-07-02 14:04:21,365 [INFO] Building datasets...
Using downloaded and verified file: /data/MiniGPT4Qwen/lavis/../cache/dataset/videochatgpt/videochatgpt_instruction_100k.json
2024-07-02 14:04:22,514 [INFO] Building datasets...
Finishing Initializing Vision-Encoder...
2024-07-02 14:04:35,207 [INFO] freeze vision encoder
Finishing Loading Q-former Initializing Config...
Finishing Initializing Q-former...
2024-07-02 14:04:35,917 [INFO] no text input for q-former
Loading LLM:/data/MiniGPT4Qwen/cache/ckpt/Qwen7B-chat...
2024-07-02 14:04:36,396 [WARNING] The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
2024-07-02 14:04:36,397 [WARNING] Try importing flash-attention for faster inference...
2024-07-02 14:04:36,397 [WARNING] Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
2024-07-02 14:04:36,397 [WARNING] Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
2024-07-02 14:04:36,397 [WARNING] Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00, 1.16it/s]
Unfreeze LLM!!!
Start loading pretrained model: /data/MiniGPT4Qwen/cache/ckpt/blip2/blip2_pretrained_flant5xxl.pth
Loading the File Named: /data/MiniGPT4Qwen/cache/ckpt/blip2/blip2_pretrained_flant5xxl.pth...
2024-07-02 14:04:43,919 [INFO] load checkpoint from /data/MiniGPT4Qwen/cache/ckpt/blip2/blip2_pretrained_flant5xxl.pth
Start loading finetuned model: /data/MiniGPT4Qwen/lavis/output/ckpt-and-data/pretrain/global_step2181/model.pth
Checkpoint: /data/MiniGPT4Qwen/lavis/output/ckpt-and-data/pretrain/global_step2181/model.pth

###################################################
在这里读取预训练模型model.pth时,报错提示Missing keys
###################################################

2024-07-02 14:04:43,958 [INFO] Missing keys ['query_tokens', 'visual_encoder.cls_token', 'visual_encoder.pos_embed', 'visual_encoder.patch_embed.proj.weight', 'visual_encoder.patch_embed.proj.bias', 'visual_encoder.blocks.0.norm1.weight', 'visual_encoder.blocks.0.norm1.bias', 'visual_encoder.blocks.0.attn.q_bias', 'visual_encoder.blocks.0.attn.v_bias', 'visual_encoder.blocks.0.attn.qkv.weight', 'visual_encoder.blocks.0.attn.proj.weight', 'visual_encoder.blocks.0.attn.proj.bias', 'visual_encoder.blocks.0.norm2.weight', ......后面还有很多

请问是什么原因导致的,谢谢!

@wangyin717 wangyin717 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
@wangyin717 wangyin717 reopened this Jul 2, 2024
@Coobiw
Copy link
Owner

Coobiw commented Jul 2, 2024

是解决了吗,没有的话可以发一下配置的yaml文件,然后也可以检查下权重是否下载完全

@wangyin717
Copy link
Author

这是我的sft.yaml配置文件,在这里我尝试用单卡A100训练,所以改了run下面的参数,训练运行 CUDA_VISIBLE_DEVICES=0 python train.py --cfg-path lavis/projects/pp_qwen7b_video/sft.yaml 然后会出现上述Missing keys的问题,最后会报错显存不足(单张A100肯定显存不够)。此外model.pth权重检查过应该没问题,所以Missing keys按理说不应该出现。

sft.yaml:

model:
  arch: minigpt4qwen
  model_type: qwen7b_chat
  load_finetuned: True
  load_pretrained: True

  # pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/blip2_pretrained_flant5xxl.pth"
  pretrained: "/data/MiniGPT4Qwen/cache/ckpt/blip2/blip2_pretrained_flant5xxl.pth"
  # finetuned: ""
  finetuned: "/data/MiniGPT4Qwen/lavis/output/pp_7b_video/pretrain/global_step2181/model.pth"

  # vit encoder
  vit_model: "eva_clip_g"
  image_size: 224
  drop_path_rate: 0
  use_grad_checkpoint: True
  vit_precision: "fp16"  # 如果你要打开vit进行训练,这里需要调整成fp32,否则如果开启amp混合精度训练会有问题(在scaler处报错,因为没有实现一个fp16的AdamW)
  freeze_vit: True
  unfreeze_pos_embed: False

  # Q-Former
  num_query_token: 32
  qformer_text_input: False
  freeze_qformer: True
  freeze_queries: True

  # projection
  freeze_proj: False

  # path to Vicuna checkpoint
  llm_model: "/data/MiniGPT4Qwen/cache/ckpt/Qwen7B-chat"

  # unfreeze LLM for better chat
  freeze_llm: False

  # lora config
  get_lora: False
  lora_alpha: 32
  lora_r: 8
  lora_dropout: 0.05

  # text length when training
  max_txt_len: 1536 # 512

  # enable autocast of vit
  enable_autocast: False

datasets:
  llava_instruct_156k: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 200

  videochatgpt_100k: # name of the dataset builder
    vis_processor:
        train:
          name: "blip2_image_train"
          image_size: 224
    text_processor:
        train:
          name: "base_instruction"
          max_words: 200

# run:
#   output_dir: "lavis/output/pp_7b_video/sft_video/"

#   # task: deepspeed_image_text_pretrain

#   task: deepspeed_image_text_pretrain
#   num_workers: 4

#   seed: 42

#   world_size: 1
#   dist_url: "env://"
#   distributed: True

#   max_epoch: 1
#   log_freq: 10

#   lr_sched: "linear_warmup_cosine_lr_step-wise"
#   warmup_lr: 0
#   init_lr: 2e-5
#   min_lr: 0
#   warmup_ratio: 0.1

#   deepspeed_config:
#     # global batch = 128 = n_ranks * grad_acc_steps * micro_batch_size = (4//2) * 64 * 1
#     # 8 x 3090
#     # pp=8 dp=1 nproc=pp*dp=8 
#     gradient_accumulation_steps: 128 # 128 // dp(=1) // bs_per_gpu(=1) = 128
#     train_micro_batch_size_per_gpu: 1

#     gradient_clipping: 1.
#     steps_per_print: 10
#     wall_clock_breakdown: false
#     dump_state: False

#     fp16:
#         enabled: false
#         loss_scale: 0
#         loss_scale_window: 1000
#         initial_scale_power: 16
#         hysteresis: 2
#         min_loss_scale: 1

#     bf16:
#         enabled: true

#     optimizer:
#         type: "AdamW"
#         params:
#             lr: 2e-5
#             betas: [0.9,0.99]
#             eps: 1e-7
#             weight_decay: 0.

#     zero_optimization:
#         stage: 0
#         # offload_optimizer:
#         #   device: "cpu"
#         #   pin_memory: true
#         allgather_partitions: true
#         allgather_bucket_size: 2e8
#         overlap_comm: true
#         reduce_scatter: true
#         reduce_bucket_size: 2e8
#         contiguous_gradients: true

run:
  task: image_text_pretrain
  # optimizer
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 2e-5
  min_lr: 1e-6
  warmup_lr: 0
  warmup_steps: 500
  weight_decay: 0.05
  grad_norm_clip: 1.
  max_epoch: 10 #5
  batch_size_train: 1 #16
  batch_size_eval: 1
  num_workers: 4
  accum_grad_iters: 16 #1

  seed: 42
  output_dir: "lavis/output/pp_7b_video/pretrain/"

  log_freq: 64

  amp: True
  autocast_dtype: "bf16" # ['bf16','fp16']
  loss_scale: False # 默认为True,如果autocast_dtype为float16,请设置为True,为bfloat16,建议设置为False
  resume_ckpt_path: null

  evaluate: False
  train_splits: ["train"]
  # valid_splits: ["val"]
  # test_splits: ["test"]

  device: "cuda"
  world_size: 1
  dist_url: "env://"
  distributed: True
  # fsdp: False

@wangyin717 wangyin717 reopened this Jul 3, 2024
@Coobiw
Copy link
Owner

Coobiw commented Jul 3, 2024

看了下,应该不是报错,https://github.com/Coobiw/MiniGPT4Qwen/blob/master/lavis/models/base_model.py#L53

 msg = self.load_state_dict(state_dict, strict=False)

logging.info("Missing keys {}".format(msg.missing_keys))

这里只是一个log,因为model.pth里只有中间的projection层的参数(第一步pretrain stage只训练中间projection层),其他都用的eva,blip2_qformer和qwen-7B的参数,所以会有一个INFO的提醒,按照 https://github.com/Coobiw/MiniGPT4Qwen/blob/master/WEIGHT.md 中下载权重就好

担心的话先训几步看看loss,显存不够的话可以把freeze_llm设为True先看下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants