Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint? #3210

Open
liming-ai opened this issue Nov 1, 2024 · 0 comments
Open

Comments

@liming-ai
Copy link

This github issue is also open in DeepSpeed Repo

I use DeepSpeed ZeRO-0 to train a diffusion model with multi-node GPU, with huggingface diffusers training scripts, the accelerate config is set to:

deepspeed_config:
  deepspeed_hostfile: /opt/tiger/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_clipping: auto
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 0
distributed_type: DEEPSPEED

When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:

Traceback (most recent call last):
  File "code/diffusers/tools/zero_to_fp32.py", line 601, in <module>
    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
  File "code/diffusers/tools/zero_to_fp32.py", line 536, in convert_zero_checkpoint_to_fp32_state_dict
    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
  File "code/diffusers/tools/zero_to_fp32.py", line 521, in get_fp32_state_dict_from_zero_checkpoint
    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
  File "code/diffusers/tools/zero_to_fp32.py", line 205, in _get_fp32_state_dict_from_zero_checkpoint
    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
  File "code/diffusers/tools/zero_to_fp32.py", line 153, in parse_optim_states
    raise ValueError(f"{files[0]} is not a zero checkpoint")
ValueError: work_dirs/checkpoint-2000/pytorch_model/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt is not a zero checkpoint

My work_dir tree structure is:

work_dirs/checkpoint-2000
├── latest
├── pytorch_model
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt
│   └── mp_rank_00_model_states.pt
├── random_states_0.pkl
├── random_states_10.pkl
├── random_states_11.pkl
├── random_states_12.pkl
├── random_states_13.pkl
├── random_states_14.pkl
├── random_states_15.pkl
├── random_states_1.pkl
├── random_states_2.pkl
├── random_states_3.pkl
├── random_states_4.pkl
├── random_states_5.pkl
├── random_states_6.pkl
├── random_states_7.pkl
├── random_states_8.pkl
├── random_states_9.pkl
├── scheduler.bin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant