You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:
Traceback (most recent call last):
File "code/diffusers/tools/zero_to_fp32.py", line 601, in <module>
convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
File "code/diffusers/tools/zero_to_fp32.py", line 536, in convert_zero_checkpoint_to_fp32_state_dict
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
File "code/diffusers/tools/zero_to_fp32.py", line 521, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "code/diffusers/tools/zero_to_fp32.py", line 205, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "code/diffusers/tools/zero_to_fp32.py", line 153, in parse_optim_states
raise ValueError(f"{files[0]} is not a zero checkpoint")
ValueError: work_dirs/checkpoint-2000/pytorch_model/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt is not a zero checkpoint
This github issue is also open in DeepSpeed Repo
I use DeepSpeed ZeRO-0 to train a diffusion model with multi-node GPU, with huggingface diffusers training scripts, the accelerate config is set to:
When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:
My work_dir tree structure is:
The text was updated successfully, but these errors were encountered: