Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

Open
7 of 9 tasks
SwayamInSync opened this issue Dec 21, 2024 · 1 comment
Open
7 of 9 tasks
Labels
🐛 bug Something isn't working ⚡ PEFT Related to PEFT 🏋 SFT Related to SFT

Comments

@SwayamInSync
Copy link
Contributor

System Info

- Platform: Linux-5.15.0-1074-azure-x86_64-with-glibc2.31
- Python version: 3.10.15
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe
- Transformers version: 4.46.3
- Accelerate version: 1.0.1
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.2
- bitsandbytes version: not installed
- DeepSpeed version: 0.15.3
- Diffusers version: not installed
- Liger-Kernel version: 0.4.2
- LLM-Blender version: not installed
- OpenAI version: 1.54.1
- PEFT version: 0.14.0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

I am not sure whether to post this issue here, will relocate if needed
So using Zero-3 is causing the checkpointing error when using LoRA from peft as following

[rank0]:     raise CheckpointError(
[rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
[rank0]: tensor at position 4:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 6:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 12:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 18:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 40:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 49:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 51:
[rank0]: saved metadata: {'shape': torch.Size([11008, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 58:
[rank0]: saved metadata: {'shape': torch.Size([11008, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 66:
[rank0]: saved metadata: {'shape': torch.Size([4096, 11008]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}

Works fine with

  • Full Finetuning (no peft) + Zero-3
  • peft + Zero-2

Here is the deepspeed config file

{
    "fp16": {
        "enabled": false
    },
    "bf16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "steps_per_print": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false,
    "communication_data_type": "bf16"
}

Expected behavior

Expected behaviour would be executing the training properly

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@SwayamInSync
Copy link
Contributor Author

This accounted with SFTTrainer if this is a general issue with Trainer from transformers, can be re-located there

@August-murr August-murr added 🐛 bug Something isn't working 🏋 SFT Related to SFT ⚡ PEFT Related to PEFT labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working ⚡ PEFT Related to PEFT 🏋 SFT Related to SFT
Projects
None yet
Development

No branches or pull requests

2 participants