Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

SwayamInSync · 2024-12-21T19:57:38Z

System Info

- Platform: Linux-5.15.0-1074-azure-x86_64-with-glibc2.31
- Python version: 3.10.15
- PyTorch version: 2.5.1
- CUDA device(s): NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe, NVIDIA A100 80GB PCIe
- Transformers version: 4.46.3
- Accelerate version: 1.0.1
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.2
- bitsandbytes version: not installed
- DeepSpeed version: 0.15.3
- Diffusers version: not installed
- Liger-Kernel version: 0.4.2
- LLM-Blender version: not installed
- OpenAI version: 1.54.1
- PEFT version: 0.14.0

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

I am not sure whether to post this issue here, will relocate if needed
So using Zero-3 is causing the checkpointing error when using LoRA from peft as following

[rank0]:     raise CheckpointError(
[rank0]: torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Recomputed values for the following tensors have different metadata than during the forward pass.
[rank0]: tensor at position 4:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 6:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 12:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 18:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 40:
[rank0]: saved metadata: {'shape': torch.Size([4096, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 49:
[rank0]: saved metadata: {'shape': torch.Size([4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 51:
[rank0]: saved metadata: {'shape': torch.Size([11008, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 58:
[rank0]: saved metadata: {'shape': torch.Size([11008, 4096]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: tensor at position 66:
[rank0]: saved metadata: {'shape': torch.Size([4096, 11008]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}
[rank0]: recomputed metadata: {'shape': torch.Size([0]), 'dtype': torch.bfloat16, 'device': device(type='cuda', index=0)}

Works fine with

Full Finetuning (no peft) + Zero-3
peft + Zero-2

Here is the deepspeed config file

{
    "fp16": {
        "enabled": false
    },
    "bf16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "steps_per_print": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "memory_breakdown": false,
    "communication_data_type": "bf16"
}

Expected behavior

Expected behaviour would be executing the training properly

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

SwayamInSync · 2024-12-21T19:58:52Z

This accounted with SFTTrainer if this is a general issue with Trainer from transformers, can be re-located there

August-murr added 🐛 bug Something isn't working 🏋 SFT Related to SFT ⚡ PEFT Related to PEFT labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

SwayamInSync commented Dec 21, 2024

SwayamInSync commented Dec 21, 2024

Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

Checkpointing is failing with SFTTrainer PEFT LoRA on DeepSpeed Zero-3 #2514

Comments

SwayamInSync commented Dec 21, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

Checklist

SwayamInSync commented Dec 21, 2024