[checkpoint] feat: open source fast checkpoint system #38

MingjiHan99 · 2024-05-31T05:07:31Z

Summary

We improved vescale.checkpoint with the following new features for fast checkpointing (where front three features are built-in techniques without necessitating manual activation):

Saving Plan Caching: During training, the program may save model and optimizer checkpoints every n steps. Once a saving plan is created, it remains unchanged as long as the model does. We implemented plan caching to avoid regenerating the plan when checkpointing a model or optimizer multiple times, reducing unnecessary compute and communication costs. As of 05/30/2024, PyTorch DCP does not support plan caching.
Saving Plan Load-Balancing: In data parallel training, models are replicated across GPUs with different data parallel ranks but the same pipeline and tensor parallel ranks. Existing PyTorch DCP (as of 05/30/2024) deduplicates replicated tensors using a simple algorithm, causing GPUs with data parallel rank 0 to save the entire model, leading to load imbalance. We implemented a load-balancing algorithm to address this issue when deduplicating model tensors.
D2H Tensor Copying via Pinned Memory: When copying tensors from GPU to host memory, vescale.checkpoint uses pinned host memory, reducing memory allocation costs each time a checkpoint is saved. As of 05/30/2024, PyTorch DCP does not support pinned memory.
Checkpoint Broadcasting: In data parallel training, models are replicated across GPUs with different data parallel ranks but the same pipeline and tensor parallel ranks. If broadcast_checkpoint is enabled, vescale.checkpoint.load lets GPUs with data parallel rank 0 to load the model and broadcast it to other GPUs with higher data parallel ranks. If GPUs are connected with NCCL and I/O bandwidth is fully utilized, broadcasting model tensors speeds up checkpoint loading compared to all GPUs loading models from persistent storage. E.g.:
```
# prepare checkpoint state for the model and optimizer
checkpoint_state = { "model": distributed_model, "optimizer": distributed_optimizer }
# load the checkpoint
vescale.checkpoint.load("/user/vescale/gpt/", checkpoint_state, broadcast_checkpoint=True)
```
Asynchronous Checkpointing: When vescale.checkpoint.save is called, it first generates a saving plan and then synchronously copies tensors from GPU to host memory. If async_checkpoint is enabled, the training program can continue after the D2H copying, while vescale.checkpoint.save continues to serialize tensors and dump the checkpoint to persistent storage asynchronously without blocking training. As of 05/30/2024, PyTorch DCP does not support asynchronous checkpointing. E.g.:
```
# prepare checkpoint state for the model and optimizer
checkpoint_state = { "model": distributed_model, "optimizer": distributed_optimizer }
# save the checkpoint asynchronuously
vescale.checkpoint.save("/user/vescale/gpt/", checkpoint_state, async_checkpoint=True)
```

Acknowledgement

We sincerely appreciate all contributors including but not limited to @shanesyy-1992 @raywan-110 @lazychao @AHEADer @MingjiHan99

shanesyy-1992 · 2024-05-31T07:02:06Z

From my understanding, Checkpoint Broadcasting might be beneficial only when the storage throughput is limited under certain circumstances. Maybe it's better to add some more guidance on when to use this feature.

raywan-110 · 2024-05-31T07:42:20Z

Let's keep pushing forward 💪!

open source fast checkpoint system

f645c30

MingjiHan99 requested a review from leonardo0lyj May 31, 2024 05:08

leonardo0lyj approved these changes May 31, 2024

View reviewed changes

shanesyy-1992 approved these changes May 31, 2024

View reviewed changes

jc-bytedance approved these changes May 31, 2024

View reviewed changes

MingjiHan99 merged commit c4afc72 into main May 31, 2024
1 check passed

MingjiHan99 deleted the opensource_053024 branch May 31, 2024 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[checkpoint] feat: open source fast checkpoint system #38

[checkpoint] feat: open source fast checkpoint system #38

MingjiHan99 commented May 31, 2024 •

edited

Loading

shanesyy-1992 commented May 31, 2024

raywan-110 commented May 31, 2024

[checkpoint] feat: open source fast checkpoint system #38

[checkpoint] feat: open source fast checkpoint system #38

Conversation

MingjiHan99 commented May 31, 2024 • edited Loading

Summary

Acknowledgement

shanesyy-1992 commented May 31, 2024

raywan-110 commented May 31, 2024

MingjiHan99 commented May 31, 2024 •

edited

Loading