-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[checkpoint] Open Source #27
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions abort protobuf, does we need that codegen file to be push?
Based on our discussion, we reserve the protobuf files for now. Otherwise, users have to generate code on their own. |
8d962d6
to
42d8c22
Compare
Could you help make some clean with the fast checkpoint code. There seems to be some code that hasn't been used. |
Sure. I will remove |
In this PR, we open source our
vescale.checkpoint
, Yo. ~vescale.checkpoint
is a distributed LLM checkpointing system.vescale.checkpoint
offers simple and straightforward APIs,enabling users to load and save distributed model (
DModule
) and optimizer (DistributedOptimizer
) seamlessly,abstracting away the complexities of underlying details such as process rank and device mesh.
vescale.checkpoint
supports load-time checkpoint resharding when varying the degrees of data, tensor, or pipeline (TODO) parallelismfor both veScale distributed model (
DModule
) and optimizer (DistributedOptimizer
).vescale.checkpoint
incorporates fast checkpointing and various I/O optimization techinques,enhancing I/O efficiency during large language model training.
vescale.checkpoint
will be a part of OmniStore project, a new open source project coming soon.Credit to veScale Checkpoint Team
This endeavor would not have been possible without the contribution of veScale Checkpoint team which includes but not limited to:
@shanesyy-1992 @MingjiHan99 @AHEADer @raywan-110 @michael4RD @lazychao @leochen-ai
Also thanks to the great guidance and leadership of: @pengyanghua @eric-haibin-lin @liwenchangbdbz @Meteorix
Credit to veScale Team
We would like to sincerely acknowledge the assistance of and collaboration with the veScale team which inlcudes but not limited to:
@leonardo0lyj @JsBlueCat @MackZackA @Vremold @jc-bytedance @lichen225
Credit to PyTorch Distributed Checkpoint (DCP) Team
We would like to sincerely acknowledge the assistance of and collaboration
with the PyTorch Distributed Checkpoint (DCP) team
which includes but not limited to:
@wz337 @kumpera @fegin @LucasLLC