DTensor implementation #3117
YuliangLiu0306
started this conversation in
Development | Core
Replies: 1 comment
-
What's the status of this work, and can I participate in it?😊 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Proposal
We have invastigated the current implementation of DTensor from PyTorch and TensorFlow. Inspired from them, we propose a new design for DTensor.
Motivation
Supplys a uniform way for checkpointing, in automatic parallelism or other flexible distributed training paradigms, we need to save and load checkpoints in a flexible and fine-grained way.
DTensor serves as a tensor abstraction carrying distributed information. It is a key component to support both SPMD automatic parallelism and Gemini.
Refactor related components, like
CommSpec
,ShardingSpec
,LayoutConverter
,DeviceMesh
, etc. Those components were tightly coupled with Automatic parallelism feature, which makes it hard to reuse them in other components.Design
We design several components for API refactoring.
Possible class definition (pseudo-code)
DTensor
Layout
ShardingSpec
CommSpec
LayoutConverter
Future work
After refatoring/implement above features, we could use them to implement a new abstraction called
DProxy
to serve as a proxy of the real tensor in automatic parallelism context. It will carry necessary information to estimate distributed operation memory/computing overhead.Self-service
Beta Was this translation helpful? Give feedback.
All reactions