DTensor implementation #3117

YuliangLiu0306 · 2023-03-13T03:41:09Z

YuliangLiu0306
Mar 13, 2023

Proposal

We have invastigated the current implementation of DTensor from PyTorch and TensorFlow. Inspired from them, we propose a new design for DTensor.

Motivation

Supplys a uniform way for checkpointing, in automatic parallelism or other flexible distributed training paradigms, we need to save and load checkpoints in a flexible and fine-grained way.
DTensor serves as a tensor abstraction carrying distributed information. It is a key component to support both SPMD automatic parallelism and Gemini.
Refactor related components, like CommSpec, ShardingSpec, LayoutConverter, DeviceMesh, etc. Those components were tightly coupled with Automatic parallelism feature, which makes it hard to reuse them in other components.

Design

We design several components for API refactoring.

Layout: store tensor layout information, like tensor shape, tensor dtype, tensor device, tensor storage layout, etc.
ShardingSpec: store sharding information, like sharding type, sharding dim, sharding size, etc.
DeviceMesh: A logical view of a physical cluster.
CommSpec: store communication information, like communication type, communication dim, process group, etc.
LayoutConverter: Supply a efficient way to convert tensor layout from one to another.

Possible class definition (pseudo-code)

DTensor

class DTensor(torch.Tensor):

    def __init__(self, local_tensor: torch.Tensor, dist_layout: Layout):
        self.local_tensor = local_tensor
        self.data_type = local_tensor.dtype
        self.entire_shape = local_tensor.shape
        self.dist_layout = dist_layout
        self._apply_layout()

    @staticmethod
    def __new__(cls, local_tensor, layout):
        return torch.Tensor._make_subclass(cls, local_tensor, local_tensor.requires_grad)

    def __repr__(self):
        return f"DTensor({self.to_global()}, {self.dist_layout})"

    def __str__(self):
        return self.__repr__()

    def layout_convert(self, target_layout):
        '''
        Convert the layout of the tensor from source_spec to target_spec.
        '''
        self.local_tensor = layout_converter.apply(self.local_tensor, self.dist_layout, target_layout)
        self.dist_layout = target_layout

    def _apply_layout(self):
        '''
        Apply the layout to the local tensor during initializing process.
        '''
        pass

    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        pass

    @property
    def device_mesh(self):
        '''
        Return the device mesh of the tensor.
        '''
        return self.dist_layout.device_mesh

    @property
    def sharding_spec(self):
        '''
        Return the sharding specification of the tensor.
        '''
        return self.dist_layout.sharding_spec

    def to(self, *args, **kwargs):
        '''
        Move the tensor to a new device or convert the tensor to a new dtype.
        '''
        pass

    def to_local(self):
        '''
        Return the local tensor in this rank.
        '''
        return self.local_tensor

    def to_global(self):
        '''
        Recover the global tensor from the distributed tensor.

        Note: This function will all_gather the local tensor to the global tensor 
        and it will not change the layout of the DTensor. This function is mainly 
        used for debugging or check the correctness of the distributed tensor.
        '''
        return to_global(self.local_tensor, self.dist_layout)

Layout

class Layout:
    """Layout of a tensor.

    Attributes:
        device_mesh: the device mesh to store the tensor distributedly.
        device_type: the type of the device mesh, e.g. 'cpu' or 'cuda'.
        sharding_spec: the sharding specification to describe how the tensor is sharded.
        entire_shape: the entire shape of the global tensor.
    """

    def __init__(self, device_mesh: DeviceMesh, device_type: torch.device, sharding_spec: ShardingSpec, entire_shape: torch.Size):
        self.device_mesh = device_mesh
        self.device_type = device_type
        self.sharding_spec = sharding_spec
        self.entire_shape = entire_shape
        self._sanity_check()

    def __hash__(self) -> int:
        return hash(f'{self.sharding_spec}')

    def get_sharded_shape_per_device(self):
        pass

    def _sanity_check(self):
        pass

ShardingSpec

class ShardingSpec:
    '''
    Sharding spec describes how to shard a tensor with dim_size dimensions. The sharding sequence looks like
    [R, R, S0, S1], which means

    Argument:
        dim_partition_dict(Dict[int, List[int]], optional): The key is the dimension of tensor to be sharded,
            and the value of the key decribe which logical axis will be sharded in that dimension.
        sharding_sequence(List[DimSpec], optional): A straight view of ShardingSpec looks like [R, R, S0, S1].
    '''

    def __init__(self,
                 dim_size: int,
                 dim_partition_dict: Dict[int, List[int]] = None,
                 sharding_sequence: List[DimSpec] = None):
        self.dims = dim_size
        self.dim_partition_dict = dim_partition_dict
        self.sharding_sequence = sharding_sequence
        self._sanity_check()

    def __post_init__(self):
        pass

    def _sanity_check(self):
        pass

    def __repr__(self):
        pass

    def convert_dict_to_shard_sequence(self):
        '''
        Convert dim_partition_dict into list of DimSpec, and assign it to sharding_sequence.
        '''
        pass

    def convert_shard_sequence_to_dict(self):
        '''
        Convert sharding_sequence into dim_partition_dict.
        '''
        pass

    def spec_diff(self, other):
        '''
        This function is a naive version of difference computation. It just simply accumulates difference every dimension between the
        pair of sharding sequence.

        Argument:
            other(ShardingSpec): The ShardingSpec to compared with.

        Return:
            difference(int): Difference between two ShardingSpec.
        '''
        pass

CommSpec

class CommSpec:
    '''
    Communication spec is used to record the communication action. It converts the communication spec
    to real action which will be used in runtime. It contains comm_pattern to determine the
    communication method, process_groups_dict to determine the process groups, gather_dim and shard_dim
    to determine the buffer shape, and logical_process_axis

    Argument:
        comm_pattern(CollectiveCommPattern): decribe the communication method used in this spec.
        process_groups_dict(Dict): A dict which contains the process groups used to apply this CommSpec.
        gather_dim(int, Optional): The gather_dim of the tensor will be gathered.
        shard_dim(int, Optional): The shard_dim of the tensor will be sharded.
        logical_process_axis(Union(int, List[int]), Optional): The mesh_dim to implement the communication action.
    '''

    def __init__(self,
                 comm_pattern: CollectiveCommPattern,
                 process_groups_dict: Dict,
                 gather_dim: int = None,
                 shard_dim: int = None,
                 logical_process_axis: int = None):
        self.comm_pattern = comm_pattern
        self.gather_dim = gather_dim
        self.shard_dim = shard_dim
        self.logical_process_axis = logical_process_axis
        self.process_groups_dict = process_groups_dict

    def __repr__(self):
        pass

    def covert_spec_to_action(self, tensor):
        '''
        Convert CommSpec into runtime action, implement real collection communication to target tensor.
        The collection communication action is directed by the CommSpec.

        Argument:
            tensor(torch.Tensor): Tensor stored in each device, which could be different in different ranks.
        '''
        pass

LayoutConverter

class LayoutConverter(metaclass=SingletonMeta):

    def __init__(self):
        self._options = None
        self._forward_only = False
        self.cached_solution = {}

    @property
    def options(self):
        return self._options

    @options.setter
    def options(self, options_: LayoutConverterOptions):
        assert isinstance(options_, LayoutConverterOptions)
        self._options = options_

    @property
    def forward_only(self):
        return self._forward_only

    @forward_only.setter
    def forward_only(self, value):
        assert isinstance(value, bool)
        self._forward_only = value

    def get_all_one_step_transform_spec(self, source_layout: Layout) -> Dict[Layout, CommSpec]:
        '''
        Get all valid layouts from source_layout with one step transform.

        Argument:
            source_layout(Layout): the layout to be transformer.

        Return:
            valid_spec_dict(Dict[Layout, CommSpec]): all valid layouts from source_layout with one step transform.
        '''
        pass

    def layout_converting(self, source_layout: Layout,
                          target_layout: Layout) -> Tuple[List[Layout], List[CommSpec], float]:
        '''
        This method will find a path to transform source_layout to target_layout with
        a greedy algorithm.
        The basic idea is:
        Step1:
            Generate all one-step transform sequences from source_layout.
        Step2:
            Pick the 'best' layout following the heuristic function.
        Step3:
            Repeat above steps until the source layout transform to target layout.

        Additionally, to avoid repeating the path search in runtime, we cached all solved path
        in auto parallel strategy building time, which could handle most of cases in runtime.

        Args:
            source_layout(Layout): the layout to be transformed.
            target_layout(Layout): the layout to be achieved after a serious of transforms.

        Return:
            transform_path(List[Layout]): The transform path from source_layout to target_layout,
                                                it contains the source_layout and target_layout.
            comm_action_sequence(List[CommSpec]): Keep the communication operations to complete the layout converting in order.
        '''
        pass

    def get_total_comm_cost(self, source_layout: Layout, target_layout: Layout) -> Dict[str, float]:
        '''
        Get the total communication cost of the layout converting process.
        '''
        pass

    def apply(self, tensor: torch.Tensor, source_layout: Layout, target_layout: Layout) -> torch.Tensor:
        '''
        Apply target_layout to tensor with source layout, the transform path is generated by the
        layout_converting method.

        Argument:
            tensor (torch.Tensor): The tensor to be redistributed.
            source_layout(Layout): The source layout of the tensor.
            target_layout (Layout): The tensor will be redistributed to the target_layout.
        '''
        pass

Future work

After refatoring/implement above features, we could use them to implement a new abstraction called DProxy to serve as a proxy of the real tensor in automatic parallelism context. It will carry necessary information to estimate distributed operation memory/computing overhead.

Self-service

I'd be willing to do some initial work on this proposal myself.

HongLouyemeng · 2023-08-02T09:57:46Z

HongLouyemeng
Aug 2, 2023

What's the status of this work, and can I participate in it?😊

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTensor implementation #3117

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

DTensor implementation #3117

YuliangLiu0306 Mar 13, 2023

Proposal

Motivation

Design

Possible class definition (pseudo-code)

DTensor

Layout

ShardingSpec

CommSpec

LayoutConverter

Future work

Self-service

Replies: 1 comment

HongLouyemeng Aug 2, 2023

YuliangLiu0306
Mar 13, 2023

HongLouyemeng
Aug 2, 2023