Refactor pipeline parallelism for training #4050

ver217 · 2023-06-20T08:25:25Z

ver217
Jun 20, 2023
Maintainer

Motivation

Old pipeline parallelism is deeply coupled with old Engine and trainer, which is not recommended any longer.

We should refactor pipeline parallelism to fit new booster API.

Goal

To keep the implemention simple, we only focus on:

Unidirectional chain data flow
Single model training
huggingface/transformers models
Hybrid parallelism with TP, PP and Zero-1

Note that this pipeline parallelism is not intended for LM generation.

Design

There are below main components:

Pipeline stage manager
Pipeline parallel policy
P2P communication
Pipeline schedule
Pipeline plugin

Pipeline stage manager

Pipeline stage manager relies on process group mesh (described in #4038). It manages pipeline stages and process groups.

Pseudo-code of class:

class PipelineStageManager:
    def __init__(self, pg_mesh: ProcessGroupMesh, pipeline_axis: int) -> None:
        pass

    def is_first_stage(self, virtual: bool = False) -> bool:
        pass

    def is_last_stage(self, virtual: bool = False) -> bool:
        pass

    @property
    def num_stages(self) -> int:
        pass

    @property
    def stage(self) -> int:
        pass

    def get_rank(self) -> int:
        pass

    def get_prev_rank(self) -> int:
        pass

    def get_next_rank(self) -> int:
        pass

    def set_num_virtual_stages(self, num_virtual_stages: int) -> None:
        pass

    def set_virtual_stage(self, virtual_stage: int) -> None:
        pass

    @contextmanager
    def switch_virtual_stage(self, virtual_stage: int) -> None:
        pass

    def get_p2p_process_group(self, first_rank: int, second_rank: int) -> ProcessGroup:
        pass

    def init_process_group_by_stages(self, stages: List[int]) -> ProcessGroup:
        pass

Pipeline parallel policy

The policy handles:

model partition
forward method replacement
shared parameters

The policy should be applied on top-level module, and it may have a method:

def parallelize_model(self, module: Module) -> Tuple[Dict[str, Parameter], Dict[str, Tensor], List[Dict[int, Tensor]]]:
    """Parallelize model for pipeline parallel

    Args:
        module (Module): Module to be setup

    Returns:
        Tuple[Dict[str, Parameter], Dict[str, Tensor], List[Dict[int, Tensor]]]: Hold parameters, buffers and shared parameters
    """
    pass

It receives a pipeline stage manager and determine which layers should be held. And then we get hold_params and hold_buffers. Other params and buffers will be set to None. For LMs, there are few shared parameters. A typical one is tied embedding weight, which sets lm_head.weight=input_embedding.weight. This parameter is shared accross the first stage and the last stage, and it's gradient should be all-reduced after backward.

For sake of simplicity, the new forward method should obey below rules:

The arguments are categorized into two classes: original inputs (load from data loader) and outputs of the previous stage (receive from communication)
The outputs of non-last stage should always be a dict.
If a tensor (like residual) need to be passed accross stages (e.g. from stage-0 to stage-2), pass it through data chain.
If a tensor can be either recomputed or passed through data chain (e.g. attention mask), recompute it locally.

For memory efficiency, the model partition should be coupled with ShardFormer.

P2P communication

P2P communication encapsulates basic communication methods of pipeline parallism.

Pseudo-code:

class PipelineP2PCommunication:
    def __init__(self, stage_manager: PipelineStageManager) -> None:
        self.stage_manager = stage_manager

    def recv_forward(self, prev_rank: int = None) -> Any:
        """Copy the forward output from the previous stage in pipeline as the input tensor of this stage.

        Args:
            input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor to be received.
            prev_rank (int, optional): The rank of the source of the tensor.

        Returns:
            Any: The input tensor or input tensor list.
        """
        pass

    def recv_backward(self, next_rank: int = None) -> Any:
        """Copy the gradient tensor from the next stage in pipeline as the input gradient of this stage.

        Args:
            output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor to be received.
            next_rank (int, optional): The rank of the source of the tensor.

        Returns:
            Any: The input gradient tensor or gradient tensor list.
        """
        pass

    def send_forward(self, output_object: Any, next_rank: int = None) -> None:
        """Sends the input tensor to the next stage in pipeline.

        Args:
            output_object Any: Object to be sent.
            next_rank (int, optional): The rank of the recipient of the tensor.
        """
        pass

    def send_backward(self, input_object: Any, prev_rank: int = None) -> None:
        """Sends the gradient tensor to the previous stage in pipeline.

        Args:
            input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent
            prev_rank (int, optional): The rank of the recipient of the tensor
        """
        pass

This can be implemented by various base communication methods, including isend/irecv, braodcast and rpc. We found broadcast is robust and efficient. However, this base class may be compatible with all theses base communication methods.

Pipeline schedule

The most codes can be reused based on old pipeline schedule.

Pseudo-code:

class PipelineSchedule:
    def __init__(self, stage_manager: PipelineStageManager) -> None:
        self.stage_manager = stage_manager

    def forward_backward_step(self, model: Module, optimizer: OptimizerWrapper, data_iter: Iterable, criterion: Callable, return_loss: bool = False, return_outputs: bool = False) -> dict:
        raise NotImplementedError

forward_backward_step() method should return a dict with two keys "loss" and "outputs".

Pipeline plugin

It will compose all other core components. The most important methods of it is execute_pipeline().

Pseudo-cude:

def execute_pipeline(self,
                         data_iter: Iterator,
                         model: PipelineModule,
                         criterion: Callable[[Any, Any], torch.Tensor],
                         optimizer: PipelineOptimizer,
                         return_loss: bool = True,
                         return_outputs: bool = False) -> dict:
    with model.no_sync():
        outputs = self.schedule.forward_backward_step(
            model, optimizer, data_iter, criterion, return_loss, return_outputs)
    model.sync_shared_params()
    model.sync_grads()
    return outputs

Compatability with other parallel methods

Zero-1 should provide no_sync() and sync_grads() methods
TP module should allow initlization with pipeline

ver217 · 2023-06-29T10:05:58Z

ver217
Jun 29, 2023
Maintainer Author

How to integrate with ShardFormer

Issues to resolve

Uniform process group management
Uniform model initialization logic
Model forward replacement

Process group management

As process groups are used in other component (like pipeline schedule and optimizer), we'd better initialize them out of ShardFormer class.

class ShardFormer:
    def __init__(self, shard_config, pg_mesh, tp_axis=None, pp_stage_manager=None):
        pass

Model forward replacement

We can implement this by simply using ModulePolicyDescription.method_replacement.

Model initialization logic

Policy class should add three methos:

class Policy:
    def get_held_layers(self) -> List[Module]:
        """Get layers that should be held in current stage. This method should be implemented by subclass.

        Returns:
            List[Module]: List of layers that should be hold in current stage
        """
        raise NotImplementedError

    def get_shared_params(self) -> List[Dict[int, Tensor]]:
        """Get parameters that should be shared across stages. This method should be implemented by subclass.

        Returns:
            List[Dict[int, Tensor]]: List of parameters that should be shared across stages. E.g. [{0: module.model.embed_tokens.weight, 3: module.lm_head.weight}]
        """
        raise NotImplementedError

    def set_pipeline_stage_manager(self, pp_stage_manager):
         pass

And the shard() method of Sharder then become:

class Sharder:
    def shard(self) -> List[Dict[int, Tensor]]:
        r"""
        Shard the model according to the policy
        """
        self.policy.set_model(self.model)
        self.policy.set_shard_config(self.shard_config)
        self.policy.set_pipeline_stage_manager(self.pp_stage_manager)
        self._preprocess()
        self._release_unheld_layers() # new
        self._replace_model_class()
        self._replace_module()
        self._materialize_model() # new
        self._postprocess()
        return self.policy.get_shared_params()

This method should return sharded params list, which would be used in pipeline schedule.

The newly added _release_unheld_layers() method should release those unheld layers by setting their params and buffers to None. _materialize_model() method will materialize those lazy params and buffers. Note that param/buffer initialization may be executed in two methods: _replace_module() when initializing TP layers and _materialize_model() for other normal layers.

0 replies

FrankLeeeee · 2023-06-30T02:15:25Z

FrankLeeeee
Jun 30, 2023

I have put the tensor parallel process group in the ShardConfig, can we put the pipeline-related staff there as well? Meanwhile, should we only provide one argument for pipeline instead of two (pg_mesh, pg_stage_manager) as these two arguments look a bit confusing to me as a person who does not know much about this module.

4 replies

ver217 Jun 30, 2023
Maintainer Author

That's coupled with other components like schedule and optimizer. If we manage process group in ShardFormer, it must provide an interface for other components.

FrankLeeeee Jun 30, 2023

@ver217 ShardFormer no longer manages the process group, it only receives process group from other components for flexibility.

tp_process_group, dp_process_group = ...

shard_config = ShardConfig(tensor_parallel_process_group=tp_process_group, enable_fused_normalization=True)
shardformer = ShardFormer(shard_config=shard_config)
sharded_model = shardformer.shard_model(model)

# add ddp
sharded_ddp_model = DDP(sharded_model, process_group=dp_process_group)

ver217 Jun 30, 2023
Maintainer Author

OK, then we can only pass pg_mesh to ShardFormer. If we always create 3D process group mesh (DP-PP-TP), even the parallel size is 1, we can further omit the axis argument.

FrankLeeeee Jun 30, 2023

Ok sure.

CjhHa1 · 2023-07-03T09:55:07Z

CjhHa1
Jul 3, 2023

The implementation of Policy, take Bert model for example

In the actual development, I came over some bad cases and thus I revised the mindmap and properties of the subclass of policy:

in the case that the repeated layers can't be divided evenly to every stage, we should build new method. So I add 2 simple features：

distribute_layers：get the list of distributed layers on each stage : layers_per_stage ,which is unevenly.
convert_into_accumulated： convert layers_per_stage array into an accumulated array, this is for inferring the original location of layers in current stage. This is done by using dumpy cumsum

Now the subclass of Policy will be like this :

class BertModelPolicy(Policy):
   def __init__(self, stage_manager: PipelineStageManager, num_layers: int, num_stages: int):
        self.stage_manager = stage_manager
        self.layers_per_stage = self.distribute_layers(num_layers, num_stages)
   def get_hold_layers(self, module: BertModel) -> List[Module]:
   def get_shared_params(self, module: BertModel) -> List[Dict[int, Tensor]]:
   def replace_forward(self, module: Module) -> None:
   def distribute_layers(self, num, stage_num) -> List[int]:

The policy need num_layers and num_stages to initialize, and these are from model config and stage_manager(?)

The method may be added into the Base class Policy

 def distribute_layers(self, num, stage_num) -> List[int]:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pipeline parallelism for training #4050

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Refactor pipeline parallelism for training #4050

ver217 Jun 20, 2023 Maintainer

Motivation

Goal

Design

Pipeline stage manager

Pipeline parallel policy

P2P communication

Pipeline schedule

Pipeline plugin

Compatability with other parallel methods

Replies: 3 comments · 4 replies

ver217 Jun 29, 2023 Maintainer Author

How to integrate with ShardFormer

Issues to resolve

Process group management

Model forward replacement

Model initialization logic

FrankLeeeee Jun 30, 2023

ver217 Jun 30, 2023 Maintainer Author

FrankLeeeee Jun 30, 2023

ver217 Jun 30, 2023 Maintainer Author

FrankLeeeee Jun 30, 2023

CjhHa1 Jul 3, 2023

The implementation of Policy, take Bert model for example

ver217
Jun 20, 2023
Maintainer

Replies: 3 comments 4 replies

ver217
Jun 29, 2023
Maintainer Author

FrankLeeeee
Jun 30, 2023

ver217 Jun 30, 2023
Maintainer Author

ver217 Jun 30, 2023
Maintainer Author

CjhHa1
Jul 3, 2023