[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

renyinCheng001 · 2024-11-27T05:47:33Z

Hi, All~

I am using pipeline model parallel to train a GPT3-13B model on 16 GPUs, which has 40 Transform layers. Obviously, the number of model layers cannot divide pipeline-model-parallel-size, resulting in only 2 Transform layers on each GPU.

I noticed that the number of layers on the first and last pipeline stages can be configured by --decoder-first-pipeline-num-layers and --decoder-last-pipeline-num-layers, as shown in the following code:

def get_num_layers_to_build(config: TransformerConfig) -> int:
    """
    Determine the number of transformer layers to build for the current pipeline stage.
    Args:
        config (TransformerConfig): Configuration object containing transformer model parameters.

    Returns:
        int: The number of layers to be built for the current pipeline stage.
    """
    if config.first_pipeline_num_layers is not None or config.last_pipeline_num_layers is not None:
        assert (
            parallel_state.get_virtual_pipeline_model_parallel_world_size() is None
        ), "Uneven number of layer not compatible with interleaved pipeline schedule"

        # Number of layers to distribute over rest of pipeline stages
        layers_to_distribute = config.num_layers
        # Number of pipeline stages left for distributing transformer layers
        pipeline_stages_left = parallel_state.get_pipeline_model_parallel_world_size()

        if config.first_pipeline_num_layers is not None:
            layers_to_distribute -= config.first_pipeline_num_layers
            pipeline_stages_left -= 1
            if parallel_state.is_pipeline_first_stage():
                return config.first_pipeline_num_layers

        if config.last_pipeline_num_layers is not None:
            layers_to_distribute -= config.last_pipeline_num_layers
            pipeline_stages_left -= 1
            if parallel_state.is_pipeline_last_stage():
                return config.last_pipeline_num_layers

        assert (
            layers_to_distribute % pipeline_stages_left == 0
        ), "With uneven pipelineing the left over layers must be divisible by left over stages"
        num_layers_per_pipeline_rank = layers_to_distribute // pipeline_stages_left
    else:
        pipeline_ranks = config.pipeline_model_parallel_size
        num_layers_per_pipeline_rank = config.num_layers // pipeline_ranks

   ......

    return num_layers_to_build

I configured these two parameters as follows

--decoder-first-pipeline-num-layers 6
--decoder-last-pipeline-num-layers 6

The number of model layers on different GPUs is as follows:

GPU0: 6 layers
GPU1~14: 2 layers
GPU15: 6 layers

However, this results in the first and last GPUs having much more computational effort than the rest of the GPUs, further amplifying the problem of unbalanced loads on different GPUs, thus increasing the bubble of PP training.

When the number of model layers cannot divide the PP size, is there a more even way to divide the model?

The text was updated successfully, but these errors were encountered:

Baibaifan · 2024-11-27T10:40:34Z

+1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

renyinCheng001 commented Nov 27, 2024

Baibaifan commented Nov 27, 2024

[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

Comments

renyinCheng001 commented Nov 27, 2024

Baibaifan commented Nov 27, 2024