Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to split the Transform layer when the pipeline is uneven? #1303

Open
renyinCheng001 opened this issue Nov 27, 2024 · 1 comment

Comments

@renyinCheng001
Copy link

Hi, All~

I am using pipeline model parallel to train a GPT3-13B model on 16 GPUs, which has 40 Transform layers. Obviously, the number of model layers cannot divide pipeline-model-parallel-size, resulting in only 2 Transform layers on each GPU.

I noticed that the number of layers on the first and last pipeline stages can be configured by --decoder-first-pipeline-num-layers and --decoder-last-pipeline-num-layers, as shown in the following code:

def get_num_layers_to_build(config: TransformerConfig) -> int:
    """
    Determine the number of transformer layers to build for the current pipeline stage.
    Args:
        config (TransformerConfig): Configuration object containing transformer model parameters.

    Returns:
        int: The number of layers to be built for the current pipeline stage.
    """
    if config.first_pipeline_num_layers is not None or config.last_pipeline_num_layers is not None:
        assert (
            parallel_state.get_virtual_pipeline_model_parallel_world_size() is None
        ), "Uneven number of layer not compatible with interleaved pipeline schedule"

        # Number of layers to distribute over rest of pipeline stages
        layers_to_distribute = config.num_layers
        # Number of pipeline stages left for distributing transformer layers
        pipeline_stages_left = parallel_state.get_pipeline_model_parallel_world_size()

        if config.first_pipeline_num_layers is not None:
            layers_to_distribute -= config.first_pipeline_num_layers
            pipeline_stages_left -= 1
            if parallel_state.is_pipeline_first_stage():
                return config.first_pipeline_num_layers

        if config.last_pipeline_num_layers is not None:
            layers_to_distribute -= config.last_pipeline_num_layers
            pipeline_stages_left -= 1
            if parallel_state.is_pipeline_last_stage():
                return config.last_pipeline_num_layers

        assert (
            layers_to_distribute % pipeline_stages_left == 0
        ), "With uneven pipelineing the left over layers must be divisible by left over stages"
        num_layers_per_pipeline_rank = layers_to_distribute // pipeline_stages_left
    else:
        pipeline_ranks = config.pipeline_model_parallel_size
        num_layers_per_pipeline_rank = config.num_layers // pipeline_ranks

   ......

    return num_layers_to_build

I configured these two parameters as follows

--decoder-first-pipeline-num-layers 6
--decoder-last-pipeline-num-layers 6

The number of model layers on different GPUs is as follows:

GPU0: 6 layers
GPU1~14: 2 layers
GPU15: 6 layers

However, this results in the first and last GPUs having much more computational effort than the rest of the GPUs, further amplifying the problem of unbalanced loads on different GPUs, thus increasing the bubble of PP training.

When the number of model layers cannot divide the PP size, is there a more even way to divide the model?

@Baibaifan
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants