You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using pipeline model parallel to train a GPT3-13B model on 16 GPUs, which has 40 Transform layers. Obviously, the number of model layers cannot divide pipeline-model-parallel-size, resulting in only 2 Transform layers on each GPU.
I noticed that the number of layers on the first and last pipeline stages can be configured by --decoder-first-pipeline-num-layers and --decoder-last-pipeline-num-layers, as shown in the following code:
def get_num_layers_to_build(config: TransformerConfig) -> int:
"""
Determine the number of transformer layers to build for the current pipeline stage.
Args:
config (TransformerConfig): Configuration object containing transformer model parameters.
Returns:
int: The number of layers to be built for the current pipeline stage.
"""
if config.first_pipeline_num_layers is not None or config.last_pipeline_num_layers is not None:
assert (
parallel_state.get_virtual_pipeline_model_parallel_world_size() is None
), "Uneven number of layer not compatible with interleaved pipeline schedule"
# Number of layers to distribute over rest of pipeline stages
layers_to_distribute = config.num_layers
# Number of pipeline stages left for distributing transformer layers
pipeline_stages_left = parallel_state.get_pipeline_model_parallel_world_size()
if config.first_pipeline_num_layers is not None:
layers_to_distribute -= config.first_pipeline_num_layers
pipeline_stages_left -= 1
if parallel_state.is_pipeline_first_stage():
return config.first_pipeline_num_layers
if config.last_pipeline_num_layers is not None:
layers_to_distribute -= config.last_pipeline_num_layers
pipeline_stages_left -= 1
if parallel_state.is_pipeline_last_stage():
return config.last_pipeline_num_layers
assert (
layers_to_distribute % pipeline_stages_left == 0
), "With uneven pipelineing the left over layers must be divisible by left over stages"
num_layers_per_pipeline_rank = layers_to_distribute // pipeline_stages_left
else:
pipeline_ranks = config.pipeline_model_parallel_size
num_layers_per_pipeline_rank = config.num_layers // pipeline_ranks
......
return num_layers_to_build
The number of model layers on different GPUs is as follows:
GPU0: 6 layers
GPU1~14: 2 layers
GPU15: 6 layers
However, this results in the first and last GPUs having much more computational effort than the rest of the GPUs, further amplifying the problem of unbalanced loads on different GPUs, thus increasing the bubble of PP training.
When the number of model layers cannot divide the PP size, is there a more even way to divide the model?
The text was updated successfully, but these errors were encountered:
Hi, All~
I am using pipeline model parallel to train a GPT3-13B model on 16 GPUs, which has 40 Transform layers. Obviously, the number of model layers cannot divide pipeline-model-parallel-size, resulting in only 2 Transform layers on each GPU.
I noticed that the number of layers on the first and last pipeline stages can be configured by
--decoder-first-pipeline-num-layers
and--decoder-last-pipeline-num-layers
, as shown in the following code:I configured these two parameters as follows
The number of model layers on different GPUs is as follows:
However, this results in the first and last GPUs having much more computational effort than the rest of the GPUs, further amplifying the problem of unbalanced loads on different GPUs, thus increasing the
bubble
of PP training.When the number of model layers cannot divide the PP size, is there a more even way to divide the model?
The text was updated successfully, but these errors were encountered: