Fix SequentialWrapper Generation (pipe_parallel_size = 0) #1031

xu-song · 2023-09-15T14:11:32Z

This PR fix the following bug for generating with pipe_parallel_size = 0

Bug Reproduce

set pipe_parallel_size = 0

$ python ./deepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml

The following error occurres when generating with pipe_parallel_size = 0

  is_pipe_parallel ................ False.......................default
  pipe_parallel_size .............. 0...........................default

Traceback (most recent call last):
  File "generate.py", line 91, in <module>
    main()
  File "generate.py", line 73, in main
    generate_samples_interactive(
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/text_generation_utils.py", line 782, in generate_samples_interactive
    for (
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/text_generation_utils.py", line 319, in stream_tokens
    logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/text_generation_utils.py", line 137, in forward_model
    return model.module(model_inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/utils.py", line 182, in forward
    x = func(forward_input)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/utils.py", line 175, in exec_func
    inputs = layer(inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/transformer.py", line 916, in forward
    return super().forward(hidden_states, attention_mask), attention_mask
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/transformer.py", line 860, in forward
    attention_output, attention_bias = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/transformer.py", line 688, in forward
    context_layer = self.attention(
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/transformer.py", line 451, in attention
    attention_probs = self.scale_mask_softmax(attention_scores, attention_mask)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/fused_softmax.py", line 146, in forward
    return self.forward_torch_softmax(input, mask)
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/fused_softmax.py", line 190, in forward_torch_softmax
    mask_output = self.mask_func(input, mask) if mask is not None else input
  File "/workspace/gpt-neox/gpt-neox-dev-latest/megatron/model/gpt2_model.py", line 52, in gpt2_attention_mask_func
    attention_scores.masked_fill_(ltor_mask, mask_value)
RuntimeError: The expanded size of the tensor (1) must match the existing size (4) at non-singleton dimension 2.  Target sizes: [1, 12, 1, 4].  Tensor sizes: [1, 1, 4, 4]

The above procedure is easy to reproduce.

Analysis

attention_score.size= [1, 12, 1, 4]
attention_mask.size = [1, 1, 4, 4] (wrong size)

The right size of attention_mask should be [1, 1, 1, 1] for generation_step > 1.

Root Cause

The sequential generation process (SequentialWrapper) is missing a batch_fn, which leads to bad size of attention_mask.

gpt-neox/megatron/training.py

Lines 643 to 655 in c883e8c

    
           if neox_args.is_pipe_parallel: 
        
               model.set_has_attention_mask(True) 
        
               if neox_args.curriculum_learning: 
        
                   curr_scheduler = CurriculumScheduler(neox_args.curriculum_learning) 
        
                   if iteration is not None and iteration > 0: 
        
                       curr_scheduler.update_difficulty(iteration) 
        
               else: 
        
                   curr_scheduler = None 
        
               model.set_batch_fn( 
        
                   partial( 
        
                       get_batch_pipe, neox_args=neox_args, curr_scheduler=curr_scheduler 
        
                   ) 
        
               )

Similar implementation can be found in PipelineEngine deepspeed/runtime/pipe/engine.py#L-578 pipe_parallel_size > 0

xu-song added 2 commits September 15, 2023 22:07

Fix SequentialGeneration

5673a2f

Fix SequentialGeneration

5098970

xu-song requested a review from a team as a code owner September 15, 2023 14:11

xu-song requested review from Quentin-Anthony and ShivanshuPurohit September 15, 2023 14:11

xu-song changed the title ~~Fix SequentialGeneration~~ Fix SequentialWrapper Generation (pipe_parallel_size = 0) Sep 16, 2023

Quentin-Anthony approved these changes Sep 18, 2023

View reviewed changes

Quentin-Anthony merged commit 70af6e8 into EleutherAI:main Sep 18, 2023
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SequentialWrapper Generation (pipe_parallel_size = 0) #1031

Fix SequentialWrapper Generation (pipe_parallel_size = 0) #1031

xu-song commented Sep 15, 2023 •

edited

Loading

	if neox_args.is_pipe_parallel:
	model.set_has_attention_mask(True)
	if neox_args.curriculum_learning:
	curr_scheduler = CurriculumScheduler(neox_args.curriculum_learning)
	if iteration is not None and iteration > 0:
	curr_scheduler.update_difficulty(iteration)
	else:
	curr_scheduler = None
	model.set_batch_fn(
	partial(
	get_batch_pipe, neox_args=neox_args, curr_scheduler=curr_scheduler
	)
	)

Fix SequentialWrapper Generation (pipe_parallel_size = 0) #1031

Fix SequentialWrapper Generation (pipe_parallel_size = 0) #1031

Conversation

xu-song commented Sep 15, 2023 • edited Loading

Bug Reproduce

Analysis

Root Cause

xu-song commented Sep 15, 2023 •

edited

Loading