Skip to content

Commit

Permalink
Sum instead of average for LayerNorm gradient all reduce
Browse files Browse the repository at this point in the history
  • Loading branch information
bclyang committed Aug 14, 2024
1 parent 651e24e commit 9a43318
Showing 1 changed file with 0 additions and 2 deletions.
2 changes: 0 additions & 2 deletions megatron/model/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -370,8 +370,6 @@ def reduce_weight_grads_from_model_parallel_region(input_):

# All-reduce.
torch.distributed.all_reduce(input_, group=mpu.get_model_parallel_group())
# average grads
input_ = input_ / mpu.get_model_parallel_world_size()

# Bf16 convert
if dt == torch.bfloat16 and mpu.get_fp32_allreduce():
Expand Down

0 comments on commit 9a43318

Please sign in to comment.