You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The term proportional to the vocab_size, flops_calc2 , should share all the same prefactors as flops_calc_1. See page 12 of 2104.04473
Since this term scales inversely with both hidden_dim and num_layers, it is more significant for small models and less important for large models: for Pythia-70m the error is roughly 50257/(16 \times 6 \times 512) = 102% while for GPT-NeoX-20B it is only 50257/(16 \times 44 \times 6144) = 1.2%.
This bug seems to have been introduced only ~3 months ago in #1044 so it may not have had an impact on, e.g., any tests done while training Pythia.
The text was updated successfully, but these errors were encountered:
There's a bug in the GPT-NeoX flops calculation here:
gpt-neox/megatron/logging.py
Line 104 in a2b2020
The term proportional to the vocab_size,
flops_calc2
, should share all the same prefactors asflops_calc_1
. See page 12 of 2104.04473Since this term scales inversely with both hidden_dim and num_layers, it is more significant for small models and less important for large models: for Pythia-70m the error is roughly 50257/(16 \times 6 \times 512) = 102% while for GPT-NeoX-20B it is only 50257/(16 \times 44 \times 6144) = 1.2%.
This bug seems to have been introduced only ~3 months ago in #1044 so it may not have had an impact on, e.g., any tests done while training Pythia.
The text was updated successfully, but these errors were encountered: