You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to understand the following function get_flops to calculate flops, but I still can't figure out why the parameter attn needs to multiply a constant value 60.
The paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM provides a formula for calculating flops similar to get_flops(Page 12). When we don't use the activation-checkpointing, I think attn needs to multiply 12 not 60 according to the formula. Of course, it is also possible that I misunderstood.
Could you help me to solve the above problem? Thanks!!!
The text was updated successfully, but these errors were encountered:
I try to understand the following function
get_flops
to calculate flops, but I still can't figure out why the parameterattn
needs to multiply a constant value 60.The paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM provides a formula for calculating flops similar to
get_flops
(Page 12). When we don't use the activation-checkpointing, I thinkattn
needs to multiply 12 not 60 according to the formula. Of course, it is also possible that I misunderstood.Could you help me to solve the above problem? Thanks!!!
The text was updated successfully, but these errors were encountered: