A question about flops calculation #981

CSlearnerZM · 2023-06-25T07:38:16Z

I try to understand the following function get_flops to calculate flops, but I still can't figure out why the parameter attn needs to multiply a constant value 60.

The paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM provides a formula for calculating flops similar to get_flops(Page 12). When we don't use the activation-checkpointing, I think attn needs to multiply 12 not 60 according to the formula. Of course, it is also possible that I misunderstood.

Could you help me to solve the above problem? Thanks!!!

The text was updated successfully, but these errors were encountered:

dashstander · 2023-09-25T19:57:32Z

I think you're right! Thanks for bringing this to our attention @CSlearnerZM

dashstander · 2023-09-25T20:32:57Z

Ok, I have a branch that pretty much ports over the Megatron-DeepSpeed calculation from here

dashstander linked a pull request Sep 25, 2023 that will close this issue

Improve FLOPS Calculation #1044

Merged

Quentin-Anthony closed this as completed in #1044 Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about flops calculation #981

A question about flops calculation #981

CSlearnerZM commented Jun 25, 2023

dashstander commented Sep 25, 2023

dashstander commented Sep 25, 2023

A question about flops calculation #981

A question about flops calculation #981

Comments

CSlearnerZM commented Jun 25, 2023

dashstander commented Sep 25, 2023

dashstander commented Sep 25, 2023