[feat] QK-Norm #82

sohamparikh · 2024-12-03T03:10:03Z

🧐 Problem Description

OLMoE applies RMSNorm to query and key projections for training stability, at the cost of 10% training throughput. See Figure 18 in OLMoE paper for the ablation. We would like to implement the same for apples-to-apples comparison

💡 Proposed Solution

Specify QK-norm in config transformers/config.py, and maybe modify the forward pass in attention.py::forward?

🔄 Alternatives Considered

No QK-norm, which would require a custom implementation in HF, or we go with the MixtralForCausalLM implementation. Might be a bit more risky

📈 Potential Benefits

Better training stability as indicated by the OLMoE ablations.

sohamparikh added the enhancement New feature or request label Dec 3, 2024

sohamparikh mentioned this issue Dec 3, 2024

[epic] OLMoE support #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] QK-Norm #82

[feat] QK-Norm #82

sohamparikh commented Dec 3, 2024

[feat] QK-Norm #82

[feat] QK-Norm #82

Comments

sohamparikh commented Dec 3, 2024

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits