You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OLMoE applies RMSNorm to query and key projections for training stability, at the cost of 10% training throughput. See Figure 18 in OLMoE paper for the ablation. We would like to implement the same for apples-to-apples comparison
🧐 Problem Description
OLMoE applies RMSNorm to query and key projections for training stability, at the cost of 10% training throughput. See Figure 18 in OLMoE paper for the ablation. We would like to implement the same for apples-to-apples comparison
💡 Proposed Solution
Specify QK-norm in config transformers/config.py, and maybe modify the forward pass in attention.py::forward?
🔄 Alternatives Considered
No QK-norm, which would require a custom implementation in HF, or we go with the MixtralForCausalLM implementation. Might be a bit more risky
📈 Potential Benefits
Better training stability as indicated by the OLMoE ablations.
The text was updated successfully, but these errors were encountered: