You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When converting HF QWEN2 checkpoint into megatron, your conversion script reports the differences between both models layers. There are a lot of mismatches.
The root cause of the mismatch seems to be in the scaled dot product attention implementations between MG and HF, with the same inputs and settings, they produce different outputs. Not completely different, but still non negligible.
Did you experience that? if yes, are you aware of any decreased effectiveness of such converted qwen model later?
The text was updated successfully, but these errors were encountered:
When converting HF QWEN2 checkpoint into megatron, your conversion script reports the differences between both models layers. There are a lot of mismatches.
The root cause of the mismatch seems to be in the scaled dot product attention implementations between MG and HF, with the same inputs and settings, they produce different outputs. Not completely different, but still non negligible.
Did you experience that? if yes, are you aware of any decreased effectiveness of such converted qwen model later?
The text was updated successfully, but these errors were encountered: