Difference between attention contrib ops #15325
-
What is the difference between |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 7 replies
-
CC: @tianleiwu
|
Beta Was this translation helpful? Give feedback.
-
For this case (there is no bias in QKV projection and qkv can be packed), Attention need transpose BS3NH to BSN3H for fp16 fused attention in CUDA, while packed QKV is BSN3H format so MultiHeadAttention might be faster since it saves a transpose.
Yes. |
Beta Was this translation helpful? Give feedback.
CC: @tianleiwu
Attention
only support SelfAttention whereasMultiHeadAttention
supports self and cross attention