Difference between attention contrib ops #15325

asyncth · 2023-04-02T09:58:30Z

asyncth
Apr 2, 2023

What is the difference between com.microsoft.Attention and com.microsoft.MultiHeadAttention contrib operators? ContribOperators.md says that both of them are multi-head attention.

Answered by hariharans29

Apr 4, 2023

CC: @tianleiwu

Attention only support SelfAttention whereas MultiHeadAttention supports self and cross attention

View full answer

hariharans29 · 2023-04-04T21:01:26Z

hariharans29
Apr 4, 2023
Collaborator

CC: @tianleiwu

Attention only support SelfAttention whereas MultiHeadAttention supports self and cross attention

6 replies

tianleiwu Apr 5, 2023
Collaborator

Right. There are also some minor difference in past state input. In Attention, each past input is past_key + past_value, while in MultiHeadAttention, past_key and past_value are separated inputs.

asyncth Apr 5, 2023
Author

Thanks for help!

asyncth Apr 5, 2023
Author

Should one always prefer Attention over MultiHeadAttention for self-attention?

tianleiwu Apr 5, 2023
Collaborator

Sometime MultiHeadAttention is a better choice like the following:
(1) past_key and past_value are separated in ONNX model.
(2) there is no bias in QKV projection and qkv can be packed.

Right now, you can choose Attention if it is applicable.

asyncth Apr 5, 2023
Author

(2) there is no bias in QKV projection and qkv can be packed.

What do you mean? Doesn't Attention op handle that case just as well?

Also I'm having trouble understanding bias input of MultiHeadAttention, is just a value that is chunked into 3 tensors of shapes (hidden_size), (hidden_size) and (v_hidden_size) and then those tensors are added to query, key and value?

tianleiwu · 2023-04-05T17:38:08Z

tianleiwu
Apr 5, 2023
Collaborator

Doesn't Attention op handle that case just as well?

For this case (there is no bias in QKV projection and qkv can be packed), Attention need transpose BS3NH to BSN3H for fp16 fused attention in CUDA, while packed QKV is BSN3H format so MultiHeadAttention might be faster since it saves a transpose.

Also I'm having trouble understanding bias input of MultiHeadAttention, is just a value that is chunked into 3 tensors of shapes (hidden_size), (hidden_size) and (v_hidden_size) and then those tensors are added to query, key and value?

Yes.

1 reply

asyncth Apr 5, 2023
Author

Got it, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between attention contrib ops #15325

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Difference between attention contrib ops #15325

asyncth Apr 2, 2023

Replies: 2 comments · 7 replies

hariharans29 Apr 4, 2023 Collaborator

tianleiwu Apr 5, 2023 Collaborator

asyncth Apr 5, 2023 Author

asyncth Apr 5, 2023 Author

tianleiwu Apr 5, 2023 Collaborator

asyncth Apr 5, 2023 Author

tianleiwu Apr 5, 2023 Collaborator

asyncth Apr 5, 2023 Author

asyncth
Apr 2, 2023

Replies: 2 comments 7 replies

hariharans29
Apr 4, 2023
Collaborator

tianleiwu Apr 5, 2023
Collaborator

asyncth Apr 5, 2023
Author

asyncth Apr 5, 2023
Author

tianleiwu Apr 5, 2023
Collaborator

asyncth Apr 5, 2023
Author

tianleiwu
Apr 5, 2023
Collaborator

asyncth Apr 5, 2023
Author