`self.proj` in `CausalSelfAttention` too large if `config.n_query_groups < config.n_head`? #1890

mseeger · 2024-12-25T14:00:04Z

There are only config.n_query_groups V vectors. The shape of self.proj should really be (config.head_size * config.n_query_groups, config.n_embd). It should be smaller in the same sense as self.attn is smaller if there are less query groups than heads.

In the code, you expand the V matrix before multiplying with the linear map. This is equivalent to using a smaller weight matrix, but what is done right now is more expensive and needs more memory.

I could send a PR to fix this, but I am wondering about the compatibility with Hugging Face. Are they also doing this? Do we need to change import scripts for pre-trained models then?

The text was updated successfully, but these errors were encountered:

Andrei-Aksionov · 2024-12-25T15:14:59Z

Hello @mseeger

I'm not entirely sure why the size of self.proj should be smaller when using GQA (Grouped Query Attention). The input shape to the projection layer remains unchanged. GQA primarily shares/reuses the key and value during computations, but the output shape should still align with the number of heads multiplied by the size of each head.

mseeger · 2024-12-26T09:15:31Z

Here is what I'd do:

self.proj = nn.Linear(config.head_size * config.n_query_groups, config.n_embd, bias=config.bias)

The output of all still has the same shape.

And remove this one: v = v.expand(*q.shape).

scaled_dot_product_attention is fine with v having a shorter final dimension.

Andrei-Aksionov · 2024-12-26T12:48:46Z

Still don't understand 🙂

If you don't expand V, then you will get a mismatch at dimension 1, i.e.:

attention scores of shape (B, nh, T, T)
V of shape (B, nh_v, T, hs)

If to drop all prep steps, this is the core of SDPA (python pseudo-code):

attn_weight = query @ key.transpose(-2, -1) * scale_factor
attn_weight += attn_bias
attn_weight = torch.softmax(attn_weight, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p, train=True)
return attn_weight @ value

From the PR that I still need to merge 🫠:

# ↓ (B, nh, T, hs) @ (B, nh, T, hs).mT --> (B, nh, T, T) @ (B, nh, T, hs) --> (B, nh, T, hs)
  y = self.scaled_dot_product_attention(q, k, v, mask)

scaled_dot_product_attention is fine with v having a shorter final dimension.

The only reason that I see, is that in the latest version of SDPA there is enable_gqa argument, that does the expansion.
But it by default is False. Perhaps you enabled it?

Or I'm clearly missing something.

On the side note, didn't see that in 2.5 the SDPA is updated.
We need to relax the version constraint for pytorch, so the latest one (2.5) is installed and let SDPA deal with GQA.

mseeger · 2024-12-26T14:14:17Z

OK, I see. You are right. It is still a bit weird.

mseeger · 2024-12-30T11:18:48Z

https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch-nn-functional-scaled-dot-product-attention

Adding a comment here: If enable_gqa=True, you can have Hq > H. In your notation, H = n_query_groups, Hq = n_head. The final linear self.proj still has full size, though.

Not sure how important this is.

Andrei-Aksionov · 2024-12-30T16:16:39Z

Yes, if you set enable_gqa=True, then SDPA will expand K and V by itself.
It can be seen from the pseudocode in the docs:

...
if enable_gqa:
    key = key.repeat_interleave(query.size(-3)//key.size(-3), -3)
    value = value.repeat_interleave(query.size(-3)//value.size(-3), -3)
...

The benefit of it, as I understand, is that the kernel might do the expansion during calculation, instead of expanding K&V first and then moving tensors from HBM ("global" memory) to the "shared" memory.
But I haven't noticed any speed-up.

mseeger · 2025-01-06T08:54:29Z

Resolved

mseeger added the question Further information is requested label Dec 25, 2024

mseeger closed this as completed Dec 26, 2024

mseeger reopened this Dec 30, 2024

mseeger closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`self.proj` in `CausalSelfAttention` too large if `config.n_query_groups < config.n_head`? #1890

`self.proj` in `CausalSelfAttention` too large if `config.n_query_groups < config.n_head`? #1890

mseeger commented Dec 25, 2024

Andrei-Aksionov commented Dec 25, 2024

mseeger commented Dec 26, 2024

Andrei-Aksionov commented Dec 26, 2024

mseeger commented Dec 26, 2024

mseeger commented Dec 30, 2024

Andrei-Aksionov commented Dec 30, 2024

mseeger commented Jan 6, 2025

self.proj in CausalSelfAttention too large if config.n_query_groups < config.n_head? #1890

self.proj in CausalSelfAttention too large if config.n_query_groups < config.n_head? #1890

Comments

mseeger commented Dec 25, 2024

Andrei-Aksionov commented Dec 25, 2024

mseeger commented Dec 26, 2024

Andrei-Aksionov commented Dec 26, 2024

mseeger commented Dec 26, 2024

mseeger commented Dec 30, 2024

Andrei-Aksionov commented Dec 30, 2024

mseeger commented Jan 6, 2025

`self.proj` in `CausalSelfAttention` too large if `config.n_query_groups < config.n_head`? #1890

`self.proj` in `CausalSelfAttention` too large if `config.n_query_groups < config.n_head`? #1890