Self attention Chapter 3: What is the point of the value matrix? #454

EricThomson · 2024-12-12T15:55:53Z

EricThomson
Dec 12, 2024

I'm reading Chapter 3, which is great. I'm pretty clear on things up until "we now compute the context vector as a weighted sum over the value vectors". I'm confused about what $W_v$ is adding, it feels like it's throwing around superfluous free parameters.

Given that the point of self-attention seems to mainly be using keys/queries to calculate attention scores which allow long-range dependencies to resolve context, is the value matrix really necessary? It seems like the additional branch off on the right in the diagram is there just sort of doing nothing obvious:

Note I know you sort of have to include it: it's in the GPT model, which we are reproducing. My question is more, why?

I have been researching this a bit, and there is this amazing video from Rasa where they discuss how each vector embedding $x_i$ is basically used three times to produce the final context vector: first as input (query), second, to calculate attention scores (key), third when being combined with the attention weight to calculate the final context vector (value). In this video, which I consider a tour de force, he's basically like "Screw it while we're in here making this thing tunable, let's not use the raw vectors. Let's tune all the things!"

https://www.youtube.com/watch?v=tIvKXrEDMhk&t=359s

I think this was a really good point: rather than lock in one set of these weights as a constant, let's just let all be tunable, and see if it works. [Narrator: it worked out pretty well.]

So to answer the butter robot, here is the full final diagram showing the point of $W_v$ (FIgure 3.17 from your book):

I think this is the answer.

That said, it would be interesting to run the model without $W_v$ and see how much it affects performance (I'm just starting Ch 3 so plan to do this): e.g., what is the cost/performance tradeoff? Is it actually cheaper, faster with negligible performance costs? If this were all being done with out-of-core computations using hard drives it wouldn't matter, but GPU is expensive, so it sort of does matter and (I find it) conceptually interesting.

rasbt · 2024-12-13T21:17:02Z

rasbt
Dec 13, 2024
Maintainer

You raise a good question, and I think it is mostly for historical reasons. W_v is not strictly necessary. You might like the Simplifying Transformer Blocks paper (https://arxiv.org/abs/2311.01906) where they demonstrate that you can train a model without the W_v. (It's in the reference of one of the later chapters).
Note that if you modify the architecture though, then you won't be able to load the pretrained weights from OpenAI later in chapter 5, for example.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self attention Chapter 3: What is the point of the value matrix? #454

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Self attention Chapter 3: What is the point of the value matrix? #454

EricThomson Dec 12, 2024

Replies: 1 comment

rasbt Dec 13, 2024 Maintainer

EricThomson
Dec 12, 2024

rasbt
Dec 13, 2024
Maintainer