Self attention Chapter 3: What is the point of the value matrix? #454
EricThomson
started this conversation in
General
Replies: 1 comment
-
You raise a good question, and I think it is mostly for historical reasons. W_v is not strictly necessary. You might like the Simplifying Transformer Blocks paper (https://arxiv.org/abs/2311.01906) where they demonstrate that you can train a model without the W_v. (It's in the reference of one of the later chapters). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm reading Chapter 3, which is great. I'm pretty clear on things up until "we now compute the context vector as a weighted sum over the value vectors". I'm confused about what$W_v$ is adding, it feels like it's throwing around superfluous free parameters.
Given that the point of self-attention seems to mainly be using keys/queries to calculate attention scores which allow long-range dependencies to resolve context, is the value matrix really necessary? It seems like the additional branch off on the right in the diagram is there just sort of doing nothing obvious:
Note I know you sort of have to include it: it's in the GPT model, which we are reproducing. My question is more, why?
I have been researching this a bit, and there is this amazing video from Rasa where they discuss how each vector embedding$x_i$ is basically used three times to produce the final context vector: first as input (query), second, to calculate attention scores (key), third when being combined with the attention weight to calculate the final context vector (value). In this video, which I consider a tour de force, he's basically like "Screw it while we're in here making this thing tunable, let's not use the raw vectors. Let's tune all the things!"
https://www.youtube.com/watch?v=tIvKXrEDMhk&t=359s
I think this was a really good point: rather than lock in one set of these weights as a constant, let's just let all be tunable, and see if it works. [Narrator: it worked out pretty well.]
So to answer the butter robot, here is the full final diagram showing the point of$W_v$ (FIgure 3.17 from your book):
I think this is the answer.
That said, it would be interesting to run the model without$W_v$ and see how much it affects performance (I'm just starting Ch 3 so plan to do this): e.g., what is the cost/performance tradeoff? Is it actually cheaper, faster with negligible performance costs? If this were all being done with out-of-core computations using hard drives it wouldn't matter, but GPU is expensive, so it sort of does matter and (I find it) conceptually interesting.
Beta Was this translation helpful? Give feedback.
All reactions