Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why recompute can differ from window attention? #88

Open
habaohaba opened this issue Oct 11, 2024 · 2 comments
Open

why recompute can differ from window attention? #88

habaohaba opened this issue Oct 11, 2024 · 2 comments

Comments

@habaohaba
Copy link

I think recompute just give the same value of kv state which is saved when using window attention. So what is the difference between recompute and cache version of slide window?
Or it is because no matter what position embedding we use, llm juse learn to set first index a large attention value?

@darth-c0d3r
Copy link

Let's say the total input tokens are 1024 and the KV-Cache size is 512. While generating the next token, the recomputed representations would totally drop the initial 512 tokens and would just be computed over the most recent 512 tokens as if that was the whole input. Whereas, for cached sliding window, there would be dependency on the first 512 tokens as well. Like this: First 512 representations are trivial. For 513rd, the 1st token representations are dropped, but still the rest of the 511 token repesntations were calculated with the 1st token in context. It works similarly for all the next tokens. Hope it's clear.

@FranciscoPark
Copy link

I also have a question about this. Assume we use recomputational sliding window, ppl should also be similar to that of regular sliding window since ,whether recomputing or not, it would eventually evict initial tokens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants