why recompute can differ from window attention? #88

habaohaba · 2024-10-11T15:54:12Z

I think recompute just give the same value of kv state which is saved when using window attention. So what is the difference between recompute and cache version of slide window?
Or it is because no matter what position embedding we use, llm juse learn to set first index a large attention value?

darth-c0d3r · 2024-10-23T17:24:20Z

Let's say the total input tokens are 1024 and the KV-Cache size is 512. While generating the next token, the recomputed representations would totally drop the initial 512 tokens and would just be computed over the most recent 512 tokens as if that was the whole input. Whereas, for cached sliding window, there would be dependency on the first 512 tokens as well. Like this: First 512 representations are trivial. For 513rd, the 1st token representations are dropped, but still the rest of the 511 token repesntations were calculated with the 1st token in context. It works similarly for all the next tokens. Hope it's clear.

FranciscoPark · 2024-10-25T07:01:34Z

I also have a question about this. Assume we use recomputational sliding window, ppl should also be similar to that of regular sliding window since ,whether recomputing or not, it would eventually evict initial tokens?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why recompute can differ from window attention? #88

why recompute can differ from window attention? #88

habaohaba commented Oct 11, 2024

darth-c0d3r commented Oct 23, 2024

FranciscoPark commented Oct 25, 2024

why recompute can differ from window attention? #88

why recompute can differ from window attention? #88

Comments

habaohaba commented Oct 11, 2024

darth-c0d3r commented Oct 23, 2024

FranciscoPark commented Oct 25, 2024