im confused with the PPL of sliding window with recomputation #87

coderwayne3025 · 2024-10-11T06:55:04Z

this problem troubled me for a long time when im reading this paper,here is what i thought:

Two Key Steps in Sliding Window Re-computation

In sliding window re-computation, the key to deriving the time complexity lies in the following two steps:

1.	Recompute Key and Value.
2.	Attention calculation, including calculating the similarity between Query and Key, and performing a weighted sum on Value.

Sliding Window Re-computation Process

Assumptions:

•	The sliding window size is L, meaning it can store up to L tokens.
•	The total number of generated tokens is T.
•	Goal: Generate a sequence of length T, and only consider the most recent L tokens each time a new token is generated.

Two Steps in Sliding Window Re-computation

a. Recomputing Key and Value

•	When generating a new token, the sliding window moves forward, causing the tokens inside the window to change, and thus the positional encodings of the tokens are updated accordingly.
•	Due to changes in positional encodings, each token’s Key and Value must be recomputed to ensure they correctly represent information in the current context.
•	Complexity: There are L tokens in the sliding window, and their Keys and Values need to be recomputed. Recomputing each token’s Key and Value involves a linear transformation, with a complexity of O(1) for each token.
•	Therefore, the complexity for recomputing the Key and Value for all L tokens is O(L).

b. Attention Calculation

•	Calculating the similarity between Query and Key:
•	For each newly generated token, the model generates a Query (Q), which then computes similarity with the Keys (K) of all L tokens in the sliding window.
•	Each Query needs to be compared with each Key, with L tokens in total, and each similarity calculation has a complexity of O(d), where d is the dimensionality of the Key and Query.
•	Thus, the complexity of similarity calculation for L tokens is O(L ）. Ignoring the effect of dimensionality, it can be approximated as O(L).
•	Weighted Sum of Values:
•	After calculating attention scores, these scores are used to perform a weighted sum of each token’s Value (V) in the sliding window.
•	The weighted summation also involves L tokens, with a complexity of O(L).
•	Therefore, the complexity of the weighted summation is also O(L).

c. Total Complexity for Generating a New Token

•	Recomputing Key and Value: O(L).
•	Attention Calculation (including similarity calculation and weighted sum): O(L).
•	Total Complexity: These two steps are performed sequentially, so the total complexity is:

O(L) + O(L) = O(2L)

For each new token generation, the overall complexity is O(L). Since this process involves operations on all tokens within the sliding window, the complexity can be understood as linearly increasing.

Overall Time Complexity Calculation

• We need to generate T tokens, and each time a new token is generated, the model needs to complete recomputing Key and Value, as well as performing attention calculations.
• For each new token generation, the complexity is O(L), and since we need to generate T tokens, the overall time complexity is:

O(T) * O(L) = O(TL)

so this is what I think, maybe somewhere i misunderstand, it will be really appreciated if you can help me solve this problem,thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

im confused with the PPL of sliding window with recomputation #87

im confused with the PPL of sliding window with recomputation #87

coderwayne3025 commented Oct 11, 2024

im confused with the PPL of sliding window with recomputation #87

im confused with the PPL of sliding window with recomputation #87

Comments

coderwayne3025 commented Oct 11, 2024