Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

im confused with the PPL of sliding window with recomputation #87

Open
coderwayne3025 opened this issue Oct 11, 2024 · 0 comments
Open

Comments

@coderwayne3025
Copy link

this problem troubled me for a long time when im reading this paper,here is what i thought:

Two Key Steps in Sliding Window Re-computation

In sliding window re-computation, the key to deriving the time complexity lies in the following two steps:

1.	Recompute Key and Value.
2.	Attention calculation, including calculating the similarity between Query and Key, and performing a weighted sum on Value.
  1. Sliding Window Re-computation Process

Assumptions:

•	The sliding window size is L, meaning it can store up to L tokens.
•	The total number of generated tokens is T.
•	Goal: Generate a sequence of length T, and only consider the most recent L tokens each time a new token is generated.
  1. Two Steps in Sliding Window Re-computation

a. Recomputing Key and Value

•	When generating a new token, the sliding window moves forward, causing the tokens inside the window to change, and thus the positional encodings of the tokens are updated accordingly.
•	Due to changes in positional encodings, each token’s Key and Value must be recomputed to ensure they correctly represent information in the current context.
•	Complexity: There are L tokens in the sliding window, and their Keys and Values need to be recomputed. Recomputing each token’s Key and Value involves a linear transformation, with a complexity of O(1) for each token.
•	Therefore, the complexity for recomputing the Key and Value for all L tokens is O(L).

b. Attention Calculation

•	Calculating the similarity between Query and Key:
•	For each newly generated token, the model generates a Query (Q), which then computes similarity with the Keys (K) of all L tokens in the sliding window.
•	Each Query needs to be compared with each Key, with L tokens in total, and each similarity calculation has a complexity of O(d), where d is the dimensionality of the Key and Query.
•	Thus, the complexity of similarity calculation for L tokens is O(L ). Ignoring the effect of dimensionality, it can be approximated as O(L).
•	Weighted Sum of Values:
•	After calculating attention scores, these scores are used to perform a weighted sum of each token’s Value (V) in the sliding window.
•	The weighted summation also involves L tokens, with a complexity of O(L).
•	Therefore, the complexity of the weighted summation is also O(L).

c. Total Complexity for Generating a New Token

•	Recomputing Key and Value: O(L).
•	Attention Calculation (including similarity calculation and weighted sum): O(L).
•	Total Complexity: These two steps are performed sequentially, so the total complexity is:

O(L) + O(L) = O(2L)

For each new token generation, the overall complexity is O(L). Since this process involves operations on all tokens within the sliding window, the complexity can be understood as linearly increasing.

  1. Overall Time Complexity Calculation

    • We need to generate T tokens, and each time a new token is generated, the model needs to complete recomputing Key and Value, as well as performing attention calculations.
    • For each new token generation, the complexity is O(L), and since we need to generate T tokens, the overall time complexity is:

O(T) * O(L) = O(TL)

so this is what I think, maybe somewhere i misunderstand, it will be really appreciated if you can help me solve this problem,thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant