You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
this problem troubled me for a long time when im reading this paper,here is what i thought:
Two Key Steps in Sliding Window Re-computation
In sliding window re-computation, the key to deriving the time complexity lies in the following two steps:
1. Recompute Key and Value.
2. Attention calculation, including calculating the similarity between Query and Key, and performing a weighted sum on Value.
Sliding Window Re-computation Process
Assumptions:
• The sliding window size is L, meaning it can store up to L tokens.
• The total number of generated tokens is T.
• Goal: Generate a sequence of length T, and only consider the most recent L tokens each time a new token is generated.
Two Steps in Sliding Window Re-computation
a. Recomputing Key and Value
• When generating a new token, the sliding window moves forward, causing the tokens inside the window to change, and thus the positional encodings of the tokens are updated accordingly.
• Due to changes in positional encodings, each token’s Key and Value must be recomputed to ensure they correctly represent information in the current context.
• Complexity: There are L tokens in the sliding window, and their Keys and Values need to be recomputed. Recomputing each token’s Key and Value involves a linear transformation, with a complexity of O(1) for each token.
• Therefore, the complexity for recomputing the Key and Value for all L tokens is O(L).
b. Attention Calculation
• Calculating the similarity between Query and Key:
• For each newly generated token, the model generates a Query (Q), which then computes similarity with the Keys (K) of all L tokens in the sliding window.
• Each Query needs to be compared with each Key, with L tokens in total, and each similarity calculation has a complexity of O(d), where d is the dimensionality of the Key and Query.
• Thus, the complexity of similarity calculation for L tokens is O(L ). Ignoring the effect of dimensionality, it can be approximated as O(L).
• Weighted Sum of Values:
• After calculating attention scores, these scores are used to perform a weighted sum of each token’s Value (V) in the sliding window.
• The weighted summation also involves L tokens, with a complexity of O(L).
• Therefore, the complexity of the weighted summation is also O(L).
c. Total Complexity for Generating a New Token
• Recomputing Key and Value: O(L).
• Attention Calculation (including similarity calculation and weighted sum): O(L).
• Total Complexity: These two steps are performed sequentially, so the total complexity is:
O(L) + O(L) = O(2L)
For each new token generation, the overall complexity is O(L). Since this process involves operations on all tokens within the sliding window, the complexity can be understood as linearly increasing.
Overall Time Complexity Calculation
• We need to generate T tokens, and each time a new token is generated, the model needs to complete recomputing Key and Value, as well as performing attention calculations.
• For each new token generation, the complexity is O(L), and since we need to generate T tokens, the overall time complexity is:
O(T) * O(L) = O(TL)
so this is what I think, maybe somewhere i misunderstand, it will be really appreciated if you can help me solve this problem,thanks!
The text was updated successfully, but these errors were encountered:
this problem troubled me for a long time when im reading this paper,here is what i thought:
Two Key Steps in Sliding Window Re-computation
In sliding window re-computation, the key to deriving the time complexity lies in the following two steps:
Assumptions:
a. Recomputing Key and Value
b. Attention Calculation
c. Total Complexity for Generating a New Token
O(L) + O(L) = O(2L)
For each new token generation, the overall complexity is O(L). Since this process involves operations on all tokens within the sliding window, the complexity can be understood as linearly increasing.
Overall Time Complexity Calculation
• We need to generate T tokens, and each time a new token is generated, the model needs to complete recomputing Key and Value, as well as performing attention calculations.
• For each new token generation, the complexity is O(L), and since we need to generate T tokens, the overall time complexity is:
O(T) * O(L) = O(TL)
so this is what I think, maybe somewhere i misunderstand, it will be really appreciated if you can help me solve this problem,thanks!
The text was updated successfully, but these errors were encountered: