You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.
Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?
The text was updated successfully, but these errors were encountered:
timljj
changed the title
Results Section 3.2 (Without Pretraining)
Results for Section 3.2 (Without Pretraining)
Nov 1, 2023
timljj
changed the title
Results for Section 3.2 (Without Pretraining)
Results for Section 3.2 Rolling KV Cache (Without Pretraining)
Nov 1, 2023
We did not pre-train LLMs in most experiments. Only section 4.2 includes pre-training experiments. You can use StreamingLLM with off-the-shelf Llama models, just like our demo.
Hi,
Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.
Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?
The text was updated successfully, but these errors were encountered: