Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

timljj · 2023-11-01T02:31:00Z

Hi,

Do you have any experiment results for attention sink for the non pre-training case? From what I read, all the results shown in the paper are from pretraining with attention sinks.

Additionally, did you ever test smaller cache sizes like 128? If I understood correctly, the model should not break with smaller cache sizes?

Guangxuan-Xiao · 2023-11-08T00:58:41Z

We did not pre-train LLMs in most experiments. Only section 4.2 includes pre-training experiments. You can use StreamingLLM with off-the-shelf Llama models, just like our demo.

Guangxuan

timljj changed the title ~~Results Section 3.2 (Without Pretraining)~~ Results for Section 3.2 (Without Pretraining) Nov 1, 2023

timljj changed the title ~~Results for Section 3.2 (Without Pretraining)~~ Results for Section 3.2 Rolling KV Cache (Without Pretraining) Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

timljj commented Nov 1, 2023

Guangxuan-Xiao commented Nov 8, 2023

Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

Results for Section 3.2 Rolling KV Cache (Without Pretraining) #61

Comments

timljj commented Nov 1, 2023

Guangxuan-Xiao commented Nov 8, 2023