Replies: 4 comments 7 replies
-
Hey @mcm007! Not sure exactly what you mean, are you suggesting reducing the maximum KV cache allocated by PagedAttention? |
Beta Was this translation helpful? Give feedback.
-
Hi. Just came here from llama.cpp. Depending on how mistral.rs currently handles context/kv cache (I think llama.cpp calls kv cache context), this is critical functionality. For example, I'm currently running deepseek-coder-v2 with 8K "context" in llama.cpp. My setup couldn't possibly handle running it with full context, even with swap (nor would I want to use swap - that would be extra wear-and-tear on expensive nvme's) |
Beta Was this translation helpful? Give feedback.
-
Yep this feature is very important in my workflows. I can't load the Llama 3.1 70B model with full context in my gpus. As it is right now, whenever there is a prompt with large context, mistralrs crashes. Although I am also seeing examples of this happening when it should be within the context length my vram can handle as these same prompts work in vllm/aphrodite. |
Beta Was this translation helpful? Give feedback.
-
JIC, sharing my experience fighting the out of memory errors due to spikes during inference... Apparently there's no such param as E.g. with Phi 3.5 MoE (loading as Q4K) I found the config.json under '..\hub\models--microsoft--Phi-3.5-MoE-instruct\snapshots\ae6cb90aceffd86d1e3fba55c59ec62dfc88d4a1' and updated "max_position_embeddings" and "sliding_window" to 50000 (from 131...). Loading 20 layers dropped the VRAM consumption from 20GB to 16GB and I didn't have the VRAM spikes leading to OOM anymore. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to run with limited context size?
Would be useful for the models with huge context size.
Beta Was this translation helpful? Give feedback.
All reactions