[Q] Limited context size #622

mcm007 · 2024-07-24T18:38:03Z

mcm007
Jul 24, 2024

Is it possible to run with limited context size?

Would be useful for the models with huge context size.

EricLBuehler · 2024-07-24T18:45:09Z

EricLBuehler
Jul 24, 2024
Maintainer

Hey @mcm007!

Not sure exactly what you mean, are you suggesting reducing the maximum KV cache allocated by PagedAttention?

4 replies

mcm007 Jul 24, 2024
Author

Thanks for response!

On llama.cpp server is this switch:

-c,    --ctx-size N             size of the prompt context (default: 0, 0 = loaded from model)

which can be used with to specify a shorter context, eg. 2048, 4096, than the model default.

Some models are crashing when running on CPU due to insufficient memory, but with a shorter --ctx-size are usable.

From models metadata for reference:
Qwen2 qwen2.context_length 32768
Mistral Nemo llama.context_length 1024000

I was trying to run Mistral Nemo with mistral.rs on CPU + 16GB memory and crashed after quantisation.

podman run --rm -it --name mistral.rs --env-file ./.env mistral.rs:latest --isq Q4K plain --model-id mistralai/Mistral-Nemo-Instruct-2407 --arch mistral

EricLBuehler Jul 24, 2024
Maintainer

Mistral Nemo in full precision takes up 24GB of memory (as shown by the size of consolidated.safetensors. When running this model on the CPU, the rest of it should go the swap space, so this is unexpected. Can you please post an error log here (if it is not too long), or via a Gist?

mcm007 Jul 25, 2024
Author

Thanks for the great project!
Llama-3.1 works perfect, API endpoint as well, performance with ISQ on CPU is similar to llama.cpp.

Mistral Nemo simply stops here:

$ podman run --rm -it --name mistral.rs -p "8000:80" --env-file ./.env -v ./data-mistralrs:/data -v ../hf-cache:/root/.cache/huggingface mistral.rs:latest --isq Q4K plain --model-id mistralai/Mistral-Nemo-Instruct-2407 --arch mistral
2024-07-25T05:16:36.226533Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-07-25T05:16:36.226992Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-07-25T05:16:36.227743Z  INFO mistralrs_server: Model kind is: normal (no quant, no adapters)
2024-07-25T05:16:36.229261Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-07-25T05:16:36.231454Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `mistralai/Mistral-Nemo-Instruct-2407`
tokenizer.json [00:00:01] [███] 8.84 MiB/8.84 MiB 5.98 MiB/s (0s)
2024-07-25T05:16:38.191744Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `mistralai/Mistral-Nemo-Instruct-2407`
config.json [00:00:00] [███] 623 B/623 B 4.06 KiB/s (0s)
2024-07-25T05:16:38.969605Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00005.safetensors", "model-00002-of-00005.safetensors", "model-00003-of-00005.safetensors", "model-00004-of-00005.safetensors", "model-00005-of-00005.safetensors"]
..del-00001-of-00005.safetensors [00:01:42] [████] 4.53 GiB/4.53 GiB 45.21 MiB/s (0s)
..del-00002-of-00005.safetensors [00:01:45] [████] 4.57 GiB/4.57 GiB 44.50 MiB/s (0s)
..del-00003-of-00005.safetensors [00:01:44] [████] 4.57 GiB/4.57 GiB 44.88 MiB/s (0s)
..del-00004-of-00005.safetensors [00:01:49] [████] 4.57 GiB/4.57 GiB 42.91 MiB/s (0s)
..del-00005-of-00005.safetensors [00:02:24] [████] 4.57 GiB/4.57 GiB 32.38 MiB/s (0s)
2024-07-25T05:26:07.222927Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `mistralai/Mistral-Nemo-Instruct-2407`
generation_config.json [00:00:00] [████] 116 B/116 B 240 B/s (0s)
2024-07-25T05:26:08.729641Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `mistralai/Mistral-Nemo-Instruct-2407`
tokenizer_config.json [00:00:00] [████] 174.01 KiB/174.01 KiB 258.51 KiB/s (0s)
2024-07-25T05:26:09.600978Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T05:26:09.601018Z  INFO mistralrs_core::pipeline::normal: Loading model `mistralai/Mistral-Nemo-Instruct-2407` on cpu.
2024-07-25T05:26:09.601978Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 131072, hidden_size: 5120, intermediate_size: 14336, num_hidden_layers: 40, num_attention_heads: 32, num_key_value_heads: 8, hidden_act: Silu, max_position_embeddings: 1024000, rms_norm_eps: 1e-5, rope_theta: 1000000.0, sliding_window: None, use_flash_attn: false, head_dim: Some(128) }
100%|████| 60/60 [01:03<00:00, 1.15it/s]
100%|████| 81/81 [01:03<00:00, 2.95it/s]
100%|████| 81/81 [01:03<00:00, 4.49it/s]
100%|████| 81/81 [01:03<00:00, 1.05it/s]
100%|████| 60/60 [01:03<00:00, 7.76it/s]
2024-07-25T05:27:13.587515Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2024-07-25T05:27:13.588114Z  INFO mistralrs_core::utils::normal: DType selected is F16.
Loading repeating layers: [00:00:11] [######>---------------------------------] 6/40 (59s)

Indeed memory is exhausted:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi        12Gi       777Mi       160Mi       2.8Gi       3.1Gi
Swap:           15Gi        15Gi        20Ki

oldgithubman Jul 27, 2024

Hey @mcm007!

Not sure exactly what you mean, are you suggesting reducing the maximum KV cache allocated by PagedAttention?

This might be relevant - ggerganov/llama.cpp#1955 (comment)

I think the end goal is being able to control resource usage

oldgithubman · 2024-07-27T00:08:21Z

oldgithubman
Jul 27, 2024

Is it possible to run with limited context size?

Would be useful for the models with huge context size.

Hi. Just came here from llama.cpp. Depending on how mistral.rs currently handles context/kv cache (I think llama.cpp calls kv cache context), this is critical functionality. For example, I'm currently running deepseek-coder-v2 with 8K "context" in llama.cpp. My setup couldn't possibly handle running it with full context, even with swap (nor would I want to use swap - that would be extra wear-and-tear on expensive nvme's)

2 replies

EricLBuehler Jul 27, 2024
Maintainer

Hi @oldgithubman - welcome to mistral.rs!

We support PagedAttention, on CUDA. PagedAttention is a method to efficiently manage KV cache and increases throughput by enabling efficient batching (which we support even when not using PagedAttention, on all devices). Our PagedAttention implementation allocates the KV cache upfront, and it provides 2 methods to control memory usage:

Set the amout of VRAM to use for KV cache
Set the total VRAM utilization percent for KV cache and memory

These are both ways to control the VRAM usage on CUDA and will affect the maximum context length you can run accordingly. What we do not have is a way to allocate enough KV cache for a specific number of tokens, although we can aboslutely add this if you think it is necessary.

Outside of using a CUDA GPU or when PagedAttention is disabled, the KV cache is allocated dynamically and so is not a problem at all.

oldgithubman Jul 27, 2024

Hi @oldgithubman - welcome to mistral.rs!

We support PagedAttention, on CUDA. PagedAttention is a method to efficiently manage KV cache and increases throughput by enabling efficient batching (which we support even when not using PagedAttention, on all devices). Our PagedAttention implementation allocates the KV cache upfront, and it provides 2 methods to control memory usage:

Set the amout of VRAM to use for KV cache

Set the total VRAM utilization percent for KV cache and memory

These are both ways to control the VRAM usage on CUDA and will affect the maximum context length you can run accordingly. What we do not have is a way to allocate enough KV cache for a specific number of tokens, although we can aboslutely add this if you think it is necessary.

Outside of using a CUDA GPU or when PagedAttention is disabled, the KV cache is allocated dynamically and so is not a problem at all.

I don't think it would be necessary, but useful to implement a way to allocate enough KV cache for a specific number of tokens. For example, my current workflow is to first decide how much cache I want (usually 8K tokens worth these days), pick a quant (if it's a large model, I usually start with IQ4_XS), and try it. If it OOM's, I try a smaller quant while maintaining cache size. Of course, I could maybe adapt to a different workflow, but that's been working well.

The problem with dynamically-allocated cache (assuming no limiter functionality), is usually, if I'm targeting a certain performance level (for example, RAM), I don't want the cache to grow beyond RAM and even using all my RAM is bad. So, I think a limiter is still useful. Again, it could just be a matter of adapting to a new workflow though. Being able to prevent cache from using swap is very important though, since this adds wear and tear on expensive nvme's

murtaza-nasir · 2024-07-29T00:37:39Z

murtaza-nasir
Jul 29, 2024

Yep this feature is very important in my workflows. I can't load the Llama 3.1 70B model with full context in my gpus. As it is right now, whenever there is a prompt with large context, mistralrs crashes. Although I am also seeing examples of this happening when it should be within the context length my vram can handle as these same prompts work in vllm/aphrodite.

0 replies

maxim-saplin · 2024-09-13T15:01:58Z

maxim-saplin
Sep 13, 2024

JIC, sharing my experience fighting the out of memory errors due to spikes during inference... Apparently there's no such param as n_ctx as in llama.cpp. Though if you are loading the huggingface model from safetensors (I beleive it is called plain in terms of mistralrs), you can find it's config.json and update the context window size there.

E.g. with Phi 3.5 MoE (loading as Q4K) I found the config.json under '..\hub\models--microsoft--Phi-3.5-MoE-instruct\snapshots\ae6cb90aceffd86d1e3fba55c59ec62dfc88d4a1' and updated "max_position_embeddings" and "sliding_window" to 50000 (from 131...). Loading 20 layers dropped the VRAM consumption from 20GB to 16GB and I didn't have the VRAM spikes leading to OOM anymore.

1 reply

oldgithubman Sep 13, 2024

JIC, sharing my experience fighting the out of memory errors due to spikes during inference... Apparently there's no such param as n_ctx as in llama.cpp. Though if you are loading the huggingface model from safetensors (I beleive it is called plain in terms of mistralrs), you can find it's config.json and update the context window size there.

E.g. with Phi 3.5 MoE (loading as Q4K) I found the config.json under '..\hub\models--microsoft--Phi-3.5-MoE-instruct\snapshots\ae6cb90aceffd86d1e3fba55c59ec62dfc88d4a1' and updated "max_position_embeddings" and "sliding_window" to 50000 (from 131...). Loading 20 layers dropped the VRAM consumption from 20GB to 16GB and I didn't have the VRAM spikes leading to OOM anymore.

Nice hack. Looking forward to a proper implementation though

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Limited context size #622

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[Q] Limited context size #622

mcm007 Jul 24, 2024

Replies: 4 comments · 7 replies

EricLBuehler Jul 24, 2024 Maintainer

mcm007 Jul 24, 2024 Author

EricLBuehler Jul 24, 2024 Maintainer

mcm007 Jul 25, 2024 Author

oldgithubman Jul 27, 2024

oldgithubman Jul 27, 2024

EricLBuehler Jul 27, 2024 Maintainer

oldgithubman Jul 27, 2024

murtaza-nasir Jul 29, 2024

maxim-saplin Sep 13, 2024

oldgithubman Sep 13, 2024

mcm007
Jul 24, 2024

Replies: 4 comments 7 replies

EricLBuehler
Jul 24, 2024
Maintainer

mcm007 Jul 24, 2024
Author

EricLBuehler Jul 24, 2024
Maintainer

mcm007 Jul 25, 2024
Author

oldgithubman
Jul 27, 2024

EricLBuehler Jul 27, 2024
Maintainer

murtaza-nasir
Jul 29, 2024

maxim-saplin
Sep 13, 2024