How to calculate the required memory for context size and batch. #10068

Emreerdog · 2024-10-27T21:01:54Z

Emreerdog
Oct 27, 2024

Hello, I can't seem to find a way to calculate the memory with given context size and batch size.

When searching for it on the internet, it seems that all resources give responses about calculating the base memory size for only loading the model such as 7B Q8 model requiring approximately 7 GB etc. However, what I need to know is, how much memory I need to initialize context with let's say 4096 and with 512 batch size.

By doing some research, it seems that there are a lot of parameters should be taken into consideration when calculating the memory such as hidden_size(embedding_length i guess), attention head count, kv head count etc. I also can't seem to find the memory requirement by trial and error, because the same inputs generate different outcome when it comes to different models.

So, my question is, is there a formula for calculating the memory excluding the base model size?

Second question is, I am integrating the llama.cpp to my c++ application and I am reading the gguf file's params before loading the model. For that reason, if such formula exists, what are the standard corresponding gguf meta key/value pairs to substitute into the formula?

slaren · 2024-10-27T23:33:14Z

slaren
Oct 27, 2024
Collaborator

There isn't an easy way to calculate the context size, but you may be able to implement your own function to do this based on the function llama_kv_cache_init. The size of the context would be the sum of the sizes of the tensors that this function creates. I think it would be good to add an API to do that.

0 replies

steampunque · 2024-10-28T17:20:03Z

steampunque
Oct 28, 2024

Hello, I can't seem to find a way to calculate the memory with given context size and batch size.

This is tricky but this is the approximation method I use:
When the model loads with server you will get an output summarizing allocated GPU mem for context like this (this happens to be gemma 2 9b fully offloaded to a 1070)

llama_new_context_with_model: n_ctx      = 3072
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1008.00 MiB
llama_new_context_with_model: KV self size  = 1008.00 MiB, K (f16):  504.00 MiB, V (f16):  504.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.95 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   126.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     4.75 MiB

There are four chunks of memory which need to be computed as a function of n_ctx (and n_batch if you are varying that, but I leave that fixed at 128 for all models) :

KV = KV self size
B1 = CUDA_Host output buffer size
B2 = CUDA0 compute buffer size
B3 = CUDA_Host compute buffer size

Then M(n_ctx) = KV + B1 + B2 + B3 = GPU memory needed for KV and buffers at a context size of n_ctx

Next I set up a system of overdetermined first order equations to predict each memory chunk size as a function of the n_ctx at three different n_ctx settings selected to span the expected use range. For example to compute the gain (k) and offset (o) for the KV self size over an expected operating range of n_ctx of 2k to 8k:

k * 8192 + o = KV(8192)
k * 4096 + o = KV(4096)
k * 2048 + o = KV(2048)

To find the KV(8192), KV(4096), and KV(2048) start the server with -ngl 0 and -n_ctx equal to 8129, 4096, and 2048 and read the values output on the console.

Now solve the system of overdetermined equations for k and o. Now KV self size can be approximated as a function of n_ctx :

KV = k * n_ctx + o

Repeat this for all four allocated memory chunks. The GPU memory needed as a function of n_ctx can then be computed by summing up the values of the four equations which are all only a function of n_ctx. If you are varying batch size, you will need to repeat this procedure for each different batch size you want to use (I have found no reason to use anything other than 128 as it gets most of the speedup possible in my tests). You also need to repeat this procedure and compute unique k and o if flash attention is being used or not. The procedure needs to be done separately on every different model since they all have different resource allocation for KV/buffers and also when llama.cpp is updated you may need to completely recompute the gain and offset parameters for the equations since any change to backends can cause the buffer sizes to change.

I make use of the prediction to either size the KV so I can fully offload all the weights layers to the GPU or to compute the ngl I can achieve for a user specified KV size.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to calculate the required memory for context size and batch. #10068

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to calculate the required memory for context size and batch. #10068

Emreerdog Oct 27, 2024

Replies: 2 comments

slaren Oct 27, 2024 Collaborator

steampunque Oct 28, 2024

Emreerdog
Oct 27, 2024

slaren
Oct 27, 2024
Collaborator

steampunque
Oct 28, 2024