Replies: 2 comments
-
There isn't an easy way to calculate the context size, but you may be able to implement your own function to do this based on the function |
Beta Was this translation helpful? Give feedback.
-
This is tricky but this is the approximation method I use:
There are four chunks of memory which need to be computed as a function of n_ctx (and n_batch if you are varying that, but I leave that fixed at 128 for all models) : KV = KV self size Then M(n_ctx) = KV + B1 + B2 + B3 = GPU memory needed for KV and buffers at a context size of n_ctx Next I set up a system of overdetermined first order equations to predict each memory chunk size as a function of the n_ctx at three different n_ctx settings selected to span the expected use range. For example to compute the gain (k) and offset (o) for the KV self size over an expected operating range of n_ctx of 2k to 8k: k * 8192 + o = KV(8192) To find the KV(8192), KV(4096), and KV(2048) start the server with -ngl 0 and -n_ctx equal to 8129, 4096, and 2048 and read the values output on the console. Now solve the system of overdetermined equations for k and o. Now KV self size can be approximated as a function of n_ctx : KV = k * n_ctx + o Repeat this for all four allocated memory chunks. The GPU memory needed as a function of n_ctx can then be computed by summing up the values of the four equations which are all only a function of n_ctx. If you are varying batch size, you will need to repeat this procedure for each different batch size you want to use (I have found no reason to use anything other than 128 as it gets most of the speedup possible in my tests). You also need to repeat this procedure and compute unique k and o if flash attention is being used or not. The procedure needs to be done separately on every different model since they all have different resource allocation for KV/buffers and also when llama.cpp is updated you may need to completely recompute the gain and offset parameters for the equations since any change to backends can cause the buffer sizes to change. I make use of the prediction to either size the KV so I can fully offload all the weights layers to the GPU or to compute the ngl I can achieve for a user specified KV size. |
Beta Was this translation helpful? Give feedback.
-
Hello, I can't seem to find a way to calculate the memory with given context size and batch size.
When searching for it on the internet, it seems that all resources give responses about calculating the base memory size for only loading the model such as 7B Q8 model requiring approximately 7 GB etc. However, what I need to know is, how much memory I need to initialize context with let's say 4096 and with 512 batch size.
By doing some research, it seems that there are a lot of parameters should be taken into consideration when calculating the memory such as hidden_size(embedding_length i guess), attention head count, kv head count etc. I also can't seem to find the memory requirement by trial and error, because the same inputs generate different outcome when it comes to different models.
So, my question is, is there a formula for calculating the memory excluding the base model size?
Second question is, I am integrating the llama.cpp to my c++ application and I am reading the gguf file's params before loading the model. For that reason, if such formula exists, what are the standard corresponding gguf meta key/value pairs to substitute into the formula?
Beta Was this translation helpful? Give feedback.
All reactions