Does Llama.cpp support custom attention head size? #9089

suhara · 2024-08-19T07:26:16Z

suhara
Aug 19, 2024

Usually, the attention head size head_dim = hidden_dim // num_attention_heads in many model architectures including Llama.

Some models use more flexible head_dim sizes such as

https://huggingface.co/nvidia/Minitron-8B-Base/blob/main/config.json#L25

For Llama models, here is one pending PR for HF

Add custom head_dim support to Llama huggingface/transformers#32502

Looking at src/llama.cpp, I feel like the information is handled around here but I'm not sure.

llama.cpp/src/llama.cpp

Lines 4698 to 4702 in 1b6ff90

    
           hparams.n_embd_head_k = hparams.n_embd / hparams.n_head(); 
        
           ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH, hparams.n_embd_head_k, false); 
        
           hparams.n_embd_head_v = hparams.n_embd / hparams.n_head(); 
        
           ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH, hparams.n_embd_head_v, false);

Could anybody help me understand how the information is loaded into hparams and can be used in build_*()?

Thank you!

ggerganov · 2024-08-19T07:37:57Z

ggerganov
Aug 19, 2024
Maintainer

The default head size is n_embd/n_head, but as you pointed out in the code block, it can be overridden via the LLM_KV_ATTENTION_KEY_LENGTH and LLM_KV_ATTENTION_VALUE_LENGTH meta pairs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Llama.cpp support custom attention head size? #9089

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Does Llama.cpp support custom attention head size? #9089

suhara Aug 19, 2024

Replies: 1 comment

ggerganov Aug 19, 2024 Maintainer

suhara
Aug 19, 2024

ggerganov
Aug 19, 2024
Maintainer