Support for DeepSeekV2-Chat with only 16GB VRAM #55
-
So my laptop has 16GB VRAM and 128GB RAM. When using llama.cpp, I am able to load the IQ4_XS model from https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Instruct-i1-GGUF, with 7/60 layers offloaded to the GPU. This is the only Q4 quant that can fit into the laptop, and IQ4_XS is a lot more coherent than Q3_K_M. With this setup, I can get about 2.5 t/s. I am intrigued by how much more performance I can get with ktransformers. With the help of chatgpt, I was able to get ktransformers to support IQ4_XS. Then I created an optimize config yaml file, with the first 40 layers going to GPU, and the last 20 layers going to CPU. The models can be loaded, but during inference, I will get error:
If I understand correctly, this optimize config with
Is there any way to work around this? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Can you provide your complete yaml file? |
Beta Was this translation helpful? Give feedback.
-
Sure, this is what I use (modified from the multi gpu config): - match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([45][0-9])\\."
class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
replace:
class: ktransformers.operators.RoPE.YarnRotaryEmbedding
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.(?!self_attn).*$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
generate_op: "KLinearMarlin"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.([45][0-9])\\.(?!self_attn).*$" # regular expression
class: torch.nn.Linear # only match modules matching name and class simultaneously
replace:
class: ktransformers.operators.linear.KTransformersLinear # optimized Kernel on quantized data types
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
generate_op: "KLinearCPUInfer"
prefill_op: "KLinearTorch"
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE # mlp module with custom forward function
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([45][0-9])\\.mlp$"
class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
replace:
class: ktransformers.operators.experts.KDeepseekV2MoE # mlp module with custom forward function
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cuda:0"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cuda:0"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.([45][0-9])\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts # custom MoE Kernel with expert paralleism
kwargs:
prefill_device: "cpu"
prefill_op: "KExpertsTorch"
generate_device: "cpu"
generate_op: "KExpertsCPU"
out_device: "cpu"
recursive: False # don't recursively inject submodules of this module
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "^model\\.layers\\.([45][0-9])\\.self_attn$"
replace:
class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
- match:
name: "^model$"
replace:
class: "ktransformers.operators.models.KDeepseekV2Model"
kwargs:
per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
transfer_map:
40: "cpu"
- match:
name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\."
replace:
class: "default"
kwargs:
generate_device: "cuda:0"
prefill_device: "cuda:0"
- match:
name: "(^model\\.layers\\.([45][0-9])\\.)|(model.norm)|(lm_head)"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu" |
Beta Was this translation helpful? Give feedback.
-
Sorry for inconvenience. It seems we cannot put whole layer to cpu. I have fixed this bug in #62. And if you have only 16G VRAM, a good news is that we compressed deepseekv2's required VRAM from 21G to 11G. Please check our latest release.
|
Beta Was this translation helpful? Give feedback.
Sorry for inconvenience. It seems we cannot put whole layer to cpu. I have fixed this bug in #62.
And if you have only 16G VRAM, a good news is that we compressed deepseekv2's required VRAM from 21G to 11G. Please check our latest release.