Support for DeepSeekV2-Chat with only 16GB VRAM #55

sayap · 2024-08-27T13:55:49Z

sayap
Aug 27, 2024

So my laptop has 16GB VRAM and 128GB RAM. When using llama.cpp, I am able to load the IQ4_XS model from https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Instruct-i1-GGUF, with 7/60 layers offloaded to the GPU. This is the only Q4 quant that can fit into the laptop, and IQ4_XS is a lot more coherent than Q3_K_M. With this setup, I can get about 2.5 t/s.

I am intrigued by how much more performance I can get with ktransformers. With the help of chatgpt, I was able to get ktransformers to support IQ4_XS. Then I created an optimize config yaml file, with the first 40 layers going to GPU, and the last 20 layers going to CPU. The models can be loaded, but during inference, I will get error:

  File "/path/to/git/repo/ktransformers/operators/models.py", line 583, in forward
    self.stream_device_map[cur_device] = torch.cuda.Stream(cur_device)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/virtualenv/lib/python3.12/site-packages/torch/cuda/streams.py", line 37, in __new__
    with torch.cuda.device(device):
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/virtualenv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 382, in __init__
    self.idx = _get_device_index(device, optional=True)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/virtualenv/lib/python3.12/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
    raise ValueError(f"Expected a cuda device, but got: {device}")

If I understand correctly, this optimize config with "cpu" in transfer_map is not supported:

- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
      transfer_map: 
        40: "cpu"

Is there any way to work around this? Thanks

Answered by Azure-Tang

Aug 29, 2024

Sorry for inconvenience. It seems we cannot put whole layer to cpu. I have fixed this bug in #62.

And if you have only 16G VRAM, a good news is that we compressed deepseekv2's required VRAM from 21G to 11G. Please check our latest release.

Btw, to offload whole layer to CPU, you have to slightly modify your yaml:

- match:
    name: "^model\\.layers\\.([45][0-9])\\.(?!self_attn).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
      genera…

View full answer

Azure-Tang · 2024-08-28T09:46:48Z

Azure-Tang
Aug 28, 2024
Collaborator

Can you provide your complete yaml file?

0 replies

sayap · 2024-08-28T10:26:26Z

sayap
Aug 28, 2024
Author

Sure, this is what I use (modified from the multi gpu config):

- match:
    name: "^model.embed_tokens"
  replace:
    class: "default"
    kwargs:
        generate_device: "cpu"
        prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([45][0-9])\\."
    class: ktransformers.models.modeling_deepseek.DeepseekV2YarnRotaryEmbedding
  replace:
    class: ktransformers.operators.RoPE.YarnRotaryEmbedding
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.(?!self_attn).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
      generate_op: "KLinearMarlin"
      prefill_op: "KLinearTorch"

- match:
    name: "^model\\.layers\\.([45][0-9])\\.(?!self_attn).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
      generate_op: "KLinearCPUInfer"
      prefill_op: "KLinearTorch"
  
- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([45][0-9])\\.mlp$"
    class: ktransformers.models.modeling_deepseek.DeepseekV2MoE
  replace:
    class: ktransformers.operators.experts.KDeepseekV2MoE     # mlp module with custom forward function
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.([45][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cpu"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cpu"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"
- match:
    name: "^model\\.layers\\.([45][0-9])\\.self_attn$"
  replace:
    class: ktransformers.operators.attention.KDeepseekV2Attention # optimized MLA implementation
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
- match:
    name: "^model$"
  replace:
    class: "ktransformers.operators.models.KDeepseekV2Model"
    kwargs:
      per_layer_prefill_intput_threshold: 0 # 0 is close layer wise prefill
      transfer_map: 
        40: "cpu"

- match:
    name: "^model\\.layers\\.(0|[1-9]|[123][0-9])\\."
  replace:
    class: "default"
    kwargs:
      generate_device: "cuda:0"
      prefill_device: "cuda:0"

- match:
    name: "(^model\\.layers\\.([45][0-9])\\.)|(model.norm)|(lm_head)"
  replace:
    class: "default"
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"

0 replies

Azure-Tang · 2024-08-29T11:14:45Z

Azure-Tang
Aug 29, 2024
Collaborator

Sorry for inconvenience. It seems we cannot put whole layer to cpu. I have fixed this bug in #62.

And if you have only 16G VRAM, a good news is that we compressed deepseekv2's required VRAM from 21G to 11G. Please check our latest release.

Btw, to offload whole layer to CPU, you have to slightly modify your yaml:

- match:
    name: "^model\\.layers\\.([45][0-9])\\.(?!self_attn).*$"  # regular expression 
    class: torch.nn.Linear  # only match modules matching name and class simultaneously
  replace:
    class: ktransformers.operators.linear.KTransformersLinear  # optimized Kernel on quantized data types
    kwargs:
      generate_device: "cpu"
      prefill_device: "cpu"
      generate_op: "KLinearCPUInfer"
      prefill_op: "KLinearTorch"
      out_device: "cpu" # add this one to your cpu linear

And close your cuda graph

2 replies

sayap Aug 30, 2024
Author

Wow, that's a great news! Thank you for all the amazing works

sayap Sep 2, 2024
Author

With the IQ4_XS model from https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Instruct-i1-GGUF, the laptop with 128GB 2933MT/s DDR4 RAM and 16GB VRAM (RTX A5000 Mobile) can get:

(llama.cpp) 35 token/s prefill, 1.5~2.5 token/s generation, depending on context length
(ktransformers) 35 token/s prefill, 4.5 token/s generation, regardless of context length

That's between 1.8x to 3x improvement in token generation speed 👍

Better still, ktransformers can support longer context, while with llama.cpp I can only get a context length of 2048 with 6 layers offloaded to gpu (-ngl 6 -c 2048), or 1024 with 7 layers offloaded (-ngl 7 -c 1024).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DeepSeekV2-Chat with only 16GB VRAM #55

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Support for DeepSeekV2-Chat with only 16GB VRAM #55

sayap Aug 27, 2024

Replies: 3 comments · 2 replies

Azure-Tang Aug 28, 2024 Collaborator

sayap Aug 28, 2024 Author

Azure-Tang Aug 29, 2024 Collaborator

sayap Aug 30, 2024 Author

sayap Sep 2, 2024 Author

sayap
Aug 27, 2024

Replies: 3 comments 2 replies

Azure-Tang
Aug 28, 2024
Collaborator

sayap
Aug 28, 2024
Author

Azure-Tang
Aug 29, 2024
Collaborator

sayap Aug 30, 2024
Author

sayap Sep 2, 2024
Author