Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

jukofyork · 2024-07-27T11:46:27Z

jukofyork
Jul 27, 2024

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/deepseek-v2-injection.md

Arithmetic Intensity Guided Offloading

Storing all 236 billion parameters of a model in GPU VRAM is clearly impractical for local users. Therefore, we strategically store only the most computationally intensive parameters on the GPU. For instance, after our optimizations, the MLA operator, which contains 128 heads with a shared compressed key-value representation, shows an arithmetic intensity of 512. This makes it the most intensive operator, particularly during smaller inference batch sizes. Hence, it is allocated to the GPU to leverage the power of tensor cores.

On the other hand, as shown in Figure 1, each transformer block in DeepSeek-V2 includes 160 mixture-of-experts (MoE) experts, comprising 96% of the total parameters. However, the MoE router activates only 6 out of these 160 experts for each token, which means that only 3.75% of the MoE parameters are utilized during the decoding phase. With a batch size of one, the arithmetic intensity of the MoE operation is roughly 0.075. This operation, primarily involving a batched General Matrix-Vector Multiplication (GEMV), can thus be efficiently handled by the CPU.

Looks particularly interesting - could even go one step further and run something similar to llama-imatrix to find the frequency of expert selections and then rank from "most selected" to "least selected" and allow a variable offload like -ngpu.

james0zan · 2024-07-27T14:00:36Z

james0zan
Jul 27, 2024

According to our evaluation with KTransformers, the distribution of experts in Mixtral and Qwen2-57B-A14 is very imbalanced; thus, it would be beneficial to store only the most frequently used experts on the GPU. In contrast, this strategy does not lead to significant benefits with DeepSeek-V2 because it has a huge number of experts and is trained with a more balanced recipe.

We plan to implement this strategy in KTransformers to measure the appropriate parameters, which can be used in future implementations in llama.cpp. We are not very familiar with the specific llama.cpp code, thus we are unable to upstream such modifications ourselves. However, we are willing to participate in constructing a PR if there is such an objection in the llama.cpp community.

2 replies

JohannesGaessler Jul 30, 2024
Collaborator

My expectation is that the synchronization overhead from potentially having to sandwich CPU operations between GPU operations will outweigh the speedup from more efficient use of VRAM. But I would be happy to be proven wrong.

james0zan Jul 30, 2024

DeepSeek V2's MoE is very large and sparse, thus a layer-wise offload only makes marginal improvement because only a few layers can be loaded into the VRAM.

In contrast, offloading these experts to CPU/DRAM ensures that all the QKVO projections and the shared experts are kept in VRAM. These matrices are scanned for every request and they cost only 21GB of VRAM in 4-bit quantization.

Moreover, we utilize the cudaLaunchHostFunc function
https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/ktransformers_ext/ext_bindings.cpp#L241
It ensures that, even though we have to sandwich CPU operations between GPU operations, all the GPU operators are captured into a single CUDA Graph. This optimization largely reduces the launch cost by invoking only one CUDA Graph replay for all the layers instead of one CUDA graph per layer.

ELigoP · 2024-08-20T16:36:41Z

ELigoP
Aug 20, 2024

Out of the box llama.cpp with bartowski/DeepSeek-V2-Chat-0628-GGUF gives me about 7.5 tokens per second while utilizing about 95% of my 2x RTX 3090 48GB VRAM.
ktransformers utilize about 45% of that VRAM and give me 9 tokens per second.
So I think there is big potential for this approach for GPU-poor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

{{title}}

Arithmetic Intensity Guided Offloading

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Loads of interesting ideas in the 'ktransformers' report for mixed CPU/GPU inference #8721

jukofyork Jul 27, 2024

Arithmetic Intensity Guided Offloading

Replies: 2 comments · 2 replies

james0zan Jul 27, 2024

JohannesGaessler Jul 30, 2024 Collaborator

james0zan Jul 30, 2024

ELigoP Aug 20, 2024

jukofyork
Jul 27, 2024

Replies: 2 comments 2 replies

james0zan
Jul 27, 2024

JohannesGaessler Jul 30, 2024
Collaborator

ELigoP
Aug 20, 2024