Replies: 2 comments 2 replies
-
According to our evaluation with KTransformers, the distribution of experts in Mixtral and Qwen2-57B-A14 is very imbalanced; thus, it would be beneficial to store only the most frequently used experts on the GPU. In contrast, this strategy does not lead to significant benefits with DeepSeek-V2 because it has a huge number of experts and is trained with a more balanced recipe. We plan to implement this strategy in KTransformers to measure the appropriate parameters, which can be used in future implementations in llama.cpp. We are not very familiar with the specific llama.cpp code, thus we are unable to upstream such modifications ourselves. However, we are willing to participate in constructing a PR if there is such an objection in the llama.cpp community. |
Beta Was this translation helpful? Give feedback.
-
Out of the box llama.cpp with |
Beta Was this translation helpful? Give feedback.
-
https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/deepseek-v2-injection.md
Looks particularly interesting - could even go one step further and run something similar to
llama-imatrix
to find the frequency of expert selections and then rank from "most selected" to "least selected" and allow a variable offload like-ngpu
.Beta Was this translation helpful? Give feedback.
All reactions