Ideas for models & also distributed inference over LAN. #6
ghchris2021
started this conversation in
Ideas
Replies: 1 comment
-
Thank you for the great suggestions! Merely supporting heterogeneous GPUs would not be a problem for KTransformers because it is based on transformers/torch. It may not be as efficient as the highly optimized Marlin CUDA kernel, but it can still benefit from CPU offloading. We are also interested in implementing an Exo-like multi-machine operator. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Congratulations on the great FOSS project, thank you very much, I look forward to see what becomes of this project!
Per. the request for ideas & aspirations for features / model support I'll share my own thoughts.
In terms of facilitating running larger models in general my primary wish lists from an inference system are:
A: Support distributed inference using any mix of combinations of GPU,+VRAM / CPU+RAM resources across an IP LAN using multiple linux PCs to share the available GPU, CPU, RAM resources effectively when dealing with models using RAM more than the 16-24 GB VRAM size of a typical GPN.
B: Support heterogeneous GPUs -- nvidia, intel arc, amd RDNA in any combination alone, together, distributed.
For model support my main desires from inference are (in no particular order):
LLama-3.1; Deepseek-coder-v2; Deepseek-chat-most recent; mistral-large; codestral; mixtral-8x22b; gemma-2-27b; codegemma; qwen2; codeqwen.
Beta Was this translation helpful? Give feedback.
All reactions