Multi gpu setup & Parallel decoding : share compute, more than sharing VRAM #9364
ExtReMLapin
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
From my understanding, simple non parallel decoding doesn't allow for efficient muilti gpu computer power usage because the "prompt" goes sequentially from a layer to another so from a gpu to another.
However, -again from my understanding- on the opposite, on a model hosted on one single GPU, we can queue prompts (do parallel decoding) to increase the GPU total throughput.
Should'd multi-gpu coupled with parallel procedding allow to share compute power as the model (so the layers) are shared on multiple GPUs ?
I'm not thinking about having duplicated layers from GPU A and B, because we can already do that outselves by just starting llama-server twice.
Beta Was this translation helpful? Give feedback.
All reactions