-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Inference #22
Comments
Hello, and thanks! |
I mean batched inference, i.e. submitting two or more requests at once to the LLM. Sorry for the confusion. |
Ah, it isn't supported at the moment no, but it's definitely something that could be added I think. |
I see some Batch code in this repo already. I think llama-cpp support batched inference, so does this crate already support it? |
one way you can achieve it is by spawning multiple contexts I believe, multithreading should help as workaround for now |
by contexts do you mean sessions? I tried running 2 simultaneous inference over 2 sessions, but it took the same amount of time as running both in serial, so i'm not sure anything is actually happening in parallel (or at least that it just doesn't provide any benefit) |
yes yes sessions. I had a bit of luck with it while testing phi2 model, saw the inference happening at the same time with two threads, so as you said it can be parallel execution happens or if its simply sync as it takes same amount of time. |
Concurrent execution with multiple sessions should work as long they are executed in different threads. But I thought author of this issue was asking about batching multiple prompts in the same session, am I wrong? |
this doesn’t sound like batching, but it is a little unclear 😅 |
No, I really mean batched inference. Upon receiving multiple inference requests, the server queues them, takes batches out of the queue and submits them for further processing to the engine. The terminology is not from me - it's from here: https://github.com/ggerganov/llama.cpp/blob/master/examples/parallel/parallel.cpp. But maybe I misunderstood something, then I am sorry for the confusion. |
Hi guys, somehow llama.cpp has already in the server |
Hi there!
Great library! I was wondering if parallel inferencing is possible or a planned feature as llamacpp supports it.
The text was updated successfully, but these errors were encountered: