Parallel Inference #22

DaOnlyOwner · 2024-02-18T17:53:26Z

Hi there!

Great library! I was wondering if parallel inferencing is possible or a planned feature as llamacpp supports it.

pedro-devv · 2024-02-19T14:22:43Z

Hello, and thanks!
What do you mean exactly by parallel inferencing?

DaOnlyOwner · 2024-02-20T22:17:48Z

I mean batched inference, i.e. submitting two or more requests at once to the LLM. Sorry for the confusion.

pedro-devv · 2024-02-21T15:27:14Z

Ah, it isn't supported at the moment no, but it's definitely something that could be added I think.

benbot · 2024-03-17T01:28:58Z

I see some Batch code in this repo already. I think llama-cpp support batched inference, so does this crate already support it?

ElhamAryanpur · 2024-03-20T14:04:24Z

one way you can achieve it is by spawning multiple contexts I believe, multithreading should help as workaround for now

benbot · 2024-03-20T14:20:43Z

one way you can achieve it is by spawning multiple contexts I believe, multithreading should help as workaround for now

by contexts do you mean sessions?

I tried running 2 simultaneous inference over 2 sessions, but it took the same amount of time as running both in serial, so i'm not sure anything is actually happening in parallel (or at least that it just doesn't provide any benefit)

ElhamAryanpur · 2024-03-20T14:33:08Z

yes yes sessions. I had a bit of luck with it while testing phi2 model, saw the inference happening at the same time with two threads, so as you said it can be parallel execution happens or if its simply sync as it takes same amount of time.

pedro-devv · 2024-03-20T15:25:28Z

Concurrent execution with multiple sessions should work as long they are executed in different threads. But I thought author of this issue was asking about batching multiple prompts in the same session, am I wrong?

benbot · 2024-03-20T18:05:34Z

submitting two or more requests at once to the LLM

this doesn’t sound like batching, but it is a little unclear 😅

DaOnlyOwner · 2024-03-20T18:22:00Z

No, I really mean batched inference. Upon receiving multiple inference requests, the server queues them, takes batches out of the queue and submits them for further processing to the engine. The terminology is not from me - it's from here: https://github.com/ggerganov/llama.cpp/blob/master/examples/parallel/parallel.cpp. But maybe I misunderstood something, then I am sorry for the confusion.

pabl-o-ce · 2024-04-25T02:55:45Z

Hi guys, somehow llama.cpp has already in the server -np N, --parallel N: Set the number of slots for process requests. Default: 1 handle parallel request already only the server

DaOnlyOwner closed this as completed Feb 20, 2024

DaOnlyOwner reopened this Feb 20, 2024

pedro-devv added the enhancement New feature or request label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Inference #22

Parallel Inference #22

DaOnlyOwner commented Feb 18, 2024

pedro-devv commented Feb 19, 2024

DaOnlyOwner commented Feb 20, 2024

pedro-devv commented Feb 21, 2024

benbot commented Mar 17, 2024

ElhamAryanpur commented Mar 20, 2024

benbot commented Mar 20, 2024

ElhamAryanpur commented Mar 20, 2024

pedro-devv commented Mar 20, 2024

benbot commented Mar 20, 2024

DaOnlyOwner commented Mar 20, 2024

pabl-o-ce commented Apr 25, 2024

Parallel Inference #22

Parallel Inference #22

Comments

DaOnlyOwner commented Feb 18, 2024

pedro-devv commented Feb 19, 2024

DaOnlyOwner commented Feb 20, 2024

pedro-devv commented Feb 21, 2024

benbot commented Mar 17, 2024

ElhamAryanpur commented Mar 20, 2024

benbot commented Mar 20, 2024

ElhamAryanpur commented Mar 20, 2024

pedro-devv commented Mar 20, 2024

benbot commented Mar 20, 2024

DaOnlyOwner commented Mar 20, 2024

pabl-o-ce commented Apr 25, 2024