Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Inference #22

Open
DaOnlyOwner opened this issue Feb 18, 2024 · 11 comments
Open

Parallel Inference #22

DaOnlyOwner opened this issue Feb 18, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@DaOnlyOwner
Copy link

Hi there!

Great library! I was wondering if parallel inferencing is possible or a planned feature as llamacpp supports it.

@pedro-devv
Copy link
Contributor

Hello, and thanks!
What do you mean exactly by parallel inferencing?

@DaOnlyOwner
Copy link
Author

I mean batched inference, i.e. submitting two or more requests at once to the LLM. Sorry for the confusion.

@pedro-devv
Copy link
Contributor

Ah, it isn't supported at the moment no, but it's definitely something that could be added I think.

@pedro-devv pedro-devv added the enhancement New feature or request label Feb 26, 2024
@benbot
Copy link

benbot commented Mar 17, 2024

I see some Batch code in this repo already. I think llama-cpp support batched inference, so does this crate already support it?

@ElhamAryanpur
Copy link

one way you can achieve it is by spawning multiple contexts I believe, multithreading should help as workaround for now

@benbot
Copy link

benbot commented Mar 20, 2024

one way you can achieve it is by spawning multiple contexts I believe, multithreading should help as workaround for now

by contexts do you mean sessions?

I tried running 2 simultaneous inference over 2 sessions, but it took the same amount of time as running both in serial, so i'm not sure anything is actually happening in parallel (or at least that it just doesn't provide any benefit)

@ElhamAryanpur
Copy link

yes yes sessions. I had a bit of luck with it while testing phi2 model, saw the inference happening at the same time with two threads, so as you said it can be parallel execution happens or if its simply sync as it takes same amount of time.

@pedro-devv
Copy link
Contributor

Concurrent execution with multiple sessions should work as long they are executed in different threads. But I thought author of this issue was asking about batching multiple prompts in the same session, am I wrong?

@benbot
Copy link

benbot commented Mar 20, 2024

submitting two or more requests at once to the LLM

this doesn’t sound like batching, but it is a little unclear 😅

@DaOnlyOwner
Copy link
Author

No, I really mean batched inference. Upon receiving multiple inference requests, the server queues them, takes batches out of the queue and submits them for further processing to the engine. The terminology is not from me - it's from here: https://github.com/ggerganov/llama.cpp/blob/master/examples/parallel/parallel.cpp. But maybe I misunderstood something, then I am sorry for the confusion.

@pabl-o-ce
Copy link

Hi guys, somehow llama.cpp has already in the server -np N, --parallel N: Set the number of slots for process requests. Default: 1 handle parallel request already only the server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants