Skip to content

running inference in parallel in multiple threads #565

Closed Answered by slaren
oppiliappan asked this question in Q&A
Discussion options

You must be logged in to vote

Most (all?) of the synchronization is done through spin locks, so using more threads than physically available can have disastrous effects on the performance. You are likely to get better performance if you serialize the requests. I also suggest looking into batched decoding in llama.cpp, that should be the best way to process multiple sequences simultaneously.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by YavorGIvanov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants