running inference in parallel in multiple threads #565
-
some context about this:
i suppose my question here is then: is it possible that there is some sort of internal contention going on when attempting to run multiple embeddings across threads (in the same process)? i should mention that this is a checkout of ggml before GGUF. i've noticed this happen the following configurations:
and if this is the case, can this be fixed by maybe running separate processes altogether? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Most (all?) of the synchronization is done through spin locks, so using more threads than physically available can have disastrous effects on the performance. You are likely to get better performance if you serialize the requests. I also suggest looking into batched decoding in llama.cpp, that should be the best way to process multiple sequences simultaneously. |
Beta Was this translation helpful? Give feedback.
Most (all?) of the synchronization is done through spin locks, so using more threads than physically available can have disastrous effects on the performance. You are likely to get better performance if you serialize the requests. I also suggest looking into batched decoding in llama.cpp, that should be the best way to process multiple sequences simultaneously.