Why is llama_synchronize
called?
#6385
Replies: 3 comments
-
This is something new since pipeline parallelism has been implemented (at least for CUDA) in #6017
They are actually returned after, this is exactly what In Lines 10030 to 10037 in 0308f5e When Lines 15175 to 15176 in 0308f5e then it extracts the specified logits. Line 15195 in 0308f5e
Any operation which returns the content of the output buffer calls Note that Footnotes
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed explanation, that makes sense. I was wondering, how does the computation graph allow async GPU (CUDA) operations? If you were to build a graph for the Llama architecture, wouldn't all parts need to be sequentially executed? I am sure this is wrong because llama.cpp would not implement it otherwise. |
Beta Was this translation helpful? Give feedback.
-
Async operations are queued into an asynchronous queue (in CUDA this is just a stream) and executed sequentially. The copy doesn't happen until the computation is completed. |
Beta Was this translation helpful? Give feedback.
-
Hello all,
I was reading through the codebase and saw
llama_synchronize
was being called when the logits are retrieved:During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!
Edit: When I run a flamegraph, I get this:
It seems like avoiding the sync would be very beneficial!
Beta Was this translation helpful? Give feedback.
All reactions