Why is `llama_synchronize` called? #6385

EricLBuehler · 2024-03-28T10:25:12Z

EricLBuehler
Mar 28, 2024

Hello all,

I was reading through the codebase and saw llama_synchronize was being called when the logits are retrieved:

llama.cpp/ggml-cuda.cu

Line 2492 in cfc4d75

GGML_CALL static void ggml_backend_cuda_synchronize(ggml_backend_t backend) {

During my work on inference, I noticed that after the model runs, any synchronizing operation blocks for some time before it can be done. After I add an explicit synchronization, it obviously does not do that. However, this confuses me: why are the logits returned before the GPU is done "working"? What operations cause this? I would appreciate any help!

Edit: When I run a flamegraph, I get this:

It seems like avoiding the sync would be very beneficial!

compilade · 2024-03-28T14:46:20Z

compilade
Mar 28, 2024
Collaborator

This is something new since pipeline parallelism has been implemented (at least for CUDA) in #6017

why are the logits returned before the GPU is done "working"?

They are actually returned after, this is exactly what llama_synchronize is used for.

In llama_decode, the logits are copied asynchronously¹ to the output buffer in llama_decode, so that it can return before computing the outputs has finished.

llama.cpp/llama.cpp

Lines 10030 to 10037 in 0308f5e

    
           float * logits_out = lctx.logits + n_outputs_prev*n_vocab; 
        
           const int32_t n_outputs_new = lctx.n_outputs; 
        
           if (n_outputs_new) { 
        
               GGML_ASSERT( n_outputs_prev + n_outputs_new <= n_outputs); 
        
               GGML_ASSERT((n_outputs_prev + n_outputs_new)*n_vocab <= (int64_t) lctx.logits_size); 
        
               ggml_backend_tensor_get_async(backend_res, res, logits_out, 0, n_outputs_new*n_vocab*sizeof(float)); 
        
           }

When llama_get_logits_ith is called, it first calls llama_synchronize to ensure data has been copied,

llama.cpp/llama.cpp

Lines 15175 to 15176 in 0308f5e

    
           float * llama_get_logits_ith(struct llama_context * ctx, int32_t i) { 
        
               llama_synchronize(ctx);

then it extracts the specified logits.

llama.cpp/llama.cpp

Line 15195 in 0308f5e

return ctx->logits + j*ctx->model.hparams.n_vocab;

What operations cause this?

Any operation which returns the content of the output buffer calls llama_synchronize before accessing the values. (e.g. llama_get_logits, llama_get_logits_ith, llama_get_embeddings, llama_get_embeddings_ith, and llama_get_embeddings_seq)

Note that llama_sampling_sample indirectly calls llama_get_logits_ith, which is what is shown in your flamegraph.

async copy falls back to synchronous copy when the backend doesn't support it. See https://github.com/ggerganov/llama.cpp/blob/0308f5e3d7bf9879f818b1a4ae589ff36b242af5/ggml-backend.c#L214-L215 ↩

0 replies

EricLBuehler · 2024-03-28T17:26:31Z

EricLBuehler
Mar 28, 2024
Author

Thanks for the detailed explanation, that makes sense. I was wondering, how does the computation graph allow async GPU (CUDA) operations? If you were to build a graph for the Llama architecture, wouldn't all parts need to be sequentially executed? I am sure this is wrong because llama.cpp would not implement it otherwise.

0 replies

slaren · 2024-03-29T15:18:09Z

slaren
Mar 29, 2024
Collaborator

Async operations are queued into an asynchronous queue (in CUDA this is just a stream) and executed sequentially. The copy doesn't happen until the computation is completed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is `llama_synchronize` called? #6385

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Why is llama_synchronize called? #6385

EricLBuehler Mar 28, 2024

Replies: 3 comments

compilade Mar 28, 2024 Collaborator

Footnotes

EricLBuehler Mar 28, 2024 Author

slaren Mar 29, 2024 Collaborator

Why is `llama_synchronize` called? #6385

EricLBuehler
Mar 28, 2024

compilade
Mar 28, 2024
Collaborator

EricLBuehler
Mar 28, 2024
Author

slaren
Mar 29, 2024
Collaborator