Benchmarking llama-server | RPC backend segmentation faults #9682

Allan-Luu · 2024-09-29T04:55:25Z

Allan-Luu
Sep 29, 2024

Hello,

I'm having some issues with llama-server benchmarking with rpc backends. When running local GPUs there's only some issues, but whenever the llama-server is running with rpc, after the second iteration the rpc backend will crash with a segmentation fault. Any advice on how to get the segmentation faults to stop?

I'm running the line below for the RPC backends

export HSA_OVERRIDE_GFX_VERSION=11.0.1 && export HIP_VISIBLE_DEVICES=$GPU && sudo ./rpc-server --host $HOST --port $PORT --mem 16000

I've tried messing around with the defragmentation and smaller batch sizes with llama-server but it doesn't seem to help.
The line I use to run my server is:

export HSA_OVERRIDE_GFX_VERSION=11.0.1 && \
export HIP_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 && \
sudo ./llama-server  \
--host <server address>  \
--port 8080  \ 
--model /data/models/llama-3-1-8b-instruct-f16.gguf  \
--cont-batching  \
--metrics  \
--parallel 8  \ 
--batch-size 512   \ #tried decreasing
--ctx-size 4096  \ #tried increasing and decreasing
-ngl 33  \
-b 1  \ 
-dt 0.15  \
--rpc <list of 8 ip addresses/ports> #optional

Below is a screenshot of my "successful" run using 8 local AMD GPUs. I noticed that the data sent and received is 0 B/s. This is something I'd like to confirm with anyone, to see if they had similar results.

The only way I could get this to work was modifying script.js and adding a logic handling for sse.open for when llama-server returns [DONE]

// script.js modification
const res = sse.open(`${server_url}/chat/completions`, params, function (client) {
        client.on('event', function (event) {
	    //console.log("Raw event data:", event.data);  // Log raw data
	    if (promptEvalEndTime == null) {
                promptEvalEndTime = new Date()
            }
	    if (event.data === "[DONE]") {
        	console.log("Stream finished.");
        	return; // Exit the function, stream is done
    	    }

            let chunk = JSON.parse(event.data)
            let choice = chunk.choices[0]
            if (choice.finish_reason) {
                finish_reason = choice.finish_reason
            }
           // .... Rest of code

Below is a few lines that generate when using llama-server without defragmentation. This is with using the 8 local GPUs only

srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -4, n_batch = 4, ret = 1
srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -16, n_batch = 16, ret = 1
srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -8, n_batch = 8, ret = 1
srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -4, n_batch = 4, ret = 1
srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -16, n_batch = 16, ret = 1
srv  update_slots: failed to find free space in the KV cache, retrying with smaller batch size - try increasing it via the context size or enable defragmentation, i = -8, n_batch = 8, ret = 1

Aluu-dev · 2024-10-10T03:55:50Z

Aluu-dev
Oct 10, 2024

I compiled my RPC servers in debugging mode, and noticed the first specified RPC backend in the list of RPC servers will crash because the number of nodes changes, any idea why this would happen?

1 reply

slaren Oct 10, 2024
Collaborator

You can build with LLAMA_SANITIZE_ADDRESS to get a more detailed report on the crash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking llama-server | RPC backend segmentation faults #9682

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Benchmarking llama-server | RPC backend segmentation faults #9682

Allan-Luu Sep 29, 2024

Replies: 1 comment · 1 reply

Aluu-dev Oct 10, 2024

slaren Oct 10, 2024 Collaborator

Allan-Luu
Sep 29, 2024

Replies: 1 comment 1 reply

Aluu-dev
Oct 10, 2024

slaren Oct 10, 2024
Collaborator