Am I being limited by single core CPU performance when fully offloaded? #5803

reversebias · 2024-02-29T23:02:04Z

reversebias
Feb 29, 2024

I'm using a system with the following hardware:

Xeon W-2133
96GB DDR4-2666
2x Nvidia P40 each in a 16x 3.0 slot

Running a Q6 quant of mixtral with:
./main -m ~/models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 32768 -ngl 128 -ts 39,61,0 -sm row -t 8 --prompt "Once upon a time"

And getting very decent 21-22 t/s:
llama_print_timings: sample time = 155.18 ms / 352 runs ( 0.44 ms per token, 2268.29 tokens per second) llama_print_timings: prompt eval time = 168.00 ms / 5 tokens ( 33.60 ms per token, 29.76 tokens per second) llama_print_timings: eval time = 16486.27 ms / 351 runs ( 46.97 ms per token, 21.29 tokens per second) llama_print_timings: total time = 16912.39 ms / 356 tokens

However I noticed that during generation, there's a single CPU thread pegged at 100%:

Single threaded is expected here, from looking at commits like #5238

But with the single thread at 100%, and nvidia-smi showing 50-60% utilization:

Could my single thread performance be my bottleneck?

reversebias · 2024-03-01T01:53:48Z

reversebias
Mar 1, 2024
Author

Found a response here that suggests it's spinning a thread and not doing anything, so not the bottleneck: #3210 (comment)

1 reply

ggerganov Mar 1, 2024
Maintainer

The thread spinning referenced in the comment is no longer occurring with latest ggml-backend updates

ExtReMLapin · 2024-03-06T12:42:03Z

ExtReMLapin
Mar 6, 2024

I feel like i'm having the same issue

0 replies

Artefact2 · 2024-03-09T14:41:47Z

Artefact2
Mar 9, 2024
Collaborator

Similar behaviour here on AMD/rocBLAS.

Most of the CPU time is spent in rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t). Needless to say, this makes the CPU generate a lot of heat and consume extra power for nothing.

0 replies

moritzschaefer · 2024-03-10T16:31:02Z

moritzschaefer
Mar 10, 2024

I would have suspected that more CPUs should at least benefit initial tokenization (is that the load part of the timings?).

0 replies

jwhitehorn · 2024-04-30T01:00:52Z

jwhitehorn
Apr 30, 2024

I too am running into this behavior. Single CPU thread at 100%, and GPU under-utilized (about 20% utilization).

Not sure if it matters, but here are some details:

Debian 12 / 6.8.4 host
Dual Xeon E5 2697v2 CPUs
64GB ECC RAM (Quad-channel DDR3-1333)
Intel Arc A770 GPU
Llama.cpp (via llama-cpp-python 0.2.65) dockerized using the intel/oneapi-basekit:2024.1.0-devel-ubuntu22.04 image
I've tried different models (llama 2, llama 3, claude 2, etc), all fully offloaded to VRAM

I've tried using llama.cpp's main and server executables directly and can confirm that the fact that I'm running through Python seems immaterial to the issue.

If the CPU thread isn't the bottleneck, do we know why the GPU isn't more utilized?

0 replies

fidecastro · 2024-07-28T20:29:43Z

fidecastro
Jul 28, 2024

I'm also seeing similar behavior on my system. My situation might make it a bit more interesting a problem since I wasn't expecting to see both a single CPU thread and GPU maxing out, as I'm seeing here.

Currently running the latest llama.cpp release (b3484) on the following machine:

2x Xeon E5-2690v4 (total 28 cores, 56 threads)
128 GB DDR4 RAM 2133
4 x NVIDIA P40 (3 on 16x PCIe 3.0, 1 on 8x PCIe 3.0)

I compiled llama.cpp with CUDA support on a Docker container based on this Docker image (Ubuntu 22.04 with CUDA 12.5.1 installed), and which is running with full access to all host GPUs and CPU cores. I also installed flash-attention and am running q4_K_M llama3.1-70B-Instruct.

I'm calling llama-cli with the following terminal command: ./llama-cli -m /workspace/llama3.1/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf -ctk q4_0 -ctv q4_0 -fa -ngl 999 -e --chat-template llama3 -n 8000 --file prompt.txt

HTOP tells me there's a single CPU with 100% utilization, while my GPUs cycle between idle (3 of the 4 GPUs are always like that) and 100% use (only 1 GPU is in use at a given time).

1 reply

slaren Jul 29, 2024
Collaborator

You may get better results with -sm row with the P40.

stevenbedrick · 2024-07-30T23:26:20Z

stevenbedrick
Jul 30, 2024

Just chiming in that I am also seeing the same behavior! Model fully offloaded across multiple GPUs (A40s in my case), and then during generation I'm seeing a single CPU thread, pegged at 100, with the GPUs being underutilized. I've verified that the behavior persists regardless of split mode. 🤷‍♂️

0 replies

8XXD8 · 2024-07-31T14:54:10Z

8XXD8
Jul 31, 2024

I have the same problem, with 4X AMD GPUs, one thread maxed out, and low GPU utilization.
For me, --no-mmap helps.

1 reply

stevenbedrick Jul 31, 2024

Hmm, thanks; in my case that doesn't seem to make much of a difference. 🤷‍♂️

ClonedPoro · 2024-08-06T09:56:32Z

ClonedPoro
Aug 6, 2024

0 replies

AGenchev · 2024-08-14T14:41:11Z

AGenchev
Aug 14, 2024

My experimets: Performance on AMD APU 5600G (w BIOS GPU overclock)
I'm using same model in all tests (llama-13-v2/ggml-model-q8_0.gguf).
using Vulkan ~ 1.63 tokens per second (but strange - no load on the CPU).
using CPU - 2.86 tokens per second (around 599% load on CPU).
using RoCM HIPBLAS - 3.07 tokens per second (1 CPU core loaded at 100%).

So inference with Vulkan comes 'for free' using only the GPU. But it is slowest. Other frameworks are faster but load the CPU.

Haven't tried what it can do with OpenCL.

0 replies

KutayKoray · 2024-08-24T15:24:49Z

KutayKoray
Aug 24, 2024

Hi have you done any particular thing to use 2 gpu at the same time because i have 2 1660 super but i cant use both of them model is only recognize 1 gpu. When i serve ollama it detects both of my gpu's when i run nvidia-smi it detects both of my gpu's but when i run llama3.1 it only use one of them has any one faced with this problem or is this because of my gpu model?

0 replies

Fuckingnameless · 2024-08-24T21:00:01Z

Fuckingnameless
Aug 24, 2024

has always been the normal behavior for me on llamacpp when fully offloaded

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Am I being limited by single core CPU performance when fully offloaded? #5803

{{title}}

Replies: 12 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Am I being limited by single core CPU performance when fully offloaded? #5803

Replies: 12 comments · 3 replies

reversebias Mar 1, 2024 Author

ggerganov Mar 1, 2024 Maintainer

Artefact2 Mar 9, 2024 Collaborator

slaren Jul 29, 2024 Collaborator

Replies: 12 comments 3 replies

reversebias
Mar 1, 2024
Author

ggerganov Mar 1, 2024
Maintainer

Artefact2
Mar 9, 2024
Collaborator

slaren Jul 29, 2024
Collaborator