Am I being limited by single core CPU performance when fully offloaded? #5803
Replies: 12 comments 3 replies
-
Found a response here that suggests it's spinning a thread and not doing anything, so not the bottleneck: #3210 (comment) |
Beta Was this translation helpful? Give feedback.
-
I feel like i'm having the same issue |
Beta Was this translation helpful? Give feedback.
-
Similar behaviour here on AMD/rocBLAS. Most of the CPU time is spent in |
Beta Was this translation helpful? Give feedback.
-
I would have suspected that more CPUs should at least benefit initial tokenization (is that the |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I'm also seeing similar behavior on my system. My situation might make it a bit more interesting a problem since I wasn't expecting to see both a single CPU thread and GPU maxing out, as I'm seeing here. Currently running the latest llama.cpp release (b3484) on the following machine: 2x Xeon E5-2690v4 (total 28 cores, 56 threads) I compiled llama.cpp with CUDA support on a Docker container based on this Docker image (Ubuntu 22.04 with CUDA 12.5.1 installed), and which is running with full access to all host GPUs and CPU cores. I also installed flash-attention and am running q4_K_M llama3.1-70B-Instruct. I'm calling llama-cli with the following terminal command: HTOP tells me there's a single CPU with 100% utilization, while my GPUs cycle between idle (3 of the 4 GPUs are always like that) and 100% use (only 1 GPU is in use at a given time). |
Beta Was this translation helpful? Give feedback.
-
Just chiming in that I am also seeing the same behavior! Model fully offloaded across multiple GPUs (A40s in my case), and then during generation I'm seeing a single CPU thread, pegged at 100, with the GPUs being underutilized. I've verified that the behavior persists regardless of split mode. 🤷♂️ |
Beta Was this translation helpful? Give feedback.
-
I have the same problem, with 4X AMD GPUs, one thread maxed out, and low GPU utilization. |
Beta Was this translation helpful? Give feedback.
-
Is there any updates on this?/ Have you found a solution or workaround? Im seeing the exact same behaviour with a fully offloaded model (still got dedicated memory to spare on the GPU) and barely any GPU utilization but 100% on CPU core 4 in my case (rest is mostly idle). |
Beta Was this translation helpful? Give feedback.
-
My experimets: Performance on AMD APU 5600G (w BIOS GPU overclock) So inference with Vulkan comes 'for free' using only the GPU. But it is slowest. Other frameworks are faster but load the CPU. Haven't tried what it can do with OpenCL. |
Beta Was this translation helpful? Give feedback.
-
Hi have you done any particular thing to use 2 gpu at the same time because i have 2 1660 super but i cant use both of them model is only recognize 1 gpu. When i serve ollama it detects both of my gpu's when i run nvidia-smi it detects both of my gpu's but when i run llama3.1 it only use one of them has any one faced with this problem or is this because of my gpu model? |
Beta Was this translation helpful? Give feedback.
-
has always been the normal behavior for me on llamacpp when fully offloaded |
Beta Was this translation helpful? Give feedback.
-
I'm using a system with the following hardware:
Running a Q6 quant of mixtral with:
./main -m ~/models/mixtral-8x7b-instruct-v0.1.Q6_K.gguf -c 32768 -ngl 128 -ts 39,61,0 -sm row -t 8 --prompt "Once upon a time"
And getting very decent 21-22 t/s:
llama_print_timings: sample time = 155.18 ms / 352 runs ( 0.44 ms per token, 2268.29 tokens per second) llama_print_timings: prompt eval time = 168.00 ms / 5 tokens ( 33.60 ms per token, 29.76 tokens per second) llama_print_timings: eval time = 16486.27 ms / 351 runs ( 46.97 ms per token, 21.29 tokens per second) llama_print_timings: total time = 16912.39 ms / 356 tokens
However I noticed that during generation, there's a single CPU thread pegged at 100%:
Single threaded is expected here, from looking at commits like #5238
But with the single thread at 100%, and nvidia-smi showing 50-60% utilization:
Could my single thread performance be my bottleneck?
Beta Was this translation helpful? Give feedback.
All reactions