Performance of llama.cpp on Snapdragon X Elite/Plus #8273
Replies: 3 comments 10 replies
-
On my 32gb Surface Pro 11 using LM Studio with 4 threads on Llama 3 Instruct 8B q4_k_m gguf I am seeing 12 - 20+ tok/s pretty consistently. Doh, will try bumping LM Studio to 10 threads. The rSnapdragon arm64 release version of LM Studio is here : https://lmstudio.ai/snapdragon I don't understand how llama.cpp projects are prioritized and queued but LM Studio 0.3.0 (beta) supposedly has some snapdragon/npu something already somehow? (on waiting list to get the beta bits) Excitedly anticipating future NPU support! |
Beta Was this translation helpful? Give feedback.
-
Update for Surface Laptop 7 / Snapdragon X Elite - it might seem, that the Elite utilizes the memory-bandwidth better than the asymmetrical Plus (for token-generation):
build: cddae48 (3646) |
Beta Was this translation helpful? Give feedback.
-
What about the GPU and NPU backend? |
Beta Was this translation helpful? Give feedback.
-
I want to start a discussion on the performance of the new Qualcomm Snapdragon X similar to Apple M Silicon in #4167
This post got completely updated, because power-setting to "best performance" IS needed. Default it only uses 4 of the 10 cores fully, which prevents thermal throttling but gives much less performance.
I am agnostic to Apple/intel/AMD/... or any discussion on Windows/MacOS/Linux merits - please spare us any "religiosity" here on Operating-systems, etc. For me it's important to have good tools, and I think running LLMs/SLMs locally via llama.cpp is important. We need good llama.cpp benchmarking, to be able to decide. I am currently primarily a Mac user (MacBook Air M2, Mac Studio M2 Max), running MacOS, Windows and Linux.
I just got a Surface 11 Pro with the X Plus and these are my 1st benchmarks. The Surface Pros always had thermal constraints, so I got a Plus and not an Elite - even with the Plus it throttles quickly when its 10 CPUs are used fully. Also since there was an optimization of llama.cpp for Snapdragon-builds, I am NOT testing with build 8e672ef, but with the current build. But I'm trying to produce comparable results to Apple Silicon #4167.
Here are my results for my Surface 11 Pro Snapdragon(R) X 10-core X1P64100 @ 3.40 GHz, 16GB, running Windows 11 Enterprise 22H2 26100.1000 - with the 16GB, I could not test fp16 since it swaps.
llama-bench with -t 10 for Q8_0 and later after a bit of cool-down for Q4_0 (the throttled numbers were 40% !!! lower). F16 does swap with the 16GB RAM, so its not included.
build: a27152b (3285)
Update: Results for a Snapdragon X Elite (Surface Laptop 7 15"):
build: cddae48 (3646)
I think the new Qualcomm chips are interesting, the numbers are a bit faster than my M2 MacBook Air in CPU-only mode - feedback welcome!
It's early in the life of this SoC as well as with Windows for arm64, and a lot of optimizations are still needed. There is no GPU/NPU support (yet) and Windows/gcc arm64 is still work-in-progress. DirectML, QNN and ONNX seems to be the main optimization focus for Microsoft/Qualcomm, I will look into this later (maybe the llama.cpp QNN backend of #7541 would also help/be a starting-point). So this is work-in-progress.
I tested 2 llama.cpp build methods for Windows with MSVC, and the method in https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows got me a little better results, than the build-method in #7191. I still need to test building with clang, but I expect not much difference, since clang uses the MSVC backend on Windows.
Another update/extension - with WSL2/gcc using 10 CPUs / 8 GB RAM and Ubuntu 24.04, the numbers are very similar (all dependent on cooldowns/throttling):
build: a27152b (3285)
Beta Was this translation helpful? Give feedback.
All reactions