Performance of llama.cpp on Snapdragon X Elite/Plus #8273

AndreasKunar · 2024-07-03T09:53:02Z

AndreasKunar
Jul 3, 2024

I want to start a discussion on the performance of the new Qualcomm Snapdragon X similar to Apple M Silicon in #4167

This post got completely updated, because power-setting to "best performance" IS needed. Default it only uses 4 of the 10 cores fully, which prevents thermal throttling but gives much less performance.

I am agnostic to Apple/intel/AMD/... or any discussion on Windows/MacOS/Linux merits - please spare us any "religiosity" here on Operating-systems, etc. For me it's important to have good tools, and I think running LLMs/SLMs locally via llama.cpp is important. We need good llama.cpp benchmarking, to be able to decide. I am currently primarily a Mac user (MacBook Air M2, Mac Studio M2 Max), running MacOS, Windows and Linux.

I just got a Surface 11 Pro with the X Plus and these are my 1st benchmarks. The Surface Pros always had thermal constraints, so I got a Plus and not an Elite - even with the Plus it throttles quickly when its 10 CPUs are used fully. Also since there was an optimization of llama.cpp for Snapdragon-builds, I am NOT testing with build 8e672ef, but with the current build. But I'm trying to produce comparable results to Apple Silicon #4167.

Here are my results for my Surface 11 Pro Snapdragon(R) X 10-core X1P64100 @ 3.40 GHz, 16GB, running Windows 11 Enterprise 22H2 26100.1000 - with the 16GB, I could not test fp16 since it swaps.

llama-bench with -t 10 for Q8_0 and later after a bit of cool-down for Q4_0 (the throttled numbers were 40% !!! lower). F16 does swap with the 16GB RAM, so its not included.

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	pp512	58.72 ± 2.50
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	tg128	13.54 ± 1.12
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	pp512	58.59 ± 3.12
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	tg128	18.23 ± 6.23

build: a27152b (3285)

Update: Results for a Snapdragon X Elite (Surface Laptop 7 15"):

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	12	pp512	63.51 ± 4.94
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	12	tg128	12.65 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	66.63 ± 3.90
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	20.72 ± 0.54

build: cddae48 (3646)

I think the new Qualcomm chips are interesting, the numbers are a bit faster than my M2 MacBook Air in CPU-only mode - feedback welcome!

It's early in the life of this SoC as well as with Windows for arm64, and a lot of optimizations are still needed. There is no GPU/NPU support (yet) and Windows/gcc arm64 is still work-in-progress. DirectML, QNN and ONNX seems to be the main optimization focus for Microsoft/Qualcomm, I will look into this later (maybe the llama.cpp QNN backend of #7541 would also help/be a starting-point). So this is work-in-progress.

I tested 2 llama.cpp build methods for Windows with MSVC, and the method in https://www.qualcomm.com/developer/blog/2024/04/big-performance-boost-for-llama-cpp-and-chatglm-cpp-with-windows got me a little better results, than the build-method in #7191. I still need to test building with clang, but I expect not much difference, since clang uses the MSVC backend on Windows.

Another update/extension - with WSL2/gcc using 10 CPUs / 8 GB RAM and Ubuntu 24.04, the numbers are very similar (all dependent on cooldowns/throttling):

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	pp512	62.46 ± 2.69
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	10	tg128	9.58 ± 3.04
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	pp512	61.93 ± 2.76
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	10	tg128	13.74 ± 10.70

build: a27152b (3285)

joShu001 · 2024-07-23T19:04:48Z

joShu001
Jul 23, 2024

On my 32gb Surface Pro 11 using LM Studio with 4 threads on Llama 3 Instruct 8B q4_k_m gguf I am seeing 12 - 20+ tok/s pretty consistently. Doh, will try bumping LM Studio to 10 threads. The rSnapdragon arm64 release version of LM Studio is here : https://lmstudio.ai/snapdragon

I don't understand how llama.cpp projects are prioritized and queued but LM Studio 0.3.0 (beta) supposedly has some snapdragon/npu something already somehow? (on waiting list to get the beta bits)

Excitedly anticipating future NPU support!

7 replies

AndreasKunar Jul 24, 2024
Author

My llama-bench command-line is derived from the same one which got used by ggerganov for the initial Apple M-Series benchmarking ./llama-bench -m <model-name> -p 512 -n 128 -t 10 (10 is for the Plus' 10 cores, for the Elite use -t 12, if llama.cpp can use a GPU, you add -ngl 99 or -ngl 0, if you don't want it to use the GPU).

As far as I know, there is no download of Q4_0_4_8 models. You just download the model you want, ideally as Q4_0 variant, and convert it via: ./llama-quantize --allow-requantize <name of the downloaded model> <name of the new Q4_0_4_8 model> Q4_0_4_8.

On AI performance, it's complicated. Any AI response has 2 parts. 1) the AI analyzes the prompt (which can be quite long), to see what the situation is, and what is required as answer. This "prompt processing" (PP) is done for the complete prompt-string at once. And for this a lot of compute-horsepower is needed. There GPU-acceleration or the new ARM CPU-optimizations with this Q4_0_4_8 gives a 2-3x acceleration. 2) once the prompt is processed completely, the LLM generates the response token-per-token. For this "token-generation" (TG), the LLM needs to calculate the next token from ALL the many billion parameters as well as the context (all the token of the prompt and the previous response). So with TG the LLM shuffles GB of data - FOR EACH AND EVERY TOKEN to be generated - from its RAM into the CPU/GPU's chip-internal ultra-fast cache-memory (which is much to small to hold everything, and needs to be re-used/re-loaded all the time). So for TG the RAM-bandwidth (and less the compute-horsepower) becomes the limiting factor - how fast it can pump all the billions of parameters,... into the calculation. Therefore we see little improvement for TG with more compute-horsepower. Apple's Max and Ultra chips have 4x to 8x the memory-bandwidth of the base M-chip, or even the Snapdragon X (which has a 33% faster RAM than the base M2/M3), and this influences the TG numbers. This is why llama-bench gives "pp" and "tg" numbers separately in its tables. pp512 means, it got a 512 token long prompt to analyze - a very long one, these long prompts are e.g. very important for retrieval-augmented-generation (RAG), where the LLM gets a lot of context-information in the prompt for a question.

SK is Microsoft's open-source framework for building their Copilots (similar to langchain,... but simpler/easier). Currently more of a programmer thing. It's weird how programming changes with AI. E.g. programming a Spanish translator functionality for your application becomes just a few SK calls and telling an AI, to translate to Spanish. Totally unlike the geeky/complex recipes of traditional programming. And with AI Agents, the AI gets an input, and then decides, which tools it should use, and how, in order to accomplish the task - e.g. web-search,... - its still very early. The new Llama-3-Groq-8B-Tool-Use is the first LOCAL LLM, which is very capable of generating a good plan for this tool-use of AI-agents, until then it was only possible with cloud-AI.

Very long-winded answer, but I hope it helps.

joShu001 Jul 24, 2024

You are a great explainer, thanks for your long-windedness!

To test your claims of 2 to 3x acceleration, I did the following:

Obtained Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf and quantized it using Q4_0_4_8.
Benched the original (llama 8B Q4_K - Medium) vs (llama 8B Q4_0_4_8) the quantized version using 10 and 12 threads, and the meager GPU on the SP11.

llama-bench -m "Llama-3-Groq-8B-Tool-Use-Q4_K_M.gguf" -p 512 -n 128 -t 12

model	size	params	backend	threads	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CPU	12	pp512	32.31 ± 0.40
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CPU	12	tg128	11.15 ± 1.26

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 10

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	pp512	161.48 ± 11.56
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	10	tg128	16.17 ± 2.24

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	173.87 ± 34.26
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	15.90 ± 3.48

llama-bench -m Groq_8B_Q4_0_4_8.gguf -p 512 -n 128 -t 12 -ngl 99

model	size	params	backend	threads	test	t/s
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	pp512	174.54 ± 21.36
llama 8B Q4_0_4_8	4.33 GiB	8.03 B	CPU	12	tg128	16.37 ± 3.25

The benchmarks easily confirm the acceleration. In terms of pure t/s it is more like 5x, actually.

But how does the Q4_0_4_8 actually perform in a chat?

My interface is LM Studio. I copied the Q4_0_4_8 gguf to the models directory and, after restarting LM Studio, LM Studio did indeed show it as available but it wouldn't load it. I tried several different "preset files" (all the different settings, mlock etc etc) but nothing worked. The non-quantized version of the groq gguf worked fine, however. I will get on Discord and see if I can learn what is wrong.

On a side note, I spent some time at the Qualcomm AI hub where they say to "Bring your own model". I think the idea is use their hub to transform a model into something that uses the NPU? https://aihub.qualcomm.com/compute/models

Thanks again, for your good info.

AndreasKunar Jul 24, 2024
Author

OK, sorry, I never used LM Studio, so cannot help you there. I have used Ollama, but not yet with the Q4_0_4_8 models.

On chat performance with Q4_0_4_8, it's probably not a lot of improvement, since prompt-processing (pp) normally only plays a minor role in chats, tg is the major factor there, and for tg its not this much of an improvement. If you are mainly into chats, probably the new Lamma-3.1-8B-Instruct models of this week would be best. Llama-3-Groq-8B-Tool-Use is best for tool-use and not for chats. To my knowledge, the llama.cpp team is hard at work to try and support Llama-3.1.

To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

kormakurd Sep 1, 2024

To my knowledge, the Qualcomm AI hub (with its QNN technology) is all about small local models and power.savings, much smaller and less capable models than Llama 3.1 8B. There is an effort underway, to get llama.cpp support QNN, but I think it's still a long way off.

From the list of models they host, I believe that's mostly true, but they also have deployable versions of Llama 2 7B and Llama 3 8B with support for Snapdragon 8 Gen 3 Mobile and Snapdragon X Elite.

I haven't done anything with qai_hub yet because it looks (comparatively) convoluted to get deployed for a pleb like myself.
8B Model (github)

They write that their llama 3 8B model is quantized to w4a16(4-bit weights and 16-bit activations)

AndreasKunar Sep 2, 2024
Author

FYI, I just tried WebGL, which supports the Ardeno GPU and it is surprisingly fast for e.g. Phi-3.5-mini-instruct - 1/2 the performance of my MacBook Air M2 10-GPU llama.cpp native Q4_0 model, but when running only in Chrome!! I will try and dig deeper into this.

Thanks! Let's see what developments will happen around the NPU with QNN.

AndreasKunar · 2024-08-30T18:28:42Z

AndreasKunar
Aug 30, 2024
Author

Update for Surface Laptop 7 / Snapdragon X Elite - it might seem, that the Elite utilizes the memory-bandwidth better than the asymmetrical Plus (for token-generation):

model	size	params	backend	threads	test	t/s
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	pp512	169.12 ± 8.85
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	tg128	23.41 ± 1.35

build: cddae48 (3646)

0 replies

neozhang307 · 2024-10-08T07:50:11Z

neozhang307
Oct 8, 2024

What about the GPU and NPU backend?

3 replies

AndreasKunar Oct 8, 2024
Author

What about the GPU and NPU backend?

@neozhang307 - llama.cpp on the Snapdragon X's GPU should in theory work via Vulkan. But llama.cpp with enabled Vulkan currently hangs on load (both for the native Qualcomm driver and for Microsoft's DX12 driver via SET GGML_VK_VISIBLE_DEVICES=1 ). As for the NPU, there is currently some work being done to support it via QNN, also there is some initial discussion about supporting DirectML - but both not running.

Use the Q4_0_4_8 (or _4) quantization for your models on Snapdragon X CPU's with llama.cpp. It runs quite fast because it uses the CPU's matrix instructions. The Snapdragon X Elite's CPUs with Q4_0_4_8 are similar in performance to an Apple M3 running Q4_0 on its GPUs.

llama-bench -m <model>-p 512 -n 128 -t 12 with a Snapdragon X Elite Surface Laptop 7 on build fa42aa6 (3897) yields:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	63.05 ± 7.40
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	19.83 ± 1.33
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	pp512	178.88 ± 11.31
llama 7B Q4_0_4_8	3.56 GiB	6.74 B	CPU	12	tg128	23.24 ± 0.84
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	12	pp512	144.52 ± 14.16
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	12	tg128	22.81 ± 1.28

Not sure how the Surface Laptop handles thermals with this stress-workload and how much it throttles (the CPUs are not always maxed out after the initial 100%).

There is a working Snapdragon X GPU Support via WebML in Chrome (e.g. via chat.webllm.ai). But llama.cpp Q4_0_4_8 on the CPU seems faster and much more versatile.

Also there is support for the Snapdragon X's NPU via ONNX's QNN drivers. I did not test the performance of this via the CPU (speed and power-consumption).

manuelpaulo Oct 9, 2024

Thanks a lot Andreas. Keep us updated, seems no one else is doing much on this. Can you please be so kind and show us a complete example of a /llama-quantize you made?

AndreasKunar Oct 10, 2024
Author

@manuelpaulo - I just download a llama-2 7B Q4_0 gguf model from huggingface and did ./llama-quantize --allow-requantize <name of the downloaded Q4_0 model>.gguf <name of the new Q4_0_4_8 model>.gguf Q4_0_4_8. I used llama-2 7B because then you can compare the results to the Apple Silicon (with GPU/Metal) llama.cpp performance numbers in discussion #4167

With newer models like e.g. llama 3.2, there already are ready-made Q4_0_4_8 quantized gguf-file versions available for direct download from huggingface.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of llama.cpp on Snapdragon X Elite/Plus #8273

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance of llama.cpp on Snapdragon X Elite/Plus #8273

AndreasKunar Jul 3, 2024

Replies: 3 comments · 10 replies

joShu001 Jul 23, 2024

AndreasKunar Jul 24, 2024 Author

joShu001 Jul 24, 2024

AndreasKunar Jul 24, 2024 Author

kormakurd Sep 1, 2024

AndreasKunar Sep 2, 2024 Author

AndreasKunar Aug 30, 2024 Author

neozhang307 Oct 8, 2024

AndreasKunar Oct 8, 2024 Author

manuelpaulo Oct 9, 2024

AndreasKunar Oct 10, 2024 Author

AndreasKunar
Jul 3, 2024

Replies: 3 comments 10 replies

joShu001
Jul 23, 2024

AndreasKunar Jul 24, 2024
Author

AndreasKunar Jul 24, 2024
Author

AndreasKunar Sep 2, 2024
Author

AndreasKunar
Aug 30, 2024
Author

neozhang307
Oct 8, 2024

AndreasKunar Oct 8, 2024
Author

AndreasKunar Oct 10, 2024
Author