Speeding up 30-65B models #394
Replies: 5 comments 16 replies
-
llama.cpp is based on ggml which does inference on the CPU. Having hybrid GPU support would be great for accelerating some of the operations, but it would mean adding dependencies to a GPU compute framework and/or vendor libraries. llama.cpp doesn't scale that well with many threads. I'd recommend to keep the number of threads at or bellow the number of actual cores (not counting hyper-threaded "cores"). On most recent x86-64 CPUs, a value between 4 and 6 seems to work best. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@x02Sylvie @Piezoid I did some experimenting of my own with different thread counts. My benchmark command was: Results below are in milliseconds per token generated:
6 threads seems to be the sweet spot when it comes to llama.cpp as mentioned by @Piezoid but it likely varies on the CPU and a billion different other factors. I might run this again with longer tokens outputted and more samples but anything above a 6 on this CPU seems to be a waste (as it utilises 100% regardless with little to no performance gain). And if you're curious, the jokes were awful. |
Beta Was this translation helpful? Give feedback.
-
Even with the extra dependencies, it would be revolutionary if llama.cpp/ggml supported hybrid GPU mode. The costs to have a machine of running big models would be significantly lower. A gaming laptop with RTX3070 and 64GB of RAM costs around $1800, and it could potentially run 16-bit llama 30B with acceptable performance. |
Beta Was this translation helpful? Give feedback.
-
Has anyone had any luck with compiler flags? I'd imagine you could probably enable certain compiler flags such as ABI-breaking optimisation (O3+ I believe) as well as enabling flags which tells the compiler to use certain instruction sets your CPU supports, but isn't compiled with by default as it would break support for CPUs that don't support them. I'm going to play around with this a little bit. |
Beta Was this translation helpful? Give feedback.
-
Running the 30B llama model 4-bit quantified with about 75% ram utilisation (confirming its not a swap overhead issue), tokens generate at a rate of about 700-800ms with my CPU being maxed out with threads maxed as well, which is not terrible by all means but could be better.
Is there any way as of now to improving the performance?
And speaking about the future, is there any hope of GPU utilisation during the autocompletion process? I have a powerhouse of a GPU sitting idly watching the CPU do all the work 😁.
PS - I don't want to sit around and let others do the work and want to try contribute in the future so watch out for that!
Beta Was this translation helpful? Give feedback.
All reactions