Speeding up 30-65B models #394

SpeedyCraftah · 2023-03-22T14:36:01Z

SpeedyCraftah
Mar 22, 2023

Running the 30B llama model 4-bit quantified with about 75% ram utilisation (confirming its not a swap overhead issue), tokens generate at a rate of about 700-800ms with my CPU being maxed out with threads maxed as well, which is not terrible by all means but could be better.

Is there any way as of now to improving the performance?

And speaking about the future, is there any hope of GPU utilisation during the autocompletion process? I have a powerhouse of a GPU sitting idly watching the CPU do all the work 😁.

PS - I don't want to sit around and let others do the work and want to try contribute in the future so watch out for that!

Piezoid · 2023-03-22T14:54:40Z

Piezoid
Mar 22, 2023

llama.cpp is based on ggml which does inference on the CPU. Having hybrid GPU support would be great for accelerating some of the operations, but it would mean adding dependencies to a GPU compute framework and/or vendor libraries.
See the whisper.cpp's FAQ entry. There is already some initial works and experiments in that direction.

llama.cpp doesn't scale that well with many threads. I'd recommend to keep the number of threads at or bellow the number of actual cores (not counting hyper-threaded "cores"). On most recent x86-64 CPUs, a value between 4 and 6 seems to work best.
In most cases, memory bandwidth is likely the main bottleneck.

6 replies

x02Sylvie Mar 22, 2023

I personally had better performance with -t 8 instead of -t 16 on my 8-core cpu

my cpu supports 16 threads but using all of them seems to slow things down
try 8 threads instead

x02Sylvie Mar 22, 2023

Actually, a bit of correction: I noticed weird behavior
My CPU 10700k, 8 physical cores and 16 threads

With 8 threads I had 280 ms, with 12 - 14 threads I had 208 ms
but If i increased number up to 15 - 16 it reverted back to 290 ms

So you need to experiment it seems

SpeedyCraftah Mar 22, 2023
Author

Yeah it's quite odd.

nazthelizard122 Mar 22, 2023

I have an M1 ultra 20C CPU, with 16 threads in use I get around 200-300mspt

niansa Mar 23, 2023

With 8 threads I had 280 ms, with 12 - 14 threads I had 208 ms but If i increased number up to 15 - 16 it reverted back to 290 ms

As far as I know AVX2 takes up the entire core, so both threads on the core will be blocked by it.

x02Sylvie · 2023-03-22T16:01:04Z

x02Sylvie
Mar 22, 2023

The current performance hotspots seem to be bytesFromNibbles and ggml_vec_dot_q4_0 on my AVX2 supporting cpu.

Both functions seem to be highly performant already so im wondering what future optimization direction could be other than gpu/cpu hybrid mode.

0 replies

SpeedyCraftah · 2023-03-22T16:28:32Z

SpeedyCraftah
Mar 22, 2023
Author

@x02Sylvie @Piezoid I did some experimenting of my own with different thread counts.
To reiterate, my CPU is the Ryzen 7 5800x clocked at 4.8GHz (when boosting) with DDR4 RAM 3600MHz 32GB.
system_info: n_threads = x / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

My benchmark command was: .\llama.exe --seed -1 --threads x --n_predict 30 --model ../../models/30B/ggml-model-q4_0.bin --top_k 40 --top_p 0.9 --temp 0.5 --repeat_last_n 64 --repeat_penalty 1.3 -p "Write a funny joke:".

Results below are in milliseconds per token generated:

4 threads - 637, 639, 634, 633 (avg 635.75)
6 threads - 538, 540, 533, 542 (avg 538.25)
8 threads - 544, 552, 550, 551 (avg 549.25)
16 threads - 521, 515, 545, 545 (avg 531.5)

6 threads seems to be the sweet spot when it comes to llama.cpp as mentioned by @Piezoid but it likely varies on the CPU and a billion different other factors. I might run this again with longer tokens outputted and more samples but anything above a 6 on this CPU seems to be a waste (as it utilises 100% regardless with little to no performance gain).
It would probably be more beneficial to performance running multiple llama.cpp instances at once (I will try this out).

And if you're curious, the jokes were awful.

1 reply

KASR Mar 24, 2023

i've tried to do the same, some jokes seems to have potential :)

I used the same command:
./main --seed -1 --threads x --n_predict 30 --model ./models/7B/ggml-model-q4_0.bin --top_k 40 --top_p 0.9 --temp 0.5 --repeat_last_n 64 --repeat_penalty 1.3 -p "Write a funny joke:"

The system info is:
system_info: n_threads = x / 36 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

The results are:

4 threads - 	225	222	225	221	(avg. =	223,3)
8 threads - 	140	137	140	143	(avg. =	140,0)
12 threads - 	114	115	110	116	(avg. =	113,8)
16 threads - 	116	116	112	114	(avg. =	114,5)
20 threads - 	112	110	110	114	(avg. =	111,5)
24 threads - 	106	107	106	106	(avg. =	106,3)
28 threads - 	106	106	105	106	(avg. =	105,8)
32 threads - 	119	116	119	117	(avg. =	117,8)

I've omitted using all 36 threads since the timings were really bad... i guess this is in part because some things were running in the background. Nevertheless the timings seemed to improve even when increasing above the physical cores (i have an Intel Xeon W2295

tarruda · 2023-03-22T16:40:12Z

tarruda
Mar 22, 2023

Having hybrid GPU support would be great for accelerating some of the operations, but it would mean adding dependencies to a GPU compute framework and/or vendor libraries.

Even with the extra dependencies, it would be revolutionary if llama.cpp/ggml supported hybrid GPU mode. The costs to have a machine of running big models would be significantly lower.

A gaming laptop with RTX3070 and 64GB of RAM costs around $1800, and it could potentially run 16-bit llama 30B with acceptable performance.

3 replies

SpeedyCraftah Mar 22, 2023
Author

Yeah. I'm really counting on GPU support from whisper or llama.cpp.
I might give implementing GPU acceleration for whisper a go on my RTX 3080 in a fork since all that realistically needs to be ran on the GPU is only one function as mentioned in the whisper thread since it takes up around 90% of the total time.

I think more of the complexity of doing that on whisper is having to support a number of different GPUs and APIs and as they mention it's mainly a hobby project which is not designed for enterprise use. Anything cuda wise seems to be a mess overall in any project.

tarruda Mar 22, 2023

I think more of the complexity of doing that on whisper is having to support a number of different GPUs and APIs and as they mention

Can't whisper/ggml incrementally add support for GPUs using an abstraction layer? No need to add support for all GPUs, but design an API that allows offloading the expensive computation, with the default (and initially only) implementationof this API using the CPU. If the user has a GPU that is supported, the it could optionally offload to it. As the community interest increases, I'm sure there would be contributions to support other GPUs.

I wish I could be of more help, but I have 0 experience in GPU programming and still a beginner in machine learning concepts.

SpeedyCraftah Mar 22, 2023
Author

but I have 0 experience in GPU programming

Me neither, but hey part of being an engineer is learning and trying to do things you have no clue about.
I am going to look into GPU acceleration and learn some of the basics before I try come up with something.

SpeedyCraftah · 2023-03-24T16:42:35Z

SpeedyCraftah
Mar 24, 2023
Author

Has anyone had any luck with compiler flags? I'd imagine you could probably enable certain compiler flags such as ABI-breaking optimisation (O3+ I believe) as well as enabling flags which tells the compiler to use certain instruction sets your CPU supports, but isn't compiled with by default as it would break support for CPUs that don't support them.

I'm going to play around with this a little bit.

6 replies

SpeedyCraftah Mar 24, 2023
Author

This is the instructions sets that show up as available when using llama:
system_info: n_threads = x / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

Only issue is, I have both FMA and SSE3? But MSVC doesn't seem to have a flag to enable them:

Odd. I might have to look into compiling llama.cpp with MinGW instead, but not with WSL since that probably has a bigger performance impact of it's own.

SpeedyCraftah Mar 24, 2023
Author

Okay - I have compiled it with MinGW and it appears the instruction sets were automatically added in by llama.cpp.
Oddly enough, the model takes longer to load and the performance is actually slower by a couple of milliseconds, even with these additional instruction sets.

system_info: n_threads = 6 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Guess they don't matter too much?
Would love some insight from someone who knows a little more about this.

ivanstepanovftw Apr 2, 2023
Collaborator

Did you fix your CPU frequency? Both core clock and ring clock?

SpeedyCraftah Apr 2, 2023
Author

Did you fix your CPU frequency? Both core clock and ring clock?

I'm not sure what you mean. Ring clock?

ivanstepanovftw Apr 2, 2023
Collaborator

Sorry, I mean "CPU cache speed", in BIOS it is displayed as "CPU ring ratio"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up 30-65B models #394

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Speeding up 30-65B models #394

Replies: 5 comments · 16 replies

SpeedyCraftah Mar 22, 2023 Author

SpeedyCraftah Mar 22, 2023 Author

SpeedyCraftah Mar 22, 2023 Author

SpeedyCraftah Mar 22, 2023 Author

SpeedyCraftah Mar 24, 2023 Author

SpeedyCraftah Mar 24, 2023 Author

SpeedyCraftah Mar 24, 2023 Author

ivanstepanovftw Apr 2, 2023 Collaborator

SpeedyCraftah Apr 2, 2023 Author

ivanstepanovftw Apr 2, 2023 Collaborator

Replies: 5 comments 16 replies

SpeedyCraftah Mar 22, 2023
Author

SpeedyCraftah
Mar 22, 2023
Author

SpeedyCraftah Mar 22, 2023
Author

SpeedyCraftah Mar 22, 2023
Author

SpeedyCraftah
Mar 24, 2023
Author

SpeedyCraftah Mar 24, 2023
Author

SpeedyCraftah Mar 24, 2023
Author

ivanstepanovftw Apr 2, 2023
Collaborator

SpeedyCraftah Apr 2, 2023
Author

ivanstepanovftw Apr 2, 2023
Collaborator