CUDA decoding #315

jafioti · 2023-06-18T18:21:56Z

Hey all, great work on integrating cuda support for the prompt tokens. How much work would it be to support GPU decoding? Currently on llama.cpp I can reach about 35 tokens per second on llama 7B on a 2080 super, and I'd love to reach somewhere near that in rust!

Please lmk if there's anything I can do to help this effort.

LLukas22 · 2023-06-18T18:46:05Z

I'm currently working on adding CUDA acceleration and it's already in place for the LLama architecture. If you're interested in giving it a go, you can check out the branch I'm working on here: https://github.com/LLukas22/llm/tree/cublas-clblast-support

For a test drive, here's a command you can use:

cargo run --release --features cublas -- llama infer -m "C:\Users\lkreu\Downloads\wizardlm-30b.ggmlv3.q4_1.bin" --accelerator-layers 40 --batch-size 512  -p "Write me a short story about a llama riding a crab:"

Now, I'm in the process of implementing acceleration for the other architectures. But, there's a hiccup: some GGML operations don't have CUDA support, which means I have to run some parts on the CPU. My goal is to work around this without having to tweak the existing model implementations. If you've got some ideas on how to tackle this, I'm all ears!

jafioti · 2023-06-19T16:39:25Z

@LLukas22 This is awesome, thanks for the link. I tried it out on my GPU, and it's a lot faster than pure CPU inference. It does seem to be quite a bit slower than llama.cpp (maybe a quarter of the speed, will run measurements). Is it because it's doing CPU sync after every token to run the callback function?

jafioti · 2023-06-19T16:54:03Z

Some quick stats:
Model: llama-7B
llama.cpp - 23.57 ms per token
llm cublas-clblast-support - 86.39 ms per token

llama.cpp stat output:

llama_print_timings:        load time =   971.34 ms
llama_print_timings:      sample time =   174.34 ms /   424 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   187.43 ms /    10 tokens (   18.74 ms per token)
llama_print_timings:        eval time =  9970.32 ms /   423 runs   (   23.57 ms per token)
llama_print_timings:       total time = 10414.72 ms

llm cublas-clblast-support stat output:

feed_prompt_duration: 209ms
prompt_tokens: 10
predict_duration: 9158ms
predict_tokens: 106
per_token_duration: 86.396ms

LLukas22 · 2023-06-19T18:02:52Z

Well it's still in a pre-draft stage, i'm guessing the creation of a new eval-context each call kills the performance, but that's something to optimize when we get the acceleration working for all models. There was also a lot of work done on the metal branch, which probably will also help us close the gap a bit.

jafioti · 2023-06-21T15:13:52Z

@LLukas22 Understandable, is there anything I can do to help this along?

LLukas22 · 2023-06-21T15:25:06Z

I'm currently waiting on @philpax to review the metal PR. If that gets merged we can start to integrate CUDA acceleration, until then we could think about how we can support architecture which use functions which are not yet CUDA accelerated or we could start to implement these functions as CUDA kernels into GGML/llama.cpp

malv-c · 2023-06-25T14:05:31Z

can i expect to test it in my orin agx 32g as an EFI app ?

LLukas22 · 2023-06-25T15:06:52Z

@malv-c i don't know what you mean with EFI app. But if you can setup cuda you should be able to compile it for an ARM based system. But maybe we have to adjust the build.rs script to enable building with cuda acceleration on ARM 🤔

malv-c · 2023-06-25T16:11:38Z

EFI app to run seriously llm without wasting resources on the poor ubuntu
https://github.com/rust-osdev/uefi-rs

jafioti · 2023-06-25T16:14:16Z

@malv-c Still not sure how this relates to running language models. Why would uefi calls help speed up LLMs?

malv-c · 2023-06-25T16:21:48Z

llm without loading os is better than llm+os

jafioti · 2023-06-25T16:27:07Z

That's impractical.

Latency / compute constraints don't come from the OS, but from the model size / CUDA kernels / CPU speed.
CUDA is usually tightly integrated with the OS as well, so you wouldn't be able to use the GPU.
LLMs aren't the only things running, there needs to be some way to interact with the LLM, usually a web server queuing up prompts or some terminal interface. How would that work without an OS?

LLukas22 · 2023-06-25T19:06:36Z

@malv-c I agree with @jafioti, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly.

malv-c · 2023-06-26T06:58:05Z

hi Joe i strongly disagree but as i don't know rust i will not output code for now if i find a usable llm ... Le dim. 25 juin 2023 à 18:27, Joe Fioti ***@***.***> a écrit :

…

That's impractical. - Latency / compute constraints don't come from the OS, but from the model size / CUDA kernels / CPU speed. - CUDA is usually tightly integrated with the OS as well, so you wouldn't be able to use the GPU. - LLMs aren't the only things running, there needs to be some way to interact with the LLM, usually a web server queuing up prompts or some terminal interface. How would that work without an OS? — Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AESIHJL5EJUTVSTHV6GOQDLXNBRGLANCNFSM6AAAAAAZLBLIU4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

malv-c · 2023-06-26T07:01:41Z

hi Lukas not simple i agree thanks Le dim. 25 juin 2023 à 21:06, Lukas Kreussel ***@***.***> a écrit :

…

@malv-c <https://github.com/malv-c> I agree with @jafioti <https://github.com/jafioti>, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly. — Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AESIHJPVMFSIYAUUKQJGL33XNCD4NANCNFSM6AAAAAAZLBLIU4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

LLukas22 · 2023-07-23T09:58:18Z

Implemented with #325

philpax added issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features labels Jun 19, 2023

LLukas22 closed this as completed Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA decoding #315

CUDA decoding #315

jafioti commented Jun 18, 2023

LLukas22 commented Jun 18, 2023 •

edited

Loading

jafioti commented Jun 19, 2023 •

edited

Loading

jafioti commented Jun 19, 2023 •

edited

Loading

LLukas22 commented Jun 19, 2023

jafioti commented Jun 21, 2023

LLukas22 commented Jun 21, 2023

malv-c commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

malv-c commented Jun 25, 2023

jafioti commented Jun 25, 2023

malv-c commented Jun 25, 2023

jafioti commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

malv-c commented Jun 26, 2023 via email

malv-c commented Jun 26, 2023 via email

LLukas22 commented Jul 23, 2023

CUDA decoding #315

CUDA decoding #315

Comments

jafioti commented Jun 18, 2023

LLukas22 commented Jun 18, 2023 • edited Loading

jafioti commented Jun 19, 2023 • edited Loading

jafioti commented Jun 19, 2023 • edited Loading

LLukas22 commented Jun 19, 2023

jafioti commented Jun 21, 2023

LLukas22 commented Jun 21, 2023

malv-c commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

malv-c commented Jun 25, 2023

jafioti commented Jun 25, 2023

malv-c commented Jun 25, 2023

jafioti commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

malv-c commented Jun 26, 2023 via email

malv-c commented Jun 26, 2023 via email

LLukas22 commented Jul 23, 2023

LLukas22 commented Jun 18, 2023 •

edited

Loading

jafioti commented Jun 19, 2023 •

edited

Loading

jafioti commented Jun 19, 2023 •

edited

Loading