Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

CUDA decoding #315

Closed
jafioti opened this issue Jun 18, 2023 · 16 comments
Closed

CUDA decoding #315

jafioti opened this issue Jun 18, 2023 · 16 comments
Labels
issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features

Comments

@jafioti
Copy link
Contributor

jafioti commented Jun 18, 2023

Hey all, great work on integrating cuda support for the prompt tokens. How much work would it be to support GPU decoding? Currently on llama.cpp I can reach about 35 tokens per second on llama 7B on a 2080 super, and I'd love to reach somewhere near that in rust!

Please lmk if there's anything I can do to help this effort.

@LLukas22
Copy link
Contributor

LLukas22 commented Jun 18, 2023

I'm currently working on adding CUDA acceleration and it's already in place for the LLama architecture. If you're interested in giving it a go, you can check out the branch I'm working on here: https://github.com/LLukas22/llm/tree/cublas-clblast-support

For a test drive, here's a command you can use:

cargo run --release --features cublas -- llama infer -m "C:\Users\lkreu\Downloads\wizardlm-30b.ggmlv3.q4_1.bin" --accelerator-layers 40 --batch-size 512  -p "Write me a short story about a llama riding a crab:"

Now, I'm in the process of implementing acceleration for the other architectures. But, there's a hiccup: some GGML operations don't have CUDA support, which means I have to run some parts on the CPU. My goal is to work around this without having to tweak the existing model implementations. If you've got some ideas on how to tackle this, I'm all ears!

@philpax philpax added issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features labels Jun 19, 2023
@jafioti
Copy link
Contributor Author

jafioti commented Jun 19, 2023

@LLukas22 This is awesome, thanks for the link. I tried it out on my GPU, and it's a lot faster than pure CPU inference. It does seem to be quite a bit slower than llama.cpp (maybe a quarter of the speed, will run measurements). Is it because it's doing CPU sync after every token to run the callback function?

@jafioti
Copy link
Contributor Author

jafioti commented Jun 19, 2023

Some quick stats:
Model: llama-7B
llama.cpp - 23.57 ms per token
llm cublas-clblast-support - 86.39 ms per token

llama.cpp stat output:

llama_print_timings:        load time =   971.34 ms
llama_print_timings:      sample time =   174.34 ms /   424 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   187.43 ms /    10 tokens (   18.74 ms per token)
llama_print_timings:        eval time =  9970.32 ms /   423 runs   (   23.57 ms per token)
llama_print_timings:       total time = 10414.72 ms

llm cublas-clblast-support stat output:

feed_prompt_duration: 209ms
prompt_tokens: 10
predict_duration: 9158ms
predict_tokens: 106
per_token_duration: 86.396ms

@LLukas22
Copy link
Contributor

Well it's still in a pre-draft stage, i'm guessing the creation of a new eval-context each call kills the performance, but that's something to optimize when we get the acceleration working for all models. There was also a lot of work done on the metal branch, which probably will also help us close the gap a bit.

@jafioti
Copy link
Contributor Author

jafioti commented Jun 21, 2023

@LLukas22 Understandable, is there anything I can do to help this along?

@LLukas22
Copy link
Contributor

I'm currently waiting on @philpax to review the metal PR. If that gets merged we can start to integrate CUDA acceleration, until then we could think about how we can support architecture which use functions which are not yet CUDA accelerated or we could start to implement these functions as CUDA kernels into GGML/llama.cpp

@malv-c
Copy link

malv-c commented Jun 25, 2023

can i expect to test it in my orin agx 32g as an EFI app ?

@LLukas22
Copy link
Contributor

@malv-c i don't know what you mean with EFI app. But if you can setup cuda you should be able to compile it for an ARM based system. But maybe we have to adjust the build.rs script to enable building with cuda acceleration on ARM 🤔

@malv-c
Copy link

malv-c commented Jun 25, 2023

EFI app to run seriously llm without wasting resources on the poor ubuntu
https://github.com/rust-osdev/uefi-rs

@jafioti
Copy link
Contributor Author

jafioti commented Jun 25, 2023

@malv-c Still not sure how this relates to running language models. Why would uefi calls help speed up LLMs?

@malv-c
Copy link

malv-c commented Jun 25, 2023

llm without loading os is better than llm+os

@jafioti
Copy link
Contributor Author

jafioti commented Jun 25, 2023

That's impractical.

  • Latency / compute constraints don't come from the OS, but from the model size / CUDA kernels / CPU speed.
  • CUDA is usually tightly integrated with the OS as well, so you wouldn't be able to use the GPU.
  • LLMs aren't the only things running, there needs to be some way to interact with the LLM, usually a web server queuing up prompts or some terminal interface. How would that work without an OS?

@LLukas22
Copy link
Contributor

@malv-c I agree with @jafioti, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly.

@malv-c
Copy link

malv-c commented Jun 26, 2023 via email

@malv-c
Copy link

malv-c commented Jun 26, 2023 via email

@LLukas22
Copy link
Contributor

Implemented with #325

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features
Projects
None yet
Development

No branches or pull requests

4 participants