-
Notifications
You must be signed in to change notification settings - Fork 355
CUDA decoding #315
Comments
I'm currently working on adding CUDA acceleration and it's already in place for the LLama architecture. If you're interested in giving it a go, you can check out the branch I'm working on here: https://github.com/LLukas22/llm/tree/cublas-clblast-support For a test drive, here's a command you can use:
Now, I'm in the process of implementing acceleration for the other architectures. But, there's a hiccup: some GGML operations don't have CUDA support, which means I have to run some parts on the CPU. My goal is to work around this without having to tweak the existing model implementations. If you've got some ideas on how to tackle this, I'm all ears! |
@LLukas22 This is awesome, thanks for the link. I tried it out on my GPU, and it's a lot faster than pure CPU inference. It does seem to be quite a bit slower than llama.cpp (maybe a quarter of the speed, will run measurements). Is it because it's doing CPU sync after every token to run the callback function? |
Some quick stats: llama.cpp stat output:
llm cublas-clblast-support stat output:
|
Well it's still in a pre-draft stage, i'm guessing the creation of a new eval-context each call kills the performance, but that's something to optimize when we get the acceleration working for all models. There was also a lot of work done on the metal branch, which probably will also help us close the gap a bit. |
@LLukas22 Understandable, is there anything I can do to help this along? |
I'm currently waiting on @philpax to review the metal PR. If that gets merged we can start to integrate CUDA acceleration, until then we could think about how we can support architecture which use functions which are not yet CUDA accelerated or we could start to implement these functions as CUDA kernels into GGML/llama.cpp |
can i expect to test it in my orin agx 32g as an EFI app ? |
@malv-c i don't know what you mean with EFI app. But if you can setup cuda you should be able to compile it for an ARM based system. But maybe we have to adjust the |
EFI app to run seriously llm without wasting resources on the poor ubuntu |
@malv-c Still not sure how this relates to running language models. Why would uefi calls help speed up LLMs? |
llm without loading os is better than llm+os |
That's impractical.
|
@malv-c I agree with @jafioti, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly. |
hi Joe
i strongly disagree but as i don't know rust i will not output code for now
if i find a usable llm ...
Le dim. 25 juin 2023 à 18:27, Joe Fioti ***@***.***> a écrit :
… That's impractical.
- Latency / compute constraints don't come from the OS, but from the
model size / CUDA kernels / CPU speed.
- CUDA is usually tightly integrated with the OS as well, so you
wouldn't be able to use the GPU.
- LLMs aren't the only things running, there needs to be some way to
interact with the LLM, usually a web server queuing up prompts or some
terminal interface. How would that work without an OS?
—
Reply to this email directly, view it on GitHub
<#315 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AESIHJL5EJUTVSTHV6GOQDLXNBRGLANCNFSM6AAAAAAZLBLIU4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
hi Lukas not simple i agree
thanks
Le dim. 25 juin 2023 à 21:06, Lukas Kreussel ***@***.***> a
écrit :
… @malv-c <https://github.com/malv-c> I agree with @jafioti
<https://github.com/jafioti>, the scope of this project will be to
provide a good, fast and easy to use llm library. If you want to you can
thinker around a bit and see if you get it running as an EFI app. But i
think the ggml backend we are using will be a very problematic to get
running correctly.
—
Reply to this email directly, view it on GitHub
<#315 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AESIHJPVMFSIYAUUKQJGL33XNCD4NANCNFSM6AAAAAAZLBLIU4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Implemented with #325 |
Hey all, great work on integrating cuda support for the prompt tokens. How much work would it be to support GPU decoding? Currently on llama.cpp I can reach about 35 tokens per second on llama 7B on a 2080 super, and I'd love to reach somewhere near that in rust!
Please lmk if there's anything I can do to help this effort.
The text was updated successfully, but these errors were encountered: