-
Notifications
You must be signed in to change notification settings - Fork 355
CUDA/OpenCL Acceleration #325
CUDA/OpenCL Acceleration #325
Conversation
Awesome to see the progress! I'll run some benchmarks on my machine (2080 super) tonight |
We will first focus on supporting only LLama. Other architectures will be supported via commites to the llama.cpp/ggml repo. Which implement the missing ggml_ops in CUDA/OpenCL. |
Some quick benchmarks: This Branch Stats:
llama.cpp (commit d7b7484f74d486f77feb4c0b7af7e1718ed91651) Stats:
|
@jafioti I can't reproduce your results i did my own benchmarking with the following setup and results. Device: Models: commands: llm: llama.cpp: (For Wizard-Vicuna-Uncensored 30B only the first 40 layers were offloaded to make the model fit into 24 GB Vram and the token limit was reduced from 128 to 50 tokens) Results: The following table shows the per token duration:
|
I've done another test on an A10 with OpenLLama 7B Q4_0 I did the same, 6 tests, keep last 5
It should be noted that I did see a wider variance in the llm numbers. Here are the raw observed numbers: |
OK thats very interesting, any idea on why the performance of the 2080 was that much worse? The A10 results seam to be ok'ish. I still have to check how we meassure token times and how LLama.cpp does it. |
No idea, will check on an a100 and see if the gap shrinks furthur |
Could it be possible for the 2080 to run out of memory and start paging into RAM? |
Alright another run on an A100 using Wizard 33B:
Interestingly this discrepency is mostly due to a few tokens generated on a single run in llm. Here are the runtimes of each llm run: Also for the 2080 I think if it ran out of VRAM it would just error, no? |
Ok that indeed is interesting, thanks for helping by testing different cards. Theoretically all layers should be offloaded except for the embedding layer. You could omit the
Depending on your driver version, some drivers decide to offload into RAM if you are a bit over your available VRAM. But i only encountered it once and i don't know if that's just a windows thing. 🤔 |
Alright, really strange but I got very similar generation performance the higher I go in card power. On an A10 or H100, the token generation time is nearly identical, so it might just be my 2080, idk. But another thing I noticed that's pretty major is the prompt feeding stage taking quite a bit longer than llama.cpp. On your branch, is the initial prompt feeding happening on GPU? Or is only the subsequent token generation offloaded? Here's the prompt I ran through WizardLM-30B:
And the prompt-only times: |
@jafioti Thanks for the additional tests. The prompt feeding happens on the GPU, as it's the same forward call as the inference of new tokens. The difference in feeding times is probably caused by the default If i get home from work im gonna sync this branch with the newest ggml source from llama.cpp and im gonna do some benchmarking to test the impact of the |
Yup you were exactly right, the batch size was the key! The same prompt now takes 672ms |
ClBlast currently fails on Windows, see ggerganov/llama.cpp#2065 |
Had a very cursory look and this is really impressive. I'll test this out and review it properly soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really impressive work! I'm very thankful you've taken the lead on this.
I still need to test it personally (haven't been around my Windows machine much in the last two days), but this is looking good. I've made a few comments about style/minor tweaks, but the core of this looks excellent.
Looking forward to seeing how quickly my GPU can run LLaMA 🚀
Thanks for the review, i'll try to implement the changes later today. As previously mentioned some models currently produce gibberish when Cuda acceleration is enabled, thats something i also have to look into. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there...
@philpax I closed all conversations i fixed in my last commit. Would be great if you could take a look at the rest. |
Nice work! I'll test it locally, make any final changes of my own, and then I'll merge it 🚀 (Don't worry about solving the merge conflicts, I'll do that myself) |
I'm going to sleep on this before I merge it. I think my changes should all check out, but it's quite late now and I'm pretty sure the GPU offloading stuff did a number on my brain. Pending issues (that will be made into issues after this is merged):
|
I have 3 Tesla T4's running and obviously i cannot use all the GPUs yet... So just commenting for now til this is supported. Possibly could just extend to the current T4's running idle:
Running command:
|
Implements CUDA/OpenCL acceleration via CuBLAS/CLBLAST.
Recording.2023-06-22.141324.mp4
Stuff that works:
--use-gpu
--gpu-layers
Stuff that still needs to be done:
ctx0
(I dont want to use a RefCell here)build.rs
to enablef16
optimizations. Maybe @darxkies could help here.Nice to have: