CUDA/OpenCL Acceleration #325

LLukas22 · 2023-06-22T12:14:57Z

Implements CUDA/OpenCL acceleration via CuBLAS/CLBLAST.

Recording.2023-06-22.141324.mp4

Stuff that works:

Offload the weights to the GPU
Enable gpu acceleration via --use-gpu
Control how many layers are offloaded via --gpu-layers

Stuff that still needs to be done:

Check why CLBLAST's matmul throws an error in the current llama.cpp version.
Decide on how we want to handle other architectures
Find another way to pass a mutable ctx0 (I dont want to use a RefCell here)
Find a way to pass additional arguments to ggml-sys's build.rs to enable f16 optimizations. Maybe @darxkies could help here.
Find a better place to initialize the gpu context and cuda scratch buffer

Nice to have:

Implement Multi-GPU support
Benchmark against llama.cpp (Maybe something for @jafioti ?)

jafioti · 2023-06-22T14:34:51Z

Awesome to see the progress! I'll run some benchmarks on my machine (2080 super) tonight

LLukas22 · 2023-06-23T09:53:25Z

* [x]  Decide on how we want to handle other architectures

We will first focus on supporting only LLama. Other architectures will be supported via commites to the llama.cpp/ggml repo. Which implement the missing ggml_ops in CUDA/OpenCL.

jafioti · 2023-06-23T22:19:14Z

Some quick benchmarks:
GPU: 2080 Super
Model: Llama 7B
Prompt: "Rust is a cool programming language because"
Num Tokens Generated: 128
llm Command: cargo run --features cublas --release llama infer -m ../llama.cpp/models/llama-7b.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128
llama.cpp Command: ./main -m models/llama-7b.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" -n 128 -ngl 100

This Branch Stats:

feed_prompt_duration: 317ms
prompt_tokens: 9
predict_duration: 9548ms
predict_tokens: 137
per_token_duration: 69.693ms

llama.cpp (commit d7b7484f74d486f77feb4c0b7af7e1718ed91651) Stats:

llama_print_timings:        load time =  1015.55 ms
llama_print_timings:      sample time =    52.68 ms /   128 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   182.96 ms /     9 tokens (   20.33 ms per token)
llama_print_timings:        eval time =  2901.68 ms /   127 runs   (   22.85 ms per token)
llama_print_timings:       total time =  3162.59 ms

LLukas22 · 2023-06-24T10:07:04Z

@jafioti I can't reproduce your results i did my own benchmarking with the following setup and results.

Device:
Windows 11
RTX 3090

Models:

commands:

llm: cargo run --features cublas --release llama infer -m [MODELPATH] -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128

llama.cpp: .\main.exe -m [MODELPATH] -p "Rust is a cool programming language because" -n 128 -ngl 100

(For Wizard-Vicuna-Uncensored 30B only the first 40 layers were offloaded to make the model fit into 24 GB Vram and the token limit was reduced from 128 to 50 tokens)

Results:
6 runs were performed for each model, the results of the first run were discarded and the other 5 runs were averaged.

The following table shows the per token duration:

	llama.cpp	llm
OpenLLama 3B	N/A	16.17 ms
OpenLLama 7B	17.54 ms	19.40 ms
Nous Hermes 13B	41.23 ms	42.72ms
Wizard Vincune 30B	312.30 ms	289.97ms

llm seams to be about 1-2 ms slower than llama.cpp which is probably caused by different measuring locations, i think llm includes the token callback into the measurement.

…celeration

jafioti · 2023-06-24T20:47:34Z

I've done another test on an A10 with OpenLLama 7B Q4_0

I did the same, 6 tests, keep last 5
Results:

llama.cpp	llm
OpenLLama 7B	21.44 ms

It should be noted that I did see a wider variance in the llm numbers. Here are the raw observed numbers:
llm: 30.8, 22.3, 32.5, 32, 22.5
llama.cpp: 21.5, 21.5, 21.3, 21.4, 21.5

LLukas22 · 2023-06-24T21:22:21Z

OK thats very interesting, any idea on why the performance of the 2080 was that much worse?

The A10 results seam to be ok'ish. I still have to check how we meassure token times and how LLama.cpp does it.

jafioti · 2023-06-25T03:33:15Z

No idea, will check on an a100 and see if the gap shrinks furthur

LLukas22 · 2023-06-25T07:54:28Z

Could it be possible for the 2080 to run out of memory and start paging into RAM? llm seams to allocate a bit more VRAM than llama.cpp 🤔

jafioti · 2023-06-25T14:45:12Z

Alright another run on an A100 using Wizard 33B:

Model	llama.cpp	llm
WizardLM 33B	50.98 ms	83.79 ms

Interestingly this discrepency is mostly due to a few tokens generated on a single run in llm. Here are the runtimes of each llm run: 52.47, 56.07, 192.3, 61.2, 57.4. As you can see, one of them took a lot longer, and when running I noticed most of the tokens generated fast, but like 5 of them took almost a whole second to generate. Does it decide to use CPU for some forward passes? How can I force it to always use GPU? Command I used: cargo run --features cublas --release llama infer -m ./wizardlm-33b-v1.0-uncensored.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128

Also for the 2080 I think if it ran out of VRAM it would just error, no?

LLukas22 · 2023-06-25T15:14:33Z

Ok that indeed is interesting, thanks for helping by testing different cards. Theoretically all layers should be offloaded except for the embedding layer. You could omit the --gpu-layers 100 command as llm will offload all layers if you don't define it and gpu acceleration is enabled. I actually don't know why the inference would slowdown for some tokens, maybe it's a problem with the sampler/tokenizer? I'll try to run some benchmarks with the huggingface tokenizer enabled to see if that changes something. If you want to test it you can define an external tokenizer via the -r [HF/repo] parameter. E.g. -r "ehartford/WizardLM-33B-V1.0-Uncensored".

Also for the 2080 I think if it ran out of VRAM it would just error, no?

Depending on your driver version, some drivers decide to offload into RAM if you are a bit over your available VRAM. But i only encountered it once and i don't know if that's just a windows thing. 🤔

jafioti · 2023-06-30T03:39:33Z

Alright, really strange but I got very similar generation performance the higher I go in card power. On an A10 or H100, the token generation time is nearly identical, so it might just be my 2080, idk.

But another thing I noticed that's pretty major is the prompt feeding stage taking quite a bit longer than llama.cpp. On your branch, is the initial prompt feeding happening on GPU? Or is only the subsequent token generation offloaded?

Here's the prompt I ran through WizardLM-30B:

A chat between a curious user and an artificial intelligence assistant.
The assistant gives accurate answers to the user's questions while being short and to-the-point.
DOCUMENT START

Company Dividends

The table below provides a breakdown of dividends received by shareholders for Acme, Inc. and Newtech Corp. over a period of five years.

Dividends per Share
\```csv
Year,Acme, Inc.,Newtech Corp.
2016,$1.10,$0.45
2017,$1.25,$0.60
2018,$1.40,$0.75
2019,$1.55,$0.90
2020,$1.75,$1.05
\```

DOCUMENT END

USER: Please answer the following question based on the document provided above. In 2019, how much was Acme, Inc.'s dividend per share?
ASSISTANT:

And the prompt-only times:
llama.cpp: 580.72 ms / 231 tokens ( 2.51 ms per token, 397.78 tokens per second)
llm: 9619ms/ 231 tokens( 41.64ms per token, 24.01 tokens per second)

LLukas22 · 2023-06-30T07:20:54Z

@jafioti Thanks for the additional tests. The prompt feeding happens on the GPU, as it's the same forward call as the inference of new tokens. The difference in feeding times is probably caused by the default batch-size. I think llama.cpp uses 256 or 512 as a default when running gpu inference. We curretnly always default to 8. This means your 231 token long prompt needs only one forward call in llama.cpp but 231 // 8 forward calls on this branch. You could try increasing it via the --batch-size parameter.

If i get home from work im gonna sync this branch with the newest ggml source from llama.cpp and im gonna do some benchmarking to test the impact of the batch-size parameter.

…celeration

jafioti · 2023-06-30T15:40:41Z

Yup you were exactly right, the batch size was the key! The same prompt now takes 672ms

LLukas22 · 2023-07-01T15:54:21Z

ClBlast currently fails on Windows, see ggerganov/llama.cpp#2065

jafioti · 2023-07-01T18:51:07Z

@LLukas22 Can you pull main again? PR #339 was just merged and I'd like to run in an OpenSSL-free environment

…celeration

philpax · 2023-07-12T23:57:39Z

Had a very cursory look and this is really impressive. I'll test this out and review it properly soon.

philpax

Really impressive work! I'm very thankful you've taken the lead on this.

I still need to test it personally (haven't been around my Windows machine much in the last two days), but this is looking good. I've made a few comments about style/minor tweaks, but the core of this looks excellent.

Looking forward to seeing how quickly my GPU can run LLaMA 🚀

crates/ggml/src/context.rs

crates/ggml/src/lib.rs

crates/ggml/src/tensor.rs

crates/llm-base/src/inference_session.rs

crates/models/llama/src/lib.rs

LLukas22 · 2023-07-14T04:33:45Z

Thanks for the review, i'll try to implement the changes later today. As previously mentioned some models currently produce gibberish when Cuda acceleration is enabled, thats something i also have to look into.

philpax

Almost there...

crates/ggml/src/context.rs

crates/ggml/src/tensor.rs

crates/llm-base/src/inference_session.rs

crates/models/llama/src/lib.rs

crates/ggml/src/tensor.rs

crates/models/llama/src/lib.rs

LLukas22 · 2023-07-15T18:37:53Z

@philpax I closed all conversations i fixed in my last commit. Would be great if you could take a look at the rest.

philpax · 2023-07-15T19:03:05Z

Nice work! I'll test it locally, make any final changes of my own, and then I'll merge it 🚀

(Don't worry about solving the merge conflicts, I'll do that myself)

philpax · 2023-07-16T01:00:27Z

I'm going to sleep on this before I merge it. I think my changes should all check out, but it's quite late now and I'm pretty sure the GPU offloading stuff did a number on my brain.

Pending issues (that will be made into issues after this is merged):

Certain quantization levels do not work.
- 7B Q5_1 does not work (Philpax)
- 7B Q3_K_M works (Philpax)
- 13B Q4_K_M works (Philpax)
- 13B Q5_K_M works (Lukas)
- 13B Q5_K_S does not work (Lukas)
ExecutionParameters struct for passing thread count and backend around
No multi-GPU support

crates/ggml/src/tensor.rs

cwysong85 · 2023-08-08T13:44:04Z

I have 3 Tesla T4's running and obviously i cannot use all the GPUs yet... So just commenting for now til this is supported. Possibly could just extend to the current --use-gpu parameter e.g. --use-gpu all

T4's running idle:

nvidia-smi

Tue Aug  8 13:41:32 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:3B:00.0 Off |                  Off |
| N/A   47C    P0              27W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:87:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:AF:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |      2MiB / 16384MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Running command:

./llm infer --batch-size 512 --use-gpu --num-ctx-tokens 4096 -a llama -m /usr/local/models/nous-hermes-llama-2-7b.ggmlv3.q8_0.bin -p "Who was the first US president?"

LLukas22 added 3 commits June 22, 2023 10:19

Update LLama.cpp

ce7ffd1

Made LLama work with CUDA

46fb5e3

formatting

2c90981

LLukas22 mentioned this pull request Jun 22, 2023

Feature Request: CUDA/OpenCL Builds louisgv/local.ai#45

Open

philpax added the topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features label Jun 22, 2023

LLukas22 added 3 commits June 24, 2023 12:31

Merge remote-tracking branch 'upstream/main' into feat/cuda-opencl-ac…

897a136

…celeration

Merge: Accelerator Functions

93dac8f

We need pre-commit hooks

8499798

LLukas22 mentioned this pull request Jun 25, 2023

Better generation stats #331

Open

update llama.cpp

a1f61b4

LLukas22 added 3 commits June 30, 2023 16:37

Merge remote-tracking branch 'upstream/main' into feat/cuda-opencl-ac…

b1e8cd0

…celeration

Update falcon

e1476a4

Sync latest llama.cpp

4a18bff

LLukas22 added 2 commits July 1, 2023 20:56

Merge remote-tracking branch 'upstream/main' into feat/cuda-opencl-ac…

e788b55

…celeration

Merge remote-tracking branch 'upstream/main' into feat/cuda-opencl-ac…

174b4e9

…celeration

potentially fixed metal

351a0f5

philpax added this to the 0.2 milestone Jul 13, 2023

Sync latest llama.cpp

a8986b8

philpax suggested changes Jul 13, 2023

View reviewed changes

LLukas22 mentioned this pull request Jul 14, 2023

BUG | Can't run 0.5.1 on Windows, asks for additional dlls louisgv/local.ai#61

Open

Review fixes

cea02f2

LLukas22 marked this pull request as ready for review July 15, 2023 14:17

philpax suggested changes Jul 15, 2023

View reviewed changes

philpax mentioned this pull request Jul 15, 2023

Fix tensor name API #369

Closed

Typos and small review changes

e09b937

philpax added 7 commits July 15, 2023 22:05

Merge branch 'main' into feat/cuda-opencl-acceleration

3e404d2

refactor: remove unused tensorloader tensors

67bbaf9

refactor(ggml): accelerator/tensor/context

e7ac55b

refactor(ggml): offload_no_scratch auto-free

c74e159

refactor(ggml): use ContextInner for shared state

55b2dc3

refactor(ggml): unify context creation

70d57dc

fix(ggml): make metal work again

090735a

LLukas22 commented Jul 16, 2023

View reviewed changes

crates/ggml/src/tensor.rs Show resolved Hide resolved

crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved

fix(ggml): only set backend after if guards

d815857

philpax merged commit 3062a08 into rustformers:main Jul 16, 2023
14 checks passed

philpax mentioned this pull request Jul 16, 2023

Multi-GPU support for inferencing #371

Open

LLukas22 mentioned this pull request Jul 23, 2023

CUDA decoding #315

Closed

LLukas22 deleted the feat/cuda-opencl-acceleration branch July 26, 2023 08:55

hhamud mentioned this pull request Aug 7, 2023

Write a 0.2 changelog #244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA/OpenCL Acceleration #325

CUDA/OpenCL Acceleration #325

LLukas22 commented Jun 22, 2023 •

edited

Loading

jafioti commented Jun 22, 2023

LLukas22 commented Jun 23, 2023

jafioti commented Jun 23, 2023 •

edited

Loading

LLukas22 commented Jun 24, 2023

jafioti commented Jun 24, 2023

LLukas22 commented Jun 24, 2023

jafioti commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

jafioti commented Jun 25, 2023 •

edited

Loading

LLukas22 commented Jun 25, 2023

jafioti commented Jun 30, 2023

LLukas22 commented Jun 30, 2023

jafioti commented Jun 30, 2023

LLukas22 commented Jul 1, 2023

jafioti commented Jul 1, 2023

philpax commented Jul 12, 2023

philpax left a comment

LLukas22 commented Jul 14, 2023

philpax left a comment

LLukas22 commented Jul 15, 2023

philpax commented Jul 15, 2023

philpax commented Jul 16, 2023 •

edited

Loading

cwysong85 commented Aug 8, 2023 •

edited

Loading

CUDA/OpenCL Acceleration #325

CUDA/OpenCL Acceleration #325

Conversation

LLukas22 commented Jun 22, 2023 • edited Loading

jafioti commented Jun 22, 2023

LLukas22 commented Jun 23, 2023

jafioti commented Jun 23, 2023 • edited Loading

LLukas22 commented Jun 24, 2023

jafioti commented Jun 24, 2023

LLukas22 commented Jun 24, 2023

jafioti commented Jun 25, 2023

LLukas22 commented Jun 25, 2023

jafioti commented Jun 25, 2023 • edited Loading

LLukas22 commented Jun 25, 2023

jafioti commented Jun 30, 2023

LLukas22 commented Jun 30, 2023

jafioti commented Jun 30, 2023

LLukas22 commented Jul 1, 2023

jafioti commented Jul 1, 2023

philpax commented Jul 12, 2023

philpax left a comment

Choose a reason for hiding this comment

LLukas22 commented Jul 14, 2023

philpax left a comment

Choose a reason for hiding this comment

LLukas22 commented Jul 15, 2023

philpax commented Jul 15, 2023

philpax commented Jul 16, 2023 • edited Loading

cwysong85 commented Aug 8, 2023 • edited Loading

LLukas22 commented Jun 22, 2023 •

edited

Loading

jafioti commented Jun 23, 2023 •

edited

Loading

jafioti commented Jun 25, 2023 •

edited

Loading

philpax commented Jul 16, 2023 •

edited

Loading

cwysong85 commented Aug 8, 2023 •

edited

Loading