Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

CUDA/OpenCL Acceleration #325

Merged
merged 36 commits into from
Jul 16, 2023
Merged

CUDA/OpenCL Acceleration #325

merged 36 commits into from
Jul 16, 2023

Conversation

LLukas22
Copy link
Contributor

@LLukas22 LLukas22 commented Jun 22, 2023

Implements CUDA/OpenCL acceleration via CuBLAS/CLBLAST.

Recording.2023-06-22.141324.mp4

Stuff that works:

  • Offload the weights to the GPU
  • Enable gpu acceleration via --use-gpu
  • Control how many layers are offloaded via --gpu-layers

Stuff that still needs to be done:

  • Check why CLBLAST's matmul throws an error in the current llama.cpp version.
  • Decide on how we want to handle other architectures
  • Find another way to pass a mutable ctx0 (I dont want to use a RefCell here)
  • Find a way to pass additional arguments to ggml-sys's build.rs to enable f16 optimizations. Maybe @darxkies could help here.
  • Find a better place to initialize the gpu context and cuda scratch buffer

Nice to have:

  • Implement Multi-GPU support
  • Benchmark against llama.cpp (Maybe something for @jafioti ?)

@jafioti
Copy link
Contributor

jafioti commented Jun 22, 2023

Awesome to see the progress! I'll run some benchmarks on my machine (2080 super) tonight

@philpax philpax added the topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features label Jun 22, 2023
@LLukas22
Copy link
Contributor Author

* [x]  Decide on how we want to handle other architectures

We will first focus on supporting only LLama. Other architectures will be supported via commites to the llama.cpp/ggml repo. Which implement the missing ggml_ops in CUDA/OpenCL.

@jafioti
Copy link
Contributor

jafioti commented Jun 23, 2023

Some quick benchmarks:
GPU: 2080 Super
Model: Llama 7B
Prompt: "Rust is a cool programming language because"
Num Tokens Generated: 128
llm Command: cargo run --features cublas --release llama infer -m ../llama.cpp/models/llama-7b.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128
llama.cpp Command: ./main -m models/llama-7b.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" -n 128 -ngl 100

This Branch Stats:

feed_prompt_duration: 317ms
prompt_tokens: 9
predict_duration: 9548ms
predict_tokens: 137
per_token_duration: 69.693ms

llama.cpp (commit d7b7484f74d486f77feb4c0b7af7e1718ed91651) Stats:

llama_print_timings:        load time =  1015.55 ms
llama_print_timings:      sample time =    52.68 ms /   128 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   182.96 ms /     9 tokens (   20.33 ms per token)
llama_print_timings:        eval time =  2901.68 ms /   127 runs   (   22.85 ms per token)
llama_print_timings:       total time =  3162.59 ms

@LLukas22
Copy link
Contributor Author

@jafioti I can't reproduce your results i did my own benchmarking with the following setup and results.

Device:
Windows 11
RTX 3090

Models:

commands:

llm: cargo run --features cublas --release llama infer -m [MODELPATH] -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128

llama.cpp: .\main.exe -m [MODELPATH] -p "Rust is a cool programming language because" -n 128 -ngl 100

(For Wizard-Vicuna-Uncensored 30B only the first 40 layers were offloaded to make the model fit into 24 GB Vram and the token limit was reduced from 128 to 50 tokens)

Results:
6 runs were performed for each model, the results of the first run were discarded and the other 5 runs were averaged.

The following table shows the per token duration:

llama.cpp llm
OpenLLama 3B N/A 16.17 ms
OpenLLama 7B 17.54 ms 19.40 ms
Nous Hermes 13B 41.23 ms 42.72ms
Wizard Vincune 30B 312.30 ms 289.97ms

llm seams to be about 1-2 ms slower than llama.cpp which is probably caused by different measuring locations, i think llm includes the token callback into the measurement.

@jafioti
Copy link
Contributor

jafioti commented Jun 24, 2023

I've done another test on an A10 with OpenLLama 7B Q4_0

I did the same, 6 tests, keep last 5
Results:

llama.cpp llm
OpenLLama 7B 21.44 ms

It should be noted that I did see a wider variance in the llm numbers. Here are the raw observed numbers:
llm: 30.8, 22.3, 32.5, 32, 22.5
llama.cpp: 21.5, 21.5, 21.3, 21.4, 21.5

@LLukas22
Copy link
Contributor Author

OK thats very interesting, any idea on why the performance of the 2080 was that much worse?

The A10 results seam to be ok'ish. I still have to check how we meassure token times and how LLama.cpp does it.

@jafioti
Copy link
Contributor

jafioti commented Jun 25, 2023

No idea, will check on an a100 and see if the gap shrinks furthur

@LLukas22
Copy link
Contributor Author

Could it be possible for the 2080 to run out of memory and start paging into RAM? llm seams to allocate a bit more VRAM than llama.cpp 🤔

@jafioti
Copy link
Contributor

jafioti commented Jun 25, 2023

Alright another run on an A100 using Wizard 33B:

Model llama.cpp llm
WizardLM 33B 50.98 ms 83.79 ms

Interestingly this discrepency is mostly due to a few tokens generated on a single run in llm. Here are the runtimes of each llm run: 52.47, 56.07, 192.3, 61.2, 57.4. As you can see, one of them took a lot longer, and when running I noticed most of the tokens generated fast, but like 5 of them took almost a whole second to generate. Does it decide to use CPU for some forward passes? How can I force it to always use GPU? Command I used: cargo run --features cublas --release llama infer -m ./wizardlm-33b-v1.0-uncensored.ggmlv3.q4_0.bin -p "Rust is a cool programming language because" --stats --use-gpu --gpu-layers 100 --num-predict 128

Also for the 2080 I think if it ran out of VRAM it would just error, no?

@LLukas22
Copy link
Contributor Author

Ok that indeed is interesting, thanks for helping by testing different cards. Theoretically all layers should be offloaded except for the embedding layer. You could omit the --gpu-layers 100 command as llm will offload all layers if you don't define it and gpu acceleration is enabled. I actually don't know why the inference would slowdown for some tokens, maybe it's a problem with the sampler/tokenizer? I'll try to run some benchmarks with the huggingface tokenizer enabled to see if that changes something. If you want to test it you can define an external tokenizer via the -r [HF/repo] parameter. E.g. -r "ehartford/WizardLM-33B-V1.0-Uncensored".

Also for the 2080 I think if it ran out of VRAM it would just error, no?

Depending on your driver version, some drivers decide to offload into RAM if you are a bit over your available VRAM. But i only encountered it once and i don't know if that's just a windows thing. 🤔

@jafioti
Copy link
Contributor

jafioti commented Jun 30, 2023

Alright, really strange but I got very similar generation performance the higher I go in card power. On an A10 or H100, the token generation time is nearly identical, so it might just be my 2080, idk.

But another thing I noticed that's pretty major is the prompt feeding stage taking quite a bit longer than llama.cpp. On your branch, is the initial prompt feeding happening on GPU? Or is only the subsequent token generation offloaded?

Here's the prompt I ran through WizardLM-30B:

A chat between a curious user and an artificial intelligence assistant.
The assistant gives accurate answers to the user's questions while being short and to-the-point.
DOCUMENT START

Company Dividends

The table below provides a breakdown of dividends received by shareholders for Acme, Inc. and Newtech Corp. over a period of five years.

Dividends per Share
\```csv
Year,Acme, Inc.,Newtech Corp.
2016,$1.10,$0.45
2017,$1.25,$0.60
2018,$1.40,$0.75
2019,$1.55,$0.90
2020,$1.75,$1.05
\```

DOCUMENT END

USER: Please answer the following question based on the document provided above. In 2019, how much was Acme, Inc.'s dividend per share?
ASSISTANT:

And the prompt-only times:
llama.cpp: 580.72 ms / 231 tokens ( 2.51 ms per token, 397.78 tokens per second)
llm: 9619ms/ 231 tokens( 41.64ms per token, 24.01 tokens per second)

@LLukas22
Copy link
Contributor Author

@jafioti Thanks for the additional tests. The prompt feeding happens on the GPU, as it's the same forward call as the inference of new tokens. The difference in feeding times is probably caused by the default batch-size. I think llama.cpp uses 256 or 512 as a default when running gpu inference. We curretnly always default to 8. This means your 231 token long prompt needs only one forward call in llama.cpp but 231 // 8 forward calls on this branch. You could try increasing it via the --batch-size parameter.

If i get home from work im gonna sync this branch with the newest ggml source from llama.cpp and im gonna do some benchmarking to test the impact of the batch-size parameter.

@jafioti
Copy link
Contributor

jafioti commented Jun 30, 2023

Yup you were exactly right, the batch size was the key! The same prompt now takes 672ms

@LLukas22
Copy link
Contributor Author

LLukas22 commented Jul 1, 2023

ClBlast currently fails on Windows, see ggerganov/llama.cpp#2065

@jafioti
Copy link
Contributor

jafioti commented Jul 1, 2023

@LLukas22 Can you pull main again? PR #339 was just merged and I'd like to run in an OpenSSL-free environment

@philpax
Copy link
Collaborator

philpax commented Jul 12, 2023

Had a very cursory look and this is really impressive. I'll test this out and review it properly soon.

@philpax philpax added this to the 0.2 milestone Jul 13, 2023
Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressive work! I'm very thankful you've taken the lead on this.

I still need to test it personally (haven't been around my Windows machine much in the last two days), but this is looking good. I've made a few comments about style/minor tweaks, but the core of this looks excellent.

Looking forward to seeing how quickly my GPU can run LLaMA 🚀

crates/ggml/src/context.rs Outdated Show resolved Hide resolved
crates/ggml/src/lib.rs Outdated Show resolved Hide resolved
crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved
crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved
crates/llm-base/src/inference_session.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
@LLukas22
Copy link
Contributor Author

Thanks for the review, i'll try to implement the changes later today. As previously mentioned some models currently produce gibberish when Cuda acceleration is enabled, thats something i also have to look into.

@LLukas22 LLukas22 marked this pull request as ready for review July 15, 2023 14:17
Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there...

crates/ggml/src/context.rs Show resolved Hide resolved
crates/ggml/src/context.rs Outdated Show resolved Hide resolved
crates/ggml/src/context.rs Outdated Show resolved Hide resolved
crates/ggml/src/context.rs Outdated Show resolved Hide resolved
crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved
crates/llm-base/src/inference_session.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Show resolved Hide resolved
crates/models/llama/src/lib.rs Outdated Show resolved Hide resolved
crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved
crates/models/llama/src/lib.rs Show resolved Hide resolved
@philpax philpax mentioned this pull request Jul 15, 2023
@LLukas22
Copy link
Contributor Author

@philpax I closed all conversations i fixed in my last commit. Would be great if you could take a look at the rest.

@philpax
Copy link
Collaborator

philpax commented Jul 15, 2023

Nice work! I'll test it locally, make any final changes of my own, and then I'll merge it 🚀

(Don't worry about solving the merge conflicts, I'll do that myself)

@philpax
Copy link
Collaborator

philpax commented Jul 16, 2023

I'm going to sleep on this before I merge it. I think my changes should all check out, but it's quite late now and I'm pretty sure the GPU offloading stuff did a number on my brain.

Pending issues (that will be made into issues after this is merged):

  • Certain quantization levels do not work.
    • 7B Q5_1 does not work (Philpax)
    • 7B Q3_K_M works (Philpax)
    • 13B Q4_K_M works (Philpax)
    • 13B Q5_K_M works (Lukas)
    • 13B Q5_K_S does not work (Lukas)
  • ExecutionParameters struct for passing thread count and backend around
  • No multi-GPU support

crates/ggml/src/tensor.rs Show resolved Hide resolved
crates/ggml/src/tensor.rs Outdated Show resolved Hide resolved
@philpax philpax merged commit 3062a08 into rustformers:main Jul 16, 2023
14 checks passed
@LLukas22 LLukas22 mentioned this pull request Jul 23, 2023
@LLukas22 LLukas22 deleted the feat/cuda-opencl-acceleration branch July 26, 2023 08:55
@hhamud hhamud mentioned this pull request Aug 7, 2023
@cwysong85
Copy link

cwysong85 commented Aug 8, 2023

I have 3 Tesla T4's running and obviously i cannot use all the GPUs yet... So just commenting for now til this is supported. Possibly could just extend to the current --use-gpu parameter e.g. --use-gpu all

T4's running idle:

nvidia-smi

Tue Aug  8 13:41:32 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:3B:00.0 Off |                  Off |
| N/A   47C    P0              27W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:87:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                       Off | 00000000:AF:00.0 Off |                  Off |
| N/A   46C    P0              27W /  70W |      2MiB / 16384MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Running command:

./llm infer --batch-size 512 --use-gpu --num-ctx-tokens 4096 -a llama -m /usr/local/models/nous-hermes-llama-2-7b.ggmlv3.q8_0.bin -p "Who was the first US president?"

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
topic:backend-support Support for alternate non-GGML backends, or for particular GGML backend features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants