Benching local GGUF model layers allocated to vRAM but no GPU activity #330

polarathene · 2024-05-19T03:10:57Z

Describe the bug

Building mistral.rs with the cuda feature, when I test it with mistralrs-bench and a local GGUF I observed via nvidia-smi that layers were allocated to vRAM, but GPU activity was 0 after warmup.

Despite this, within the same environment (llama-cpp official Dockerfile for full-cuda variant), the equivalent llama-cpp bench tool worked using the GPU at 100%. I built both projects within the same container environment myself, so something is off?

More details here: #329 (comment)

I can look at running the Dockerfile from this project, but besides cudnn, there shouldn't be much difference AFAIK. I've not tried other commands, or non-gguf, but assume that shouldn't affect this?

Latest commit

v0.1.8: ca9bf7d

Additional context

There is a modification I've applied to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.

AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp, from what I understand all the relevant metadata is already available with the GGUF file itself?

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-05-19T09:04:41Z

This seems very strange. I'll do some digging, but my suspicion is that they do device mapping differently. Please see my comment in #329.

There is a #326 (comment) to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.

AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp, from what I understand all the relevant metadata is already available with the GGUF file itself?

No, that shouldn't be a problem.

polarathene · 2024-05-31T01:18:53Z

Seems to be using GPU now: #329 (comment)

Although the test finishes rather quickly it's a bit tricky to monitor the load, if you have a command that would take a little longer I could give that a go 👍

EDIT: Advice of increasing -r below, can confirm 100% GPU load.

EricLBuehler · 2024-05-31T01:20:10Z

You can configure the number of times the test runs with the -r or --repetitions flag, simply run --help for more info.

polarathene added the bug Something isn't working label May 19, 2024

polarathene closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benching local GGUF model layers allocated to vRAM but no GPU activity #330

Benching local GGUF model layers allocated to vRAM but no GPU activity #330

polarathene commented May 19, 2024

EricLBuehler commented May 19, 2024

polarathene commented May 31, 2024 •

edited

Loading

EricLBuehler commented May 31, 2024

Benching local GGUF model layers allocated to vRAM but no GPU activity #330

Benching local GGUF model layers allocated to vRAM but no GPU activity #330

Comments

polarathene commented May 19, 2024

Describe the bug

Latest commit

Additional context

EricLBuehler commented May 19, 2024

polarathene commented May 31, 2024 • edited Loading

EricLBuehler commented May 31, 2024

polarathene commented May 31, 2024 •

edited

Loading