[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Iory1998 · 2024-07-31T17:51:04Z

Hello team,

LM Studio is using recent updates in llama.cpp, which already has support for 4 bit and 8 bit cache, so I don't LM Studio does not incorporate it yet.
The benefits are tremendous since it improves generation speed. It also helps with using a higher quantization.

The give you an example, I run aya-23-35B-Q4_K_M.gguf in LM Studio at a speed of 4.5t/s because the maximum number of layers I can load on my GPU with 24GB of VRAM is 30 layers. Aya has 41 layers. In Oobabooga Webui, with 4-bit cache enabled, I can load all layers in my Vram, and the speed bumps to 20.5t/s. That's a significant increase in performance (5 folds).

This should be your main priority since you are actually pushing your customer to move to a different platform. Right now, I don't LM Studio when I want to run a larger model, which is unfortunate since I am your biggest fan.

Please, solve this issue ASAP.

yagil · 2024-07-31T18:11:11Z

Noted @Iory1998 will be addressed

yagil · 2024-07-31T22:36:07Z

This is now available in beta. Check out the #beta-releases-chat channel on discord

Iory1998 · 2024-07-31T23:28:47Z

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

yagil · 2024-07-31T23:32:07Z

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

It is a parallel beta for the current release train. Available as of an hour ago

Iory1998 · 2024-07-31T23:35:15Z

Thank you for your prompt response. Can I get a link here or an email since I don't use discord?
On a different note, I already sent your team an email with some remarks about the beta 1 but hasn't heard from your team back. The email subject is "Feedback on LMS v0.3b Beta.

Iory1998 · 2024-08-01T00:19:47Z

Never mind, I joined discord just to test the 0.31 beta 1

GabeAl · 2024-08-30T20:50:05Z

K and V quants for the context are still not available. Rolling back to pre 0.3 to get them back.

The difference is usable vs unusable for me on a 16GB GPU for llama 3.1 8B and Phi-medium. with the Q4 quants, the model fit and could look through the full context.

The new release takes 4 times the memory (and even with smaller cache still runs slower).

My request is to bring back the ability for the user to adjust the K and V context quants for Flash attention.

GabeAl · 2024-08-30T20:55:20Z

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

#70

Iory1998 · 2024-09-09T00:08:33Z

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

#70

No, it was closed because the feature in being added. In the version 0.3.2, KV Cache is being set at FP8. I tested the beta, and you could have the KV cache set to Q4 and Q8, but it has not being added to the official LM Studio yet.

Iory1998 closed this as completed Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Iory1998 commented Jul 31, 2024

yagil commented Jul 31, 2024

yagil commented Jul 31, 2024

Iory1998 commented Jul 31, 2024

yagil commented Jul 31, 2024

Iory1998 commented Jul 31, 2024

Iory1998 commented Aug 1, 2024

GabeAl commented Aug 30, 2024

GabeAl commented Aug 30, 2024

Iory1998 commented Sep 9, 2024

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Comments

Iory1998 commented Jul 31, 2024

yagil commented Jul 31, 2024

yagil commented Jul 31, 2024

Iory1998 commented Jul 31, 2024

yagil commented Jul 31, 2024

Iory1998 commented Jul 31, 2024

Iory1998 commented Aug 1, 2024

GabeAl commented Aug 30, 2024

GabeAl commented Aug 30, 2024

Iory1998 commented Sep 9, 2024