Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Closed
Iory1998 opened this issue Jul 31, 2024 · 9 comments
Closed

Comments

@Iory1998
Copy link

Hello team,

LM Studio is using recent updates in llama.cpp, which already has support for 4 bit and 8 bit cache, so I don't LM Studio does not incorporate it yet.
The benefits are tremendous since it improves generation speed. It also helps with using a higher quantization.

The give you an example, I run aya-23-35B-Q4_K_M.gguf in LM Studio at a speed of 4.5t/s because the maximum number of layers I can load on my GPU with 24GB of VRAM is 30 layers. Aya has 41 layers. In Oobabooga Webui, with 4-bit cache enabled, I can load all layers in my Vram, and the speed bumps to 20.5t/s. That's a significant increase in performance (5 folds).

This should be your main priority since you are actually pushing your customer to move to a different platform. Right now, I don't LM Studio when I want to run a larger model, which is unfortunate since I am your biggest fan.

Please, solve this issue ASAP.

@yagil
Copy link
Member

yagil commented Jul 31, 2024

Noted @Iory1998 will be addressed

@yagil
Copy link
Member

yagil commented Jul 31, 2024

This is now available in beta. Check out the #beta-releases-chat channel on discord

@Iory1998
Copy link
Author

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

@yagil
Copy link
Member

yagil commented Jul 31, 2024

Thank you very much. I have been testing the 0.3beta 1 for a days now, and it does not have the caching feature.

It is a parallel beta for the current release train. Available as of an hour ago

@Iory1998
Copy link
Author

Thank you for your prompt response. Can I get a link here or an email since I don't use discord?
On a different note, I already sent your team an email with some remarks about the beta 1 but hasn't heard from your team back. The email subject is "Feedback on LMS v0.3b Beta.

@Iory1998
Copy link
Author

Iory1998 commented Aug 1, 2024

Never mind, I joined discord just to test the 0.31 beta 1

@Iory1998 Iory1998 closed this as completed Aug 3, 2024
@GabeAl
Copy link

GabeAl commented Aug 30, 2024

K and V quants for the context are still not available. Rolling back to pre 0.3 to get them back.

The difference is usable vs unusable for me on a 16GB GPU for llama 3.1 8B and Phi-medium. with the Q4 quants, the model fit and could look through the full context.

The new release takes 4 times the memory (and even with smaller cache still runs slower).

My request is to bring back the ability for the user to adjust the K and V context quants for Flash attention.

@GabeAl
Copy link

GabeAl commented Aug 30, 2024

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

#70

@Iory1998
Copy link
Author

Iory1998 commented Sep 9, 2024

Just saw this was closed. This should not have been closed, as the feature is not available on the latest release (as far as I can see?).

#70

No, it was closed because the feature in being added. In the version 0.3.2, KV Cache is being set at FP8. I tested the beta, and you could have the KV cache set to Q4 and Q8, but it has not being added to the official LM Studio yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants