-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to load GGUF models? #3141
Comments
HI @InAnYan, we don't have official support for GGUF and we don't support running local in-process inference on GGUF models, but you can try one of these ways:
|
I've modified the llama-server code on my llama.cpp fork to add more OpenAI API support so the JSON response is compatible with the OpenAI client used by helm-run. I was able to run benchmarks on the Qwen's QwQ - qwq-32b-preview-q4_k_m.gguf 32B 4 bit quantized GGUF model after this. Created a pull request here To use the modified llama.cpp server with HELM, I used the following model configurations by creating a folder with model_deployments.yaml and model_metadata.yaml files. In the
Important to point correctly to the
Then, need to install OpenAI HELM client:
I started the benchmark using:
|
Thanks @Nero7991! This looks great. I will follow the pull request to llama.cpp. |
Title says everything :)
The text was updated successfully, but these errors were encountered: