Is there any way to load GGUF models? #3141

InAnYan · 2024-11-07T20:44:58Z

Title says everything :)

yifanmai · 2024-11-08T23:43:05Z

HI @InAnYan, we don't have official support for GGUF and we don't support running local in-process inference on GGUF models, but you can try one of these ways:

vLLM has limited GGUF support. You can try running your GGUF model on vLLM and then use the HELM vLLM integration.
LLaMA.cpp HTTP Server and Ollama server both have partial OpenAI API compatibility, so it may be possible to use it with HELM's OpenAIClient by setting base_url to your server's URL (example)

Nero7991 · 2024-12-02T21:03:13Z

I've modified the llama-server code on my llama.cpp fork to add more OpenAI API support so the JSON response is compatible with the OpenAI client used by helm-run. I was able to run benchmarks on the Qwen's QwQ - qwq-32b-preview-q4_k_m.gguf 32B 4 bit quantized GGUF model after this. Created a pull request here

To use the modified llama.cpp server with HELM, I used the following model configurations by creating a folder with model_deployments.yaml and model_metadata.yaml files.

In the model_deployments.yaml file:

model_deployments:
  - name: vllm/qwq-32b-preview-q4_k_m
    model_name: vllm/qwq-32b-preview-q4_k_m
    tokenizer_name: qwen/qwen1.5-7b
    max_sequence_length: 32767
    client_spec:
      class_name: "helm.clients.vllm_client.VLLMClient"
      args:
        base_url: http://127.0.0.1:8080/v1/

Important to point correctly to the llama-server URL.

./llama-server -m models/qwq-32b-preview-q4_k_m.gguf -ngl 65 --port 8080

model_metadata.yaml file:

models: 
  - name: vllm/qwq-32b-preview-q4_k_m
    display_name: QwQ Preview 4 bit GGUF (32B)
    description: QwQ Preview 4 bit GGUF (32B), proposed by Aibaba Cloud. Qwen is a family of transformer models with SwiGLU activation, RoPE, and multi-head attention. The 32B version also includes grouped query attention (GQA). ([blog](https://qwenlm.github.io/blog/qwen1.5-32b/))
    creator_organization_name: Qwen
    access: open
    release_date: 2024-04-02
    tags: [TEXT_MODEL_TAG, LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG, INSTRUCTION_FOLLOWING_MODEL_TAG]

Then, need to install OpenAI HELM client:

pip install crfm-helm[openai]

I started the benchmark using:

helm-run --run-entries mmlu:subject=philosophy,model=vllm/qwq-32b-preview-q4_k_m --suite my-suite --max-eval-instances 10 --local-path ~/GitHub/helm-config/

yifanmai · 2024-12-02T22:43:37Z

Thanks @Nero7991! This looks great. I will follow the pull request to llama.cpp.

yifanmai added the user question label Nov 8, 2024

Nero7991 mentioned this issue Dec 4, 2024

server: add OpenAI compatible response format for legacy /completions with b… ggerganov/llama.cpp#10645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to load GGUF models? #3141

Is there any way to load GGUF models? #3141

InAnYan commented Nov 7, 2024

yifanmai commented Nov 8, 2024

Nero7991 commented Dec 2, 2024 •

edited

Loading

yifanmai commented Dec 2, 2024

Is there any way to load GGUF models? #3141

Is there any way to load GGUF models? #3141

Comments

InAnYan commented Nov 7, 2024

yifanmai commented Nov 8, 2024

Nero7991 commented Dec 2, 2024 • edited Loading

yifanmai commented Dec 2, 2024

Nero7991 commented Dec 2, 2024 •

edited

Loading