Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to load GGUF models? #3141

Open
InAnYan opened this issue Nov 7, 2024 · 3 comments
Open

Is there any way to load GGUF models? #3141

InAnYan opened this issue Nov 7, 2024 · 3 comments

Comments

@InAnYan
Copy link

InAnYan commented Nov 7, 2024

Title says everything :)

@yifanmai
Copy link
Collaborator

yifanmai commented Nov 8, 2024

HI @InAnYan, we don't have official support for GGUF and we don't support running local in-process inference on GGUF models, but you can try one of these ways:

@Nero7991
Copy link

Nero7991 commented Dec 2, 2024

I've modified the llama-server code on my llama.cpp fork to add more OpenAI API support so the JSON response is compatible with the OpenAI client used by helm-run. I was able to run benchmarks on the Qwen's QwQ - qwq-32b-preview-q4_k_m.gguf 32B 4 bit quantized GGUF model after this. Created a pull request here

To use the modified llama.cpp server with HELM, I used the following model configurations by creating a folder with model_deployments.yaml and model_metadata.yaml files.

In the model_deployments.yaml file:

model_deployments:
  - name: vllm/qwq-32b-preview-q4_k_m
    model_name: vllm/qwq-32b-preview-q4_k_m
    tokenizer_name: qwen/qwen1.5-7b
    max_sequence_length: 32767
    client_spec:
      class_name: "helm.clients.vllm_client.VLLMClient"
      args:
        base_url: http://127.0.0.1:8080/v1/

Important to point correctly to the llama-server URL.

./llama-server -m models/qwq-32b-preview-q4_k_m.gguf -ngl 65 --port 8080

model_metadata.yaml file:

models: 
  - name: vllm/qwq-32b-preview-q4_k_m
    display_name: QwQ Preview 4 bit GGUF (32B)
    description: QwQ Preview 4 bit GGUF (32B), proposed by Aibaba Cloud. Qwen is a family of transformer models with SwiGLU activation, RoPE, and multi-head attention. The 32B version also includes grouped query attention (GQA). ([blog](https://qwenlm.github.io/blog/qwen1.5-32b/))
    creator_organization_name: Qwen
    access: open
    release_date: 2024-04-02
    tags: [TEXT_MODEL_TAG, LIMITED_FUNCTIONALITY_TEXT_MODEL_TAG, INSTRUCTION_FOLLOWING_MODEL_TAG]

Then, need to install OpenAI HELM client:

pip install crfm-helm[openai]

I started the benchmark using:

helm-run --run-entries mmlu:subject=philosophy,model=vllm/qwq-32b-preview-q4_k_m --suite my-suite --max-eval-instances 10 --local-path ~/GitHub/helm-config/

@yifanmai
Copy link
Collaborator

yifanmai commented Dec 2, 2024

Thanks @Nero7991! This looks great. I will follow the pull request to llama.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants