VPTQ Model Quantization Support in llama.cpp #9974
YangWang92
started this conversation in
Ideas
Replies: 1 comment 6 replies
-
It would be helpful to know what is the data that you need to store and how it will be used during the matrix multiplication. For example, if I look at the data in one of the VPTQ models: https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft/tree/main?show_file_info=model-00001-of-00002.safetensors, can you briefly sketch how the tensors for |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization), which allows for fast quantization of large language models (LLMs) to 1-4 bits. The community has also helped us release several models using this method https://huggingface.co/VPTQ-community I am personally very interested in integrating this quantization method into
ollama/llama.cpp
.There have been some discussions about this at this link, but I'm not sure if it fully explains the possibility of integrating VPTQ with
llama.cpp
. One important point to note is that VPTQ may not necessarily require a separate quantization dtype.VPTQ’s dequantization method is quite simple, using just a lookup table. I would like to ask if you could guide me on how to integrate VPTQ into Ollama, even if it's on my own fork. Specifically, I’m considering two approaches:
Defining a series of new models (e.g.,
vptq-llama3.1
) using existing data types (int32, fp16) and hiding the dequantization process within a separate dequant op.Defining a new quantization data type (e.g., some lookup table structure).
Could you please share your thoughts on which approach would be better or any suggestions for integration?
Thank you for your time and insights!
Yang
Beta Was this translation helpful? Give feedback.
All reactions