VPTQ Model Quantization Support in llama.cpp #9974

YangWang92 · 2024-10-21T08:25:28Z

YangWang92
Oct 21, 2024

Hi all,

We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization), which allows for fast quantization of large language models (LLMs) to 1-4 bits. The community has also helped us release several models using this method https://huggingface.co/VPTQ-community I am personally very interested in integrating this quantization method into ollama/llama.cpp.

There have been some discussions about this at this link, but I'm not sure if it fully explains the possibility of integrating VPTQ with llama.cpp. One important point to note is that VPTQ may not necessarily require a separate quantization dtype.

VPTQ’s dequantization method is quite simple, using just a lookup table. I would like to ask if you could guide me on how to integrate VPTQ into Ollama, even if it's on my own fork. Specifically, I’m considering two approaches:

Defining a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16) and hiding the dequantization process within a separate dequant op.
Defining a new quantization data type (e.g., some lookup table structure).

Could you please share your thoughts on which approach would be better or any suggestions for integration?

Thank you for your time and insights!
Yang

ggerganov · 2024-10-22T12:45:30Z

ggerganov
Oct 22, 2024
Maintainer

It would be helpful to know what is the data that you need to store and how it will be used during the matrix multiplication. For example, if I look at the data in one of the VPTQ models: https://huggingface.co/VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-65536-woft/tree/main?show_file_info=model-00001-of-00002.safetensors, can you briefly sketch how the tensors for model.layers.0.mlp.down_proj are used to perform the corresponding matrix multiplication with the activations vector?

6 replies

YangWang92 Oct 22, 2024
Author

Thank you very much for paying attention to our discussion.
If you find the current method too complex/too much details, we can greatly simplify the quantization method to reduce complexity:

Keep only the lookup table,
Remove perm, weight_scale, weight_bias, and outliers
which will slightly affect the quantization accuracy of the model but is believed to significantly simplify the dequantization speed of the model.
We are willing to update our method at any time for llama.cpp, and we are very eager to contribute our method to llama.cpp. Please feel free to communicate with us if you have any instructions/concerns.

We are willing to make any effort to integrate into llama.cpp. Thank you! :)

YangWang92 Oct 22, 2024
Author

The current dequantization kernel can be simplified to a very simple vector lookup table. If you find it complex, we can remove outliers, perm, weight scale, and weight bias in the model. torch kernel, cuda kernel

class VQuantLinear(nn.Module):
    def dequant(self):
        centroids = self.centroids.weight.view(self.num_codebooks, self.num_centroids, self.vector_len)

        if self.is_indice_packed:
            index_bits = math.ceil(math.log2(self.num_centroids))
            index_res_bits = math.ceil(math.log2(self.num_res_centroids)) if self.enable_residual else 0

            indices, res_indices = self.unpack_index_tensor(
                pack_tensor=self.indices,
                index_bits=index_bits,
                num_elements=self.in_features,
                res_bits=index_res_bits,
                num_res_elements=self.in_features,
                index_dtype=torch.uint16,
            )

        else:
            indices = self.indices.view(torch.uint16).to(torch.int64)
            if self.enable_residual:
                res_indices = self.res_indices.view(torch.uint16).to(torch.int64)

        indices = indices.unsqueeze(-1).expand(-1, -1, -1, self.vector_len)

        indices = indices.reshape(self.num_codebooks, -1, self.vector_len)
        selected_centroids = torch.gather(centroids, 1, indices)
        selected_centroids = selected_centroids.view(self.num_codebooks, -1, self.group_size, self.vector_len)
        selected_centroids = selected_centroids.permute(0, 1, 3, 2)

        qweight = selected_centroids.reshape(self.num_codebooks, -1, self.group_size)
        qweight = qweight.permute(1, 0, 2)
        qweight = qweight.reshape(-1, self.num_codebooks * self.group_size)

		# optional
        if self.enable_residual:
            res_centroids = self.res_centroids.weight.view(self.num_codebooks, self.num_res_centroids, self.vector_len)

            res_indices = res_indices.unsqueeze(-1).expand(-1, -1, -1, self.vector_len)

            res_indices = res_indices.reshape(self.num_codebooks, -1, self.vector_len)

            selected_res_centroids = torch.gather(res_centroids, 1, res_indices)

            selected_res_centroids = selected_res_centroids.reshape(self.num_codebooks, -1, self.group_size,
                                                                    self.vector_len)

            selected_res_centroids = selected_res_centroids.permute(0, 1, 3, 2)

			# add dequantized weight + residual weight
            qweight = qweight + (selected_res_centroids.reshape(self.num_codebooks, -1, self.group_size).permute(
                1, 0, 2).reshape(-1, self.num_codebooks * self.group_size))


        if self.padding > 0:
            if self.vector_quant_dim == "in":
                assert True, "Not implemented"
                qweight = qweight[:, :-self.padding]
            else:
                qweight = qweight[:-self.padding, :]

		# can be removed
        if self.enable_outlier:
            outlier_centroids = self.outlier_centroids.weight.view(1, self.outlier_num_centroids,
                                                                   self.outlier_vector_len)

            outlier_indices = self.outlier_indices.view(torch.uint16).to(torch.int64)

            outlier_indices = outlier_indices.unsqueeze(-1).expand(-1, -1, -1, self.outlier_vector_len)
            outlier_indices = outlier_indices.reshape(1, -1, self.outlier_vector_len)

            selected_outlier_centroids = torch.gather(outlier_centroids, 1, outlier_indices)
            selected_outlier_centroids = selected_outlier_centroids.reshape(1, -1, self.outlier_size,
                                                                            self.outlier_vector_len)
            selected_outlier_centroids = selected_outlier_centroids.permute(0, 1, 3, 2)

            qweight_outlier = selected_outlier_centroids.reshape(-1, self.outlier_size)

            if self.outlier_padding > 0:
                if self.vector_quant_dim == "in":
                    assert True, "Not implemented"
                else:
                    qweight_outlier = qweight_outlier[:-self.outlier_padding,]
            qweight = torch.cat([qweight_outlier, qweight], dim=1)

		# can be removed
        if self.enable_perm:
            invert_perm = torch.argsort(self.perm.view(torch.uint16).to(torch.int64))
            if self.vector_quant_dim == "in":
                assert True, "Not implemented"
                # qweight = qweight[invert_perm, :]
            else:
                qweight = qweight[:, invert_perm]

		# can be removed
        if self.enable_norm:
            qweight = qweight * self.weight_scale
            qweight = qweight + self.weight_bias

        return qweight

    def unpack_index_tensor(
        self,
        pack_tensor: torch.Tensor,
        index_bits: int,
        num_elements: int,
        res_bits: int = 0,
        num_res_elements: int = 0,
        index_dtype: torch.dtype = torch.uint16,
        as_dtype: torch.dtype = torch.int32,
    ) -> torch.Tensor:
        total_bits = index_bits + res_bits
        wf = torch.arange(0, 32, 1).to(pack_tensor.device).view(1, 1, 1, -1)
        out = torch.bitwise_right_shift(torch.unsqueeze(pack_tensor, -1), wf)
        torch.bitwise_and(out, 1, out=out)
        pad_size = (pack_tensor.shape[-1] * 32) % (index_bits * num_elements + res_bits * num_res_elements)
        out = out.reshape(*pack_tensor.shape[:-1], -1)
        if pad_size > 0:
            out = out[..., :-pad_size]
        out = out.reshape(*pack_tensor.shape[:-1], -1, total_bits)
        wf1 = torch.arange(0, total_bits, 1).to(pack_tensor.device).view(1, 1, 1, -1)
        out = torch.bitwise_left_shift(out, wf1).sum(dim=-1)

        unpack_indice = out.to(torch.uint64).view(torch.int64)

        indices = (unpack_indice & ((1 << index_bits) - 1)).view(torch.uint64).to(torch.int64)


        if res_bits > 0:
            res_indices = ((unpack_indice >> index_bits) & ((1 << index_bits) - 1)).view(torch.uint64).to(torch.int64)
        else:
            res_indices = None

        return indices, res_indices
        
    def forward(self, x, W=None, H=None):
		qweight = self.dequant()
        return F.linear(x, qweight, self.bias)

YangWang92 Oct 23, 2024
Author

Please give me two days, if you have concerns about the data format, and let me prepare a set of the simplest llama3.1 8b small models, removing the permutations, weight scale, and bias, to facilitate the testing and integration.

ggerganov Oct 23, 2024
Maintainer

Please give me two days, if you have concerns about the data format, and let me prepare a set of the simplest llama3.1 8b small models, removing the permutations, weight scale, and bias, to facilitate the testing and integration.

No worries. I don't have concerns yet, because I haven't looked into the details. I will need some time to think about the information that you have provided so far and see if there is a reasonable way to support this data format.

YangWang92 Oct 23, 2024
Author

Thank you very much for your attention. We are willing to make any modifications to the algorithm and implmenetation, and we hope you feel free to make any suggestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPTQ Model Quantization Support in llama.cpp #9974

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

VPTQ Model Quantization Support in llama.cpp #9974

YangWang92 Oct 21, 2024

Replies: 1 comment · 6 replies

ggerganov Oct 22, 2024 Maintainer

YangWang92 Oct 22, 2024 Author

YangWang92 Oct 22, 2024 Author

YangWang92 Oct 23, 2024 Author

ggerganov Oct 23, 2024 Maintainer

YangWang92 Oct 23, 2024 Author

YangWang92
Oct 21, 2024

Replies: 1 comment 6 replies

ggerganov
Oct 22, 2024
Maintainer

YangWang92 Oct 22, 2024
Author

YangWang92 Oct 22, 2024
Author

YangWang92 Oct 23, 2024
Author

ggerganov Oct 23, 2024
Maintainer

YangWang92 Oct 23, 2024
Author