Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About fusion of **kdequantize kernel** and **simple bf16/fp16 matmul** #1319

Open
Ther-nullptr opened this issue Aug 15, 2024 · 1 comment
Open

Comments

@Ther-nullptr
Copy link
Contributor

Feature request

A fused CUDA kernel that combine the dequantize main weight step and matrix multiplication, in order to reduce the data on-chip/off-chip movement.

Motivation

I use profile tools to analyze the breakdown of QLoRA:

config: activation:(4,512,14336), weight:(4096,14336), precision:NF4/BF16, platform: NVIDIA A800 80GB PCIe

image

image

Notice that the quantize/dequantize process of main weight occupies near 30%~50% of the main matrix multiplication. Analyzing the computing process:

  1. load 4bit weight matrix from DRAM(HBM) to SRAM(shared memory). <kernel 1>
  2. dequantize the weight to fp16/bf16 on SRAM. <kernel 1>
  3. write back the weight to DRAM. <kernel 1>
  4. load the fp16/bf16 weight and activation to SRAM. <kernel 2>
  5. compute on SRAM. <kernel 2>
  6. write back the output to DRAM. <kernel 2>

So is it possible to fuse the kernels to act like that:

  1. load 4bit weight matrix from DRAM(HBM) and 16bit activation to SRAM(shared memory). <kernel 1>
  2. dequantize the weight to fp16/bf16 on SRAM. <kernel 1>
  3. compute on SRAM. <kernel 1>
  4. write back the output to DRAM. <kernel 1>

Thus we only have to launch 1 kernel, and save 1 time of 16bit weight load, 1 time of 16bit weight store.

Your contribution

I just observe this, and I want ask is this idea possible.

@matthewdouglas
Copy link
Member

matthewdouglas commented Aug 16, 2024

We do have a more optimal GEMV path for inference with batch size of 1, but otherwise your thought process here is sound. It should be possible, and I would suggest following along with a potential FLUTE integration in #1293.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants