You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We do have a more optimal GEMV path for inference with batch size of 1, but otherwise your thought process here is sound. It should be possible, and I would suggest following along with a potential FLUTE integration in #1293.
Feature request
A fused CUDA kernel that combine the dequantize main weight step and matrix multiplication, in order to reduce the data on-chip/off-chip movement.
Motivation
I use profile tools to analyze the breakdown of QLoRA:
Notice that the quantize/dequantize process of main weight occupies near 30%~50% of the main matrix multiplication. Analyzing the computing process:
So is it possible to fuse the kernels to act like that:
Thus we only have to launch 1 kernel, and save 1 time of 16bit weight load, 1 time of 16bit weight store.
Your contribution
I just observe this, and I want ask is this idea possible.
The text was updated successfully, but these errors were encountered: