You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the quantization procedure for qlora, there is the 'nf'4 storage datatype and the compute datatype (in the paper bfloat16 which is the original)(please refer to the image). They then dequantize the value to the compute datatype for inference or calculating the backward pass. When I tried using int8 for the compute datatype, matrix multiplication threw an error for not being supported for this datatype. I have not tried inference with qint8(). Is it possible to make 'nf4' as a possible computation datatype, and have the relevant functions be able to handle this?
Motivation
Dequantizing a value for performing calculations and storing those results and updates in current full precision (even though in qlora, only a small set of adapter weights are updated), is still inefficient and undoable especially for hardware on edge devices. Doing research towards performing calculations accurately with weights still in 4 bits would be a desirable improvement.
Your contribution
I can try to submit a PR for this. I would just need some guidance in the right direction to help me get started.
The text was updated successfully, but these errors were encountered:
Feature request
In the quantization procedure for qlora, there is the 'nf'4 storage datatype and the compute datatype (in the paper bfloat16 which is the original)(please refer to the image). They then dequantize the value to the compute datatype for inference or calculating the backward pass. When I tried using int8 for the compute datatype, matrix multiplication threw an error for not being supported for this datatype. I have not tried inference with qint8(). Is it possible to make 'nf4' as a possible computation datatype, and have the relevant functions be able to handle this?
Motivation
Dequantizing a value for performing calculations and storing those results and updates in current full precision (even though in qlora, only a small set of adapter weights are updated), is still inefficient and undoable especially for hardware on edge devices. Doing research towards performing calculations accurately with weights still in 4 bits would be a desirable improvement.
Your contribution
I can try to submit a PR for this. I would just need some guidance in the right direction to help me get started.
The text was updated successfully, but these errors were encountered: