Fixing quantize in int4 mode #159

Artyom17 · 2024-04-19T03:44:23Z

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

Artyom17 · 2024-04-19T21:25:38Z

@HDCharles ?

Chillee · 2024-04-21T19:11:02Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

Artyom17 · 2024-04-22T18:53:43Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

jerryzh168 · 2024-04-29T21:30:38Z

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:

(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3]
tensor([[-0.0048, -0.0957, -0.0757],
[ 0.0243, -0.0211, -0.0081],
[ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16)
(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3]
tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02],
[-1.4526e-02, -2.1118e-02, -8.0566e-03],
[ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)

cc @HDCharles

HDCharles

is this still needed, i thought @malfet addressed this a while back?

Fixing quantize in int4 mode

9f08b3c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024

Artyom17 mentioned this pull request Apr 19, 2024

llama3 8B support, tiktoken tokenizer #158

Merged

HDCharles approved these changes May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing quantize in int4 mode #159

Fixing quantize in int4 mode #159

Artyom17 commented Apr 19, 2024

Artyom17 commented Apr 19, 2024

Chillee commented Apr 21, 2024

Artyom17 commented Apr 22, 2024 •

edited

Loading

jerryzh168 commented Apr 29, 2024

HDCharles left a comment

Fixing quantize in int4 mode #159

Are you sure you want to change the base?

Fixing quantize in int4 mode #159

Conversation

Artyom17 commented Apr 19, 2024

Artyom17 commented Apr 19, 2024

Chillee commented Apr 21, 2024

Artyom17 commented Apr 22, 2024 • edited Loading

jerryzh168 commented Apr 29, 2024

HDCharles left a comment

Choose a reason for hiding this comment

Artyom17 commented Apr 22, 2024 •

edited

Loading