Add Int4CPULayout and update int4 woq #1278

yanbing-j · 2024-11-13T11:02:44Z

pytorch/pytorch#139611 is merged into PyTorch main branch.

pytorch-bot · 2024-11-13T11:02:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1278

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

GLIBC not found in Nova workflows

❌ 8 New Failures, 2 Pending

As of commit 98b8f8c with merge base 01dc7da ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
torchao/dtypes/__init__.py:11:5: F401 .affine_quantized_tensor.Int4CPULayoutimported but unused; consider removing, adding toall, or using a redundant alias
PR Label Check / Check PR Labels (gh)
##[error]This PR requires at least one label starting with 'topic:'. Available topics can be found at: https://github.com/pytorch/ao/labels?q=topic
Run Regression Tests / test (CPU 2.3, linux.4xlarge, torch==2.3.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int4woqtensors_2_cpu
Run Regression Tests / test (CPU 2.4, linux.4xlarge, torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/integration/test_integration.py::TestSubclass::test_int4_weight_only_quant_subclass_api_grouped_2_cpu
Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/integration/test_integration.py::TestSubclass::test_int4_weight_only_quant_subclass_api_grouped_2_cpu
Run Regression Tests / test (CPU Nightly, linux.4xlarge, --pre torch==2.6.0.dev20241101 --index-url https://download.pyt... / linux-job (gh)
test/integration/test_integration.py::TestSubclass::test_int4_weight_only_quant_subclass_api_grouped_2_cpu
Run Regression Tests / test (CUDA 2.3, linux.g5.12xlarge.nvidia.gpu, torch==2.3.0, cuda, 12.1) / linux-job (gh)
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int4woqtensors_2_cpu
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
test/quantization/test_qat.py::TestQAT::test_qat_4w_quantizer

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/dtypes/affine_quantized_tensor.py

jerryzh168 · 2024-11-14T02:50:26Z

we are doing a refactor for file structure btw: #1234 might be good to rebase after that is landed

jerryzh168 · 2024-11-14T02:51:43Z

test/quantization/test_quant_primitives.py

@@ -102,7 +102,8 @@ def _groupwise_affine_quantize_tensor_from_qparams(
        .reshape_as(w)
    )
    if TORCH_VERSION_AT_LEAST_2_5:
-        w_int4x8 = (w_int4x8[::, ::2] << 4 | w_int4x8[::, 1::2]).to(torch.uint8)
+        if w.device.type != "cpu":


maybe you can use

ao/torchao/dtypes/utils.py

Line 60 in 39f16f4

def is_device(target_device_str: str, device: Union[str, torch.device]):

jerryzh168 · 2024-11-14T02:52:24Z

torchao/dtypes/affine_quantized_tensor.py

@@ -630,6 +630,11 @@ def extra_repr(self):
        return f"inner_k_tiles={self.inner_k_tiles}"


+@dataclass(frozen=True)
+class Int4CPULayout(Layout):
+    def pre_process(self, input: torch.Tensor) -> torch.Tensor:


you don't need to define this if it's the same as the default behavior?

you can just do pass here I think

Yes, the default behavior is ok, use pass instead.

jerryzh168 · 2024-11-14T02:54:01Z

torchao/dtypes/affine_quantized_tensor.py

+
+    __torch_function__ = torch._C._disabled_torch_function_impl
+
+    def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:


we have an unpack op for tensor core tiled layout now, so this can actually be replaced with a call to the op:

ao/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu

Lines 311 to 312 in 39f16f4

m.impl("torchao::unpack_tensor_core_tiled_layout", &_unpack_tensor_core_tiled_layout);

m.impl("torchao::dequantize_tensor_core_tiled_layout", &_dequantize_tensor_core_tiled_layout);

do you plan to write similar ops for cpu?

I have noticed this, but I have no bandwidth to do so these days. If you are not urgent for this feature, I can take this task.

cc @mingfeima

jerryzh168 · 2024-11-14T02:55:00Z

torchao/quantization/subclass.py

@@ -609,5 +617,8 @@ def to_qtensor_components(cls, input_float, groupsize=128, inner_k_tiles=8):
        input_int4x8, scales_and_zeros = groupwise_affine_quantize_tensor(
            input_float, 4, groupsize, dtype=input_float.dtype
        )
-        int_data = aten._convert_weight_to_int4pack(input_int4x8, inner_k_tiles)
+        if input_float.device == torch.device("cpu"):


same here, can probably use

ao/torchao/dtypes/utils.py

Line 60 in 39f16f4

def is_device(target_device_str: str, device: Union[str, torch.device]):

jerryzh168 · 2024-11-14T02:55:40Z

torchao/quantization/utils.py

+        # if int_data_device_type == "mps":
+        #     int_data = int_data.cpu()
+        if int_data_device_type != "cpu":
+            int_data = (int_data[::, ::2] << 4 | int_data[::, 1::2]).to(torch.uint8)
+        # if int_data_device_type == "mps":
+        #     int_data = int_data.to(device="mps")


please remove the code that's commented out

is this equivalent to previous code?

According to #517 (comment), << can be used in MPS backend, don't need to convert to CPU and use CPU backend. Since I don't have mps machine, I want to use CI to check if this can work. Otherwise, I can update to int_data = (torch.bitwise_left_shift(int_data[::, ::2], 4) | int_data[::, 1::2]).to(torch.uint8) instead.

jerryzh168

can be a separate PR, but can you also help add support for conversion between int4 tensor core tiled layout and int4 cpu layout, we may need a separate util for this, like we discussed in the issue: #1117 (comment)

right now we error out when converting between different devices

ao/torchao/dtypes/affine_quantized_tensor.py

Lines 1486 to 1489 in 39f16f4

    
           if not is_device(torch.device(self.device).type, device): 
        
               raise ValueError( 
        
                   f"TensorCoreTiledAQTTensorImpl does not support conversion from {self.device} to {device}" 
        
               )

, this is fine I think, just need separate utils if people want to do this move.

Test can be added in

ao/test/dtypes/test_affine_quantized.py

Line 44 in 39f16f4

class TestAffineQuantized(TestCase):

jerryzh168 · 2024-11-14T03:02:26Z

torchao/dtypes/affine_quantized_tensor.py

+        )
+
+    def _apply_fn_to_data(self, fn):
+        # self.packed_weight = fn(self.packed_weight)


please remove commented code

jerryzh168 · 2024-11-14T03:03:00Z

torchao/dtypes/affine_quantized_tensor.py

+
+    def to(self, *args, **kwargs):
+        kwargs = self._get_to_kwargs(*args, **kwargs)
+        device = kwargs["device"]


we should ban the device change like

ao/torchao/dtypes/affine_quantized_tensor.py

Lines 1486 to 1489 in 39f16f4

if not is_device(torch.device(self.device).type, device):

raise ValueError(

f"TensorCoreTiledAQTTensorImpl does not support conversion from {self.device} to {device}"

)

as well

Add it back.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2024

yanbing-j mentioned this pull request Nov 13, 2024

Split int4wo weight packing pytorch/pytorch#139611

Closed

jerryzh168 reviewed Nov 13, 2024

View reviewed changes

torchao/dtypes/affine_quantized_tensor.py Outdated Show resolved Hide resolved

Add Int4CPULayout and update int4 woq

104d1f3

yanbing-j force-pushed the yanbing/update_int4 branch from 1b26f26 to 104d1f3 Compare November 14, 2024 02:46

yanbing-j marked this pull request as ready for review November 14, 2024 02:47

jerryzh168 reviewed Nov 14, 2024

View reviewed changes

jerryzh168 approved these changes Nov 14, 2024

View reviewed changes

jerryzh168 reviewed Nov 14, 2024

View reviewed changes

Update based on comments

fbb2cae

yanbing-j force-pushed the yanbing/update_int4 branch from 512eb75 to 98b8f8c Compare November 14, 2024 10:22

Update

98b8f8c

Jack-Khuu mentioned this pull request Nov 14, 2024

Bump PyTorch pin to 20241112 pytorch/torchchat#1367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Int4CPULayout and update int4 woq #1278

Add Int4CPULayout and update int4 woq #1278

yanbing-j commented Nov 13, 2024

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading

jerryzh168 commented Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 left a comment

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024

jerryzh168 Nov 14, 2024

yanbing-j Nov 14, 2024


		__torch_function__ = torch._C._disabled_torch_function_impl

		def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

	m.impl("torchao::unpack_tensor_core_tiled_layout", &_unpack_tensor_core_tiled_layout);
	m.impl("torchao::dequantize_tensor_core_tiled_layout", &_dequantize_tensor_core_tiled_layout);

	if not is_device(torch.device(self.device).type, device):
	raise ValueError(
	f"TensorCoreTiledAQTTensorImpl does not support conversion from {self.device} to {device}"
	)

Add Int4CPULayout and update int4 woq #1278

Are you sure you want to change the base?

Add Int4CPULayout and update int4 woq #1278

Conversation

yanbing-j commented Nov 13, 2024

pytorch-bot bot commented Nov 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1278

❗ 1 Active SEVs

❌ 8 New Failures, 2 Pending

jerryzh168 commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading