Subclass API (#966) #995

metascroy · 2024-10-02T22:09:59Z

Summary:

Adds new int8_dynamic_activation_intx_weight quantization with subclass API

Differential Revision: D62464487

pytorch-bot · 2024-10-02T22:10:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/995

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ae21905 with merge base 958a197 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-02T22:10:07Z

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy · 2024-10-02T22:14:33Z

torchao/quantization/quant_primitives.py

@@ -300,7 +300,7 @@ def _quantize_affine_no_dtype_cast(
    elif zero_point_domain is None:
        # This case handles quantization for float8 we expect no zero point and no zero point domain
        assert zero_point is None, "zero_point should be None when zero_point_domain is None"
-        quant = torch.clamp(input * scale.reciprocal(), quant_min, quant_max)
+        quant = torch.clamp(torch.round(input * (1.0 / scale)), quant_min, quant_max)


@jerryzh168 to confirm if this is OK. It was needed to match behavior of other quantizer.

hmmm, it might be fine as long as all the tests passes I think

I think the tests do not pass because it is slightly different quantization logic. It looks more sensible to me to round before truncating, but I can also drop this change.

We can do a perplexity study when moving from the other quantizer to this one in torchchat. But I have narrowed down this as being the only numerical difference between the two.

I see, we don't want to break tests I think, but if this is better for torchchat we can create a new quant primitive op or add a new option here I feel

metascroy · 2024-10-02T22:15:11Z

torchao/quantization/quant_primitives.py

-        if preserve_zero:
-            zero_point = quant_min - torch.round(min_val_neg / scale)
-            zero_point = torch.clamp(zero_point, quant_min, quant_max)
+        if zero_point_domain is None:


@jerryzh168 confirm if this is OK. It was needed to get scale-only quantization in affine_quantized_tensor

OK, should zero_point be None here?

I could make it None, but that changes the return type of this method from Tuple[Tensor, Tensor] to Tuple[Tensor, Optional[Tensor]]

yeah I think making it None probably makes more sense here

metascroy · 2024-10-02T22:18:17Z

torchao/experimental/tests/test_linear_8bit_act_xbit_weight_subclass_quantizer.py

+        exported = torch.export.export(model, (activations,))
+
+        print("Compiling quantized model")
+        compiled = torch.compile(unwrapped_model)


@jerryzh168 do you see unification for compile and export coming soon? The fact that one requires an unwrapped tensor subclass and the other requires a wrapped one makes using this API inconvenient in torchchat.

yes, it's blocked by pytorch/pytorch#129682 and I heard @tugsbayasgalan is working on this

metascroy · 2024-10-02T22:28:18Z

@kimishpatel @jerryzh168 moving review over to GH. I hope I've addressed most of your concerns.

@jerryzh168, the fact that compile and export cannot handle the same model (export requires an unwrapped tensor subclass, compile requires a wrapped one, and eager can handle both) makes using this API inconvenient in torchchat. Do you know if there is planned unification there?

kimishpatel · 2024-10-07T17:22:02Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+    input_tensor = input_tensor.reshape(-1, m, k)
+
+    res = [
+        _impl_2d(input_tensor[i, :, :], weight_tensor)


Why are you doing it like this? You can just fuse first N dim. LIke line 379 should be
input_tensor = input_tensor.reshape(-1, k) no?

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

jerryzh168 · 2024-10-09T18:00:07Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+        )
+
+        # Quantize activations
+        activation_scales, activation_zeros = choose_qparams_affine(


dynamic quantization should be reusing affine quantized tensor, example:

ao/torchao/quantization/quant_api.py

Line 586 in 900f9ac

def int8_dynamic_activation_int8_weight(layout_type=PlainLayoutType()):

why is this calling these functions here?

That function doesn't look equivalent? It looks like the quantization is symmetric.

you can choose a different mapping type,
I mean we can use

ao/torchao/quantization/quant_api.py

Line 613 in 900f9ac

weight = to_linear_activation_quantized(weight, input_quant_func)

to do activation quantization, maybe "Dynamic Activation Quantization + Weight Quantization" in #391 is clearer in explaining this, please let me know if it makes sense

Sorry, I don't follow what you're suggesting. I have weights_dequantized on line 260, and I need activations_dequantized so I can call torch.matmul(activations_dequantized, weights_dequantized.transpose(1, 0)) on line 296.

I'm not sure what I'm suppose to replace the code that generates activations_dequantized with.

jerryzh168 · 2024-10-09T18:01:35Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+
+
+# This format is intended for use with int8 dynamic quantization
+class IntxWeightLayoutType(LayoutType):


sorry still find this name not descriptive, what are the kernels this layout is targeting? are these executorch native kernels? if so maybe IntxExecutorchLayout or similar might be more helpful

This layout targets the linear_8bit_act_xbit_weight kernels in torchao/experimental. They can run on all PyTorch platforms (eager, AOTI, compile, and ExecuTorch), so adding ExecuTorch doesn't make sense. Open to naming suggestions.

OK I feel it makes sense to just include the kernel name here then, like Linear8BitActXBitWeightLayoutType (also we are renaming LayoutType to Layout to make things clearer) you'll probably see this after rebase

@jerryzh168 I honestly find layout type description not very useful. There can be N different layouts for N different kernels. Right now entire machinery in subclass API is built around what operator or operator implementation to dispatch to using the layout information. I see two issues here

Each operator maybe backed by different kernel implementation that want different layout. It does not seem feasible to enumerate all possible ways in which weights can be packed, and make such information visible to quant API

Quant API shouldnt really be in the business of understanding packed layout information. This should really be left to subclasses of AQT

@kimishpatel I don't quite follow why "it's not feasible to enumerate all possible ways in which weights can be packed" but I think we are not asking people to use AffineQuantizedTensor at this stage as I discussed in the post, so feel free to contribute in a way you feel makes more sense, we can always merge/refactor later if needed. although a high level API + brief description of implementation might be helpful for us to understand what you have in mind.

also for "Quant API shouldnt really be in the business of understanding packed layout information. This should really be left to subclasses of AQT" I think it might be better/easier to just copy paste AQT and create a new tensor subclass in that case instead of subclassing AQT, we are not clear whether we should have inheriting AQT as an extension point yet

jerryzh168 · 2024-10-09T18:03:27Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+        n, k_ = weight_tensor.shape
+        assert k_ == k
+
+        weights_dequantized = dequantize_per_channel_group(


we tend to use quantize_affine/dequantize_affine I think, also this should probably be:

weights_dequantized = weight_tensor.dequantize()?

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

jerryzh168 · 2024-10-09T18:05:35Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+        assert len(weight_tensor.block_size) == 2
+        assert weight_tensor.block_size[0] == 1
+        group_size = weight_tensor.block_size[1]
+        assert group_size == weight_tensor.layout_tensor.layout_type.group_size


this can probably be weight_tensor.layout_type.group_size (although we are renaming layout_type to layout now

kimishpatel · 2024-10-11T03:21:20Z

I plan to review some time tomorrow

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-11T20:26:15Z

This pull request was exported from Phabricator. Differential Revision: D62464487

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-11T20:29:09Z

This pull request was exported from Phabricator. Differential Revision: D62464487

metascroy · 2024-10-11T20:30:41Z

torchao/experimental/quant_api.py

+
+    def apply(weight):
+        assert weight.shape[-1] % group_size == 0
+        assert weight.device == torch.device("cpu"), "Only CPU is supported"


Add CPU device assert here

jerryzh168

some requested changes, please see comments

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-18T16:52:10Z

This pull request was exported from Phabricator. Differential Revision: D62464487

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-20T15:55:33Z

This pull request was exported from Phabricator. Differential Revision: D62464487

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-21T20:14:49Z

This pull request was exported from Phabricator. Differential Revision: D62464487

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

facebook-github-bot · 2024-10-21T22:27:34Z

This pull request was exported from Phabricator. Differential Revision: D62464487

torchao/quantization/quant_primitives.py

jerryzh168 · 2024-10-22T00:37:59Z

torchao/experimental/docs/readme.md

readme.md --> README.md?

jerryzh168 · 2024-10-22T00:39:04Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

nit: I think we should remove the quantizer in the name, _linear_8bit_act_xbit_weight_layout.py might be more appropriate

jerryzh168 · 2024-10-22T00:41:19Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+        weights_dequantized = weight_tensor.dequantize()
+
+        # Quantize activations
+        activation_scales, activation_zeros = choose_qparams_affine(


I think we can probably use to_affine_quantized_intx and dequantize_affine here to quantize activation?

jerryzh168 · 2024-10-22T00:41:42Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+    dequantize_per_channel_group,
+    quantize_per_channel_group,


these ops are a bit deprecated so we tend not to use these if possible

jerryzh168 · 2024-10-22T00:42:08Z

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py

+    MappingType,
+    ZeroPointDomain,
+)
+from torchao.utils import TorchAOBaseTensor


nit: looks like not used

jerryzh168

LGTM, left a few more nit comments

jerryzh168 · 2024-10-25T04:46:00Z

torchao/quantization/quant_primitives.py

    """
    INT = auto()
    FLOAT = auto()
+    ZERO = auto()


btw, I feel maybe NONE or NO_ZERO_POINT would be better?

facebook-github-bot · 2024-10-29T19:16:55Z

This pull request was exported from Phabricator. Differential Revision: D62464487

Summary: Pull Request resolved: pytorch#995 Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

jerryzh168 · 2024-10-29T20:02:00Z

torchao/experimental/quant_api.py

+)
+
+
+def int8_dyn_act_intx_weight(


we are spelling out all the words in our API so far, so this should be int8_dynamic_activation_intx_weight I think

jerryzh168 · 2024-10-29T20:02:13Z

torchao/experimental/tests/test_linear_8bit_act_xbit_weight_subclass_quantizer.py

can you rename the tests as well

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Reviewed By: jerryzh168 Differential Revision: D62464487

facebook-github-bot · 2024-10-29T21:11:59Z

This pull request was exported from Phabricator. Differential Revision: D62464487

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024

facebook-github-bot added the fb-exported label Oct 2, 2024

metascroy commented Oct 2, 2024

View reviewed changes

metascroy requested review from jerryzh168 and kimishpatel October 2, 2024 22:21

kimishpatel reviewed Oct 7, 2024

View reviewed changes

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Oct 9, 2024

View reviewed changes

torchao/experimental/_linear_8bit_act_xbit_weight_subclass_quantizer.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Oct 9, 2024

View reviewed changes

metascroy force-pushed the export-D62464487 branch from 41a40cb to ae4db88 Compare October 11, 2024 20:25

metascroy added a commit to metascroy/ao that referenced this pull request Oct 11, 2024

Subclass API (pytorch#995)

ae4db88

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy added a commit to metascroy/ao that referenced this pull request Oct 11, 2024

Subclass API (pytorch#995)

70b6b7a

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from ae4db88 to 70b6b7a Compare October 11, 2024 20:28

metascroy commented Oct 11, 2024

View reviewed changes

jerryzh168 requested changes Oct 17, 2024

View reviewed changes

metascroy added a commit to metascroy/ao that referenced this pull request Oct 18, 2024

Subclass API (pytorch#995)

83cdeb2

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from 70b6b7a to 83cdeb2 Compare October 18, 2024 16:51

metascroy added a commit to metascroy/ao that referenced this pull request Oct 20, 2024

Subclass API (pytorch#995)

e95f4ec

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from 83cdeb2 to e95f4ec Compare October 20, 2024 15:53

metascroy added a commit to metascroy/ao that referenced this pull request Oct 21, 2024

Subclass API (pytorch#995)

4f33fa5

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from e95f4ec to 4f33fa5 Compare October 21, 2024 20:14

metascroy added a commit to metascroy/ao that referenced this pull request Oct 21, 2024

Subclass API (pytorch#995)

aaf3cf4

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from 4f33fa5 to aaf3cf4 Compare October 21, 2024 22:27

jerryzh168 reviewed Oct 22, 2024

View reviewed changes

torchao/quantization/quant_primitives.py Show resolved Hide resolved

jerryzh168 reviewed Oct 22, 2024

View reviewed changes

torchao/experimental/docs/readme.md Outdated

Copy link

Contributor

jerryzh168 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readme.md --> README.md?

metascroy reacted with thumbs up emoji

jerryzh168 reviewed Oct 22, 2024

View reviewed changes

jerryzh168 approved these changes Oct 22, 2024

View reviewed changes

jerryzh168 reviewed Oct 25, 2024

View reviewed changes

metascroy force-pushed the export-D62464487 branch from aaf3cf4 to f684bb1 Compare October 29, 2024 19:16

metascroy added a commit to metascroy/ao that referenced this pull request Oct 29, 2024

Subclass API (pytorch#995)

f684bb1

Summary: Pull Request resolved: pytorch#995 Adds new int8_dynamic_activation_intx_weight quantization with subclass API Differential Revision: D62464487

jerryzh168 reviewed Oct 29, 2024

View reviewed changes

jerryzh168 approved these changes Oct 29, 2024

View reviewed changes

Subclass API (pytorch#995)

ae21905

Summary: Adds new int8_dynamic_activation_intx_weight quantization with subclass API Reviewed By: jerryzh168 Differential Revision: D62464487

metascroy force-pushed the export-D62464487 branch from f684bb1 to ae21905 Compare October 29, 2024 21:11

jerryzh168 mentioned this pull request Oct 30, 2024

intx weight only linear quantizer for mps #1192

Open

facebook-github-bot merged commit 581d8e0 into pytorch:main Oct 30, 2024
18 of 19 checks passed



		# This format is intended for use with int8 dynamic quantization
		class IntxWeightLayoutType(LayoutType):

Subclass API (#966) #995

Subclass API (#966) #995

Conversation

metascroy commented Oct 2, 2024

pytorch-bot bot commented Oct 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/995

✅ No Failures

facebook-github-bot commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metascroy Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metascroy commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Oct 11, 2024

facebook-github-bot commented Oct 11, 2024

facebook-github-bot commented Oct 11, 2024

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 18, 2024

facebook-github-bot commented Oct 20, 2024

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

jerryzh168 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 29, 2024

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

metascroy Oct 10, 2024 •

edited

Loading

jerryzh168 Oct 25, 2024 •

edited

Loading