Add PyTorch 2.5 to regression test #1100

gau-nernst · 2024-10-17T02:10:26Z

Fixes #888

pytorch-bot · 2024-10-17T02:10:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1100

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 3503d30 with merge base eb1fb3a ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CPU 2.5, linux.4xlarge, torch==2.5.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_2_cpu
Run Regression Tests / test (CUDA 2.5, linux.g5.12xlarge.nvidia.gpu, torch==2.5.0 --index-url https://download.pytorch.o... / linux-job (gh)
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_2_cpu
Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh)
test/sparsity/test_sparse_api.py::TestQuantSemiSparse::test_quant_semi_sparse_compile_True

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-10-17T02:20:22Z

Sorry for making a change to a draft PR but 2.2 is just way too old now so let's just get rid of it

gau-nernst · 2024-10-17T02:27:13Z

No worries, I was about to ask you about 2.2 too. Just wanted to see if/which tests will fail with 2.5 (and sadly there are failing tests 😢)

jerryzh168 · 2024-10-17T04:48:04Z

I think I've seen the error before, I think we'd just need to restrict all the tensor parallel tests to 2.6 and later

probably changing

ao/test/dtypes/test_affine_quantized_tensor_parallel.py

Line 124 in 6b52996

if not TORCH_VERSION_AT_LEAST_2_5:

and

ao/torchao/testing/utils.py

Line 325 in 6b52996

if not TORCH_VERSION_AT_LEAST_2_5:

to TORCH_VERSION_AFTER_2_6 (needs to be added to torchao/utils.py as well) will fix it

gau-nernst · 2024-10-17T04:56:26Z

@jerryzh168 feel free to push the changes directly to this PR (or make a separate PR if you prefer that).

I already added TORCH_VERSION_AT_LEAST_2_6 label in my previous BitNet PR.

gau-nernst · 2024-10-18T03:02:42Z

@matthewdouglas Do you know why bnb is erroring out in this job? https://github.com/pytorch/ao/actions/runs/11396526050/job/31710520597?pr=1100

PyTorch 2.5 from pypi (built with CUDA 12.4) + bnb 0.42.0. I'm not sure why the CI install bnb 0.42 instead of 0.44 (perhaps due to another dependency?).

Seems like bnb 0.42 supports CUDA 12.1 but not CUDA 12.4? I guess we can install torch2.5 from https://download.pytorch.org/whl/cu121 in that case.

gau-nernst · 2024-10-18T05:06:24Z

The torch2.5 CI job consistently get CUDA illegal memory access. I think the Galore kernel triggers it? (once illegal memory access is encountered, all subsequent CUDA ops will fail). Will try debug on my local machine.

I have skipped some distributed tests, so they only run on nightly (2.6).

There are still some failing AQT tests. @jerryzh168 perhaps you have some ideas?

torchao/prototype/galore/kernels/quant.py

jerryzh168 · 2024-10-23T04:33:32Z

sorry just saw this, the AQT related errors seems to be related to inductor freezing, cc @leslie-fang-intel can you take a look?

leslie-fang-intel · 2024-10-23T04:39:26Z

sorry just saw this, the AQT related errors seems to be related to inductor freezing, cc @leslie-fang-intel can you take a look?

@gau-nernst did you mean the failure of test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_2_cpu? I think the PR (pytorch/pytorch#136353) to fix them landed in PyTorch main but not cherry-pick to 2.5 release branch. Can we skip them in PyTorch 2.5 regression test and keep it with main?

gau-nernst · 2024-10-23T13:26:23Z

@leslie-fang-intel These are the failing tests (and their variants)

test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_subclass_api_1_cpu
test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_with_freeze_0_cpu
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_1_cpu

Not sure if they all have the same root cause. Can you add the appropriate skip test logic in my branch, since you would know best what to skip? (Not sure if you can push to my branch)

Also, do we need to add a doc somewhere about this?

leslie-fang-intel · 2024-10-24T06:33:24Z

@leslie-fang-intel These are the failing tests (and their variants)
test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_subclass_api_1_cpu
test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_with_freeze_0_cpu
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_1_cpu
Not sure if they all have the same root cause. Can you add the appropriate skip test logic in my branch, since you would know best what to skip? (Not sure if you can push to my branch)

Also, do we need to add a doc somewhere about this?

Seems I can't directly push to your branch. Stack a commit here leslie-fang-intel@1fdfa42 and seems torch 2.5 testing are all green now as in: #1156.

There 2 issues regarding to these failures:

Inductor freezing caused functional regression and which has been fixed in [aotd] Fix freezing API for subclasses pytorch#136265 and documented in https://github.com/leslie-fang-intel/ao/blob/6fd1d6432c444c1aef24f70085f724c26b79bb88/torchao/quantization/README.md#workaround-with-unwrap_tensor_subclass-for-export-aoti-and-torchcompile-pytorch-24-and-before-only
Inductor CPP backend regression fixed in [Inductor][CPP] Fix int8 cvt half pytorch#136353

gau-nernst · 2024-10-24T07:00:06Z

To make sure I understand it correctly

For this problem

Inductor freezing caused functional regression
Note that the workaround is also required for torch.compile with freezing (torch._inductor.config.freezing=True) until pytorch/pytorch#136265 is fixed.

Users have to use unwrap_tensor_subclass to get correct results? If this is the case, shouldn't the test do this instead of skipping it in 2.5?

For the 2nd problem

Inductor CPP backend regression fixed in pytorch/pytorch#136353

It's a bug in PyTorch 2.5 that won't be fixed until there is a bug fix patch 2.5.1, if any? This seems pretty serious. Apart from adding a doc on this, I wonder if anything else we can do? cc @jerryzh168 @msaroufim

leslie-fang-intel · 2024-10-24T10:10:51Z

Users have to use unwrap_tensor_subclass to get correct results? If this is the case, shouldn't the test do this instead of skipping it in 2.5?

Sure, update the skipping logic by using unwrap_tensor_subclass with freezing before torch2.6.

It's a bug in PyTorch 2.5 that won't be fixed until there is a bug fix patch 2.5.1, if any?

How to mark this PR: pytorch/pytorch#136353 to cherry-pick in 2.5.1?

gau-nernst · 2024-10-25T01:17:11Z

Replaced by #1156

add torch 2.5

b1c40e7

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2024

remove torch 2.2

42e6c70

gau-nernst added 5 commits October 18, 2024 02:20

Merge branch 'main' into torch2.5

e5ee557

use pypi

ad70a22

skip tensor-parallel test on 2.5

8f6fb82

skip low-bit optim FSDP2 test on 2.5

5828901

torch2.5 uses cuda12.4 on pypi

c4cf5b6

use cu12.1 for torch2.5

7d77856

gau-nernst added 3 commits October 21, 2024 03:08

Merge branch 'main' into torch2.5

749d9ba

re-apply version gating

475c5e9

fix galore dequant kernel

728c2c4

gau-nernst commented Oct 21, 2024

View reviewed changes

torchao/prototype/galore/kernels/quant.py Outdated Show resolved Hide resolved

add mask for absmax load

9a8319a

gau-nernst mentioned this pull request Oct 21, 2024

Fix out-of-bounds memory access in Galore dequant kernel #1125

Merged

Merge branch 'main' into torch2.5

3503d30

gau-nernst mentioned this pull request Oct 25, 2024

Leslie/torch2.5 #1156

Closed

gau-nernst closed this Oct 25, 2024

leslie-fang-intel mentioned this pull request Oct 25, 2024

Add PyTorch 2.5 to regression test #1168

Merged

gau-nernst deleted the torch2.5 branch November 2, 2024 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PyTorch 2.5 to regression test #1100

Add PyTorch 2.5 to regression test #1100

gau-nernst commented Oct 17, 2024

pytorch-bot bot commented Oct 17, 2024 •

edited

Loading

msaroufim commented Oct 17, 2024

gau-nernst commented Oct 17, 2024

jerryzh168 commented Oct 17, 2024 •

edited

Loading

gau-nernst commented Oct 17, 2024

gau-nernst commented Oct 18, 2024

gau-nernst commented Oct 18, 2024

jerryzh168 commented Oct 23, 2024

leslie-fang-intel commented Oct 23, 2024

gau-nernst commented Oct 23, 2024

leslie-fang-intel commented Oct 24, 2024

gau-nernst commented Oct 24, 2024

leslie-fang-intel commented Oct 24, 2024

gau-nernst commented Oct 25, 2024

Add PyTorch 2.5 to regression test #1100

Add PyTorch 2.5 to regression test #1100

Conversation

gau-nernst commented Oct 17, 2024

pytorch-bot bot commented Oct 17, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1100

❌ 3 New Failures

msaroufim commented Oct 17, 2024

gau-nernst commented Oct 17, 2024

jerryzh168 commented Oct 17, 2024 • edited Loading

gau-nernst commented Oct 17, 2024

gau-nernst commented Oct 18, 2024

gau-nernst commented Oct 18, 2024

jerryzh168 commented Oct 23, 2024

leslie-fang-intel commented Oct 23, 2024

gau-nernst commented Oct 23, 2024

leslie-fang-intel commented Oct 24, 2024

gau-nernst commented Oct 24, 2024

leslie-fang-intel commented Oct 24, 2024

gau-nernst commented Oct 25, 2024

pytorch-bot bot commented Oct 17, 2024 •

edited

Loading

jerryzh168 commented Oct 17, 2024 •

edited

Loading