Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyTorch 2.5 to regression test #1100

Closed
wants to merge 13 commits into from
Closed

Conversation

gau-nernst
Copy link
Collaborator

Fixes #888

Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1100

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 3503d30 with merge base eb1fb3a (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2024
@msaroufim
Copy link
Member

Sorry for making a change to a draft PR but 2.2 is just way too old now so let's just get rid of it

@gau-nernst
Copy link
Collaborator Author

No worries, I was about to ask you about 2.2 too. Just wanted to see if/which tests will fail with 2.5 (and sadly there are failing tests 😢)

@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 17, 2024

I think I've seen the error before, I think we'd just need to restrict all the tensor parallel tests to 2.6 and later

probably changing

if not TORCH_VERSION_AT_LEAST_2_5:
and
if not TORCH_VERSION_AT_LEAST_2_5:
to TORCH_VERSION_AFTER_2_6 (needs to be added to torchao/utils.py as well) will fix it

@gau-nernst
Copy link
Collaborator Author

@jerryzh168 feel free to push the changes directly to this PR (or make a separate PR if you prefer that).

I already added TORCH_VERSION_AT_LEAST_2_6 label in my previous BitNet PR.

@gau-nernst
Copy link
Collaborator Author

@matthewdouglas Do you know why bnb is erroring out in this job? https://github.com/pytorch/ao/actions/runs/11396526050/job/31710520597?pr=1100

PyTorch 2.5 from pypi (built with CUDA 12.4) + bnb 0.42.0. I'm not sure why the CI install bnb 0.42 instead of 0.44 (perhaps due to another dependency?).

Seems like bnb 0.42 supports CUDA 12.1 but not CUDA 12.4? I guess we can install torch2.5 from https://download.pytorch.org/whl/cu121 in that case.

@gau-nernst
Copy link
Collaborator Author

The torch2.5 CI job consistently get CUDA illegal memory access. I think the Galore kernel triggers it? (once illegal memory access is encountered, all subsequent CUDA ops will fail). Will try debug on my local machine.

I have skipped some distributed tests, so they only run on nightly (2.6).

There are still some failing AQT tests. @jerryzh168 perhaps you have some ideas?

@jerryzh168
Copy link
Contributor

sorry just saw this, the AQT related errors seems to be related to inductor freezing, cc @leslie-fang-intel can you take a look?

@leslie-fang-intel
Copy link
Collaborator

sorry just saw this, the AQT related errors seems to be related to inductor freezing, cc @leslie-fang-intel can you take a look?

@gau-nernst did you mean the failure of test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_2_cpu? I think the PR (pytorch/pytorch#136353) to fix them landed in PyTorch main but not cherry-pick to 2.5 release branch. Can we skip them in PyTorch 2.5 regression test and keep it with main?

@gau-nernst
Copy link
Collaborator Author

@leslie-fang-intel These are the failing tests (and their variants)

test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_subclass_api_1_cpu
test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_with_freeze_0_cpu
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_1_cpu

Not sure if they all have the same root cause. Can you add the appropriate skip test logic in my branch, since you would know best what to skip? (Not sure if you can push to my branch)

Also, do we need to add a doc somewhere about this?

@leslie-fang-intel
Copy link
Collaborator

@leslie-fang-intel These are the failing tests (and their variants)

test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_subclass_api_1_cpu
test/integration/test_integration.py::TestSubclass::test_int8_weight_only_quant_with_freeze_0_cpu
test/integration/test_integration.py::TestSaveLoadMeta::test_save_load_int8woqtensors_1_cpu

Not sure if they all have the same root cause. Can you add the appropriate skip test logic in my branch, since you would know best what to skip? (Not sure if you can push to my branch)

Also, do we need to add a doc somewhere about this?

Seems I can't directly push to your branch. Stack a commit here leslie-fang-intel@1fdfa42 and seems torch 2.5 testing are all green now as in: #1156.

There 2 issues regarding to these failures:

@gau-nernst
Copy link
Collaborator Author

To make sure I understand it correctly

For this problem

Inductor freezing caused functional regression
Note that the workaround is also required for torch.compile with freezing (torch._inductor.config.freezing=True) until pytorch/pytorch#136265 is fixed.

Users have to use unwrap_tensor_subclass to get correct results? If this is the case, shouldn't the test do this instead of skipping it in 2.5?

For the 2nd problem

Inductor CPP backend regression fixed in pytorch/pytorch#136353

It's a bug in PyTorch 2.5 that won't be fixed until there is a bug fix patch 2.5.1, if any? This seems pretty serious. Apart from adding a doc on this, I wonder if anything else we can do? cc @jerryzh168 @msaroufim

@leslie-fang-intel
Copy link
Collaborator

Users have to use unwrap_tensor_subclass to get correct results? If this is the case, shouldn't the test do this instead of skipping it in 2.5?

Sure, update the skipping logic by using unwrap_tensor_subclass with freezing before torch2.6.

It's a bug in PyTorch 2.5 that won't be fixed until there is a bug fix patch 2.5.1, if any?

How to mark this PR: pytorch/pytorch#136353 to cherry-pick in 2.5.1?

@gau-nernst gau-nernst mentioned this pull request Oct 25, 2024
@gau-nernst
Copy link
Collaborator Author

Replaced by #1156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Add CI test for PyTorch 2.5.0rc
5 participants