-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyTorch 2.5 to regression test #1100
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1100
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 3503d30 with merge base eb1fb3a (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Sorry for making a change to a draft PR but 2.2 is just way too old now so let's just get rid of it |
No worries, I was about to ask you about 2.2 too. Just wanted to see if/which tests will fail with 2.5 (and sadly there are failing tests 😢) |
I think I've seen the error before, I think we'd just need to restrict all the tensor parallel tests to 2.6 and later probably changing
Line 325 in 6b52996
TORCH_VERSION_AFTER_2_6 (needs to be added to torchao/utils.py as well) will fix it
|
@jerryzh168 feel free to push the changes directly to this PR (or make a separate PR if you prefer that). I already added |
@matthewdouglas Do you know why bnb is erroring out in this job? https://github.com/pytorch/ao/actions/runs/11396526050/job/31710520597?pr=1100 PyTorch 2.5 from pypi (built with CUDA 12.4) + bnb 0.42.0. I'm not sure why the CI install bnb 0.42 instead of 0.44 (perhaps due to another dependency?). Seems like bnb 0.42 supports CUDA 12.1 but not CUDA 12.4? I guess we can install torch2.5 from https://download.pytorch.org/whl/cu121 in that case. |
The torch2.5 CI job consistently get CUDA illegal memory access. I think the Galore kernel triggers it? (once illegal memory access is encountered, all subsequent CUDA ops will fail). Will try debug on my local machine. I have skipped some distributed tests, so they only run on nightly (2.6). There are still some failing AQT tests. @jerryzh168 perhaps you have some ideas? |
sorry just saw this, the AQT related errors seems to be related to inductor freezing, cc @leslie-fang-intel can you take a look? |
@gau-nernst did you mean the failure of |
@leslie-fang-intel These are the failing tests (and their variants)
Not sure if they all have the same root cause. Can you add the appropriate skip test logic in my branch, since you would know best what to skip? (Not sure if you can push to my branch) Also, do we need to add a doc somewhere about this? |
Seems I can't directly push to your branch. Stack a commit here leslie-fang-intel@1fdfa42 and seems torch 2.5 testing are all green now as in: #1156. There 2 issues regarding to these failures:
|
To make sure I understand it correctly For this problem
Users have to use For the 2nd problem
It's a bug in PyTorch 2.5 that won't be fixed until there is a bug fix patch 2.5.1, if any? This seems pretty serious. Apart from adding a doc on this, I wonder if anything else we can do? cc @jerryzh168 @msaroufim |
Sure, update the skipping logic by using
How to mark this PR: pytorch/pytorch#136353 to cherry-pick in 2.5.1? |
Replaced by #1156 |
Fixes #888