-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate instability of llama models post commit test in main #14474
Comments
are the failures due to pcc or hangs? @caixunshiren let's investigate... but keep in mind #9370 is still open, they only pushed a work-around. |
Still investigating. Not able to reproduce locally. Shouldn't be a hang or ndpcc because it passed the flash decode ndpcc tests 🤔 |
can we disable the test in post-commit? |
We cannot disable llama, it is our most important model. We might rather need to revert @xuncaiTT 's commit if it is the culprit. It has been one day, if we do not have a solution, let's revert the optimization, work on it offline. |
@cglagovichTT was able to repro a seg fault on llama stress test on a commit before my changes. It is likely that this failure is not related to my commit. |
@tt-rkim what is the exact test command that issue is for?
So where did this issue first appear? Did we git bisect? Do we have an exact command to repro the error, even if it's ND? |
Yes it is the exact same test. It was done using git bisect. @cglagovichTT would you be able to add more details? |
Is this a duplicate of #14475? My repro command is:
The command runs for 1 hour. When the test fails with a segfault, it might be after 20 minutes or 45 minutes of execution. My most recent bisect points me to this commit as the first bad. Note that it's possible that this segfault has existed in one way or another for many weeks, just becoming more or less prevalent as commits change the timing of threads. |
We can combine the two if it's the same issue Also that's a crazy commit to be causing issues... unless you mean the chunk of changes because @mrshaw01 didn't squash. |
Is that related. I mean, how would changes in the unit tests of operations cause post-commit tests for llama models to fail. FYI, details of the PR: #13849. |
I agree, I don't believe that your commit caused any problems. |
any progress? this is still failing on main |
@mtairum are you able to repro this issue locally? |
No, not locally. My guess is something weird on the VM side, and it happens to fail at llama3-8B. Will debug further. |
Have you tried getting access to a VM? You are able to provision one yourself. |
https://github.com/tenstorrent/tt-metal/actions/runs/11728678704/job/32673138552 seems to be ND and only on WH cards, but on a variety of WH cards. Seems more common on N150 cards. No GS VMs that I've seen. this seems to be showing up a lot on a lot of post-commit runs on main branch. |
So, i've created this branch https://github.com/tenstorrent/tt-metal/tree/refs/heads/mtairum/debug_llama_post_commit where I'm running on N300 (prior I've tested on N150 as well), just the Llama3 tests: 1B / 3B / 8B. Like @ttmchiou mentioned, these are ND and they occur on N150 and N300. The failures do occur often enough, if you look at the post-commit run history for my branch (4 out of 11 failed on the llama test, 1 failed on post-run): I've been trying to reproduce this locally on the VM Do you think it could be something related to specific runners? The way I'm building tt-metal on the VM is through the I see from the github scripts that the model tests are run from TT-Metal provided by wheels, and not directly the source. Could this be a problem? |
@mtairum Hi!
One way to solve this would be to test your runs on the lab machine that uses the wheel.
Execute your tests and see if you can get the |
FYI, when the test The job: https://github.com/tenstorrent/tt-metal/actions/runs/11758161892/job/32758797445 |
@uaydonat what's the plan for this P0 bug? It has been open for 2 weeks. |
I've tried the wheel, no luck in replicating the issue on lab machine. We do have a few in flight small changes that could provoke a test failure: testing those now on CI to see if they pass this issue. |
I also recommend trying custom test dispatch: https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml as that should replicate an environment close to CI in post commit. It will directly install from source as opposed to using a wheel. Wheels likely won't be a problem, but you could try installing the wheel locally to see if it helps track down the issue. |
A few ideas:
|
We can target a particular machine you've seen with custom test dispatch as I mentioned above. Please let us know if there's anything blocking from doing this.
I believe your team has the necessary tools to do those without being blocked. |
It is not blocking progress on other models, just the llama new codebase delivery. Lowering to P1, but it is still important and @mtairum is working on this full time. |
Can we disable? This can't be the ONLY llama test in post-commit I hope? |
@mtairum let's disable the ND tests, they are not helping if they fail anyways. |
I've disabled the tests for now to avoid the ND fails on the pipeline. Recent small fixes to the model added today didn't change the behaviour. My current plan is to identify a pattern on CI VMs that show consistent hangs, if any. |
Thanks! Have you tried https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml to repro on CI fleet? |
@mtairum Have you tried to run the test with watcher? Even if you run it on a machine where you're not seeing the hang, it might catch an illegal read/write that might help you isolate the issue. |
@mtairum any updates? |
Yes. Tried to run on CI with watcher enabled, didn't get any extra information. However I've compiled a list of passing/failing machines based on 44 runs. (N150 only). From all the CI VMs that I got multiple times, they either always pass or always fail. @ttmchiou @tt-rkim @TT-billteng is it possible to get access to Below is the full list based on 44 runs with the machines that pass or fail.
|
This table is very interesting. It appears that the problem is machine specific. Yes, @tt-rkim let's get access to 144, try all the di/dt workarounds, collect more data, then we will probably need to involve milos and syseng. |
special note, |
due to commit c46e501
cc: @TT-billteng @bkeith-TT @uaydonat @yieldthought
The text was updated successfully, but these errors were encountered: