Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate instability of llama models post commit test in main #14474

Open
tt-rkim opened this issue Oct 30, 2024 · 37 comments
Open

Investigate instability of llama models post commit test in main #14474

tt-rkim opened this issue Oct 30, 2024 · 37 comments
Assignees
Labels
bug Something isn't working ci-bug bugs found in CI gh-workflow infra-ci infrastructure and/or CI changes LLM_bug P1

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Oct 30, 2024

due to commit c46e501

cc: @TT-billteng @bkeith-TT @uaydonat @yieldthought

@uaydonat
Copy link
Contributor

are the failures due to pcc or hangs?

@caixunshiren let's investigate... but keep in mind #9370 is still open, they only pushed a work-around.

@caixunshiren
Copy link
Contributor

are the failures due to pcc or hangs?

@caixunshiren let's investigate... but keep in mind #9370 is still open, they only pushed a work-around.

Still investigating. Not able to reproduce locally. Shouldn't be a hang or ndpcc because it passed the flash decode ndpcc tests 🤔

@TT-billteng
Copy link
Collaborator

can we disable the test in post-commit?

@uaydonat
Copy link
Contributor

uaydonat commented Nov 1, 2024

We cannot disable llama, it is our most important model.

We might rather need to revert @xuncaiTT 's commit if it is the culprit.

It has been one day, if we do not have a solution, let's revert the optimization, work on it offline.

@caixunshiren
Copy link
Contributor

@cglagovichTT was able to repro a seg fault on llama stress test on a commit before my changes. It is likely that this failure is not related to my commit.

@TT-billteng
Copy link
Collaborator

TT-billteng commented Nov 1, 2024

@tt-rkim what is the exact test command that issue is for?

@cglagovichTT was able to repro a seg fault on llama stress test on a commit before my changes. It is likely that this failure is not related to my commit.

So where did this issue first appear? Did we git bisect? Do we have an exact command to repro the error, even if it's ND?

@caixunshiren
Copy link
Contributor

Yes it is the exact same test. It was done using git bisect. @cglagovichTT would you be able to add more details?

@cglagovichTT
Copy link
Contributor

cglagovichTT commented Nov 1, 2024

Is this a duplicate of #14475?

My repro command is:

FAKE_DEVICE=N300 python -m  pytest --count=60  -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-32"

The command runs for 1 hour. When the test fails with a segfault, it might be after 20 minutes or 45 minutes of execution.

My most recent bisect points me to this commit as the first bad.
0af26ed
I must have had a false negative during my bisect since this commit only changes test code. I'll start another bisect and run tests for 2 hours before determining a commit is good.

Note that it's possible that this segfault has existed in one way or another for many weeks, just becoming more or less prevalent as commits change the timing of threads.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 1, 2024

We can combine the two if it's the same issue

Also that's a crazy commit to be causing issues... unless you mean the chunk of changes because @mrshaw01 didn't squash.

@mrshaw01
Copy link
Contributor

mrshaw01 commented Nov 1, 2024

Is that related. I mean, how would changes in the unit tests of operations cause post-commit tests for llama models to fail.

FYI, details of the PR: #13849.

@cglagovichTT
Copy link
Contributor

I must have had a false negative during my bisect since this commit only changes test code. I'll start another bisect and run tests for 2 hours before determining a commit is good.

I agree, I don't believe that your commit caused any problems.

@TT-billteng
Copy link
Collaborator

any progress? this is still failing on main

@TT-billteng TT-billteng added bug Something isn't working ci-bug bugs found in CI labels Nov 7, 2024
@cglagovichTT
Copy link
Contributor

cglagovichTT commented Nov 7, 2024

@mtairum are you able to repro this issue locally?
https://github.com/tenstorrent/tt-metal/actions/runs/11720766141/job/32647229455

@cglagovichTT cglagovichTT assigned mtairum and unassigned caixunshiren Nov 7, 2024
@mtairum
Copy link
Contributor

mtairum commented Nov 7, 2024

No, not locally. My guess is something weird on the VM side, and it happens to fail at llama3-8B.
We've seen this in the past when running multiple tests one after another.

Will debug further.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 7, 2024

Have you tried getting access to a VM? You are able to provision one yourself.

@ttmchiou
Copy link
Contributor

ttmchiou commented Nov 7, 2024

https://github.com/tenstorrent/tt-metal/actions/runs/11728678704/job/32673138552
https://github.com/tenstorrent/tt-metal/actions/runs/11729309675/job/32679768714
https://github.com/tenstorrent/tt-metal/actions/runs/11729454024/job/32675633969
https://github.com/tenstorrent/tt-metal/actions/runs/11728678704/job/32673138552
https://github.com/tenstorrent/tt-metal/actions/runs/11727420471/job/32668842127 (N150 + N300 failure here)
https://github.com/tenstorrent/tt-metal/actions/runs/11727271604/job/32668348645
^Failing tests

seems to be ND and only on WH cards, but on a variety of WH cards. Seems more common on N150 cards. No GS VMs that I've seen.

this seems to be showing up a lot on a lot of post-commit runs on main branch.

@mtairum
Copy link
Contributor

mtairum commented Nov 8, 2024

So, i've created this branch https://github.com/tenstorrent/tt-metal/tree/refs/heads/mtairum/debug_llama_post_commit where I'm running on N300 (prior I've tested on N150 as well), just the Llama3 tests: 1B / 3B / 8B.

Like @ttmchiou mentioned, these are ND and they occur on N150 and N300.
example of a N300 failure here from my branch: https://github.com/tenstorrent/tt-metal/actions/runs/11741524983/job/32710638499

The failures do occur often enough, if you look at the post-commit run history for my branch (4 out of 11 failed on the llama test, 1 failed on post-run):
https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch%3Amtairum%2Fdebug_llama_post_commit

I've been trying to reproduce this locally on the VM 172.27.45.28, a tt-metal-dev-common-whx2-1 and have run so far more than 15 times the Llama3 1B/3B/8B test loop and never got a failure.
I should've got a failure by now.

Do you think it could be something related to specific runners?

The way I'm building tt-metal on the VM is through the ./build_metal.sh script, and then I run the /tests/scripts/run_python_model_tests.sh directly.

I see from the github scripts that the model tests are run from TT-Metal provided by wheels, and not directly the source. Could this be a problem?

@dimitri-tenstorrent
Copy link
Contributor

@mtairum Hi!
I can not directly see how the wheel could affect this but also the failure is not very clear

worker 'gw0' crashed while running 'models/demos/llama3/tests/test_llama_model.py::test_llama_model_inference[wormhole_b0-True-2-quick]'

One way to solve this would be to test your runs on the lab machine that uses the wheel.

  1. Remove the source code installation so that the machine is clean to make sure we can run a good test and identify if the wheel is a problem.
  2. Download the wheels from one of your successful runs like from here:
    https://github.com/tenstorrent/tt-metal/actions/runs/11742511348
    eager-dist-ubuntu-20.04-wormhole_b0
  3. Install it with pip install $wheel_filename

Execute your tests and see if you can get the worker crashed failures based of using the wheel.

@mrshaw01
Copy link
Contributor

FYI, when the test models-unit-tests (wormhole_b0, N150) / model wormhole_b0 N150 failed, I clicked on "run only failed tests" seven times, but it still failed. However, when I reran the entire "all post commit tests," it passed.

The job: https://github.com/tenstorrent/tt-metal/actions/runs/11758161892/job/32758797445
Hope it helps.

@jvasilje
Copy link
Collaborator

@uaydonat what's the plan for this P0 bug? It has been open for 2 weeks.

@mtairum
Copy link
Contributor

mtairum commented Nov 12, 2024

I've tried the wheel, no luck in replicating the issue on lab machine.

We do have a few in flight small changes that could provoke a test failure: testing those now on CI to see if they pass this issue.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 12, 2024

I also recommend trying custom test dispatch: https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml

as that should replicate an environment close to CI in post commit. It will directly install from source as opposed to using a wheel.

Wheels likely won't be a problem, but you could try installing the wheel locally to see if it helps track down the issue.

@uaydonat
Copy link
Contributor

A few ideas:

  • replicate the ci environment locally - @esmalTT also had difficulty in the past replicating unet failures locally until he did this. @evan any feedback?
  • take out a CI machine temporarily for interactive testing to see if it is a machine dependent issue.
  • lower clock to 900MHz to see if the hang goes away
  • turn back on di/dt work-arounds to see if the hangs go away

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 12, 2024

take out a CI machine temporarily for interactive testing to see if it is a machine dependent issue.

We can target a particular machine you've seen with custom test dispatch as I mentioned above. Please let us know if there's anything blocking from doing this.

lower clock to 900MHz to see if the hang goes away
turn back on di/dt work-arounds to see if the hangs go away

I believe your team has the necessary tools to do those without being blocked.

@jvasilje
Copy link
Collaborator

@uaydonat @tt-rkim is this still a P0 bug? As in, is it actively filing in CI and blocking progress for the past 2 weeks? If so, someone should be working on this full time, and updating the status here every few hours.

@uaydonat
Copy link
Contributor

It is not blocking progress on other models, just the llama new codebase delivery.

Lowering to P1, but it is still important and @mtairum is working on this full time.

@uaydonat uaydonat added P1 and removed P0 labels Nov 12, 2024
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Nov 12, 2024

Thanks @uaydonat
Yes this sounds like P1, thanks @mtairum for taking a look
Pls let us know if you need any more resources

@TT-billteng
Copy link
Collaborator

Can we disable? This can't be the ONLY llama test in post-commit I hope?

@uaydonat
Copy link
Contributor

@mtairum let's disable the ND tests, they are not helping if they fail anyways.

@mtairum
Copy link
Contributor

mtairum commented Nov 12, 2024

I've disabled the tests for now to avoid the ND fails on the pipeline.
PR Merged here: #14968

Recent small fixes to the model added today didn't change the behaviour.

My current plan is to identify a pattern on CI VMs that show consistent hangs, if any.
Then I'll try to reproduce locally on one of those following the same build process as CI.

@TT-billteng
Copy link
Collaborator

TT-billteng commented Nov 12, 2024

I've disabled the tests for now to avoid the ND fails on the pipeline. PR Merged here: #14968

Recent small fixes to the model added today didn't change the behaviour.

My current plan is to identify a pattern on CI VMs that show consistent hangs, if any. Then I'll try to reproduce locally on one of those following the same build process as CI.

Thanks!

Have you tried https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml to repro on CI fleet?

@esmalTT
Copy link
Contributor

esmalTT commented Nov 12, 2024

@mtairum Have you tried to run the test with watcher? Even if you run it on a machine where you're not seeing the hang, it might catch an illegal read/write that might help you isolate the issue.

@uaydonat
Copy link
Contributor

@mtairum any updates?

@mtairum
Copy link
Contributor

mtairum commented Nov 14, 2024

Yes.

Tried to run on CI with watcher enabled, didn't get any extra information.

However I've compiled a list of passing/failing machines based on 44 runs. (N150 only). From all the CI VMs that I got multiple times, they either always pass or always fail.

@ttmchiou @tt-rkim @TT-billteng is it possible to get access to tt-metal-ci-vm-144 to debug the llama failing? That machine consistently fails(4 out of 4 times).

Below is the full list based on 44 runs with the machines that pass or fail.

CI VM Pass Fail
tt-metal-ci-vm-27 1 0
tt-metal-ci-vm-29 1 0
tt-metal-ci-vm-31 2 0
tt-metal-ci-vm-32 1 0
tt-metal-ci-vm-57 0 1
tt-metal-ci-vm-58 4 0
tt-metal-ci-vm-68 2 0
tt-metal-ci-vm-95 1 0
tt-metal-ci-vm-98 1 0
tt-metal-ci-vm-120 0 2
tt-metal-ci-vm-124 0 1
tt-metal-ci-vm-126 1 0
tt-metal-ci-vm-127 0 1
tt-metal-ci-vm-130 0 1
tt-metal-ci-vm-133 0 1
tt-metal-ci-vm-134 0 3
tt-metal-ci-vm-137 0 3
tt-metal-ci-vm-138 0 1
tt-metal-ci-vm-139 0 1
tt-metal-ci-vm-140 0 1
tt-metal-ci-vm-143 0 1
tt-metal-ci-vm-144 0 4
tt-metal-ci-vm-146 6 0
tt-metal-ci-vm-149 3 0

@uaydonat
Copy link
Contributor

This table is very interesting. It appears that the problem is machine specific.

Yes, @tt-rkim let's get access to 144, try all the di/dt workarounds, collect more data, then we will probably need to involve milos and syseng.

@ttmchiou
Copy link
Contributor

@tt-rkim is on vacation.
I can handle getting @mtairum access to 144.
Will DM privately to set up access

@ttmchiou
Copy link
Contributor

special note,
VM-144 seems to only have 4 cores as noticed by @TT-billteng.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI gh-workflow infra-ci infrastructure and/or CI changes LLM_bug P1
Projects
None yet
Development

No branches or pull requests