Investigate instability of llama models post commit test in main #14474

tt-rkim · 2024-10-30T15:08:55Z

due to commit c46e501

cc: @TT-billteng @bkeith-TT @uaydonat @yieldthought

The text was updated successfully, but these errors were encountered:

uaydonat · 2024-10-31T01:22:13Z

are the failures due to pcc or hangs?

@caixunshiren let's investigate... but keep in mind #9370 is still open, they only pushed a work-around.

caixunshiren · 2024-10-31T13:30:18Z

are the failures due to pcc or hangs?

@caixunshiren let's investigate... but keep in mind #9370 is still open, they only pushed a work-around.

Still investigating. Not able to reproduce locally. Shouldn't be a hang or ndpcc because it passed the flash decode ndpcc tests 🤔

TT-billteng · 2024-10-31T21:56:12Z

can we disable the test in post-commit?

uaydonat · 2024-11-01T01:04:55Z

We cannot disable llama, it is our most important model.

We might rather need to revert @xuncaiTT 's commit if it is the culprit.

It has been one day, if we do not have a solution, let's revert the optimization, work on it offline.

caixunshiren · 2024-11-01T01:48:13Z

@cglagovichTT was able to repro a seg fault on llama stress test on a commit before my changes. It is likely that this failure is not related to my commit.

TT-billteng · 2024-11-01T01:51:59Z

@tt-rkim what is the exact test command that issue is for?

@cglagovichTT was able to repro a seg fault on llama stress test on a commit before my changes. It is likely that this failure is not related to my commit.

So where did this issue first appear? Did we git bisect? Do we have an exact command to repro the error, even if it's ND?

caixunshiren · 2024-11-01T01:53:47Z

Yes it is the exact same test. It was done using git bisect. @cglagovichTT would you be able to add more details?

cglagovichTT · 2024-11-01T12:00:58Z

Is this a duplicate of #14475?

My repro command is:

FAKE_DEVICE=N300 python -m  pytest --count=60  -svv models/demos/llama3/demo/demo.py::test_llama_demo -k "instruct_weights-32"

The command runs for 1 hour. When the test fails with a segfault, it might be after 20 minutes or 45 minutes of execution.

My most recent bisect points me to this commit as the first bad.
0af26ed
I must have had a false negative during my bisect since this commit only changes test code. I'll start another bisect and run tests for 2 hours before determining a commit is good.

Note that it's possible that this segfault has existed in one way or another for many weeks, just becoming more or less prevalent as commits change the timing of threads.

tt-rkim · 2024-11-01T14:24:31Z

We can combine the two if it's the same issue

Also that's a crazy commit to be causing issues... unless you mean the chunk of changes because @mrshaw01 didn't squash.

mrshaw01 · 2024-11-01T14:58:31Z

Is that related. I mean, how would changes in the unit tests of operations cause post-commit tests for llama models to fail.

FYI, details of the PR: #13849.

cglagovichTT · 2024-11-01T15:04:33Z

I must have had a false negative during my bisect since this commit only changes test code. I'll start another bisect and run tests for 2 hours before determining a commit is good.

I agree, I don't believe that your commit caused any problems.

TT-billteng · 2024-11-07T09:32:27Z

any progress? this is still failing on main

cglagovichTT · 2024-11-07T14:03:38Z

@mtairum are you able to repro this issue locally?
https://github.com/tenstorrent/tt-metal/actions/runs/11720766141/job/32647229455

mtairum · 2024-11-07T14:28:36Z

No, not locally. My guess is something weird on the VM side, and it happens to fail at llama3-8B.
We've seen this in the past when running multiple tests one after another.

Will debug further.

tt-rkim · 2024-11-07T20:30:05Z

Have you tried getting access to a VM? You are able to provision one yourself.

ttmchiou · 2024-11-07T21:51:09Z

https://github.com/tenstorrent/tt-metal/actions/runs/11728678704/job/32673138552
https://github.com/tenstorrent/tt-metal/actions/runs/11729309675/job/32679768714
https://github.com/tenstorrent/tt-metal/actions/runs/11729454024/job/32675633969
https://github.com/tenstorrent/tt-metal/actions/runs/11728678704/job/32673138552
https://github.com/tenstorrent/tt-metal/actions/runs/11727420471/job/32668842127 (N150 + N300 failure here)
https://github.com/tenstorrent/tt-metal/actions/runs/11727271604/job/32668348645
^Failing tests

seems to be ND and only on WH cards, but on a variety of WH cards. Seems more common on N150 cards. No GS VMs that I've seen.

this seems to be showing up a lot on a lot of post-commit runs on main branch.

mtairum · 2024-11-08T12:33:28Z

So, i've created this branch https://github.com/tenstorrent/tt-metal/tree/refs/heads/mtairum/debug_llama_post_commit where I'm running on N300 (prior I've tested on N150 as well), just the Llama3 tests: 1B / 3B / 8B.

Like @ttmchiou mentioned, these are ND and they occur on N150 and N300.
example of a N300 failure here from my branch: https://github.com/tenstorrent/tt-metal/actions/runs/11741524983/job/32710638499

The failures do occur often enough, if you look at the post-commit run history for my branch (4 out of 11 failed on the llama test, 1 failed on post-run):
https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch%3Amtairum%2Fdebug_llama_post_commit

I've been trying to reproduce this locally on the VM 172.27.45.28, a tt-metal-dev-common-whx2-1 and have run so far more than 15 times the Llama3 1B/3B/8B test loop and never got a failure.
I should've got a failure by now.

Do you think it could be something related to specific runners?

The way I'm building tt-metal on the VM is through the ./build_metal.sh script, and then I run the /tests/scripts/run_python_model_tests.sh directly.

I see from the github scripts that the model tests are run from TT-Metal provided by wheels, and not directly the source. Could this be a problem?

dimitri-tenstorrent · 2024-11-08T20:00:57Z

@mtairum Hi!
I can not directly see how the wheel could affect this but also the failure is not very clear

worker 'gw0' crashed while running 'models/demos/llama3/tests/test_llama_model.py::test_llama_model_inference[wormhole_b0-True-2-quick]'

One way to solve this would be to test your runs on the lab machine that uses the wheel.

Remove the source code installation so that the machine is clean to make sure we can run a good test and identify if the wheel is a problem.
Download the wheels from one of your successful runs like from here:
https://github.com/tenstorrent/tt-metal/actions/runs/11742511348
eager-dist-ubuntu-20.04-wormhole_b0
Install it with pip install $wheel_filename

Execute your tests and see if you can get the worker crashed failures based of using the wheel.

mrshaw01 · 2024-11-10T08:01:22Z

FYI, when the test models-unit-tests (wormhole_b0, N150) / model wormhole_b0 N150 failed, I clicked on "run only failed tests" seven times, but it still failed. However, when I reran the entire "all post commit tests," it passed.

The job: https://github.com/tenstorrent/tt-metal/actions/runs/11758161892/job/32758797445
Hope it helps.

jvasilje · 2024-11-12T05:57:44Z

@uaydonat what's the plan for this P0 bug? It has been open for 2 weeks.

mtairum · 2024-11-12T12:01:34Z

I've tried the wheel, no luck in replicating the issue on lab machine.

We do have a few in flight small changes that could provoke a test failure: testing those now on CI to see if they pass this issue.

tt-rkim · 2024-11-12T14:46:36Z

I also recommend trying custom test dispatch: https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml

as that should replicate an environment close to CI in post commit. It will directly install from source as opposed to using a wheel.

Wheels likely won't be a problem, but you could try installing the wheel locally to see if it helps track down the issue.

uaydonat · 2024-11-12T15:30:52Z

A few ideas:

replicate the ci environment locally - @esmalTT also had difficulty in the past replicating unet failures locally until he did this. @evan any feedback?
take out a CI machine temporarily for interactive testing to see if it is a machine dependent issue.
lower clock to 900MHz to see if the hang goes away
turn back on di/dt work-arounds to see if the hangs go away

tt-rkim · 2024-11-12T15:42:33Z

take out a CI machine temporarily for interactive testing to see if it is a machine dependent issue.

We can target a particular machine you've seen with custom test dispatch as I mentioned above. Please let us know if there's anything blocking from doing this.

lower clock to 900MHz to see if the hang goes away
turn back on di/dt work-arounds to see if the hangs go away

I believe your team has the necessary tools to do those without being blocked.

jvasilje · 2024-11-12T16:19:18Z

@uaydonat @tt-rkim is this still a P0 bug? As in, is it actively filing in CI and blocking progress for the past 2 weeks? If so, someone should be working on this full time, and updating the status here every few hours.

uaydonat · 2024-11-12T16:45:02Z

It is not blocking progress on other models, just the llama new codebase delivery.

Lowering to P1, but it is still important and @mtairum is working on this full time.

tt-rkim · 2024-11-12T17:24:37Z

Thanks @uaydonat
Yes this sounds like P1, thanks @mtairum for taking a look
Pls let us know if you need any more resources

TT-billteng · 2024-11-12T18:49:01Z

Can we disable? This can't be the ONLY llama test in post-commit I hope?

uaydonat · 2024-11-12T19:38:54Z

@mtairum let's disable the ND tests, they are not helping if they fail anyways.

…sue #14474

mtairum · 2024-11-12T19:52:19Z

I've disabled the tests for now to avoid the ND fails on the pipeline.
PR Merged here: #14968

Recent small fixes to the model added today didn't change the behaviour.

My current plan is to identify a pattern on CI VMs that show consistent hangs, if any.
Then I'll try to reproduce locally on one of those following the same build process as CI.

TT-billteng · 2024-11-12T22:58:53Z

I've disabled the tests for now to avoid the ND fails on the pipeline. PR Merged here: #14968

Recent small fixes to the model added today didn't change the behaviour.

My current plan is to identify a pattern on CI VMs that show consistent hangs, if any. Then I'll try to reproduce locally on one of those following the same build process as CI.

Thanks!

Have you tried https://github.com/tenstorrent/tt-metal/actions/workflows/test-dispatch.yaml to repro on CI fleet?

esmalTT · 2024-11-12T23:05:32Z

@mtairum Have you tried to run the test with watcher? Even if you run it on a machine where you're not seeing the hang, it might catch an illegal read/write that might help you isolate the issue.

uaydonat · 2024-11-14T03:07:03Z

@mtairum any updates?

mtairum · 2024-11-14T12:06:01Z

Yes.

Tried to run on CI with watcher enabled, didn't get any extra information.

However I've compiled a list of passing/failing machines based on 44 runs. (N150 only). From all the CI VMs that I got multiple times, they either always pass or always fail.

@ttmchiou @tt-rkim @TT-billteng is it possible to get access to tt-metal-ci-vm-144 to debug the llama failing? That machine consistently fails(4 out of 4 times).

Below is the full list based on 44 runs with the machines that pass or fail.

CI VM	Pass	Fail
tt-metal-ci-vm-27	1	0
tt-metal-ci-vm-29	1	0
tt-metal-ci-vm-31	2	0
tt-metal-ci-vm-32	1	0
tt-metal-ci-vm-57	0	1
tt-metal-ci-vm-58	4	0
tt-metal-ci-vm-68	2	0
tt-metal-ci-vm-95	1	0
tt-metal-ci-vm-98	1	0
tt-metal-ci-vm-120	0	2
tt-metal-ci-vm-124	0	1
tt-metal-ci-vm-126	1	0
tt-metal-ci-vm-127	0	1
tt-metal-ci-vm-130	0	1
tt-metal-ci-vm-133	0	1
tt-metal-ci-vm-134	0	3
tt-metal-ci-vm-137	0	3
tt-metal-ci-vm-138	0	1
tt-metal-ci-vm-139	0	1
tt-metal-ci-vm-140	0	1
tt-metal-ci-vm-143	0	1
tt-metal-ci-vm-144	0	4
tt-metal-ci-vm-146	6	0
tt-metal-ci-vm-149	3	0

uaydonat · 2024-11-14T16:26:13Z

This table is very interesting. It appears that the problem is machine specific.

Yes, @tt-rkim let's get access to 144, try all the di/dt workarounds, collect more data, then we will probably need to involve milos and syseng.

ttmchiou · 2024-11-14T19:03:42Z

@tt-rkim is on vacation.
I can handle getting @mtairum access to 144.
Will DM privately to set up access

ttmchiou · 2024-11-14T19:04:39Z

special note,
VM-144 seems to only have 4 cores as noticed by @TT-billteng.

tt-rkim added infra-ci infrastructure and/or CI changes P0 LLM_bug gh-workflow labels Oct 30, 2024

tt-rkim assigned caixunshiren Oct 30, 2024

TT-billteng added bug Something isn't working ci-bug bugs found in CI labels Nov 7, 2024

cglagovichTT assigned mtairum and unassigned caixunshiren Nov 7, 2024

mtairum mentioned this issue Nov 11, 2024

[Llama3-70B] model and model-prefill tests show random hangs on CI #14934

Open

uaydonat added P1 and removed P0 labels Nov 12, 2024

mtairum added a commit that referenced this issue Nov 12, 2024

#0: Disable llama test_model from all-post-commit CI pipeline. See is…

28a314a

…sue #14474

mtairum mentioned this issue Nov 12, 2024

#0: Disable llama test_model from all-post-commit CI pipeline #14968

Merged

mtairum added a commit that referenced this issue Nov 12, 2024

#0: Disable llama test_model from all-post-commit CI pipeline. See is…

4624f4e

…sue #14474

Investigate instability of llama models post commit test in main #14474

Investigate instability of llama models post commit test in main #14474

Comments

tt-rkim commented Oct 30, 2024

uaydonat commented Oct 31, 2024

caixunshiren commented Oct 31, 2024

TT-billteng commented Oct 31, 2024

uaydonat commented Nov 1, 2024

caixunshiren commented Nov 1, 2024

TT-billteng commented Nov 1, 2024 • edited Loading

caixunshiren commented Nov 1, 2024

cglagovichTT commented Nov 1, 2024 • edited Loading

tt-rkim commented Nov 1, 2024

mrshaw01 commented Nov 1, 2024

cglagovichTT commented Nov 1, 2024

TT-billteng commented Nov 7, 2024

cglagovichTT commented Nov 7, 2024 • edited Loading

mtairum commented Nov 7, 2024

tt-rkim commented Nov 7, 2024

ttmchiou commented Nov 7, 2024 • edited Loading

mtairum commented Nov 8, 2024 • edited Loading

dimitri-tenstorrent commented Nov 8, 2024

mrshaw01 commented Nov 10, 2024

jvasilje commented Nov 12, 2024

mtairum commented Nov 12, 2024

tt-rkim commented Nov 12, 2024 • edited Loading

uaydonat commented Nov 12, 2024

tt-rkim commented Nov 12, 2024

jvasilje commented Nov 12, 2024

uaydonat commented Nov 12, 2024

tt-rkim commented Nov 12, 2024

TT-billteng commented Nov 12, 2024

uaydonat commented Nov 12, 2024

mtairum commented Nov 12, 2024

TT-billteng commented Nov 12, 2024 • edited Loading

esmalTT commented Nov 12, 2024

uaydonat commented Nov 14, 2024

mtairum commented Nov 14, 2024 • edited Loading

uaydonat commented Nov 14, 2024

ttmchiou commented Nov 14, 2024

ttmchiou commented Nov 14, 2024

TT-billteng commented Nov 1, 2024 •

edited

Loading

cglagovichTT commented Nov 1, 2024 •

edited

Loading

cglagovichTT commented Nov 7, 2024 •

edited

Loading

ttmchiou commented Nov 7, 2024 •

edited

Loading

mtairum commented Nov 8, 2024 •

edited

Loading

tt-rkim commented Nov 12, 2024 •

edited

Loading

TT-billteng commented Nov 12, 2024 •

edited

Loading

mtairum commented Nov 14, 2024 •

edited

Loading