Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tilize block not working for less than 32 rows #14609

Open
sjameelTT opened this issue Nov 1, 2024 · 14 comments
Open

Tilize block not working for less than 32 rows #14609

sjameelTT opened this issue Nov 1, 2024 · 14 comments

Comments

@sjameelTT
Copy link
Contributor

To recreate

git checkout sjameel/transpose_tilize_bug

build

TT_METAL_DPRINT_ONE_FILE_PER_RISC=1 TT_METAL_DPRINT_CORES=0,0 pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py::test_tranpose_hw_rm_no_padding

It will print these dprint outputs in generated/dprint

Before tilize_block data:

0.535156 0.00390625 0.871094 0.597656 0.679688 0.410156 0.683594 0.601562 0.496094 0.878906 0.929688 0.445312 0.214844 0.582031 0.648438 0.542969 0.0664062 0.535156 0.890625 0.410156 0.984375 0.953125 0.0820312 0.445312 0.738281 0.0898438 0.972656 0.59375 0.472656 0.980469 0.335938 0.78125 

0.613281 0.8125 0.652344 0.328125 0.25 0.0820312 0.144531 0.996094 0.253906 0.578125 0.324219 0.121094 0.980469 0.515625 0.222656 0.800781 0.367188 0.253906 0.480469 0.269531 0.582031 0.503906 0.496094 0.527344 0.257812 0.0703125 0.109375 0.125 0.554688 0.542969 0.992188 0.101562 

0.699219 0.4375 0.804688 0.304688 0.5625 0.425781 0.246094 0.148438 0.136719 0.21875 0.246094 0.269531 0.0976562 0.234375 0.25 0.570312 0.773438 0.0390625 0.757812 0.449219 0.5 0.335938 0.691406 0.109375 0.960938 0.425781 0.597656 0.421875 0.046875 0.875 0.320312 0.574219 

0.949219 0.613281 0.632812 0.0859375 0.0546875 0.71875 0.464844 0.355469 0.289062 0.78125 0.578125 0.949219 0.894531 0.953125 0.359375 0.824219 0.109375 0.421875 0.316406 0.660156 0.0234375 0.726562 0.371094 0.261719 0.460938 0.0546875 0.472656 0.0820312 0.0820312 0.648438 0.03125 0.0625 

0.40625 0.394531 0.078125 0.222656 0.320312 0.589844 0.441406 0.15625 0.816406 0.53125 0.375 0.390625 0.601562 0.535156 0.957031 0.242188 0.867188 0.316406 0.894531 0.339844 0.847656 0.0976562 0.742188 0.292969 0.949219 0.871094 0.53125 0.988281 0.804688 0.746094 0.246094 0.0585938 

0.507812 0.738281 0.199219 0.5 0.96875 0.8125 0.582031 0.234375 0.714844 0.0351562 0.390625 0.824219 0.824219 0.496094 0.527344 0.296875 0.558594 0.988281 0.1875 0.0195312 0.597656 0.53125 0.25 0.382812 0.425781 0.996094 0.0820312 0.703125 0.984375 0.492188 0.464844 0.972656 

0.976562 0.175781 0.859375 0.152344 0.316406 0.0273438 0.582031 0.851562 0.863281 0.925781 0.117188 0.539062 0.75 0.21875 0.0585938 0.890625 0.9375 0.65625 0.554688 0.457031 0.0976562 0.359375 0.414062 0.304688 0.945312 0.441406 0.5625 0.460938 0.890625 0.707031 0.0390625 0 

0.476562 0.5625 0.167969 0.378906 0.402344 0.253906 0.417969 0.113281 0.09375 0.804688 0.625 0.550781 0.21875 0.476562 0.253906 0.953125 0.953125 0.625 0.484375 0.617188 0.589844 0.835938 0.589844 0.675781 0.800781 0.28125 0.242188 0.578125 0.15625 0.554688 0.824219 0.535156 

after tilize (human readable)

0.535156 0.00390625 0.871094 0.597656 0.679688 0.410156 0.683594 0.601562 0.496094 0.878906 0.929688 0.445312 0.214844 0.582031 0.648438 0.542969 0.0664062 0.535156 0.890625 0.410156 0.984375 0.953125 0.0820312 0.445312 0.738281 0.0898438 0.972656 0.59375 0.472656 0.980469 0.335938 0.78125

0.613281 0.8125 0.652344 0.328125 0.25 0.0820312 0.144531 0.996094 0.253906 0.578125 0.324219 0.121094 0.980469 0.515625 0.222656 0.800781 0.367188 0.253906 0.480469 0.269531 0.582031 0.503906 0.496094 0.527344 0.257812 0.0703125 0.109375 0.125 0.554688 0.542969 0.992188 0.101562

0.699219 0.4375 0.804688 0.304688 0.5625 0.425781 0.246094 0.148438 0.136719 0.21875 0.246094 0.269531 0.0976562 0.234375 0.25 0.570312 0.773438 0.0390625 0.757812 0.449219 0.5 0.335938 0.691406 0.109375 0.960938 0.425781 0.597656 0.421875 0.046875 0.875 0.320312 0.574219

0.949219 0.613281 0.632812 0.0859375 0.0546875 0.71875 0.464844 0.355469 0.289062 0.78125 0.578125 0.949219 0.894531 0.953125 0.359375 0.824219 0.109375 0.421875 0.316406 0.660156 0.0234375 0.726562 0.371094 0.261719 0.460938 0.0546875 0.472656 0.0820312 0.0820312 0.648438 0.03125 0.0625

0.40625 0.394531 0.078125 0.222656 0.320312 0.589844 0.441406 0.15625 0.816406 0.53125 0.375 0.390625 0.601562 0.535156 0.957031 0.242188 0.0664062 0.300781 0.101562 0.8125 0.394531 0.527344 0.320312 0.300781 0.03125 0.421875 0.902344 0.753906 0.902344 0.675781 0.00390625 0.00390625

0.507812 0.738281 0.199219 0.5 0.96875 0.8125 0.582031 0.234375 0.714844 0.0351562 0.390625 0.824219 0.824219 0.496094 0.527344 0.296875 0.628906 0.453125 0.808594 0.566406 0.472656 0.972656 0.503906 0.144531 0.0429688 0.839844 0.0390625 0.441406 0.800781 0.988281 0.421875 0.683594

0.976562 0.175781 0.859375 0.152344 0.316406 0.0273438 0.582031 0.851562 0.863281 0.925781 0.117188 0.539062 0.75 0.21875 0.0585938 0.890625 0.597656 0.472656 0.382812 0.117188 0.996094 0.9375 0.382812 0.660156 0.3125 0.078125 0.746094 0.574219 0.367188 0.816406 0.871094 0.179688

0.476562 0.5625 0.167969 0.378906 0.402344 0.253906 0.417969 0.113281 0.09375 0.804688 0.625 0.550781 0.21875 0.476562 0.253906 0.953125 0.652344 0.304688 0.515625 0.171875 0.0234375 0.410156 0.0859375 0.207031 0.875 0.792969 0.800781 0.976562 0.566406 0.761719 0.363281 0.0078125

Bottom 4 rows have incorrect values in the last 8 or so values.

@sjameelTT
Copy link
Contributor Author

@nvelickovicTT @ncvetkovicTT do you guys know when you'll be able to get to this issue?

@nvelickovicTT
Copy link

@sjameelTT we are currently looking into a related issue with tilize on BH. Most likely it is the same underlying cause as in your case, but in any case, we'll know more when we fix the one currently investigated.
I am hoping to figure it out by the end of the week (it's a tricky bug that requires simulator to be used).

@sjameelTT
Copy link
Contributor Author

Thanks @nvelickovicTT , hoping that fixing the other issue solves this one!

@zzigler-tt zzigler-tt added P0 and removed P1 labels Nov 7, 2024
@zzigler-tt
Copy link

Escalating this issue to P0 due to blocking multiple items.

ncvetkovicTT added a commit that referenced this issue Nov 7, 2024
- Issue is with transpose_wh_init_short not perfo-
rming sync between MATH and PACK. One way to
resolve it is to use full transpose_wh_init.
Further investigation is needed on proper solution.
@ncvetkovicTT
Copy link
Contributor

Hey @zzigler-tt @sjameelTT, I have pushed a workaround for this issue to the sjameel/transpose_tilize_bug. The issue is with MATH and PACK not being synced after tilize_blocka and before transpose_wh_init_short. This can be resolved by using the full transpose_wh_init for now, but I am conducting an investigation of how we can fix this properly and why transpose_wh_init_short modifies the content of the intermed0 circular buffer. Please take a look and conduct some more tests on it, or tell me what I can do to test it more. Cheers!

@ntarafdar
Copy link
Contributor

thanks @ncvetkovicTT , let us know if you can get an ETA on how long a proper fix might take. If it is long then we will go with the workaround. If it's a couple days we can wait. Either is fine, it's just knowing the ETA will help us take the right decision.

@ncvetkovicTT
Copy link
Contributor

thanks @ncvetkovicTT , let us know if you can get an ETA on how long a proper fix might take. If it is long then we will go with the workaround. If it's a couple days we can wait. Either is fine, it's just knowing the ETA will help us take the right decision.

You'll have the ETA on Monday <3

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Nov 10, 2024

TL;DR: I think that I won't find a proper fix next week, so I say that we'll have the solution after 18 Nov.

Hey @sjameelTT @ntarafdar, I tried out some potential solutions for other problems that I thought might help with this one, but to no avail.

To give you a brief overview, there seems to be issues with synchronizing the Tensix cores in the kernel that's in use. I tried moving around some synchronization points in the underlying LLKs, but I have to do proper debugging. Instead of analyzing the LLKs visually and trying to figure out where the sync issue is, I'll probably have to look at some waveforms which will take some time to set up.

Introducing a couple of NOPs in transpose_wh_init_short fixes the issue. The way I would further debug this is to have one dump with a working number of NOPs and another with a failing number of NOPs. I hope this is possible to extract, but more likely there will be a gradual increase of failed runs as I reduce the number of NOPs. Since I have to set everything up, there will be a day or two of overhead work. @nvelickovicTT do you agree on the ETA?

@abhullar-tt
Copy link
Contributor

Could the issue be arising from Blackhole being able to bundle 4 tensix instructions into a single packet?

@ncvetkovicTT
Copy link
Contributor

@abhullar-tt I am not sure, but it seems like we're really missing some STALLWAIT somewhere.

@nvelickovicTT
Copy link

@abhullar-tt are you referring to coalesced writes? That is an interesting perspective, we haven't tried disabling that and testing.
@ncvetkovicTT there is a disable_gathering()enable_gathering() function pair in the code. We can try using it before/after kernel, just for sanity.

@ncvetkovicTT
Copy link
Contributor

Hey @abhullar-tt, @sjameelTT and @zzigler-tt, you can find another branch called ncvetkovic/14609_transpose_tilize_bug which proposes another fix for the issue.

@rtawfik01 remembered that there was already an issue behaving like this. The real cause of the bug are not the out-of-sync PACK/MATH as I initially thought, but the fact that DEST_ACCESS_CFG_remap_addrs_RMW and DEST_ACCESS_CFG_remap_addrs_RMW bits that are accessed in _llk_math_hw_configure_ cannot be toggled due to poor latching of the register. By default we pass false as untilize_en argument when performing unary_op_init_common, and then after tilize_block we want to set it to true when we call pack_untilize_dst_init_short. The value of these bits doesn't really matter for our software, they don't affect the LLKs or anything, but they do make our lives miserable by disallowing us to print the dest properly.

I think that the issue where this was first mentioned is #12234, but I am not sure if this was really fixed or not, do you perhaps know @ttmtrajkovic?

@nvelickovicTT FYI

@rtawfik01
Copy link
Contributor

@ttmtrajkovic just to recap, DEST_ACCESS_CFG_remap_addrs_RMW and DEST_ACCESS_CFG_remap_addrs_RMW are the registers needed to enable strided packer operations. These registers are both set to true currently with pack_untilize and reduce inits, which were the locations pack_untilize was used or fused, but those flags are not set to true in other kernel inits such as binary ops, transpose_wh, etc.

Those flags can only be toggled when both math & pack are fully idle, so that means in fused kernels with pack_untilize, each init function will need syncs for both math & pack to be idle in order to reconfig those bits. Enabling those bits caused issues with destination register debug prints, and also potential increase in power draw for the destination reg. If the debug print issue is fixed, the above bits can be set to true for all ops and it will not impact functionality. Otherwise if we don't want these bits set as true for all ops, then the workaround here is to add CSR waits to make sure all transactions in-flight for both pack & math are complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants