-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ZEPHYR FATAL ERROR 4: ASSERTION FAIL [aligned_addr == addr] intel_adsp_hda.h:163 #7191
Comments
@abonislawski @mwasko looks like an assert in the hda logic
|
This issue happened again in CI. Test ID:https://sof-ci.ostc.intel.com/#/result/planresultdetail/21700 |
interesting because we have correct alignment in dtsi: dma-buf-addr-alignment = <128>
I found we are using wrong alignment for buffer size in chain_dma.c (the same alignment used for addr and size instead of separate alignments) but it shouldnt be a problem (at least in this case)
|
@abonislawski will you send a FW fix or will this be a topology fix (or both) ? |
@lgirdwood Im not working on this issue, just checked if we are using alignment correctly |
Assigning to myself as this is affecting SOF2.5 release. |
Somewhat hard to hit this issue today, but it seems we have a problem somewhere in host_zephyr_params(). I can see from the oops that the offending address is not valid at all (e.g. buffer address passed to DMA driver is 0x12c0). This is not aligned with 128 bytes, but alignment is not the worst problem here. FYI @abonislawski @juimonen Continuing the debug... |
Very low reproduction rate slowing down debug. |
So far one patch helps to avoid hitting the problem, but I can't explain why this helps -> https://github.com/kv2019i/sof/commits/202303-buffer-alloc-rework FYI @lyakh @ranj063 So far I've tried various Zephyr debugging tools for heap and stack debug, tried aligning "struct coherent" to a full cacheline, cherry-picked recent DW-DMA fixes from upstream Zephyr, but the panic is still hit (albeit reproduction rate is low so long test runs are needed). |
Tested with #7323 -> kernel panic still occurs |
Tested with "CONFIG_INCOHERENT=n" ->panic still occurs with the test case |
With today's mainline ("362a0781f20b6709cf8c609219798f6079a57511 (origin/main) topology2: Remove dma_buffer_size attribute"), the bug no longer shows up as Zephyr panic, but instead there's a plain IPC timeout and nothing in the FW log. But the original test sequence with "~/sof-test/test-case/multiple-pipeline.sh -f p -c 20 -l 25" is still failing. |
@lgirdwood We do have cases that affect TGL HDA tplg on multiple-pipeline-playback-50 tests. We did have this bug open on SOF2.5 already, so this is not a regression, but let's keep the door open for a last minute fix. I now have suspect under investigation. |
It seems chain-dma is somehow related to this. I've noted in past week this has only occured on systems where cha-dmain is enabled in topology. I did a custom test run in Intel CI and without chain-dma, issue is not seen (26927). Same test with current main and chain-dma enabled, issue happens (26930). UPDATE: I've run many thousands of cycles with identical hw than is used in CI, but I've not able to reproduce the issue. Code analysis of chain-dma code did not reveal anything yet either. Debug continues... |
Today we have non alignment exception in TGL UP Xtreme, we have seen this before though.
Intel internal daily test: planresultdetail/26938?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pipeline-capture-50 |
@fredoh9 wrote:
Let's track this as a separate error (do we have a bug for this?). The PC points to Zephyr dmic interrupt handler and the value of
The PIF exception occurs on the OUTSTAT0. I'd need to the matching ELF image to be sure, but I think A8=0x1 is value from dmic->reg_base (which explains the unaligned access exception). @singalsu @juimonen @abonislawski @softwarecki any ideas? It seems we get an interrupt with foobar dmic context. Anyways, until we have concrete data this is related to this bug, I'd track this in a separate bug and focus this #7191 on the aligned_addr assertion. |
@kv2019i at this point I can only thing that something is writing over the dmic regbase (aka mem corruption), it is read from DT in compilation phase into dmic driver struct and I think nothing in dmic code should write there. |
In today's CI test(ID:27130), this issue happened when testing check-playback-100times on MTL-NOCODEC, the reproduce rate is almost 100%. It should be a regression. I did a quick bisect, it points to 07ed14b mtrace
@kv2019i , please help to verify it, I can open a new bug if needed. |
@keqiaozhang Trying to reproduce on one of the CI machines, but so far no luck, so I'm not seeing that 100% repro rate. But let's keep this in this bug as the assert is the same. 07ed14b has a Zephyr baseline update (to 3.4.0-rc2) so it could bring in some change. |
Reproduction will be affected by this? |
@kv2019i , this issue happened again in today's daily test, I also can easily reproduce it. device: jf-mtlp-rvp-nocodec-6 / jf-mtlp-rvp-nocodec-4(Test ID:27145) |
@keqiaozhang Ack, I can easily hit this now. There's still unexpected connection to the binary build I can't explain. If I modify the code a bit (like remove on of the commits of #7726 -> these only add code that is run after a panic has happened), issue no longer happens. Only explanation I can think of is that the .text section size changes and this will impact the FW memory map. Debug ongoing. |
This didn't make the v2.6 cut, moving to v2.7. |
After a lot of MTL test failures across all PRs for a couple days(with empty mtraces since #7726), today's https://sof-ci.01.org/softestpr/PR1054/build419/devicetest/index.html is mostly green. Go figure. |
@marc-hb This is starting to look like a cache coherency issue and reproduction rate is very sensitive to any change to code. E.g. the rimage change merged on Friday seems to make this bug disappear again #7756 . The rimage change was not intended, so I sent a fixup PR and ta-daa, we have 7191 triggering agani with high rate -> #7782 . |
Indeed MTLP_RVP_NOCODEC is mostly red again in https://sof-ci.01.org/sofpr/PR7782/build9265/devicetest/index.html (#7782) EDIT: lucky us, we even have an error message in mtrace
Other MTL configurations are green. It seems to be MTLP_RVP_NOCODEC affected every time. Older report: https://sof-ci.01.org/sofpr/PR7773/build9201/devicetest/index.html |
Status update:
|
Debugging with @lyakh , we finally have a hypothesis that explains what happens and why e.g. #7786 helps:
At step t5, the writes to t4 are not seen in the prefetched data and end up being discarded. In case of this bug, buffer->stream.addr is written two values, NULL at t1 and allocated heap address at t4. The t4 write gets lost due to prefetch and we end up asserting due to invalid buffer address. |
Further experiments with prefetcher configuration today, confirming the analysis and hypothesis of #7191 (comment) . Let's wait for some time to see how the results look, but expectation is now that this bug no longer occurs after #7786 . Based on these findings, we've surveyed the rest of the codebase and e.g. #7808 was submitted. |
Confirmed that this issue no longer exists in CI. Closing this bug. |
Describe the bug
Observed this issue in CI daily test. From the kernel message, there's a IPC timed out
0x13020004
, but mtrace recorded a fatal exception and caused zephyr kernel panic on CPU0.DMESG
MTRACE
To Reproduce
~/sof-test/test-case/multiple-pipeline.sh -f p -c 20 -l 25
Reproduction Rate
This bug has occurred three times in CI.
Environment
dmesg.txt
mtrace.txt
The text was updated successfully, but these errors were encountered: