Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma: detect stuck and dead streams #9

Open
wants to merge 1 commit into
base: main2
Choose a base branch
from

Conversation

zack-vii
Copy link

@zack-vii zack-vii commented Oct 4, 2024

Sometimes when the stream gets stuck or crashes on a acq400 carrier. It seems to be te best to reboot the system, e.g. ssh in and reboot if that is still possible of with the REBOOT knob on site 0. However, I find myself with a now stuck system that is trapped in an infinite loop in acq400_drv.c _acq420_continuous_dma_stop(). So with the usb terminal i read the message "WAITING for work task" periodically. Since the login level already shut down there is no way of intervention that I know of and the device is rendered nonoperational until the next physical access (due to space limitation we could not fit remote controlled power strips everywhere). Is it possible to add a timeout or something that breaks out of the loop upon reboot?

One solution could be to detect a stucked dma, e.g. by adding a counter to adev that is incremented on each cycle of the for loop in axi64_dual_data_loop().
_acq420_continuous_dma_stop() may check if the counter has changes after each 1sec timeout. if not it can assume the stream is stuck or dead. it could flag abort and continue. The stream in axi64_dual_data_loop may not like it but has the chance to do a safe -abort when detecting a set abort flag. The abort flag could be cleared when entering axi64_dual_data_loop.

This solution is presented here.

@petermilne
Copy link

Hello Timo "WAITING for work task" is unrecoverable as you found. It's a result of data overload (something is overloading the A9 side of the system). That's either because the data rate is simply too high, or possibly because of a misconfiguration - maybe the A9 is doing a lot more work than it needs to be doing. It's not something to fix in normal operation.
If you would send the output of /usr/local/CARE/diagsdump that would help is diagnose. And to be sure, please add the output of "head /var/log/acq400_streamd_0.log". Then ideally, we make sure it never happens.

Your host-side timeout code is interesting. We'll try it out. I want to understand why we have two flags "task_active" and "task_alive", maybe doing the same job.

And you wanted to know how to reboot the box once it's in the "WAITING for work task" state. So even if you can log in, and normal "reboot" is ineffective. And of course, if the box didn't let you log in, you couldn't even do that. The correct industrial-strength way to handle this is to enable the watchdog on boot, then service regularly. That can be done with low level program, if the kernel stops, the service stops and the watchdog forces reset. I'm guessing simply running "reboot" would be enough to stop the watchdog service. Or you could set up an external fail-safe watchdog, to cover the case that - the system is wedged, the OS is still running, but the login is disabled; in this method, you set up a handshake with an external agent under your control; This covers both cases: internal system fail, and "supervisor system says stop". see 4GUG 31.1 "WAITING for work task" periodically.

@zack-vii
Copy link
Author

zack-vii commented Oct 5, 2024

Hi Peter,
I added a new field because task_active is already used as a boolean stating if the task is running. it work now both ways. when set to 0 while in the loop the task would know that it was aborted. The other one, task_alive is the heartbeat that is checked for an update to se if the stream produced at least one update per 1sec. In case one walk through the loop is expected to take longer.. maybe there is a wait-check loop in the axi64_dual_data_loop in which one would add a adev->task_alive++; but since it does not get to check the stop request this does not seem to be the case. technically it would be possible to squash the two variables together. This would require the update to never update to 0 as this would indicate "task is done". Also one would try to avoid the chance to accidentally misinterpret a perfect buffer-overflow (e.g. + (1<<32) for an int32 ) as an unchanged value so the counter must have enough bits. One would say this should be true for 31 bits already. Then one would use the task_active for the active state task_active&1 != 0 and heartbeat task_active += 2. You would not want to check for task_active!=0 because that would give you a race condition in case the task is aborted.

About the watchdog .. I am already using the watchdog.. maybe this is now preventing it from shutting down? The idea was to trigger a reboot when ssh becomes unavailable. The service is periodically attempting to ssh to itself just to echo the next wait period into the watchdog timer. The loop would break when it fails to open a ssh connection. But that would mean that ssh from within still works and the service is still running?! I could add a message that is printed for each period and check if that message keeps being printed. I will play around with the watchdog some more.. basically it could as well try to ping some known always running server e.g. our ntp server. Once it succeeds it can be considered online and it would only feed the watchdog when a ping responds..

@petermilne
Copy link

petermilne commented Oct 6, 2024 via email

@zack-vii
Copy link
Author

zack-vii commented Oct 6, 2024

Well the problem arises during our attempt to work around the bandwidth limitation of the acq1001 carrier by programming the gpg with a duty cycle that would drop the average under 20MB/s which usually seems to be a safe value. However. the requirement is an active phase of at least 100ms at 40MHz with 2 channels. I was trying to use the RTM mode with a translen of 0x40000 every 2 seconds. I am using the two wire mode (is it possible to reduce the channels to two with that approach, I did not get it to work). This would leave me with 4M42B = 32MB of data to buffer .. I added a | pv -qB 64MB in acq400_streamd whihc should have ensured the writing process to have no issues however eventually the stream crashes:

[  606.192060] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[  606.192065] Modules linked in: emlog xilinx_axidma acq480 acq420fmc pl330 dmaengine314 xilinx_devcfg
[  606.192094] CPU: 0 PID: 28387 Comm: acq400_stream Not tainted 4.14.0-acq400-xilinx #9
[  606.192098] Hardware name: Xilinx Zynq Platform
[  606.192104] task: cde82080 task.stack: cd548000
[  606.192408] PC is at _acq400_continuous_read_update_hb0+0xb0/0x150 [acq420fmc]
[  606.192412] LR is at 0xc0601734
[  606.192418] pc : [<bf02403c>]    lr : [<c0601734>]    psr: a0000013
[  606.192423] sp : cd549ea8  ip : c0601734  fp : 00000000
[  606.192428] r10: 00000000  r9 : 00000050  r8 : ee9e8860
[  606.192434] r7 : cdef7038  r6 : ee9e89c8  r5 : ee9e89c8  r4 : ee9e8860
[  606.192439] r3 : ffff89d4  r2 : ed8b0514  r1 : ed8b04d4  r0 : ef175810
[  606.192446] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  606.192452] Control: 18c5387d  Table: 0df0404a  DAC: 00000051
[  606.192459] Process acq400_stream (pid: 28387, stack limit = 0xcd548210)
[  606.192465] Stack: (0xcd549ea8 to 0xcd54a000)
[  606.192476] 9ea0:                   ee9e8860 ee9e89c8 00000006 be908ab8 ee9e89c8 bf024110
[  606.192487] 9ec0: ee9e8860 00000000 bf061dc8 cdde2000 be908ab8 bf0242f0 ee9e89c8 eea819c0
[  606.192499] 9ee0: bf0ba0d4 ee9e89c8 00000000 cdde2000 bf024184 00000000 be908ab8 cd549f88
[  606.192510] 9f00: cd548000 c01d4ab0 0004c298 00000014 00000001 00000000 00000000 cd549f10
[  606.192520] 9f20: 00000000 00000000 00000000 00000000 00000014 00000000 00000000 eea13780
[  606.192531] 9f40: 00000000 cdde2000 cd549f88 00000050 cdde2000 00000050 00000000 be908ab8
[  606.192542] 9f60: cd549f88 c01d4c3c cdde2000 be908ab8 cdde2000 cdde2000 be908ab8 00000050
[  606.192553] 9f80: c01079a4 c01d5068 00000000 00000000 00000050 0004c120 0004c12c 00200000
[  606.192564] 9fa0: 00000003 c01077e0 0004c120 0004c12c 00000086 be908ab8 00000050 00000000
[  606.192575] 9fc0: 0004c120 0004c12c 00200000 00000003 0004c144 0002bc7c 00200000 00000000
[  606.192586] 9fe0: 000400a4 be908a98 0001e620 b6eec670 60040010 00000086 00000000 00000000
[  606.193124] [<bf02403c>] (_acq400_continuous_read_update_hb0 [acq420fmc]) from [<bf024110>] (_acq400_continuous_read+0x34/0xa8 [acq420fmc])
[  606.193650] [<bf024110>] (_acq400_continuous_read [acq420fmc]) from [<bf0242f0>] (acq400_continuous_read+0x16c/0x198 [acq420fmc])
[  606.193928] [<bf0242f0>] (acq400_continuous_read [acq420fmc]) from [<c01d4ab0>] (__vfs_read+0x1c/0x118)
[  606.193944] [<c01d4ab0>] (__vfs_read) from [<c01d4c3c>] (vfs_read+0x90/0xfc)
[  606.193956] [<c01d4c3c>] (vfs_read) from [<c01d5068>] (SyS_read+0x3c/0x74)
[  606.193976] [<c01d5068>] (SyS_read) from [<c01077e0>] (ret_fast_syscall+0x0/0x48)
[  606.193988] Code: e35c0000 01a0c00e e3530002 9a000000 (e7f001f2) 
[  606.193996] ---[ end trace 115b2554966ec1c7 ]---
[  606.206379] acq420 40000000.acq2006sc: acq400sc_onStop
[  606.453564] Unable to handle kernel NULL pointer dereference at virtual address 00000003
[  606.461672] pgd = c0004000
[  606.464385] [00000003] *pgd=00000000
[  606.467972] Internal error: Oops - BUG: 805 [#2] PREEMPT SMP ARM
[  606.473980] Modules linked in: emlog xilinx_axidma acq480 acq420fmc pl330 dmaengine314 xilinx_devcfg
[  606.483148] CPU: 1 PID: 28394 Comm: acq400.0.ai Tainted: G      D         4.14.0-acq400-xilinx #9
[  606.492018] Hardware name: Xilinx Zynq Platform
[  606.496554] task: ef12e040 task.stack: eeaac000
[  606.501442] PC is at _getEmpty+0x40/0xbc [acq420fmc]
[  606.506773] LR is at getEmpty+0x1c/0x30 [acq420fmc]
[  606.511656] pc : [<bf03bd98>]    lr : [<bf03be30>]    psr: a0060013
[  606.517921] sp : eeaadf20  ip : 00000000  fp : ee9e890c
[  606.523153] r10: 00000062  r9 : 00000062  r8 : bf064080
[  606.528401] r7 : bf05d000  r6 : bf061dc8  r5 : ed8b04c0  r4 : ee9e8860
[  606.534936] r3 : ed8b04d4  r2 : 00000003  r1 : ee9e89d4  r0 : ee9e8860
[  606.541470] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  606.548615] Control: 18c5387d  Table: 2d8d404a  DAC: 00000051
[  606.554365] Process acq400.0.ai (pid: 28394, stack limit = 0xeeaac210)
[  606.560899] Stack: (0xeeaadf20 to 0xeeaae000)
[  606.565263] df20: ee9e89b8 ee9e8860 bf061dc8 bf03be30 ee9e8860 00000000 bf061dc8 bf021e18
[  606.573450] df40: 00000001 00000001 cd549d80 00000000 ef12e040 c01495e8 eeaadf58 eeaadf58
[  606.581644] df60: 00000000 cde47700 cde405c0 eeaac000 00000000 ee9e8860 bf021a00 cd549d80
[  606.589859] df80: cde4771c c0133718 eeaac000 cde405c0 c01335e8 00000000 00000000 00000000
[  606.598045] dfa0: 00000000 00000000 00000000 c01078b0 00000000 00000000 00000000 00000000
[  606.606235] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  606.614419] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[  606.623222] [<bf03bd98>] (_getEmpty [acq420fmc]) from [<bf03be30>] (getEmpty+0x1c/0x30 [acq420fmc])
[  606.632901] [<bf03be30>] (getEmpty [acq420fmc]) from [<bf021e18>] (axi64_data_loop+0x418/0x49c [acq420fmc])
[  606.643050] [<bf021e18>] (axi64_data_loop [acq420fmc]) from [<c0133718>] (kthread+0x130/0x148)
[  606.651689] [<c0133718>] (kthread) from [<c01078b0>] (ret_from_fork+0x14/0x24)
[  606.658923] Code: e2801f5d e5902178 e5803178 e8830006 (e5823000) 
[  606.665103] ---[ end trace 115b2554966ec1c8 ]---

This is on a 712 fw.

/usr/local/CARE/diagsdump does not exist.

@petermilne
Copy link

petermilne commented Oct 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants