-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dma: detect stuck and dead streams #9
base: main2
Are you sure you want to change the base?
Conversation
Hello Timo "WAITING for work task" is unrecoverable as you found. It's a result of data overload (something is overloading the A9 side of the system). That's either because the data rate is simply too high, or possibly because of a misconfiguration - maybe the A9 is doing a lot more work than it needs to be doing. It's not something to fix in normal operation. Your host-side timeout code is interesting. We'll try it out. I want to understand why we have two flags "task_active" and "task_alive", maybe doing the same job. And you wanted to know how to reboot the box once it's in the "WAITING for work task" state. So even if you can log in, and normal "reboot" is ineffective. And of course, if the box didn't let you log in, you couldn't even do that. The correct industrial-strength way to handle this is to enable the watchdog on boot, then service regularly. That can be done with low level program, if the kernel stops, the service stops and the watchdog forces reset. I'm guessing simply running "reboot" would be enough to stop the watchdog service. Or you could set up an external fail-safe watchdog, to cover the case that - the system is wedged, the OS is still running, but the login is disabled; in this method, you set up a handshake with an external agent under your control; This covers both cases: internal system fail, and "supervisor system says stop". see 4GUG 31.1 "WAITING for work task" periodically. |
Hi Peter, About the watchdog .. I am already using the watchdog.. maybe this is now preventing it from shutting down? The idea was to trigger a reboot when ssh becomes unavailable. The service is periodically attempting to ssh to itself just to echo the next wait period into the watchdog timer. The loop would break when it fails to open a ssh connection. But that would mean that ssh from within still works and the service is still running?! I could add a message that is printed for each period and check if that message keeps being printed. I will play around with the watchdog some more.. basically it could as well try to ping some known always running server e.g. our ntp server. Once it succeeds it can be considered online and it would only feed the watchdog when a ping responds.. |
Hello Timo
On 05/10/2024 02:13, Timo Schroeder wrote:
Hi Peter,
I added a new field because task_active is already used as a boolean stating if the task is running. it work now both ways. when set to 0 while in the loop the task would know that it was aborted. The other one, task_alive is the heartbeat that is checked for an update to se if the stream produced at least one update per 1sec. In case one walk through the loop is expected to take longer.. maybe there is a wait-check loop in the axi64_dual_data_loop in which one would add a adev->task_alive++; but since it does not get to check the stop request this does not seem to be the case. technically it would be possible to squash the two variables together. This would require the update to never update to 0 as this would indicate "task is done". Also one would try to avoid the chance to accidentally misinterpret a perfect buffer-overflow (e.g. + (1<<32) for an int32 ) as an unchanged value so the counter must have enough bits. One would say this should be true for 31 bits already. Then one would use the task_active for the active state `task_active&1 != 0` and heartbeat `task_active += 2`. You would not want to check for task_active!=0 because that would give you a race condition in case the task is aborted.
About the watchdog .. I am already using the watchdog.. maybe this is now preventing it from shutting down?
No, exactly the opposite. Shutdown is not possible because we've started a kernel thread we cannot stop.
It's a bad idea to make it stoppable, because the system is unusable now anyway. Ideally, the system would PANIC and reboot itself .. but the problem then, unless it's a well-configure site with full remote logging, the cause of fault will be lost.
The idea was to trigger a reboot when ssh becomes unavailable. The service is periodically attempting to ssh to itself just to echo the next wait period into the watchdog timer. The loop would break when it fails to open a ssh connection. But that would mean that ssh from within still works and the service is still running?! I could add a message that is printed for each period and check if that message keeps being printed. I will play around with the watchdog some more.. basically it could as well try to ping some known always running server e.g. our ntp server. Once it succeeds it can be considered online and it would only feed the watchdog when a ping responds..
That could work, provided the ping timeout is not in the same ballpark as the watchdog timeout!
You could tackle the problem directly by including the search term in your WDT service loop as well
eg
while [ 1 ]; do
# do we have networking
ping -c 1 -w2 $NTPSERVER || break
# did we trip a fault in the device driver
dmesg | grep -q "WAITING for work" && break
sleep 10
wget http://localhost/cgi-bin/watchdog.cgi?WDT_RESTART
done
I'm curious why this is happening. Remember there are two data paths in the system
- ARM : ADC -> FPGA -> DMA -> DRAM
- MGT : ADC -> FPGA -> MGT
The fault ONLY occurs is the ARM path is overloaded. Which is somewhat irrelevant to your use-case since you're not even using the data from the ARM path!.
There's an easy fix for this: decimate the ARM path data:
set.site 1 nacc 10 # decimate by 10. MUST apply to all sites*
* if nacc is present in /mnt/local/sysconfig/site-1-peers, then all sites will update automatically.
I'm curious, under what conditions is it happening?.
Please send output from /usr/local/CARE/diagsdump
Cheers
Peter
…--
Peter Milne
Director of Sales
www.d-tacq.com
|
Well the problem arises during our attempt to work around the bandwidth limitation of the acq1001 carrier by programming the gpg with a duty cycle that would drop the average under 20MB/s which usually seems to be a safe value. However. the requirement is an active phase of at least 100ms at 40MHz with 2 channels. I was trying to use the RTM mode with a translen of 0x40000 every 2 seconds. I am using the two wire mode (is it possible to reduce the channels to two with that approach, I did not get it to work). This would leave me with 4M42B = 32MB of data to buffer .. I added a
This is on a 712 fw. /usr/local/CARE/diagsdump does not exist. |
Hello Timo
This is on ACQ1001?. As you've found, this is not normally used for streaming ACQ480, since the data rate is just too great.
Your attempt to reduce the duty cycle is a good idea, and, on the face of it should work.
This is on a 712 fw.
/usr/local/CARE/diagsdump does not exist.
Hmm, we've had it for some time:
commit 4f3024a
Author: Peter Milne ***@***.***>
Date: Thu Apr 15 17:03:57 2021 +0100
add diagsdump service at 54556
new file: CARE/diagsdump
modified: init/inetd.conf
That far predates 712
commit 1f57412e0350858dd91e87f90be4e9c62143c9b5 (tag: V712)
Author: Peter Milne ***@***.***>
Date: Wed Sep 25 15:38:55 2024 +0100
712
What FPGA do you have?. If it's a -2D, this is not necessary for 40MHz operation.
2 wire mode will indeed give you 4 channels rather than two. It will not allow 2 channel ops, and the data order will be strange.
Instead, I'd recommend to run in 1-wire mode, then run a SUBSET on on the output data, before it streams to port 4210 (you are streaming data from 4210?)
4GUG: https://www.d-tacq.com/resources/d-tacq-4G-acq4xx-UserGuide-r47.pdf
10.4.3 Strategies for reducing Wire Rate
STREAM_OPTS=”--subset=2”
Ultimately, though, with ACQ1001, we're up against the slowness of the ZYNQ "GEM" network interface.
We have new, better alternatives, eg
ACQ1102: Streams HUDP at 100MB/s or Aurora/PCIe at 400MB/s, costs the same as ACQ1002R.
Best regards
Peter
…On 06/10/2024 22:01, Timo Schroeder wrote:
Well the problem arises during our attempt to work around the bandwidth limitation of the acq1001 carrier by programming the gpg with a duty cycle that would drop the average under 20MB/s which usually seems to be a safe value. However. the requirement is an active phase of at least 100ms at 40MHz with 2 channels. I was trying to use the RTM mode with a translen of 0x40000 every 2 seconds. I am using the two wire mode (is it possible to reduce the channels to two with that approach, I did not get it to work). This would leave me with 4M*4*2B = 32MB of data to buffer .. I added a `| pv -qB 64MB` in acq400_streamd whihc should have ensured the writing process to have no issues however eventually the stream crashes:
```
[ 606.192060] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[ 606.192065] Modules linked in: emlog xilinx_axidma acq480 acq420fmc pl330 dmaengine314 xilinx_devcfg
[ 606.192094] CPU: 0 PID: 28387 Comm: acq400_stream Not tainted 4.14.0-acq400-xilinx #9
[ 606.192098] Hardware name: Xilinx Zynq Platform
[ 606.192104] task: cde82080 task.stack: cd548000
[ 606.192408] PC is at _acq400_continuous_read_update_hb0+0xb0/0x150 [acq420fmc]
[ 606.192412] LR is at 0xc0601734
[ 606.192418] pc : [<bf02403c>] lr : [<c0601734>] psr: a0000013
[ 606.192423] sp : cd549ea8 ip : c0601734 fp : 00000000
[ 606.192428] r10: 00000000 r9 : 00000050 r8 : ee9e8860
[ 606.192434] r7 : cdef7038 r6 : ee9e89c8 r5 : ee9e89c8 r4 : ee9e8860
[ 606.192439] r3 : ffff89d4 r2 : ed8b0514 r1 : ed8b04d4 r0 : ef175810
[ 606.192446] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 606.192452] Control: 18c5387d Table: 0df0404a DAC: 00000051
[ 606.192459] Process acq400_stream (pid: 28387, stack limit = 0xcd548210)
[ 606.192465] Stack: (0xcd549ea8 to 0xcd54a000)
[ 606.192476] 9ea0: ee9e8860 ee9e89c8 00000006 be908ab8 ee9e89c8 bf024110
[ 606.192487] 9ec0: ee9e8860 00000000 bf061dc8 cdde2000 be908ab8 bf0242f0 ee9e89c8 eea819c0
[ 606.192499] 9ee0: bf0ba0d4 ee9e89c8 00000000 cdde2000 bf024184 00000000 be908ab8 cd549f88
[ 606.192510] 9f00: cd548000 c01d4ab0 0004c298 00000014 00000001 00000000 00000000 cd549f10
[ 606.192520] 9f20: 00000000 00000000 00000000 00000000 00000014 00000000 00000000 eea13780
[ 606.192531] 9f40: 00000000 cdde2000 cd549f88 00000050 cdde2000 00000050 00000000 be908ab8
[ 606.192542] 9f60: cd549f88 c01d4c3c cdde2000 be908ab8 cdde2000 cdde2000 be908ab8 00000050
[ 606.192553] 9f80: c01079a4 c01d5068 00000000 00000000 00000050 0004c120 0004c12c 00200000
[ 606.192564] 9fa0: 00000003 c01077e0 0004c120 0004c12c 00000086 be908ab8 00000050 00000000
[ 606.192575] 9fc0: 0004c120 0004c12c 00200000 00000003 0004c144 0002bc7c 00200000 00000000
[ 606.192586] 9fe0: 000400a4 be908a98 0001e620 b6eec670 60040010 00000086 00000000 00000000
[ 606.193124] [<bf02403c>] (_acq400_continuous_read_update_hb0 [acq420fmc]) from [<bf024110>] (_acq400_continuous_read+0x34/0xa8 [acq420fmc])
[ 606.193650] [<bf024110>] (_acq400_continuous_read [acq420fmc]) from [<bf0242f0>] (acq400_continuous_read+0x16c/0x198 [acq420fmc])
[ 606.193928] [<bf0242f0>] (acq400_continuous_read [acq420fmc]) from [<c01d4ab0>] (__vfs_read+0x1c/0x118)
[ 606.193944] [<c01d4ab0>] (__vfs_read) from [<c01d4c3c>] (vfs_read+0x90/0xfc)
[ 606.193956] [<c01d4c3c>] (vfs_read) from [<c01d5068>] (SyS_read+0x3c/0x74)
[ 606.193976] [<c01d5068>] (SyS_read) from [<c01077e0>] (ret_fast_syscall+0x0/0x48)
[ 606.193988] Code: e35c0000 01a0c00e e3530002 9a000000 (e7f001f2)
[ 606.193996] ---[ end trace 115b2554966ec1c7 ]---
[ 606.206379] acq420 40000000.acq2006sc: acq400sc_onStop
[ 606.453564] Unable to handle kernel NULL pointer dereference at virtual address 00000003
[ 606.461672] pgd = c0004000
[ 606.464385] [00000003] *pgd=00000000
[ 606.467972] Internal error: Oops - BUG: 805 [#2] PREEMPT SMP ARM
[ 606.473980] Modules linked in: emlog xilinx_axidma acq480 acq420fmc pl330 dmaengine314 xilinx_devcfg
[ 606.483148] CPU: 1 PID: 28394 Comm: acq400.0.ai Tainted: G D 4.14.0-acq400-xilinx #9
[ 606.492018] Hardware name: Xilinx Zynq Platform
[ 606.496554] task: ef12e040 task.stack: eeaac000
[ 606.501442] PC is at _getEmpty+0x40/0xbc [acq420fmc]
[ 606.506773] LR is at getEmpty+0x1c/0x30 [acq420fmc]
[ 606.511656] pc : [<bf03bd98>] lr : [<bf03be30>] psr: a0060013
[ 606.517921] sp : eeaadf20 ip : 00000000 fp : ee9e890c
[ 606.523153] r10: 00000062 r9 : 00000062 r8 : bf064080
[ 606.528401] r7 : bf05d000 r6 : bf061dc8 r5 : ed8b04c0 r4 : ee9e8860
[ 606.534936] r3 : ed8b04d4 r2 : 00000003 r1 : ee9e89d4 r0 : ee9e8860
[ 606.541470] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 606.548615] Control: 18c5387d Table: 2d8d404a DAC: 00000051
[ 606.554365] Process acq400.0.ai (pid: 28394, stack limit = 0xeeaac210)
[ 606.560899] Stack: (0xeeaadf20 to 0xeeaae000)
[ 606.565263] df20: ee9e89b8 ee9e8860 bf061dc8 bf03be30 ee9e8860 00000000 bf061dc8 bf021e18
[ 606.573450] df40: 00000001 00000001 cd549d80 00000000 ef12e040 c01495e8 eeaadf58 eeaadf58
[ 606.581644] df60: 00000000 cde47700 cde405c0 eeaac000 00000000 ee9e8860 bf021a00 cd549d80
[ 606.589859] df80: cde4771c c0133718 eeaac000 cde405c0 c01335e8 00000000 00000000 00000000
[ 606.598045] dfa0: 00000000 00000000 00000000 c01078b0 00000000 00000000 00000000 00000000
[ 606.606235] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 606.614419] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 606.623222] [<bf03bd98>] (_getEmpty [acq420fmc]) from [<bf03be30>] (getEmpty+0x1c/0x30 [acq420fmc])
[ 606.632901] [<bf03be30>] (getEmpty [acq420fmc]) from [<bf021e18>] (axi64_data_loop+0x418/0x49c [acq420fmc])
[ 606.643050] [<bf021e18>] (axi64_data_loop [acq420fmc]) from [<c0133718>] (kthread+0x130/0x148)
[ 606.651689] [<c0133718>] (kthread) from [<c01078b0>] (ret_from_fork+0x14/0x24)
[ 606.658923] Code: e2801f5d e5902178 e5803178 e8830006 (e5823000)
[ 606.665103] ---[ end trace 115b2554966ec1c8 ]---
```
This is on a 712 fw.
/usr/local/CARE/diagsdump does not exist.
--
Peter Milne
Director of Sales
www.d-tacq.com
|
Sometimes when the stream gets stuck or crashes on a acq400 carrier. It seems to be te best to reboot the system, e.g. ssh in and reboot if that is still possible of with the REBOOT knob on site 0. However, I find myself with a now stuck system that is trapped in an infinite loop in acq400_drv.c _acq420_continuous_dma_stop(). So with the usb terminal i read the message "WAITING for work task" periodically. Since the login level already shut down there is no way of intervention that I know of and the device is rendered nonoperational until the next physical access (due to space limitation we could not fit remote controlled power strips everywhere). Is it possible to add a timeout or something that breaks out of the loop upon reboot?
One solution could be to detect a stucked dma, e.g. by adding a counter to adev that is incremented on each cycle of the for loop in axi64_dual_data_loop().
_acq420_continuous_dma_stop() may check if the counter has changes after each 1sec timeout. if not it can assume the stream is stuck or dead. it could flag abort and continue. The stream in axi64_dual_data_loop may not like it but has the chance to do a safe -abort when detecting a set abort flag. The abort flag could be cleared when entering axi64_dual_data_loop.
This solution is presented here.