You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While integrating uC-USBH using the ATSAMx HCD for a client, I was running some tests where an external hub with devices on two ports was continuously disconnected from and reconnected to the root hub every few seconds. After a few hours (about 2000 disconnects/reconnects, give or take), I found that the system would crash and run into a reset.
So I turned on logging and got (again, after a few hours):
The USBH_HUB_ISR() fails was unique and seemed to be where things started to go wrong, so I set a breakpoint:
Breakpoint 1, USBH_HUB_ISR (p_ep=0x20006504 <USBH_HUB_Arr+1320>, p_buf=0x200065bc <USBH_HUB_Arr+1504>, buf_len=<optimized out>, xfer_len=<optimized out>, p_arg=0x20006504 <USBH_HUB_Arr+1320>, err=USBH_ERR_HC_IO) at uC-USBH/Source/usbh_hub.c:929
929 USBH_PRINT_LOG("USBH_HUB_ISR() fails. Err=%d errcnt=%d\r\n",
(gdb) bt
#0 USBH_HUB_ISR (p_ep=0x20006504 <USBH_HUB_Arr+1320>, p_buf=0x200065bc <USBH_HUB_Arr+1504>, buf_len=<optimized out>,
xfer_len=<optimized out>, p_arg=0x20006504 <USBH_HUB_Arr+1320>, err=USBH_ERR_HC_IO)
at uC-USBH/Source/usbh_hub.c:929
#1 0x000145fe in USBH_URB_Notify (p_urb=0x20003438 <hal::(anonymous namespace)::App_USBH_AsyncTaskStk+3964>)
at uC-USBH/Source/usbh_core.c:4189
#2 USBH_URB_Complete (p_urb=p_urb@entry=0x20006520 <USBH_HUB_Arr+1348>)
at uC-USBH/Source/usbh_core.c:3498
#3 0x000146f8 in USBH_AsyncTask (p_arg=<optimized out>)
at uC-USBH/Source/usbh_core.c:5248
#4 0x00017834 in OSTaskQuery (prio=4 '\004', p_task_data=0x6060606)
at uC-OS2/Source/os_task.c:1100
#5 0x00000000 in ?? ()
Then I stepped into USBH_HUB_Disconn:
Breakpoint 3, USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>) at uC-USBH/Source/usbh_hub.c:642
642 {
(gdb) bt
#0 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:642
#1 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#2 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#3 0x00015746 in USBH_HUB_EventProcess ()
at uC-USBH/Source/usbh_hub.c:1046
#4 USBH_HUB_EventTask (p_arg=<optimized out>)
at uC-USBH/Source/usbh_hub.c:293
#5 0x00017834 in OSTaskQuery (prio=4 '\004', p_task_data=0x20006508 <USBH_HUB_Arr+1324>)
at uC-OS2/Source/os_task.c:1100
#6 0x00000000 in ?? ()
And this is where the code goes into infinite recursion:
#0 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:642
#1 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#2 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#3 0x0001532c in USBH_HUB_Uninit (p_hub_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:756
#4 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:649
#5 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#6 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#7 0x0001532c in USBH_HUB_Uninit (p_hub_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:756
#8 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:649
#9 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#10 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#11 0x0001532c in USBH_HUB_Uninit (p_hub_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:756
#12 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:649
#13 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#14 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#15 0x0001532c in USBH_HUB_Uninit (p_hub_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:756
#16 USBH_HUB_Disconn (p_class_dev=0x20006504 <USBH_HUB_Arr+1320>)
at uC-USBH/Source/usbh_hub.c:649
#17 0x00013c96 in USBH_ClassDrvDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_class.c:504
#18 0x0001480c in USBH_DevDisconn (p_dev=p_dev@entry=0x20005910 <USBH_Host+6308>)
at uC-USBH/Source/usbh_core.c:1041
#19 0x00015746 in USBH_HUB_EventProcess ()
at uC-USBH/Source/usbh_hub.c:1046
#20 USBH_HUB_EventTask (p_arg=<optimized out>)
at uC-USBH/Source/usbh_hub.c:293
#21 0x00017834 in OSTaskQuery (prio=4 '\004', p_task_data=0x20006508 <USBH_HUB_Arr+1324>)
at uC-OS2/Source/os_task.c:1100
#22 0x00000000 in ?? ()
I've created gists for dumps of USBH_HUB_Arr and USBH_Host in case these are useful.
Basically, USBH_HUB_EventProcess calls USBH_DevDisconn, which calls USBH_ClassDrvDisconn, which for some reason recurses into USBH_HUB_Disconn while iterating the ports and ends up with the same hub pointer. In frame 19 (USBH_HUB_EventProcess), p_hub_dev is 0x20006504, which corresponds to USBH_HUB_Arr[5]. This hub's DevPtrList[0] is 0x20005910, which corresponds to USBH_Host.DevList[8]. This device has a ClassDevPtr that points back at USBH_HUB_Arr[5], which gets passed to USBH_HUB_Disconn and via USBH_HUB_Uninit back to USBH_DevDisconn, which then repeats the same cycle, recursing until it falls off the end of the stack frame.
This is already strange by itself and might point to a different bug than the one I'm about to describe.
What really made my me scratch my head was that the failure happened relatively reliably after a certain number of reconnects. So I incrementally instrumented the code with more logging and at some point saw that the root hub's reference counter field (RefCnt) was slowly incrementing. An increment happened on average every 8 reconnects. So it took some time until the RefCnt overflowed from 255 to 0, and the infinite recursion above seems to be a result of that. I don't fully understand what's going on, but I'm assuming that some part of the code is under the impression that the root hub is no longer in use and puts it back into the free list, whereas some other part of the code is still actively using it.
In any case, I was now looking further into these unexpected increments of the reference counter, and these are actually quite easy to explain. When a hardware interrupt occurs, an event is queued using USBH_HUB_RH_Event, and this event is later processed asynchronously. Now, USBH_HUB_RH_Event increments the associated hub's RefCnt field and then adds it to the event list formed by USBH_HUB_HeadPtr and USBH_HUB_TailPtr. However, it seems to assume that it won't be called more than once for the same hub before that hub has been removed from the event list again. So if USBH_HUB_RH_Event is called twice for the same hub, it'll increment the RefCnt for the hub twice and set USBH_HUB_TailPtr->NxtPtr to point to the hub; in case of just a single root hub, that means the hub is now linking to itself. Due to the way USBH_HUB_EventProcess checks the queue, this circular reference doesn't cause any immediate issues, though (at least not with a single root hub; things will likely go very wrong as soon as a second root hub is involved). USBH_HUB_EventProcess will only see a single event and it will consequently only decrement the RefCnt once.
The above happens when disconnect/connect hardware interrupts arrive in quick succession. It seems to me that USBH_HUB_RH_Event is probably even right to assume it can only be called once for the same hub as there is a call p_rh_drv_api->IntDis(p_hc_drv) right before the RefCnt increment. I'm assuming this is supposed to disable root hub hardware interrupts until the current event has been asynchronously processed. However, at least for the ATSAMx HCD, neither IntDis nor InEn do anything:
Instead, quite a bit of interrupt enabling/disabling is going on in USBH_ATSAMX_ISR_Handler.
At this point, while I seem to understand roughly what's going on, I'm unsure about how this should be fixed. I didn't find any documented rationale for why IntEn and IntDis are unimplemented. At least the implementation for Renesas_USBHS seems to also leave them empty.
From looking at the rest of the code, it seems most likely that a fix for this issue should prevent new interrupts from being queued while there are still pending events in the queue.
The text was updated successfully, but these errors were encountered:
While integrating uC-USBH using the ATSAMx HCD for a client, I was running some tests where an external hub with devices on two ports was continuously disconnected from and reconnected to the root hub every few seconds. After a few hours (about 2000 disconnects/reconnects, give or take), I found that the system would crash and run into a reset.
So I turned on logging and got (again, after a few hours):
The
USBH_HUB_ISR() fails
was unique and seemed to be where things started to go wrong, so I set a breakpoint:Then I stepped into
USBH_HUB_Disconn
:And this is where the code goes into infinite recursion:
I've created gists for dumps of
USBH_HUB_Arr
andUSBH_Host
in case these are useful.Basically,
USBH_HUB_EventProcess
callsUSBH_DevDisconn
, which callsUSBH_ClassDrvDisconn
, which for some reason recurses intoUSBH_HUB_Disconn
while iterating the ports and ends up with the same hub pointer. In frame 19 (USBH_HUB_EventProcess
),p_hub_dev
is0x20006504
, which corresponds toUSBH_HUB_Arr[5]
. This hub'sDevPtrList[0]
is0x20005910
, which corresponds toUSBH_Host.DevList[8]
. This device has aClassDevPtr
that points back atUSBH_HUB_Arr[5]
, which gets passed toUSBH_HUB_Disconn
and viaUSBH_HUB_Uninit
back toUSBH_DevDisconn
, which then repeats the same cycle, recursing until it falls off the end of the stack frame.This is already strange by itself and might point to a different bug than the one I'm about to describe.
What really made my me scratch my head was that the failure happened relatively reliably after a certain number of reconnects. So I incrementally instrumented the code with more logging and at some point saw that the root hub's reference counter field (
RefCnt
) was slowly incrementing. An increment happened on average every 8 reconnects. So it took some time until theRefCnt
overflowed from 255 to 0, and the infinite recursion above seems to be a result of that. I don't fully understand what's going on, but I'm assuming that some part of the code is under the impression that the root hub is no longer in use and puts it back into the free list, whereas some other part of the code is still actively using it.In any case, I was now looking further into these unexpected increments of the reference counter, and these are actually quite easy to explain. When a hardware interrupt occurs, an event is queued using
USBH_HUB_RH_Event
, and this event is later processed asynchronously. Now,USBH_HUB_RH_Event
increments the associated hub'sRefCnt
field and then adds it to the event list formed byUSBH_HUB_HeadPtr
andUSBH_HUB_TailPtr
. However, it seems to assume that it won't be called more than once for the same hub before that hub has been removed from the event list again. So ifUSBH_HUB_RH_Event
is called twice for the same hub, it'll increment theRefCnt
for the hub twice and setUSBH_HUB_TailPtr->NxtPtr
to point to the hub; in case of just a single root hub, that means the hub is now linking to itself. Due to the wayUSBH_HUB_EventProcess
checks the queue, this circular reference doesn't cause any immediate issues, though (at least not with a single root hub; things will likely go very wrong as soon as a second root hub is involved).USBH_HUB_EventProcess
will only see a single event and it will consequently only decrement theRefCnt
once.The above happens when disconnect/connect hardware interrupts arrive in quick succession. It seems to me that
USBH_HUB_RH_Event
is probably even right to assume it can only be called once for the same hub as there is a callp_rh_drv_api->IntDis(p_hc_drv)
right before theRefCnt
increment. I'm assuming this is supposed to disable root hub hardware interrupts until the current event has been asynchronously processed. However, at least for the ATSAMx HCD, neitherIntDis
norInEn
do anything:Instead, quite a bit of interrupt enabling/disabling is going on in
USBH_ATSAMX_ISR_Handler
.At this point, while I seem to understand roughly what's going on, I'm unsure about how this should be fixed. I didn't find any documented rationale for why
IntEn
andIntDis
are unimplemented. At least the implementation forRenesas_USBHS
seems to also leave them empty.From looking at the rest of the code, it seems most likely that a fix for this issue should prevent new interrupts from being queued while there are still pending events in the queue.
The text was updated successfully, but these errors were encountered: