Leader runs out of buffer #8797

EricSTMicro · 2023-02-23T08:24:36Z

EricSTMicro
Feb 23, 2023

Hello,

I am facing an issue concerning the platform we are developing.

On one side I a have a FTD (Leader), on the other side I have 10 children.
When I start all of them at pretty much the same time, the Leader quickly runs out of buffer. I get the following messages:

[0000186327] [REGION UNDEF] [I] MeshForwarder-: Received IPv6 UDP msg, len:84, chksum:77fc, ecn:no, from:2ad3f2898261c551, sec:no, prio:net, rss:-40.0
[0000186328] [REGION UNDEF] [I] MeshForwarder-:     src:[fe80:0:0:0:28d3:f289:8261:c551]:19788
[0000186329] [REGION UNDEF] [I] MeshForwarder-:     dst:[ff02:0:0:0:0:0:0:2]:19788
[0000186332] [REGION UNDEF] [I] Mle-----------: Receive Parent Request (fe80:0:0:0:28d3:f289:8261:c551)
[0000186333] [REGION UNDEF] [I] Message-------: No available message buffer
[0000186334] [REGION UNDEF] [W] Mle-----------: Failed to send Parent Response: NoBufs

or

[0000185136] [REGION UNDEF] [I] MeshForwarder-: Prepping indir tx IPv6 UDP msg, len:83, chksum:5c6c, ecn:no, to:0xa41d, sec:yes, prio:net
[0000185137] [REGION UNDEF] [N] MeshForwarder-: Evicting IPv6 UDP msg, len:83, chksum:3477, ecn:no, sec:yes, error:NoBufs, prio:net
[0000185139] [REGION UNDEF] [N] MeshForwarder-: src:[fe80:0:0:0:2456:4fa5:b12a:173f]:19788
[0000185140] [REGION UNDEF] [N] MeshForwarder-: dst:[fe80:0:0:0:6478:529c:b841:3bf]:19788

The CLI command bufferinfo returns me the following:

bufferinfo
total: 128
free: 0
6lo send: 0 0 0
6lo reas: 0 0 0
ip6: 0 0 0
mpl: 0 0 0
mle: 64 128 6147
coap: 0 0 0
coap secure: 0 0 0
application coap: 0 0 0
Done

So all the buffer are indeed taken, so the Leader can't store any new incoming messages, nor send any messages.
But it seems that it is more the consequence and not the root cause. When I see the free buffer beginning to go to 0, the leader as already stopped to answers solicitation (parent req, etc...) from the end devices.
I think as frame are not handled, they remain in the buffer, the end devices keep sending frame and we end up full of messages.

Do you have any idea what could cause the device to suddenly stop handling incoming frames?

I first thought of the alarms, the buffer handling at platform level, buffer size, RAM corruption, but it seems not to be that.
After the problem happens, if I do a thread stop, then ifconfig up and thread start, the buffer are freed again but are immediately filling up, no Tx possible, so it doesn't solve the problem.

jwhui · 2023-02-23T17:23:13Z

jwhui
Feb 23, 2023
Maintainer

It appears that MLE packets are getting stuck in the indirect transmission queue. There could be a number of things going on here. Some example possibilities:

Radio driver on Leader is not able to deliver indirect transmission.
MTD is not properly polling for indirect transmissions.

Complete logs from both the FTD and MTD should provide more visibility into the issue.

0 replies

EricSTMicro · 2023-02-24T14:33:00Z

EricSTMicro
Feb 24, 2023
Author

Hello Jonathan,

this issue has been reproduced on Nordic NRF52840 and Silabs BRD4166A, but not on NXP (which has a proprietary stack if I'm right).

I will try to get logs from an end-device.
The end-devices are just doing some MLE parent request and then data request every 500ms.

I first thought it was an issue at platform level but as I reproduce the issue on concurrent platforms, it seems more to be OpenThread related.

Also, I was able to locate a little bit more the origin of the issue, which always happens after a Tx retry:

[0000112508] [REGION UNDEF] [I] Mac-----------: Frame tx attempt 1/16 failed, error:ChannelAccessFailure, len:83, seqnum:116, type:Data, src:7ec08793586ca402, dst:9a8db0f000f5c1
(0000000 > 0000)        [TXa:x74|00111000|..|x74,0014]  f2ad -> c1a3    (16)
[0000112524] [REGION UNDEF] [I] Mac-----------: Frame tx attempt 2/16 failed, error:NoAck, len:83, seqnum:116, type:Data, src:7ec08793586ca402, dst:9a8db0f000f5c1a3, sec:yes, ac
                        [RXa:x7b|B0(0/1)|IT:001|185,0]  f2ad -> a402    (16)
[0000112532] [REGION UNDEF] [I] MeshForwarder-: Received IPv6 UDP msg, len:83, chksum:7ace, ecn:no, from:52ef3b075e11dfcd, sec:yes, prio:net, rss:-44.0
                        [RXa:xba|B1(0/2)|IT:002|171,0]  f2ad -> a402    (16)
[0000112551] [REGION UNDEF] [I] MeshForwarder-: Received IPv6 UDP msg, len:83, chksum:5245, ecn:no, from:960d87b4c1b3cfca, sec:yes, prio:net, rss:-46.0
                        [RXa:xa2|B0(0/3)|IT:004|192,0]  f2ad -> a402    (16)

As you can see, it tried to send the frame twice, but didn't retried a 3rd time. After that the device is not capable to transmit anymore (except announce and advertisement if these events occurs before the buffer are full). The number of retries is random, but it is always after that the problem occurs.

2 replies

abtink Feb 24, 2023
Collaborator

Just a quick poin on retries and logs:

At INFO log level, frame tx failures are logged (with the error) but not successful frame tx.
DEBG log level provides more details (all tx/rx frames) but it is verbose.
So log snippet indicates that tx attempts 1 and 2 failed but likely attempt 3 was successful.

jwhui Feb 24, 2023
Maintainer

Can you provide the DEBG-level logs?

EricSTMicro · 2023-02-27T09:40:16Z

EricSTMicro
Feb 27, 2023
Author

Hello,

unfortunately the DEBUG level cause our platform to crash - too much data to print...

I am also using a sniffer, so in the wireshark logs I see there were no 3rd try here. The 2nd frame attempt in the logs I gave is the last frame sent by our platform.

The issue seems to be the same from this thread: https://groups.google.com/g/openthread-users/c/FGcMZSRjQfs

It gives me some inputs but I'm not sure to understand the fix.

1 reply

jwhui Feb 27, 2023
Maintainer

If messages are getting stuck in the MLE queue for long periods of time, check to make sure le::HandleDelayedResponseTimer() is being called.

EricSTMicro · 2023-02-28T17:01:15Z

EricSTMicro
Feb 28, 2023
Author

Hello Jonathan, Abtin,

We found the issue on our platform, it was related to the config OPENTHREAD_CONFIG_PLATFORM_USEC_TIMER_ENABLE causing troubles on the alarm management (because of the to many CSMA backoff timer --> too much for our platform when a lot of traffic).

We had it enabled at the beginning because of the line of code below because we had OPENTHREAD_CONFIG_MAC_CSL_RECEIVER_ENABLE enabled on our FTD:

#if !OPENTHREAD_CONFIG_PLATFORM_USEC_TIMER_ENABLE
#error "Microsecond timer OPENTHREAD_CONFIG_PLATFORM_USEC_TIMER_ENABLE is required for "\
    "OPENTHREAD_CONFIG_MAC_CSL_RECEIVER_ENABLE"
#endif

But in fact the FTD doesn't need this config so we also removed it and now it is working nicely.

So sorry for the disturbance and thank you for all your quick answers :)

Eric

1 reply

jwhui Feb 28, 2023
Maintainer

Glad to hear you resolved your issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenThread

Leader runs out of buffer #8797

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenThread

Leader runs out of buffer #8797

EricSTMicro Feb 23, 2023

Replies: 4 comments · 4 replies

jwhui Feb 23, 2023 Maintainer

EricSTMicro Feb 24, 2023 Author

abtink Feb 24, 2023 Collaborator

jwhui Feb 24, 2023 Maintainer

EricSTMicro Feb 27, 2023 Author

jwhui Feb 27, 2023 Maintainer

EricSTMicro Feb 28, 2023 Author

jwhui Feb 28, 2023 Maintainer

EricSTMicro
Feb 23, 2023

Replies: 4 comments 4 replies

jwhui
Feb 23, 2023
Maintainer

EricSTMicro
Feb 24, 2023
Author

abtink Feb 24, 2023
Collaborator

jwhui Feb 24, 2023
Maintainer

EricSTMicro
Feb 27, 2023
Author

jwhui Feb 27, 2023
Maintainer

EricSTMicro
Feb 28, 2023
Author

jwhui Feb 28, 2023
Maintainer