Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to a stack-overflow-resilient error handler #48

Merged
merged 2 commits into from
Nov 1, 2024

Conversation

hlef
Copy link
Collaborator

@hlef hlef commented Oct 29, 2024

The TCP/IP stack cannot currently recover from a crash due to a stack overflow in the TCP/IP compartment. This is due to a limitation of the implementation of the switcher, which cannot trigger the error handler on stack overflow. Further, the current implementation of the error handler runs at the point in the stack where the compartment crashed. This implies that the error handler runs on a potentially very small stack, making it possible that it will itself crash due to a stack overflow. This limitation is documented in #31.

This commit addresses these limitations using David's recently-merged stackless error handler mechanism: CHERIoT-Platform/cheriot-rtos#301.

We also address a bug in the reset, which happens when an error handler on a user thread triggers a crash of the IP thread. In this case, previously improperly tested, we fail to release a lock, resulting in a failure of the reset.

More information in each commit.

@hlef hlef requested a review from davidchisnall October 29, 2024 19:15
lib/tcpip/tcpip-internal.h Outdated Show resolved Hide resolved
lib/tcpip/tcpip-internal.h Outdated Show resolved Hide resolved
lib/tcpip/tcpip-internal.h Outdated Show resolved Hide resolved
lib/tcpip/tcpip_error_handler.h Show resolved Hide resolved
lib/tcpip/tcpip_error_handler.h Show resolved Hide resolved
hlef added 2 commits October 31, 2024 13:40
The TCP/IP stack cannot currently recover from a crash due to a stack
overflow in the TCP/IP compartment.  This is due to a limitation of the
implementation of the switcher, which cannot trigger the error handler
on stack overflow. Further, the current implementation of the error
handler runs at the point in the stack where the compartment crashed.
This implies that the error handler runs on a potentially very small
stack, making it possible that it will itself crash due to a stack
overflow.

This commit addresses these limitations using David's recently-merged
stackless error handler mechanism: CHERIoT-Platform/cheriot-rtos#301

The error handler is now declared at thread entry points (API entry
points and driver entry point for user threads, and IP thread start for
the IP thread). We can thus eliminate the big
`compartment_error_handler`.

The new error handler runs at the top of the stack, which eliminates all
issues mentioned earlier (provided the stack is large-enough, which
should be checked through `STACK_CHECK` - a separate problem).

We keep a single error handler for user threads and the IP thread
(`reset_network_stack_state`), differentiated through a `isIpThread`
argument. Though we could separate this into two distinct error handlers
for user threads and for the IP thread, it would result in a lot of
duplicated error-prone code.

Note that we do not "reinstall context" or "force unwind" anymore in the
error handler. This is all handled under the hood by the new error
handler, and we simply exit the error handler once done, which happens
to be the place where we would leave the compartment in the force unwind
case, or the place where we would reinstall the context for the IP
thread.

Signed-off-by: Hugo Lefeuvre <[email protected]>
If a user thread crashes and the error handler of the user thread
triggers a crash in the IP thread, the current code will block forever
because the IP thread does not release the `ipThreadLockState`.

This problem stems from the fact that we only tested user thread crashes
in a setting where the IP thread cleanly returned from `prvIPTask`
(without a crash, i.e., without triggering the IP thread error handler).

Address this by releasing the `ipThreadLockState` in both cases. Add
comments to explain.

Note: this is a pre-existing bug to the stack-overflow-resilient error
handler, so separate commit.

Signed-off-by: Hugo Lefeuvre <[email protected]>
@hlef hlef force-pushed the hlefeuvre/stack-overflow-resilient-error-handler branch from c7e8a22 to 6f0296b Compare October 31, 2024 20:40
@hlef
Copy link
Collaborator Author

hlef commented Oct 31, 2024

Addressed the comments. Planning to merge once I re-tested the PR later today (no access to the FPGA right now).

@hlef hlef merged commit 45f6872 into main Nov 1, 2024
2 checks passed
@hlef hlef deleted the hlefeuvre/stack-overflow-resilient-error-handler branch November 1, 2024 00:01
@hlef hlef mentioned this pull request Nov 1, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants