Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make signal handler less greedy: only handle signals from expected memory ranges #23

Merged
merged 4 commits into from
Mar 22, 2024

Conversation

spoonincode
Copy link
Member

EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an access violation WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverable access violation.

Because EOS VM captures SIGBUS (wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverable access violation as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (in heap mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.

Removing SIGBUS from being handled on Linux would generally resolve this problem, though if a host function had a defect causing a SIGSEGV it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handle SIGSEGV/SIGBUS/SIGFPE on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removed SIGBUS from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.

Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.

This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an access violation due to the current greediness of the handlers.

since we're going to longjmp out of this function, probably best to stay as trivial as possible
@spoonincode spoonincode merged commit aa8bd0a into main Mar 22, 2024
10 checks passed
@spoonincode spoonincode deleted the limit_signal_handler branch March 22, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants