fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

bryghtlabs-richard · 2024-10-10T22:36:53Z

Not ready to submit, prototype fix only. Posting for discussion.

Description

Prototype workaround for WS failing to parse when WS frames get split across a very specific timing and spacing pattern.

This fix is functional, but may rarely exceed the expected timeout.

I'm not sure if this is the right layer to fix it - perhaps the underlying TCP and TLS transports should be fixed instead to always wait up to the max timeout before returning failed bytes?

Testing

To test, a printout was added when fallback is triggered. When we tried to make this failure more consistent, we could only get it to around 10% of the time. It's network-timing and traffic-burst specific.

Checklist

Before submitting a Pull Request, please ensure the following:

🚨 This PR does not introduce breaking changes.
All CI checks (GH Actions) pass.
Documentation is updated as needed.
Tests are updated or added as necessary.
Code is well-commented, especially in complex areas.
Git history is clean — commits are squashed to the minimum necessary.

github-actions · 2024-10-10T22:37:39Z

	Messages
📖	🎉 Good Job! All checks are passing!

👋 Hello bryghtlabs-richard, we appreciate your contribution to this project!

📘 Please review the project's Contributions Guide for key guidelines on code, documentation, testing, and more.

🖊️ Please also make sure you have read and signed the Contributor License Agreement for this project.

Click to see more instructions ...

This automated output is generated by the PR linter DangerJS, which checks if your Pull Request meets the project's requirements and helps you fix potential issues.

DangerJS is triggered with each push event to a Pull Request and modify the contents of this comment.

Please consider the following:
- Danger mainly focuses on the PR structure and formatting and can't understand the meaning behind your code or changes.
- Danger is not a substitute for human code reviews; it's still important to request a code review from your colleagues.
- To manually retry these Danger checks, please navigate to the Actions tab and re-run last Danger workflow.

Review and merge process you can expect ...

We do welcome contributions in the form of bug reports, feature requests and pull requests via this public GitHub repository.

This GitHub project is public mirror of our internal git repository

1. An internal issue has been created for the PR, we assign it to the relevant engineer.
2. They review the PR and either approve it or ask you for changes or clarifications.
3. Once the GitHub PR is approved, we synchronize it into our internal git repository.
4. In the internal git repository we do the final review, collect approvals from core owners and make sure all the automated tests are passing.
- At this point we may do some adjustments to the proposed change, or extend it by adding tests or documentation.
5. If the change is approved and passes the tests it is merged into the default branch.
5. On next sync from the internal git repository merged change will appear in this public GitHub repository.

Generated by 🚫 dangerJS against 093ea00

filzek · 2024-10-15T19:03:20Z

I occasionally encounter an issue with the websocket where a semaphore is halted for 10 seconds. It seems that the underlying transport_ws layer is experiencing performance problems and sometimes gets stuck. Additionally, there are other issues in the transport layer that close the connection with errno 118/119. We haven't been able to solve these problems yet.

We need to conduct more tests. We will try to check and print the payload sizes to determine what is causing the issue.

bryghtlabs-richard · 2024-10-15T19:19:22Z

Could be related? I'm not familiar with when that semaphore is held. What file is that in?

The problem I've run into is whenevr the next layer down's read function returns early(before the timeout with fewer bytes than expected). When that happens, transport_ws's parser gets off a few bytes, then may misinterpret the payload as header. Worst-case, the following payload starts with byte 126 or 127, which are RFC6455 codes for "extended payload length". When that happens, transport_ws may try to read a large(125B - 4GB) frame - if the other side isn't sending a lot of data, this may take approximately far too long for the read to complete.

bryghtlabs-richard · 2024-10-15T19:26:29Z

Are you using esp_websocket_client? It looks like that lock would be held across the call to esp_websocket_client_recv(), which can hang while parsing garbage. Are you using a binary payload by chance? That's likely to be worse - we're sending EngineIO, so our misinterpreted payload-as-header bytes are usually ASCII numbers.

filzek · 2024-10-15T20:00:01Z

Are you using esp_websocket_client? It looks like that lock would be held across the call to esp_websocket_client_recv(), which can hang while parsing garbage. Are you using a binary payload by chance? That's likely to be worse - we're sending EngineIO, so our misinterpreted payload-as-header bytes are usually ASCII numbers.

We are using text for both sending and receiving. The problem sometimes occurs during both receiving and sending, and we haven't been able to identify the source to reproduce it consistently. We can reproduce it by dropping the Wi-Fi connection and through other external actions, but this doesn't reveal the origin of the problem.

CLAassistant · 2024-10-29T22:35:04Z

All committers have signed the CLA.

david-cermak

LGTM, otherwise. Thanks for the fixes!

components/tcp_transport/transport_ws.c

When the underlying transport returns header, length, or mask bytes early, again call the underlying transport. This solves the WS parser getting offset when the server sends a burst of frames where the last WS header is split across packet boundaries, so fewer than the needed bytes may be available.

david-cermak

Thanks for the updates, LGTM!

espressif-bot added the Status: Opened Issue is new label Oct 10, 2024

github-actions bot changed the title ~~fix(ws_transport): correct split header bytes~~ fix(ws_transport): correct split header bytes (IDFGH-13859) Oct 10, 2024

espressif-bot assigned euripedesrocha Oct 11, 2024

bryghtlabs-richard mentioned this pull request Oct 25, 2024

[websocket] parser gets out of sync when header/size/mask not received contiguously (IDFGH-13951) espressif/esp-protocols#679

Open

3 tasks

bryghtlabs-richard force-pushed the fix/ws-incontiguous-bytestream branch from cde804b to 46bacc3 Compare October 29, 2024 22:30

bryghtlabs-richard force-pushed the fix/ws-incontiguous-bytestream branch from 655317f to 46bacc3 Compare October 29, 2024 22:36

david-cermak reviewed Oct 30, 2024

View reviewed changes

components/tcp_transport/transport_ws.c Outdated Show resolved Hide resolved

bryghtlabs-richard force-pushed the fix/ws-incontiguous-bytestream branch from 46bacc3 to 093ea00 Compare October 30, 2024 13:39

bryghtlabs-richard mentioned this pull request Oct 30, 2024

Extend websocket error logs to include transport failure reason (IDFGH-13978) #14806

Open

6 tasks

david-cermak approved these changes Oct 30, 2024

View reviewed changes

espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Opened Issue is new labels Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

bryghtlabs-richard commented Oct 10, 2024 •

edited

Loading

github-actions bot commented Oct 10, 2024 •

edited

Loading

filzek commented Oct 15, 2024

bryghtlabs-richard commented Oct 15, 2024

bryghtlabs-richard commented Oct 15, 2024

filzek commented Oct 15, 2024

CLAassistant commented Oct 29, 2024 •

edited

Loading

david-cermak left a comment

david-cermak left a comment

fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

Are you sure you want to change the base?

fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

Conversation

bryghtlabs-richard commented Oct 10, 2024 • edited Loading

Description

Related

Testing

Checklist

github-actions bot commented Oct 10, 2024 • edited Loading

filzek commented Oct 15, 2024

bryghtlabs-richard commented Oct 15, 2024

bryghtlabs-richard commented Oct 15, 2024

filzek commented Oct 15, 2024

CLAassistant commented Oct 29, 2024 • edited Loading

david-cermak left a comment

Choose a reason for hiding this comment

david-cermak left a comment

Choose a reason for hiding this comment

bryghtlabs-richard commented Oct 10, 2024 •

edited

Loading

github-actions bot commented Oct 10, 2024 •

edited

Loading

CLAassistant commented Oct 29, 2024 •

edited

Loading