Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ws_transport): correct split header bytes (IDFGH-13859) #14706

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bryghtlabs-richard
Copy link
Contributor

@bryghtlabs-richard bryghtlabs-richard commented Oct 10, 2024

Not ready to submit, prototype fix only. Posting for discussion.

Description

Prototype workaround for WS failing to parse when WS frames get split across a very specific timing and spacing pattern.

This fix is functional, but may rarely exceed the expected timeout.

I'm not sure if this is the right layer to fix it - perhaps the underlying TCP and TLS transports should be fixed instead to always wait up to the max timeout before returning failed bytes?

Related

Fixes #14704

Testing

To test, a printout was added when fallback is triggered. When we tried to make this failure more consistent, we could only get it to around 10% of the time. It's network-timing and traffic-burst specific.


Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

Copy link

github-actions bot commented Oct 10, 2024

Messages
📖 🎉 Good Job! All checks are passing!

👋 Hello bryghtlabs-richard, we appreciate your contribution to this project!


📘 Please review the project's Contributions Guide for key guidelines on code, documentation, testing, and more.

🖊️ Please also make sure you have read and signed the Contributor License Agreement for this project.

Click to see more instructions ...


This automated output is generated by the PR linter DangerJS, which checks if your Pull Request meets the project's requirements and helps you fix potential issues.

DangerJS is triggered with each push event to a Pull Request and modify the contents of this comment.

Please consider the following:
- Danger mainly focuses on the PR structure and formatting and can't understand the meaning behind your code or changes.
- Danger is not a substitute for human code reviews; it's still important to request a code review from your colleagues.
- To manually retry these Danger checks, please navigate to the Actions tab and re-run last Danger workflow.

Review and merge process you can expect ...


We do welcome contributions in the form of bug reports, feature requests and pull requests via this public GitHub repository.

This GitHub project is public mirror of our internal git repository

1. An internal issue has been created for the PR, we assign it to the relevant engineer.
2. They review the PR and either approve it or ask you for changes or clarifications.
3. Once the GitHub PR is approved, we synchronize it into our internal git repository.
4. In the internal git repository we do the final review, collect approvals from core owners and make sure all the automated tests are passing.
- At this point we may do some adjustments to the proposed change, or extend it by adding tests or documentation.
5. If the change is approved and passes the tests it is merged into the default branch.
5. On next sync from the internal git repository merged change will appear in this public GitHub repository.

Generated by 🚫 dangerJS against 093ea00

@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 10, 2024
@github-actions github-actions bot changed the title fix(ws_transport): correct split header bytes fix(ws_transport): correct split header bytes (IDFGH-13859) Oct 10, 2024
@filzek
Copy link

filzek commented Oct 15, 2024

I occasionally encounter an issue with the websocket where a semaphore is halted for 10 seconds. It seems that the underlying transport_ws layer is experiencing performance problems and sometimes gets stuck. Additionally, there are other issues in the transport layer that close the connection with errno 118/119. We haven't been able to solve these problems yet.

We need to conduct more tests. We will try to check and print the payload sizes to determine what is causing the issue.

@bryghtlabs-richard
Copy link
Contributor Author

Could be related? I'm not familiar with when that semaphore is held. What file is that in?

The problem I've run into is whenevr the next layer down's read function returns early(before the timeout with fewer bytes than expected). When that happens, transport_ws's parser gets off a few bytes, then may misinterpret the payload as header. Worst-case, the following payload starts with byte 126 or 127, which are RFC6455 codes for "extended payload length". When that happens, transport_ws may try to read a large(125B - 4GB) frame - if the other side isn't sending a lot of data, this may take approximately far too long for the read to complete.

@bryghtlabs-richard
Copy link
Contributor Author

Are you using esp_websocket_client? It looks like that lock would be held across the call to esp_websocket_client_recv(), which can hang while parsing garbage. Are you using a binary payload by chance? That's likely to be worse - we're sending EngineIO, so our misinterpreted payload-as-header bytes are usually ASCII numbers.

@filzek
Copy link

filzek commented Oct 15, 2024

Are you using esp_websocket_client? It looks like that lock would be held across the call to esp_websocket_client_recv(), which can hang while parsing garbage. Are you using a binary payload by chance? That's likely to be worse - we're sending EngineIO, so our misinterpreted payload-as-header bytes are usually ASCII numbers.

We are using text for both sending and receiving. The problem sometimes occurs during both receiving and sending, and we haven't been able to identify the source to reproduce it consistently. We can reproduce it by dropping the Wi-Fi connection and through other external actions, but this doesn't reveal the origin of the problem.

@CLAassistant
Copy link

CLAassistant commented Oct 29, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, otherwise. Thanks for the fixes!

components/tcp_transport/transport_ws.c Outdated Show resolved Hide resolved
When the underlying transport returns header,
length, or mask bytes early, again call the
underlying transport.

This solves the WS parser getting offset when
the server sends a burst of frames where the
last WS header is split across packet boundaries,
so fewer than the needed bytes may be available.
Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM!

@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: NA Issue resolution is unavailable and removed Status: Opened Issue is new labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websocket parser fails when header/size/mask not received contiguously. (IDFGH-13857)
6 participants