Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

athenad memory leak #34079

Closed
sshane opened this issue Nov 21, 2024 · 10 comments · Fixed by #34101
Closed

athenad memory leak #34079

sshane opened this issue Nov 21, 2024 · 10 comments · Fixed by #34101
Labels
Milestone

Comments

@sshane
Copy link
Contributor

sshane commented Nov 21, 2024

It was using 15% of memory, causing low memory events. I had not rebooted the device in a few days iirc and had uploaded many rlogs.

I tried to attach with memray, but instead the process just crashed so I didn't get any information about this. @deanlee could this be similar to the leak you found in athena previously?

@sshane sshane added the bug label Nov 21, 2024
@sshane sshane added this to the 0.9.8 milestone Nov 21, 2024
@deanlee
Copy link
Contributor

deanlee commented Nov 22, 2024

It's hard to identify the exact issue, as the code is unclear and difficult to understand. However, multiple factors could be causing the resource leak. One possibility is that the athenad's implementation might cause garbage collector to accumulate objects over time, leading to memory issues during long runs.
additionally, thread race conditions related to queue and other global variable (like cur_upload_items, cancelled_uploads,...) usage, as well as bad practices like not calling task_done() after queue.get(), could also contribute to issues.

@deanlee
Copy link
Contributor

deanlee commented Nov 23, 2024

The following issues and fixes address potential resource leaks and long-term accumulation risks:

  1. Memory Leaks in Python requests: A common issue occurs when requests keep references to response objects in memory unnecessarily, leading to memory leaks by preventing garbage collection. athenad: fix memory leak by closing Response objects #34101
  2. Unclosed WebSocket: Not closing the WebSocket after a main loop restart can cause resource leaks by keeping connections open. athenad: close websocket before starting next loop iteration #34085
  3. Thread Safety in Upload Handling: Lack of proper synchronization for upload_queue, cancelled_uploads, and cur_upload_items can lead to race conditions, resulting in thread termination, main loop restarts, and inconsistent or corrupted data, as well as queue bloat. athenad: fix thread safety issues in upload handing #34084
  4. Thread Safety in ws_manage: The Python socket object is not thread-safe. A mutex should be used to protect socket.setoption while ws_send or ws_recv are sending or receiving messages, preventing race conditions and ensuring safe multithreaded operations.

@sshane
Copy link
Contributor Author

sshane commented Dec 5, 2024

This is still happening after uploading files

@sshane sshane reopened this Dec 5, 2024
@sshane
Copy link
Contributor Author

sshane commented Dec 5, 2024

I see this after uploading rlogs:

lock                            120       +47
cell                            828       +47
builtin_function_or_method     1685       +43
list                           1776       +38
tuple                          4715       +30
ReferenceType                  2736       +30
Event                            75       +15
socket                           24       +15
Condition                        54       +14
deque                            62       +14
method                          109       +13
function                       7974       +13
SSLContext                       14        +8
WebSocket                        13        +8
sock_opt                         13        +8
frame_buffer                     13        +8
continuous_frame                 13        +8
SplitResult                      14        +8
SSLSocket                        14        +8
_SSLSocket                       14        +8

@deanlee
Copy link
Contributor

deanlee commented Dec 6, 2024

The issue may not just be with the upload process—WebSocket instance growth could indicate deeper resource management problems.

@deanlee
Copy link
Contributor

deanlee commented Dec 6, 2024

@sshane: Do we really need the ws_manage thread? It doesn’t handle exceptions, and changing socket options while the WebSocket is sending or receiving could cause race conditions or undefined behavior. If socket options need to be set, a better approach would be to close the connection, reconnect, and configure the socket before restarting ws_recv/ws_send threads.Also, storing sock = ws.sock at the start of the thread and assuming it's valid throughout the loop may not be safe.

It may not be directly related to this issue, but I recommend removing this thread until we have a more robust solution.

@deanlee
Copy link
Contributor

deanlee commented Dec 6, 2024

The current athenad implementation lacks protections for concurrent data access in a multi-threaded environment. Besides PR #34084, which fixes race conditions for global variables, the UploadQueueCache class has multi-threading issues. It is accessed in both uploadFilesToUrls and upload_handler, which run in different threads, but UploadQueueCache is not thread-safe.

@deanlee
Copy link
Contributor

deanlee commented Dec 7, 2024

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

@sshane sshane changed the title Possible athenad memory leak athenad memory leak Dec 12, 2024
@jakethesnake420
Copy link
Contributor

jakethesnake420 commented Dec 13, 2024

@sshane: The current implementation does not preserve the order between received commands and their responses. Responses may be sent out of order.

Commands are placed in the recv_queue and processed by multiple jsonrpc_handler threads, with responses added to the send_queue after handling. Due to the multithreaded nature, the order of responses may differ from the order of received commands.

Is this behavior by design, or should it be considered an issue?

The order of responses does not matter if handled by the backend server correctly since the JsonRPC spec handles this. The 'requester' sends a request with a unique ID which could just be a ns timestamp for example. When the 'responder' (athena in this case) responds, it will also send the ID in the payload. https://www.jsonrpc.org/specification#response_object. This is done automatically so its opaque if you just look at the athena code.

Commas connect backend server should be doing the generation of pseudo unique ids and buffering the responses in a hashmap. Although as seen here https://github.com/commaai/connect/blob/79880f1203269eb145bc87863808e6cfca797ed1/src/actions/files.js#L147
it looks like the id is always zero so it might be worth double checking the backend code that it handles this properly.

@sshane
Copy link
Contributor Author

sshane commented Jan 17, 2025

Can't seem to reproduce this by re-uploading a large number of rlogs

@sshane sshane closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants