Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid the issue of connections not being released. #2915

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dh-cloud
Copy link
Contributor

Enable TCP Keep-Alive to detect if the remote peer is still reachable. Keep-Alive sends periodic probes to check if the remote peer is still active. If the remote peer is unreachable, the connection will be closed, preventing resource leaks and avoiding the maintenance of stale connections.

@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.85%. Comparing base (cfd6889) to head (977291b).
Report is 36 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2915      +/-   ##
============================================
- Coverage     77.87%   77.85%   -0.03%     
+ Complexity    13578    13532      -46     
============================================
  Files          1015     1015              
  Lines         59308    59231      -77     
  Branches       6835     6833       -2     
============================================
- Hits          46184    46112      -72     
+ Misses        10817    10812       -5     
  Partials       2307     2307              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// Keep-Alive sends periodic probes to check if the remote peer is still active.
// If the remote peer is unreachable, the connection will be closed, preventing resource leaks and
// avoiding the maintenance of stale connections.
b.childOption(ChannelOption.SO_KEEPALIVE, true);
Copy link
Contributor

@andreachild andreachild Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @dh-cloud, thanks for noticing this issue with potential resource leaks in the gremlin-server.

In the master branch we have switched to HTTP (away from web sockets in 3.7) which means there will be much more short term connections compared to web sockets which uses long lived connections. When there is decreased activity we should aim to release the resources and thus not use the keep alive option to keep the connection active.

There is still an issue with potential resource leak but instead we can fix this using the existing idleConnectionTimeout setting. The issue here is that during the switch from web sockets to HTTP, we neglected to enable idle connection monitoring for HTTP 😢 . What do you think about changing this PR to enable idle connection monitoring instead?

See below for references in the code as to what I'm referring to:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreachild
Yes, I agree that we can also add handling for idle connections, but the original intention of this PR was to destroy broken connections, especially in scenarios like the following: when a client establishes a long connection with the gremlin-server, and then the client's network card is suddenly unplugged to simulate a client-side network issue. In this case, when the server uses netstat -antp to check, the connection still appears to be established, but using the ping command to test the client's address fails, and this connection will never be destroyed. I have reproduced this scenario.

Feel free to let me know if you need any further adjustments!

Copy link
Contributor

@andreachild andreachild Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should protect the server against broken connections. My concern would be that enabling the SO_KEEPALIVE could interfere with the idle connection handling if it is enabled and keep connections alive which should be considered idle and cleaned up. The SO_KEEPALIVE settings are OS-dependent so it would be vary by OS when keep alive pings would kick in and how often it would be sent.

For example assuming we fix HttpChannelizer to return true for boolean supportsIdleMonitor() and Settings.idleConnectionTimeout is 30 seconds and the OS is configured to start sending keep alive pings after 10 seconds every 1 second:

  1. server receives a request and establishes a connection and sends a response
  2. client stops sending requests
  3. after 10 seconds the server starts sending keep alive pings which resets the idle connection time back to zero
  4. keep alive pings are sent every second
  5. after 30 seconds there is still no traffic but the connection is not considered idle because of the keep alive pings
  6. connection is not cleaned up as it is not considered idle

What do you think about one of the following solutions:

  • change SO_KEEPALIVE to be enabled only if idle connection detection is disabled
  • change idle connection detection to be enabled by default so that broken connections will be cleaned up as the idle timeout is reached

My preference would be the latter option as it would allow consistent control of the timeouts at the application level instead of relying on the varying OS-level settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fully understood your point. For the requirement to disconnect abnormal connections, I believe we can consider it from two dimensions:

When boolean supportsIdleMonitor() is false, i.e., idle connection detection is disabled, we can enable SO_KEEPALIVE as a fallback to prevent broken connections from remaining occupied.
When supportsIdleMonitor() is true, we need to improve the handling logic for idle timeout connections.
In the TinkerPop master source code, I noticed that settings.idleConnectionTimeout and settings.keepAliveInterval are both default to 0. As a result, the new IdleStateHandler in AbstractChannelizer.class will not have any effect.

if (supportsIdleMonitor()) {
    final int idleConnectionTimeout = (int) (settings.idleConnectionTimeout / 1000);
    final int keepAliveInterval = (int) (settings.keepAliveInterval / 1000);
    pipeline.addLast(new IdleStateHandler(idleConnectionTimeout, keepAliveInterval, 0));
}

Setting specific values for idleConnectionTimeout and keepAliveInterval is a complex task and has significant impact, as they control the idle threshold for idle connections. However, different use cases have different requirements, and it is challenging to provide a one-size-fits-all recommendation.

For this PR, I force pushed new commit implementing the fallback measure only when boolean supportsIdleMonitor() is false.

I hope this helps! If you need any further adjustments, feel free to let me know.

@dh-cloud dh-cloud force-pushed the fix-20241125-ConnectionsKeepAlive branch from d30c72c to 8ddd859 Compare November 29, 2024 02:45
@andreachild
Copy link
Contributor

VOTE +1

@kenhuuu
Copy link
Contributor

kenhuuu commented Dec 2, 2024

Based on information that I've recently read (which could be inaccurate), I think the SO_KEEPALIVE option and the IdleTimeoutHandler can be used together. TCP probes shouldn't be surfaced by recv() so it shouldn't reset the idle timer. If this can be confirmed (via a small test), then I think its probably OK to just add the SO_KEEPALIVE option regardless of whether the idle monitor is supported.

That being said, I'm still a bit skeptical about the real-world usefulness of this PR given that Netty can't control things like keepalive period and so on as those require OS-level options to be set. This just ends up using OS defaults which users may or may not find useful. So while I still support merging this PR, I think it will ultimately be replaced by the idle monitor.

Again, it would great if you could confirm that setting this option doesn't affect the idle monitor, in which case, you can just always have this option on.

@dh-cloud
Copy link
Contributor Author

dh-cloud commented Dec 5, 2024

Based on information that I've recently read (which could be inaccurate), I think the SO_KEEPALIVE option and the IdleTimeoutHandler can be used together. TCP probes shouldn't be surfaced by recv() so it shouldn't reset the idle timer. If this can be confirmed (via a small test), then I think its probably OK to just add the SO_KEEPALIVE option regardless of whether the idle monitor is supported.

That being said, I'm still a bit skeptical about the real-world usefulness of this PR given that Netty can't control things like keepalive period and so on as those require OS-level options to be set. This just ends up using OS defaults which users may or may not find useful. So while I still support merging this PR, I think it will ultimately be replaced by the idle monitor.

Again, it would great if you could confirm that setting this option doesn't affect the idle monitor, in which case, you can just always have this option on.

Regarding the scenario of using SO_KEEPALIVE and IdleTimeoutHandler together, I have conducted a small test. The results show that SO_KEEPALIVE does not interfere with IdleTimeoutHandler resetting the idle timer. I also studied the source code of IdleTimeoutHandler and related TCP KEEPALIVE documentation. Based on my understanding, as you mentioned, the SO_KEEPALIVE feature is a mechanism of the underlying operating system's TCP, and the application layer Netty does not perceive it, so it does not affect IdleTimeoutHandler. My test steps are as follows:

  1. In the gremlin-server.yaml configuration file, set settings.keepAliveInterval to 60 seconds.
  2. Create a client that connects to the gremlin-server using a long-lived connection, without actively releasing the connection, to simulate an idle connection scenario.
  3. Observe the time intervals of the WRITER_IDLE event logs printed by the gremlin-server. The entire observation period was 8 hours. The results show that the WRITER_IDLE event was printed 60 * 8 = 480 times, with a print interval of 60 seconds.

In short, SO_KEEPALIVE does not affect IdleTimeoutHandler.

@kenhuuu
Copy link
Contributor

kenhuuu commented Dec 5, 2024

Thanks a lot for confirming.

It would be nice if you could add an entry to the CHANGELOG. If you want, you can also change the code back to the original way you had it since the TCP keepalive can be active together with idle timeout

VOTE +1

Enable TCP Keep-Alive to detect if the remote peer is still reachable.
Keep-Alive sends periodic probes to check if the remote peer is still active.
If the remote peer is unreachable, the connection will be closed, preventing
resource leaks and avoiding the maintenance of stale connections.
@dh-cloud dh-cloud force-pushed the fix-20241125-ConnectionsKeepAlive branch from 8ddd859 to 977291b Compare December 5, 2024 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants