Avoid the issue of connections not being released. #2915

dh-cloud · 2024-11-25T03:13:09Z

Enable TCP Keep-Alive to detect if the remote peer is still reachable. Keep-Alive sends periodic probes to check if the remote peer is still active. If the remote peer is unreachable, the connection will be closed, preventing resource leaks and avoiding the maintenance of stale connections.

codecov-commenter · 2024-11-25T03:48:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.85%. Comparing base (cfd6889) to head (977291b).
Report is 36 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2915      +/-   ##
============================================
- Coverage     77.87%   77.85%   -0.03%     
+ Complexity    13578    13532      -46     
============================================
  Files          1015     1015              
  Lines         59308    59231      -77     
  Branches       6835     6833       -2     
============================================
- Hits          46184    46112      -72     
+ Misses        10817    10812       -5     
  Partials       2307     2307

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andreachild · 2024-11-27T18:44:28Z

gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/GremlinServer.java

+            // Keep-Alive sends periodic probes to check if the remote peer is still active.
+            // If the remote peer is unreachable, the connection will be closed, preventing resource leaks and
+            // avoiding the maintenance of stale connections.
+            b.childOption(ChannelOption.SO_KEEPALIVE, true);


Hello @dh-cloud, thanks for noticing this issue with potential resource leaks in the gremlin-server.

In the master branch we have switched to HTTP (away from web sockets in 3.7) which means there will be much more short term connections compared to web sockets which uses long lived connections. When there is decreased activity we should aim to release the resources and thus not use the keep alive option to keep the connection active.

There is still an issue with potential resource leak but instead we can fix this using the existing idleConnectionTimeout setting. The issue here is that during the switch from web sockets to HTTP, we neglected to enable idle connection monitoring for HTTP 😢 . What do you think about changing this PR to enable idle connection monitoring instead?

See below for references in the code as to what I'm referring to:

tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/Settings.java

Line 167 in 56c72f6

public long idleConnectionTimeout = 0;

tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/Channelizer.java

Line 52 in 56c72f6

public default boolean supportsIdleMonitor() {

tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/AbstractChannelizer.java

Line 144 in 56c72f6

if (supportsIdleMonitor()) {

@andreachild
Yes, I agree that we can also add handling for idle connections, but the original intention of this PR was to destroy broken connections, especially in scenarios like the following: when a client establishes a long connection with the gremlin-server, and then the client's network card is suddenly unplugged to simulate a client-side network issue. In this case, when the server uses netstat -antp to check, the connection still appears to be established, but using the ping command to test the client's address fails, and this connection will never be destroyed. I have reproduced this scenario.

Feel free to let me know if you need any further adjustments!

I agree we should protect the server against broken connections. My concern would be that enabling the SO_KEEPALIVE could interfere with the idle connection handling if it is enabled and keep connections alive which should be considered idle and cleaned up. The SO_KEEPALIVE settings are OS-dependent so it would be vary by OS when keep alive pings would kick in and how often it would be sent.

For example assuming we fix HttpChannelizer to return true for boolean supportsIdleMonitor() and Settings.idleConnectionTimeout is 30 seconds and the OS is configured to start sending keep alive pings after 10 seconds every 1 second:

server receives a request and establishes a connection and sends a response

client stops sending requests

after 10 seconds the server starts sending keep alive pings which resets the idle connection time back to zero

keep alive pings are sent every second

after 30 seconds there is still no traffic but the connection is not considered idle because of the keep alive pings

connection is not cleaned up as it is not considered idle

What do you think about one of the following solutions:

change SO_KEEPALIVE to be enabled only if idle connection detection is disabled

change idle connection detection to be enabled by default so that broken connections will be cleaned up as the idle timeout is reached

My preference would be the latter option as it would allow consistent control of the timeouts at the application level instead of relying on the varying OS-level settings.

I have fully understood your point. For the requirement to disconnect abnormal connections, I believe we can consider it from two dimensions:

When boolean supportsIdleMonitor() is false, i.e., idle connection detection is disabled, we can enable SO_KEEPALIVE as a fallback to prevent broken connections from remaining occupied.
When supportsIdleMonitor() is true, we need to improve the handling logic for idle timeout connections.
In the TinkerPop master source code, I noticed that settings.idleConnectionTimeout and settings.keepAliveInterval are both default to 0. As a result, the new IdleStateHandler in AbstractChannelizer.class will not have any effect.

if (supportsIdleMonitor()) { final int idleConnectionTimeout = (int) (settings.idleConnectionTimeout / 1000); final int keepAliveInterval = (int) (settings.keepAliveInterval / 1000); pipeline.addLast(new IdleStateHandler(idleConnectionTimeout, keepAliveInterval, 0)); }

Setting specific values for idleConnectionTimeout and keepAliveInterval is a complex task and has significant impact, as they control the idle threshold for idle connections. However, different use cases have different requirements, and it is challenging to provide a one-size-fits-all recommendation.

For this PR, I force pushed new commit implementing the fallback measure only when boolean supportsIdleMonitor() is false.

I hope this helps! If you need any further adjustments, feel free to let me know.

andreachild · 2024-11-29T20:37:13Z

VOTE +1

kenhuuu · 2024-12-02T22:34:00Z

Based on information that I've recently read (which could be inaccurate), I think the SO_KEEPALIVE option and the IdleTimeoutHandler can be used together. TCP probes shouldn't be surfaced by recv() so it shouldn't reset the idle timer. If this can be confirmed (via a small test), then I think its probably OK to just add the SO_KEEPALIVE option regardless of whether the idle monitor is supported.

That being said, I'm still a bit skeptical about the real-world usefulness of this PR given that Netty can't control things like keepalive period and so on as those require OS-level options to be set. This just ends up using OS defaults which users may or may not find useful. So while I still support merging this PR, I think it will ultimately be replaced by the idle monitor.

Again, it would great if you could confirm that setting this option doesn't affect the idle monitor, in which case, you can just always have this option on.

dh-cloud · 2024-12-05T01:57:27Z

Based on information that I've recently read (which could be inaccurate), I think the SO_KEEPALIVE option and the IdleTimeoutHandler can be used together. TCP probes shouldn't be surfaced by recv() so it shouldn't reset the idle timer. If this can be confirmed (via a small test), then I think its probably OK to just add the SO_KEEPALIVE option regardless of whether the idle monitor is supported.

That being said, I'm still a bit skeptical about the real-world usefulness of this PR given that Netty can't control things like keepalive period and so on as those require OS-level options to be set. This just ends up using OS defaults which users may or may not find useful. So while I still support merging this PR, I think it will ultimately be replaced by the idle monitor.

Again, it would great if you could confirm that setting this option doesn't affect the idle monitor, in which case, you can just always have this option on.

Regarding the scenario of using SO_KEEPALIVE and IdleTimeoutHandler together, I have conducted a small test. The results show that SO_KEEPALIVE does not interfere with IdleTimeoutHandler resetting the idle timer. I also studied the source code of IdleTimeoutHandler and related TCP KEEPALIVE documentation. Based on my understanding, as you mentioned, the SO_KEEPALIVE feature is a mechanism of the underlying operating system's TCP, and the application layer Netty does not perceive it, so it does not affect IdleTimeoutHandler. My test steps are as follows:

In the gremlin-server.yaml configuration file, set settings.keepAliveInterval to 60 seconds.
Create a client that connects to the gremlin-server using a long-lived connection, without actively releasing the connection, to simulate an idle connection scenario.
Observe the time intervals of the WRITER_IDLE event logs printed by the gremlin-server. The entire observation period was 8 hours. The results show that the WRITER_IDLE event was printed 60 * 8 = 480 times, with a print interval of 60 seconds.

In short, SO_KEEPALIVE does not affect IdleTimeoutHandler.

kenhuuu · 2024-12-05T02:47:17Z

Thanks a lot for confirming.

It would be nice if you could add an entry to the CHANGELOG. If you want, you can also change the code back to the original way you had it since the TCP keepalive can be active together with idle timeout

VOTE +1

Enable TCP Keep-Alive to detect if the remote peer is still reachable. Keep-Alive sends periodic probes to check if the remote peer is still active. If the remote peer is unreachable, the connection will be closed, preventing resource leaks and avoiding the maintenance of stale connections.

andreachild reviewed Nov 27, 2024

View reviewed changes

dh-cloud force-pushed the fix-20241125-ConnectionsKeepAlive branch from d30c72c to 8ddd859 Compare November 29, 2024 02:45

dh-cloud force-pushed the fix-20241125-ConnectionsKeepAlive branch from 8ddd859 to 977291b Compare December 5, 2024 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid the issue of connections not being released. #2915

Avoid the issue of connections not being released. #2915

dh-cloud commented Nov 25, 2024

codecov-commenter commented Nov 25, 2024 •

edited

Loading

andreachild Nov 27, 2024 •

edited

Loading

dh-cloud Nov 28, 2024

andreachild Nov 28, 2024 •

edited

Loading

dh-cloud Nov 29, 2024

andreachild commented Nov 29, 2024

kenhuuu commented Dec 2, 2024

dh-cloud commented Dec 5, 2024

kenhuuu commented Dec 5, 2024

Avoid the issue of connections not being released. #2915

Are you sure you want to change the base?

Avoid the issue of connections not being released. #2915

Conversation

dh-cloud commented Nov 25, 2024

codecov-commenter commented Nov 25, 2024 • edited Loading

Codecov Report

andreachild Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

dh-cloud Nov 28, 2024

Choose a reason for hiding this comment

andreachild Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

dh-cloud Nov 29, 2024

Choose a reason for hiding this comment

andreachild commented Nov 29, 2024

kenhuuu commented Dec 2, 2024

dh-cloud commented Dec 5, 2024

kenhuuu commented Dec 5, 2024

codecov-commenter commented Nov 25, 2024 •

edited

Loading

andreachild Nov 27, 2024 •

edited

Loading

andreachild Nov 28, 2024 •

edited

Loading