Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application hang after Neo4j connection did not respond #1504

Open
pattanmzy opened this issue Nov 14, 2023 · 11 comments
Open

Application hang after Neo4j connection did not respond #1504

pattanmzy opened this issue Nov 14, 2023 · 11 comments

Comments

@pattanmzy
Copy link

pattanmzy commented Nov 14, 2023

Hi,

I have a Spring Boot application with Neo4j AuraDB. The application was working fine initially. However, occasionally we observed that queries to the Neo4j AuraDB hang and did not respond after a long period of time, e.g. 10 - 20 minutes. When this happened, we saw the following message in the logs.

[Neo4jDriverIO-2-1] - Closing connection pool towards p-4dc8adc8-7ed2-7.production-orch-0361.neo4j.io:7687, it has no active connections and is not in the routing table registry.

We generated a few thread dumps every 30s. We observed that those threads still hang and the number of threads waiting for response from neo4j connection kept piling up.
There are no other related errors in the log.

Connection uri: neo4j+s://.databases.neo4j.io
Spring Boot starter version: 3.0.8
Spring Data Neo4j version: 7.0.7
Neo4j java driver version: 5.9.0
Neo4j version: 5
Neo4j Aura Professional on AWS
Operating System: Alpine Linux v3.18 on Kubernetes EKS
Java version: openjdk version "17.0.9"

Expected behavior

Database query should execute successfully

Actual behavior

Connection to AuraDB did not respond causing thread to hang.

Thread dumps
ThreadDumps.zip

Could you please advise what could be the root cause of this issue.

Do let me know if you need more info.

Thank you.

@injectives
Copy link
Contributor

@pattanmzy, thanks for your message. Could you let us know if you are using reactive API by any chance please?

@pattanmzy
Copy link
Author

Thanks @injectives for your reply.

The application does not use reactive API but it has dependency on Spring Boot Actuator for monitoring purpose. From the thread dumps, I can see that the Actuator HealthEndpoint uses the ReactiveHealthContributors to perform health check. There were many similar threads hanging waiting for response from Actuator health check. Is this the cause of the problem?

Extracted from thread dump
"http-nio-8080-exec-1" #52 daemon prio=5 os_prio=0 cpu=2079.61ms elapsed=84690.92s tid=0x00007f8bc5150fd0 nid=0x3a waiting on condition [0x00007f8bbe453000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000000db562688> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/AbstractQueuedSynchronizer.java:1047)
at java.util.concurrent.CountDownLatch.await([email protected]/CountDownLatch.java:230)
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:90)
at reactor.core.publisher.Mono.block(Mono.java:1712)
at org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1.getHealth(HealthEndpointConfiguration.java:160)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.getHealth(HealthEndpointWebExtension.java:94)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.getHealth(HealthEndpointWebExtension.java:47)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getLoggedHealth(HealthEndpointSupport.java:172)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getContribution(HealthEndpointSupport.java:145)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getAggregateContribution(HealthEndpointSupport.java:156)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getContribution(HealthEndpointSupport.java:141)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getHealth(HealthEndpointSupport.java:110)
at org.springframework.boot.actuate.health.HealthEndpointSupport.getHealth(HealthEndpointSupport.java:81)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.health(HealthEndpointWebExtension.java:80)
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.health(HealthEndpointWebExtension.java:69)
at jdk.internal.reflect.GeneratedMethodAccessor104.invoke(Unknown Source)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke([email protected]/Method.java:568)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:281)
at org.springframework.boot.actuate.endpoint.invoke.reflect.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:74)
at org.springframework.boot.actuate.endpoint.annotation.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60)
at org.springframework.boot.actuate.endpoint.web.servlet.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:321)

@injectives
Copy link
Contributor

injectives commented Nov 21, 2023

Is this the cause of the problem?

If I remember correctly, the healthcheck uses the driver session run method. If subscription is cancelled (for whatever reason), it might be a cause of such issue, but this may also be something else.

Neo4j java driver version: 5.9.0

I suggest trying the latest driver version 5.14.0 as 5.11.0 introduced the cancellation support, see the release notes.

@pattanmzy
Copy link
Author

Thanks for the feedback. I have updated the application to use the latest driver version. I will continue to monitor the application and will provide an update here if the problem still persist.

@pattanmzy
Copy link
Author

Hi @injectives ,

The connection issue happened again a few days after I upgraded the Neo4j Java driver to the latest version, i.e. 5.14.0. From the attached thread dumps that were generated at 30s to 1 min intervals, we observed that there were application initiated queries to Neo4j AuraDB queries as well as Actuator health check threads hang and did not respond.

There were errors such as connection to the database terminated, connection read timed out, server is no longer available, failed to obtain connection towards WRITE server, etc. found in the application logs. Please find the details in the application and trace logs attached.

Spring Boot starter version: 3.0.12
Spring Data Neo4j version: 7.0.11
Neo4j java driver version: 5.14.0

Thread dumps
ThreadDumps.zip

Logs
application-log.txt
trace-log.txt

Could you please look into this? Thank you.

@injectives
Copy link
Contributor

Hello.

Sure. The thread dumps show the http threads are in WAITING state, while the driver threads are actually in RUNNABLE state. It also seems like the main issue is around begin transaction area.

The application-log.txt and trace-log.txt show mostly connectivity issues, which could be because of some server-side maintenance or some other factor. Although, I am a little concerned about this one: org.neo4j.driver.exceptions.ConnectionReadTimeoutException: Connection read timed out due to it taking longer than the server-supplied timeout value via configuration hint. - it basically means that the server did not respond within the timeout it suggested to the driver upon connection.

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please? The driver logging can be configured though the following config option.

As a side question, could you let me know if you set the target database explicitly in the driver session please? See the following config option for more details. If possible, it would be better to set this to narrow down the issue search area a little bit.

And one more general question please. Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment? I am just trying to understand if this issue just started happening on a certain date, which could mean some server change.

Thank you.

@pattanmzy
Copy link
Author

@injectives

The driver debug logs might reveal more context on this issue. Would you be happy to enable and supply those please?

I have set the logging level to debug. I will provide the logs when the issue happens again.

As a side question, could you let me know if you set the target database explicitly in the driver session please?

The target database was not set previously. It's now set.

Has your application been using Aura for a while before the issue started happening on a certain date or is it a new deployment?

It is a new deployment of less than 2 months. This issue started happening since the beginning.

@pattanmzy
Copy link
Author

Hi @injectives ,

The issue happened again on December 26th. The thread dumps show similar issue as before. There's no error found in the application logs. However, there were warning messages on Actuator Health contributor taking long time to respond in the console log as below. There were also no application activity related log messages after this time. It appears that the issue started to happen around this time.

2023-12-26 11:20:34 WARN  [http-nio-8080-exec-2] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 30988ms to respond
2023-12-26 11:20:34 WARN  [http-nio-8080-exec-3] - Health contributor org.springframework.boot.actuate.autoconfigure.health.HealthEndpointConfiguration$AdaptedReactiveHealthContributors$1 (neo4j) took 61001ms to respond

The driver debug logs are attached.

Thread dumps
ThreadDumps_20231226.zip

Driver Debug Log
neo4j.log.zip

Console log
console.log

Appreciate if you could take a look at it. Thank you.

@injectives
Copy link
Contributor

@pattanmzy, thanks a lot for providing the vital details. We will review soon and will get back to you.

@shenhj2016
Copy link

After view the neo4j.log.zip,i have found the same issue in my application.The application is running in AWS.
Log detail:{
"log": "Suppressed: org.neo4j.driver.exceptions.DiscoveryException: Failed to update routing table with server '58534a99.databases.neo4j.io(54.216.115.14):7687'.",
"stream": "stdout",
"time": "2024-02-21T08:51:21.223046235Z"
}
neo4j-driver version is :5.15.0, and it have been update to 5.18.0 which is the latest version.

@injectives
Copy link
Contributor

We have done a significant networking refactoring in the driver that is available in the latest 5.27.0 driver. Is there any chance you could try it to see if the issue is still present please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants