Replies: 5 comments 2 replies
-
Also seeing the same issue on version 336. |
Beta Was this translation helpful? Give feedback.
-
This generally suggests that workers are not sending heartbeats to co-ordinator frequently enough. This can happen due to GC pauses or network connectivity issues. Try to see if your workers are often at 100% CPU and going through GC pauses (via the GC logs or JMX metrics). |
Beta Was this translation helpful? Give feedback.
-
The CPU usage doesn't look out of the ordinary. What would be the best way to check if its a network issue? Is there something specific I should be looking for in the http-request.log |
Beta Was this translation helpful? Give feedback.
-
When looking at the http-request.log on the worker nodes during the time of the error we see the following is missing from the coordinator nodes ip: However the logs still show this request from all the other worker nodes ips. |
Beta Was this translation helpful? Give feedback.
-
@hashhar any chance of making https://github.com/trinodb/trino/blob/393/core/trino-main/src/main/java/io/trino/metadata/DiscoveryNodeManager.java#L151-L158 and https://github.com/trinodb/trino/blob/393/core/trino-main/src/main/java/io/trino/metadata/DiscoveryNodeManager.java#L265-L267 only define worker state as 'missing' if 2 consecutive polls in a row failed? this would reduce query failures when there is once off temporary network blip |
Beta Was this translation helpful? Give feedback.
-
Hey
Every so often we get intermittent presto errors on our queries saying io.prestosql.spi.PrestoException: No nodes available to run query. On average we get this error every 2 days.
When looking at the coordinators server.log at around that time we can see the following:
node-state-poller-0 io.prestosql.metadata.DiscoveryNodeManager Previously active node is missing
When running this query: "SELECT node_id,state,coordinator FROM system.runtime.nodes"
At one minute we can see our coordinator and all our worker nodes. The next minute we can only see the coordinator node
and then the minute after that we see all our coordinator and worker nodes again
our presto version is 336.
Anyone have any suggestions on what could be going on or where to look?
Thanks
Beta Was this translation helpful? Give feedback.
All reactions