Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🤔[question] Can not connect to master node #9201

Open
2 tasks done
monody1 opened this issue Apr 19, 2024 · 6 comments
Open
2 tasks done

🤔[question] Can not connect to master node #9201

monody1 opened this issue Apr 19, 2024 · 6 comments
Labels

Comments

@monody1
Copy link

monody1 commented Apr 19, 2024

Issue Description:

Following a system instability around 20:29 on April 18th, 2024, several agents experienced WebSocket failures that led to multiple unsuccessful reconnection attempts and eventual crashes. This issue persisted across multiple agents (IDs ranging from server2 to server18), with each displaying similar patterns of i/o timeout errors and write: broken pipe messages.

Key Points:

At 20:29, a bulk of WebSocket reconnection attempts failed almost simultaneously across various agents, showing errors such as read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.X:XXXXX: i/o timeout.
Each failed attempt was followed by the system attempting to drain and then eventually remove the agent, while also noting that the agent is "past reconnect period, it must restart".
By 02:49 on April 19th, there were continued issues noted as websocket handler error: error while reading initial startup message: websocket: close 1001 (going away), indicating ongoing connectivity or configuration issues even several hours after the initial incident.
The agents seem unable to maintain stable connections post-crash, with repeated failures to upgrade connections and repeated logs of agents needing to restart due to passing the reconnect period.

master log:

[2024-04-18 16:09:34] allocation cleaned up and removed from cache  component="allocation-service"
<error> [2024-04-18 20:25:43] WebSocket failed, awaiting reconnect: master-agent-ws-server13  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.13:41924: i/o timeout" id="server13" resource-pool="default"
<info> [2024-04-18 20:25:46] draining agent: server13  component="agent-state-state" id="server13"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server17  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.17:52482: i/o timeout" id="server17" resource-pool="default"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server9  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.9:50014: i/o timeout" id="server9" resource-pool="default"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server18  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.18:52322: i/o timeout" id="server18" resource-pool="default"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server10  component="agent" error="read loop: reading message: read tcp 172.19.0.3:8080->10.14.4.10:55986: i/o timeout" id="server10" resource-pool="default"
<info> [2024-04-18 20:29:14] draining agent: server9  component="agent-state-state" id="server9"
<info> [2024-04-18 20:29:14] draining agent: server17  component="agent-state" id="server17"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server2  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.2:50498: i/o timeout" id="server2" resource-pool="default"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server3  component="agent" error="read loop: reading message: read tcp 172.19.0.3:8080->10.14.4.3:33858: i/o timeout" id="server3" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.13" component="agent" error="agent failed to reconnect by deadline" id="server13" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] draining agent: server18  component="agent-state-state" id="server18"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server5  component="agent" error="gracefully closing: sending close: write tcp 172.19.0.3:8080->10.14.4.5:56504: i/o timeout" id="server5" resource-pool="default"
<info> [2024-04-18 20:29:14] draining agent: server5  component="agent-state-state" id="server5"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server8  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.8:42202: i/o timeout" id="server8" resource-pool="default"
<info> [2024-04-18 20:29:14] draining agent: server8  component="agent-state-state" id="server8"
[2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server11  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.11:44262: i/o timeout" id="server11" resource-pool="default"
<info> [2024-04-18 20:29:14] draining agent: server11  component="agent-state-state" id="server11"
<info> [2024-04-18 20:29:14] draining agent: server2  component="agent-state-state" id="server2"
<info> [2024-04-18 20:29:14] draining agent: server3  component="agent-state-state" id="server3"
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server4  component="agent" error="read loop: reading message: write tcp 172.19.0.3:8080->10.14.4.4:43872: i/o timeout" id="server4" resource-pool="default"
<info> [2024-04-18 20:29:14] draining agent: server4  component="agent-state-state" id="server4"
<info> [2024-04-18 20:29:14] restoring agent id: server13  component="agents" reconnect="true"
<info> [2024-04-18 20:29:14] draining agent: server10  component="agent-state-state" id="server10"
<info> [2024-04-18 20:29:14] removing agent: server13  component="agent" id="server13" resource-pool="default"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.13:44710: write: broken pipe" id="server13" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.13" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.13:44710: write: broken pipe" id="server13" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server13  component="agent" id="server13" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server8  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.8:41236: write: broken pipe" id="server8" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.8" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.8:41236: write: broken pipe" id="server8" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server8  component="agent" id="server8" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server9  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.9:49090: write: broken pipe" id="server9" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.9" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.9:49090: write: broken pipe" id="server9" resource-pool="default" started="true"
[2024-04-18 20:29:14] removing agent: server9  component="agent" id="server9" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server18  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.18:36846: write: broken pipe" id="server18" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.18" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.18:36846: write: broken pipe" id="server18" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server18  component="agent" id="server18" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server2  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.2:57652: write: broken pipe" id="server2" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.2" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.2:57652: write: broken pipe" id="server2" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server2  component="agent" id="server2" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: prod1-master-2  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] socket not nil when WebSocketRequest received  component="agent" error="websocket already connected" id="prod1-master-2" resource-pool="default"
<error> [2024-04-18 20:29:14] agent: websocket already connected
<info> [2024-04-18 20:29:14] restoring agent id: server17  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.17:53870: write: broken pipe" id="server17" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.17" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.17:53870: write: broken pipe" id="server17" resource-pool="default" started="true"
[2024-04-18 20:29:14] removing agent: server17  component="agent" id="server17" resource-pool="default"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] restoring agent id: server5  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.5:49782: write: broken pipe" id="server5" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.5" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.5:49782: write: broken pipe" id="server5" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server5  component="agent" id="server5" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server4  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.4:58242: write: broken pipe" id="server4" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.4" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.4:58242: write: broken pipe" id="server4" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server4  component="agent" id="server4" resource-pool="default"
<info> [2024-04-18 20:29:14] restoring agent id: server11  component="agents" reconnect="true"
<info> [2024-04-18 20:29:14] going to try to reattach containers ([])  component="agent" id="server11" resource-pool="default"
<info> [2024-04-18 20:29:14] agent reconnected  component="agent" id="server11" resource-pool="default"
<info> [2024-04-18 20:29:14] enabling agent: server11  component="agent-state-state" id="server11"
<info> [2024-04-18 20:29:14] restoring agent id: server10  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.10:43416: write: broken pipe" id="server10" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.10" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.10:43416: write: broken pipe" id="server10" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server10  component="agent" id="server10" resource-pool="default"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] WebSocket failed, awaiting reconnect: master-agent-ws-server11  component="agent" error="read loop: reading message: websocket: close 1006 (abnormal closure): unexpected EOF" id="server11" resource-pool="default"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] draining agent: server11  component="agent-state-state" id="server11"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
[2024-04-18 20:29:14] restoring agent id: server11  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] error upgrading connection to WebSocket  component="agent" error="websocket connection error: write tcp 172.19.0.3:8080->10.14.4.11:39988: write: broken pipe" id="server11" resource-pool="default"
<error> [2024-04-18 20:29:14] agent crashed  address="10.14.4.11" component="agent" error="error upgrading connection to WebSocket: websocket connection error: write tcp 172.19.0.3:8080->10.14.4.11:39988: write: broken pipe" id="server11" resource-pool="default" started="true"
<info> [2024-04-18 20:29:14] removing agent: server11  component="agent" id="server11" resource-pool="default"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] restoring agent id: prod1-master-2  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] socket not nil when WebSocketRequest received  component="agent" error="websocket already connected" id="prod1-master-2" resource-pool="default"
<error> [2024-04-18 20:29:14] agent: websocket already connected
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] restoring agent id: prod1-master-2  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] socket not nil when WebSocketRequest received  component="agent" error="websocket already connected" id="prod1-master-2" resource-pool="default"
<error> [2024-04-18 20:29:14] agent: websocket already connected
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] restoring agent id: prod1-master-2  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] socket not nil when WebSocketRequest received  component="agent" error="websocket already connected" id="prod1-master-2" resource-pool="default"
<error> [2024-04-18 20:29:14] agent: websocket already connected
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
[2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] restoring agent id: server3  component="agents" reconnect="true"
<error> [2024-04-18 20:29:14] agent is past reconnect period, it must restart
<info> [2024-04-18 20:29:14] going to try to reattach containers ([])  component="agent" id="server3" resource-pool="default"
<info> [2024-04-18 20:29:14] agent reconnected  component="agent" id="server3" resource-pool="default"
<info> [2024-04-18 20:29:14] enabling agent: server3  component="agent-state-state" id="server3"
<info> [2024-04-18 20:29:14] agent connected ip: 10.14.4.3 resource pool: default slots: 2  component="agent" id="server3" resource-pool="default"
<info> [2024-04-18 20:29:14] resource pool is empty; using default resource pool: default  component="agents"
<info> [2024-04-18 20:29:14] resource pool is empty; using default resource pool: default  component="agents"
<info> [2024-04-18 20:29:14] agent connected ip: 10.14.4.5 resource pool: default slots: 2  component="agent" id="server5" resource-pool="default"
<info> [2024-04-18 20:29:14] adding device: cuda0 (Tesla V100S-PCIE-32GB) on server5  component="agent-state-state" id="server5"
<info> [2024-04-18 20:29:14] adding device: cuda1 (Tesla V100S-PCIE-32GB) on server5  component="agent-state-state" id="server5"
<info> [2024-04-18 20:29:14] adding agent: server5  component="agent" id="server5" resource-pool="default"
<info> [2024-04-18 20:29:14] agent connected ip: 10.14.4.17 resource pool: default slots: 2  component="agent" id="server17" resource-pool="default"
<info> [2024-04-18 20:29:14] adding device: cuda0 (Tesla V100S-PCIE-32GB) on server17  component="agent-state-state" id="server17"
<info> [2024-04-18 20:29:14] adding device: cuda1 (Tesla V100S-PCIE-32GB) on server17  component="agent-state-state" id="server17"
<info> [2024-04-18 20:29:14] adding agent: server17  component="agent" id="server17" resource-pool="default"
<info> [2024-04-18 20:29:15] resource pool is empty; using default resource pool: default  component="agents"
<info> [2024-04-18 20:29:15] resource pool is empty; using default resource pool: default  component="agents"
<warning> [2024-04-18 20:29:15] failed to get agent state for agent server18  component="agent" error="agent state is not available: agent not started" id="server18" resource-pool="default"

one of the agent log:

INFO[2024-04-18T14:47:21Z] transitioning state from RUNNING to TERMINATED  component=container cproto-id=0f517327-25cf-45e2-8de1-09519e37f4e0 stop="container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)"
TRAC[2024-04-18T20:26:37Z] websocket inbox closed                        component=agent
ERRO[2024-04-18T20:26:37Z] socket disconnected                           component=agent error="read loop: reading message: read tcp 10.14.4.18:52322->10.14.4.3:8080: i/o timeout"
TRAC[2024-04-18T20:26:37Z] reconnecting master socket...                 component=agent
INFO[2024-04-18T20:26:37Z] connecting to master at: ws://10.14.4.3:8080/agents?id=server18&version=0.30.0&resource_pool=&reconnect=true&hostname=server18  component=agent
ERRO[2024-04-18T20:27:22Z] error reconnecting to master                  component=agent error="error dialing master: read tcp 10.14.4.18:36846->10.14.4.3:8080: i/o timeout"
INFO[2024-04-18T20:27:27Z] connecting to master at: ws://10.14.4.3:8080/agents?id=server18&version=0.30.0&resource_pool=&reconnect=true&hostname=server18  component=agent
ERRO[2024-04-18T20:28:12Z] error reconnecting to master                  component=agent error="error dialing master: read tcp 10.14.4.18:43004->10.14.4.3:8080: i/o timeout"
INFO[2024-04-18T20:28:17Z] connecting to master at: ws://10.14.4.3:8080/agents?id=server18&version=0.30.0&resource_pool=&reconnect=true&hostname=server18  component=agent
ERRO[2024-04-18T20:29:02Z] error reconnecting to master                  component=agent error="error dialing master: read tcp 10.14.4.18:45396->10.14.4.3:8080: i/o timeout"
INFO[2024-04-18T20:29:07Z] connecting to master at: ws://10.14.4.3:8080/agents?id=server18&version=0.30.0&resource_pool=&reconnect=true&hostname=server18  component=agent
WARN[2024-04-18T20:29:14Z] received ErrAgentMustReconnect, exiting       component=agent
TRAC[2024-04-18T20:29:14Z] detaching container manager                   component=agent
TRAC[2024-04-18T20:29:14Z] cleaning up docker client                     component=agent
TRAC[2024-04-18T20:29:14Z] cleaning up socket                            component=agent
TRAC[2024-04-18T20:29:14Z] attempting graceful close                     component=websocket id=356f077a-7402-46a0-bee7-ac22a053e314 name=agent-server18 remote-addr="10.14.4.3:8080"
TRAC[2024-04-18T20:29:14Z] attempting forceful close                     component=websocket id=356f077a-7402-46a0-bee7-ac22a053e314 name=agent-server18 remote-addr="10.14.4.3:8080"
TRAC[2024-04-18T20:29:14Z] socket closed                                 component=websocket id=356f077a-7402-46a0-bee7-ac22a053e314 name=agent-server18 remote-addr="10.14.4.3:8080"
FATA[2024-04-18T20:29:14Z] agent is past reconnect period, it must restart
WARN[2024-04-18T20:29:15Z] no configuration file at /etc/determined/agent.yaml, skipping
INFO[2024-04-18T20:29:15Z] agent configuration: {"config_file":"","log":{"level":"trace","color":true},"master_host":"10.14.4.3","master_port":8080,"agent_id":"server18","label":"","resource_pool":"","container_master_host":"","container_master_port":0,"slot_type":"auto","visible_gpus":"","security":{"tls":{"enabled":false,"skip_verify":false,"master_cert":"","master_cert_name":"","client_cert":"","client_key":""}},"debug":false,"artificial_slots":0,"image_root":"","tls":false,"tls_cert":"","tls_key":"","api_enabled":false,"bind_ip":"0.0.0.0","bind_port":9090,"http_proxy":"","https_proxy":"","ftp_proxy":"","no_proxy":"","agent_reconnect_attempts":5,"agent_reconnect_backoff":5,"container_runtime":"docker","singularity_options":{"allow_network_creation":false},"podman_options":{"allow_network_creation":false},"container_auto_remove_disabled":false,"hooks":{"on_connection_lost":null},"fluent":{"image":"","port":0,"container_name":""}}

Manual restart agent(Master reaction):

<info> [2024-04-19 02:48:59] adding agent: server5  component="agent" id="server5" resource-pool="default"
<error> [2024-04-19 02:49:19] agent crashed  address="10.14.4.5" component="agent" error="agent failed to reconnect by deadline" id="server5" resource-pool="default" started="true"
<info> [2024-04-19 02:49:19] removing agent: server5  component="agent" id="server5" resource-pool="default"
<error> [2024-04-19 02:49:27] websocket handler error: error while reading initial startup message: websocket: close 1001 (going away)
<error> [2024-04-19 02:49:33] websocket handler error: error while reading initial startup message: websocket: close 1001 (going away)
<info> [2024-04-19 02:53:52] resources are requested by Checkpoint GC (Experiment 6) (Allocation ID: 6.ad77961d-6c59-4ecc-8e3c-8fe3961d7514.1)  allocation-id="6.ad77961d-6c59-4ecc-8e3c-8fe3961d7514.1" component="resource-pool" name="default" restore="false" restoring="false"
<info> [2024-04-19 02:53:52] allocated resources to Checkpoint GC (Experiment 6)  component="resource-pool" name="default"
<info> [2024-04-19 02:53:52] 1 resources allocated  restore="false" task-id="6.ad77961d-6c59-4ecc-8e3c-8fe3961d7514" task-type="CHECKPOINT_GC"
<info> [2024-04-19 02:53:52] resources are requested by Checkpoint GC (Experiment 7) (Allocation ID: 7.afbba4f0-ec22-48f9-acf1-b0d5c7c932d3.1)  allocation-id="7.afbba4f0-ec22-48f9-acf1-b0d5c7c932d3.1" component="resource-pool" name="default" restore="false" restoring="false"
<info> [2024-04-19 02:53:52] starting container  allocation-id="6.ad77961d-6c59-4ecc-8e3c-8fe3961d7514.1" component="agent" container-id="0218474a-94a9-49c9-9399-9eb3187ccf33" id="server13" resource-pool="default" slots="0" task-id="6.ad77961d-6c59-4ecc-8e3c-8fe3961d7514" task-type="CHECKPOINT_GC"
<info> [2024-04-19 02:53:53] allocated resources to Checkpoint GC (Experiment 7)  component="resource-pool" name="default"
<info> [2024-04-19 02:53:53] 1 resources allocated  restore="false" task-id="7.afbba4f0-ec22-48f9-acf1-b0d5c7c932d3" task-type="CHECKPOINT_GC"
<info> [2024-04-19 02:53:53] starting container  allocation-id="7.afbba4f0-ec22-48f9-acf1-b0d5c7c932d3.1" component="agent" container-id="93ad97b4-0b7f-4134-8db3-dbadc450bd5e" id="server13" resource-pool="default" slots="0" task-id="7.afbba4f0-ec22-48f9-acf1-b0d5c7c932d3" task-type="CHECKPOINT_GC"
<error> [2024-04-19 02:53:55] websocket handler error: error while reading initial startup message: websocket: close 1001 (going away)
<error> [2024-04-19 02:54:00] websocket handler error: error while reading initial startup message: websocket: close 1001 (going away)
<error> [2024-04-19 02:54:15] websocket handler error: error while reading initial startup message: websocket: close 1001 (going away)
<info> [2024-04-19 02:55:40] websocket closed gracefully, awaiting reconnect: master-agent-ws-server5  component="agent" id="server5" resource-pool="default"
<info> [2024-04-19 02:55:40] draining agent: server5  component="agent-state-state" id="server5"
<info> [2024-04-19 02:55:54] resource pool is empty; using default resource pool: default  component="agents"
<info> [2024-04-19 02:55:54] agent connected ip: 10.14.4.5 resource pool: default slots: 2  component="agent" id="server5" resource-pool="default"
<info> [2024-04-19 02:55:54] adding device: cuda0 (Tesla V100S-PCIE-32GB) on server5  component="agent-state-state" id="server5"
<info> [2024-04-19 02:55:54] adding device: cuda1 (Tesla V100S-PCIE-32GB) on server5  component="agent-state-state" id="server5"
<info> [2024-04-19 02:55:54] adding agent: server5  component="agent" id="server5" resource-pool="default"
<error> [2024-04-19 02:56:05] agent crashed  address="10.14.4.5" component="agent" error="agent failed to reconnect by deadline" id="server5" resource-pool="default" started="true"

Manual restart agent(Agent reaction):

INFO[2024-04-19T02:55:54Z] Nvidia driver version: 535.54.03
INFO[2024-04-19T02:55:54Z] detected compute devices:
INFO[2024-04-19T02:55:54Z]      cuda0 (Tesla V100S-PCIE-32GB)
INFO[2024-04-19T02:55:54Z]      cuda1 (Tesla V100S-PCIE-32GB)
TRAC[2024-04-19T02:55:54Z] setting up docker runtime                     component=agent
INFO[2024-04-19T02:55:54Z] couldn't process ~/.docker/config.json can't read Docker config: open /root/.docker/config.json: no such file or directory  component=docker-client
INFO[2024-04-19T02:55:54Z] can't find any docker credential stores, continuing without them  component=docker-client
INFO[2024-04-19T02:55:54Z] can't find any auths in ~/.docker/config.json, continuing without them  component=docker-client
TRAC[2024-04-19T02:55:54Z] setting up container manager                  component=agent
TRAC[2024-04-19T02:55:54Z] reattaching containers                        component=agent
DEBU[2024-04-19T02:55:54Z] reattachContainers: expected survivors: []    component=container-manager
DEBU[2024-04-19T02:55:54Z] reattachContainers: running containers: []    component=container-manager
TRAC[2024-04-19T02:55:54Z] iterating expected survivors and seeing if they were found  component=container-manager
TRAC[2024-04-19T02:55:54Z] sending SIGKILL to running containers that were not reattached  component=container-manager
TRAC[2024-04-19T02:55:54Z] writing agent started message                 component=agent
TRAC[2024-04-19T02:55:54Z] watching for ws requests and system events    component=agent

Checklist

  • Did you search the docs for a solution?
  • Did you search github issues to find if somebody asked this question before?
@ioga
Copy link
Contributor

ioga commented Apr 19, 2024

hello, sorry to hear this.

The agents seem unable to maintain stable connections post-crash, with repeated failures to upgrade connections and repeated logs of agents needing to restart due to passing the reconnect period.

did they work fine after the restart?

we generally suggest having an automated restart configured (e.g. through systemd units) because there're some networking errors the agent process cannot recover from by reconnecting, and to quit & restasrt.

@monody1
Copy link
Author

monody1 commented Apr 21, 2024

hello, sorry to hear this.

The agents seem unable to maintain stable connections post-crash, with repeated failures to upgrade connections and repeated logs of agents needing to restart due to passing the reconnect period.

did they work fine after the restart?

we generally suggest having an automated restart configured (e.g. through systemd units) because there're some networking errors the agent process cannot recover from by reconnecting, and to quit & restasrt.

Restarting the agents didn't work, but after restarting the master node, the connection is now stable. Btw, if I use systemd, how can systemd detect the state of the nodes (both master and agent) to trigger a restart?

@ioga
Copy link
Contributor

ioga commented Apr 22, 2024

we will try to repro and investigate internally.

@ioga
Copy link
Contributor

ioga commented Apr 24, 2024

we can't repro it. can you please provide more master logs, covering what happened between the connectivity issues and when you manually restarted it?

@monody1
Copy link
Author

monody1 commented Apr 29, 2024

I forgot to save the logs before restarting, here is the log after restart
det-agent-server5_partial.log
determined_determined-master_partial.log

@ioga
Copy link
Contributor

ioga commented Apr 29, 2024

Unfortunately the logs before the restart is specifically what we need to see. If this issue happens again, please retain these logs and share them with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants