You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently on satp3 we had a crash in sorunlib that seems like it was a result of the ACU agent disappearing/restarting. There's a few things here, but I wanted to capture in sorunlib to see if we can make the error a bit more informative. Probably other repos are also implicated here too though.
Mostly a note to myself, but my hope here is to try to reproduce this error state by running a local ocs/sorunlib system, removing an agent, then trying to command it.
I'm not really sure why the Agent dropped offline. It definitely did fully restart though, based on the uptime for the Docker container. This lead me to check the host manager logs. It looks like the HM's crossbar connection dropped, which maybe triggered the reset? If the HM agent is restarted, its child agents also restart.
hm-daq-satp3-docker logs:
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:26+0000 update:5 Status is now "done".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:26+0000 manager:0 Requesting termination of ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 start called for update
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "starting".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "running".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Update requested.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "done".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:36+0000 manager:0 Requested launch for ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:37+0000 manager:0 Launched ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 dropping connection to peer tcp4:127.0.0.1:8005 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 9000460411239835 before')
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 transport disconnected
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:56+0000 waiting for reconnect for 9.997699499130249 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:56+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4eb55f9ae0> in 1.7342787278181295 seconds.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:57+0000 waiting for reconnect for 8.767275333404541 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:58+0000 waiting for reconnect for 7.644135236740112 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:59+0000 waiting for reconnect for 6.328954696655273 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:00+0000 waiting for reconnect for 5.243591785430908 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 dropping connection to peer tcp4:127.0.0.1:8005 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4eb55f9ae0> in 1.5560060794216635 seconds.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 waiting for reconnect for 4.065957069396973 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:03+0000 waiting for reconnect for 2.8923914432525635 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:04+0000 waiting for reconnect for 1.7263340950012207 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:05+0000 waiting for reconnect for 0.6516845226287842 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:05+0000 transport connected
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Deactivated successfully.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Consumed 3d 20h 37min 59.814s CPU time.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Scheduled restart job, restart counter is at 1.
Sep 22 03:52:13 daq-satp3 systemd[1]: Stopped OCS HostManager for daq-satp3.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Consumed 3d 20h 37min 59.814s CPU time.
Sep 22 03:52:13 daq-satp3 systemd[1]: Started OCS HostManager for daq-satp3.
The ACU agent logs just jump from seemingly normal to fresh instance without any indication that it tried to shutdown or anything.
acu agent logs:
2023-09-22T03:45:49Z 2023-09-22T03:45:49+0000 go_to:89 Move complete.
2023-09-22T03:45:49Z 2023-09-22T03:45:49+0000 go_to:89 Status is now "done".
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 start called for generate_scan
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 generate_scan:90 Status is now "starting".
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 generate_scan:90 Status is now "running".
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 Moving to start position, az=301.0814, el=72.1193
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 Azimuth.state=INIT dt= 0.000 dist= 2.370
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 Elevation.state=INIT dt= 0.000 dist= 0.000
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 Elevation.state=WAIT_MOVING dt= 0.101 dist= 0.000
2023-09-22T03:45:51Z 2023-09-22T03:45:51+0000 Elevation.state=WAIT_STILL dt= 0.203 dist= 0.001
2023-09-22T03:45:52Z 2023-09-22T03:45:52+0000 Azimuth.state=WAIT_MOVING dt= 0.100 dist= 2.264
2023-09-22T03:45:52Z 2023-09-22T03:45:52+0000 Azimuth.state=WAIT_STILL dt= 0.201 dist= 2.163
2023-09-22T03:45:52Z 2023-09-22T03:45:52+0000 Elevation.state=DONE dt= 1.116 dist= 0.000
2023-09-22T03:45:54Z 2023-09-22T03:45:54+0000 Azimuth.state=DONE dt= 2.569 dist= 0.004
2023-09-22T03:45:54Z 2023-09-22T03:45:54+0000 Azimuth_mode has changed to ProgramTrack
2023-09-22T03:45:54Z 2023-09-22T03:45:54+0000 Elevation_mode has changed to Stop
2023-09-22T03:45:55Z 2023-09-22T03:45:55+0000 scan mode=go, line_buffer=0, track_free=10000
2023-09-22T03:56:50Z Args: ['--instance-id', 'acu', '--site-hub', 'ws://127.0.0.1:8005/ws', '--site-http', 'http://127.0.0.1:8005/call']
2023-09-22T03:56:50Z Installed OCS Plugins: ['socs', 'ocs']
2023-09-22T03:56:51Z Renaming this process to: "ocs-agent:acu"
2023-09-22T03:56:51Z 2023-09-22T03:56:51+0000 Using OCS version 0.10.2
2023-09-22T03:56:51Z 2023-09-22T03:56:51+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ satp3.acu
2023-09-22T03:56:51Z 2023-09-22T03:56:51+0000 log_file is apparently None
2023-09-22T03:56:51Z 2023-09-22T03:56:51+0000 transport connected
The text was updated successfully, but these errors were encountered:
This happened again today when the pysmurf-controller agents all went offline -- at the end of a scan when sorunlib tried to stop the streams we saw just:
I'm not sure why the traceback ends up being just this one line with no other information, but we need to handle instances of agents not being available for clients to command and likely raise a separate exception to get a full traceback.
Full tracebacks get printed in nextline now, so that bit is resolved, but generally we need to handle agent commands not working during shutdown specifically. I'll open a new issue for that. To resolve the remaining issue here, perhaps we should otherwise implement some error handling for ocs.client_http.ControlClientError to print an informative message about the agent being unavailable.
Recently on satp3 we had a crash in sorunlib that seems like it was a result of the ACU agent disappearing/restarting. There's a few things here, but I wanted to capture in sorunlib to see if we can make the error a bit more informative. Probably other repos are also implicated here too though.
Mostly a note to myself, but my hope here is to try to reproduce this error state by running a local ocs/sorunlib system, removing an agent, then trying to command it.
Here's all that we saw in Nextline:
I'm not really sure why the Agent dropped offline. It definitely did fully restart though, based on the uptime for the Docker container. This lead me to check the host manager logs. It looks like the HM's crossbar connection dropped, which maybe triggered the reset? If the HM agent is restarted, its child agents also restart.
hm-daq-satp3-docker logs:
The ACU agent logs just jump from seemingly normal to fresh instance without any indication that it tried to shutdown or anything.
acu agent logs:
The text was updated successfully, but these errors were encountered: