Vague error when agent drops out and command is sent #82

BrianJKoopman · 2023-09-22T19:44:06Z

Recently on satp3 we had a crash in sorunlib that seems like it was a result of the ACU agent disappearing/restarting. There's a few things here, but I wanted to capture in sorunlib to see if we can make the error a bit more informative. Probably other repos are also implicated here too though.

Mostly a note to myself, but my hope here is to try to reproduce this error state by running a local ocs/sorunlib system, removing an agent, then trying to command it.

Here's all that we saw in Nextline:

ocs.client_http.ControlClientError: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp3.acu.ops>'], {}]

I'm not really sure why the Agent dropped offline. It definitely did fully restart though, based on the uptime for the Docker container. This lead me to check the host manager logs. It looks like the HM's crossbar connection dropped, which maybe triggered the reset? If the HM agent is restarted, its child agents also restart.

hm-daq-satp3-docker logs:

Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:26+0000 update:5 Status is now "done".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:26+0000 manager:0 Requesting termination of ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 start called for update
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "starting".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "running".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Update requested.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:34+0000 update:6 Status is now "done".
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:36+0000 manager:0 Requested launch for ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-15T18:16:37+0000 manager:0 Launched ACUAgent[d]:acu
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 dropping connection to peer tcp4:127.0.0.1:8005 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 9000460411239835 before')
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:55+0000 transport disconnected
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:56+0000 waiting for reconnect for 9.997699499130249 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:56+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4eb55f9ae0> in 1.7342787278181295 seconds.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:57+0000 waiting for reconnect for 8.767275333404541 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:58+0000 waiting for reconnect for 7.644135236740112 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:51:59+0000 waiting for reconnect for 6.328954696655273 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:00+0000 waiting for reconnect for 5.243591785430908 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 dropping connection to peer tcp4:127.0.0.1:8005 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4eb55f9ae0> in 1.5560060794216635 seconds.
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:01+0000 waiting for reconnect for 4.065957069396973 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:03+0000 waiting for reconnect for 2.8923914432525635 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:04+0000 waiting for reconnect for 1.7263340950012207 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:05+0000 waiting for reconnect for 0.6516845226287842 more seconds
Sep 22 03:52:07 daq-satp3 launcher-daq-hm-daq-satp3-docker.sh[559]: 2023-09-22T03:52:05+0000 transport connected
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Deactivated successfully.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Consumed 3d 20h 37min 59.814s CPU time.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Scheduled restart job, restart counter is at 1.
Sep 22 03:52:13 daq-satp3 systemd[1]: Stopped OCS HostManager for daq-satp3.
Sep 22 03:52:13 daq-satp3 systemd[1]: ocs-hostmanager-docker.service: Consumed 3d 20h 37min 59.814s CPU time.
Sep 22 03:52:13 daq-satp3 systemd[1]: Started OCS HostManager for daq-satp3.

The ACU agent logs just jump from seemingly normal to fresh instance without any indication that it tried to shutdown or anything.

acu agent logs:

2023-09-22T03:45:49Z    2023-09-22T03:45:49+0000 go_to:89 Move complete.
2023-09-22T03:45:49Z    2023-09-22T03:45:49+0000 go_to:89 Status is now "done".
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 start called for generate_scan
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 generate_scan:90 Status is now "starting".
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 generate_scan:90 Status is now "running".
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 Moving to start position, az=301.0814, el=72.1193
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 Azimuth.state=INIT             dt=  0.000 dist=   2.370
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 Elevation.state=INIT           dt=  0.000 dist=   0.000
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 Elevation.state=WAIT_MOVING    dt=  0.101 dist=   0.000
2023-09-22T03:45:51Z    2023-09-22T03:45:51+0000 Elevation.state=WAIT_STILL     dt=  0.203 dist=   0.001
2023-09-22T03:45:52Z    2023-09-22T03:45:52+0000 Azimuth.state=WAIT_MOVING      dt=  0.100 dist=   2.264
2023-09-22T03:45:52Z    2023-09-22T03:45:52+0000 Azimuth.state=WAIT_STILL       dt=  0.201 dist=   2.163
2023-09-22T03:45:52Z    2023-09-22T03:45:52+0000 Elevation.state=DONE           dt=  1.116 dist=   0.000
2023-09-22T03:45:54Z    2023-09-22T03:45:54+0000 Azimuth.state=DONE             dt=  2.569 dist=   0.004
2023-09-22T03:45:54Z    2023-09-22T03:45:54+0000 Azimuth_mode has changed to ProgramTrack
2023-09-22T03:45:54Z    2023-09-22T03:45:54+0000 Elevation_mode has changed to Stop
2023-09-22T03:45:55Z    2023-09-22T03:45:55+0000 scan mode=go, line_buffer=0, track_free=10000
2023-09-22T03:56:50Z    Args: ['--instance-id', 'acu', '--site-hub', 'ws://127.0.0.1:8005/ws', '--site-http', 'http://127.0.0.1:8005/call']
2023-09-22T03:56:50Z    Installed OCS Plugins: ['socs', 'ocs']
2023-09-22T03:56:51Z    Renaming this process to: "ocs-agent:acu"
2023-09-22T03:56:51Z    2023-09-22T03:56:51+0000 Using OCS version 0.10.2
2023-09-22T03:56:51Z    2023-09-22T03:56:51+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ satp3.acu
2023-09-22T03:56:51Z    2023-09-22T03:56:51+0000 log_file is apparently None
2023-09-22T03:56:51Z    2023-09-22T03:56:51+0000 transport connected

The text was updated successfully, but these errors were encountered:

BrianJKoopman · 2024-01-12T15:07:39Z

This happened again today when the pysmurf-controller agents all went offline -- at the end of a scan when sorunlib tried to stop the streams we saw just:

ocs.client_http.ControlClientError: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp1.det-controller-c2s3.ops>'], {}]

I'm not sure why the traceback ends up being just this one line with no other information, but we need to handle instances of agents not being available for clients to command and likely raise a separate exception to get a full traceback.

BrianJKoopman · 2024-09-10T19:15:10Z

Full tracebacks get printed in nextline now, so that bit is resolved, but generally we need to handle agent commands not working during shutdown specifically. I'll open a new issue for that. To resolve the remaining issue here, perhaps we should otherwise implement some error handling for ocs.client_http.ControlClientError to print an informative message about the agent being unavailable.

BrianJKoopman added the bug Something isn't working label Jan 12, 2024

BrianJKoopman mentioned this issue Oct 4, 2024

Add error handling to SMuRF stream shutdown #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vague error when agent drops out and command is sent #82

Vague error when agent drops out and command is sent #82

BrianJKoopman commented Sep 22, 2023

BrianJKoopman commented Jan 12, 2024

BrianJKoopman commented Sep 10, 2024

Vague error when agent drops out and command is sent #82

Vague error when agent drops out and command is sent #82

Comments

BrianJKoopman commented Sep 22, 2023

BrianJKoopman commented Jan 12, 2024

BrianJKoopman commented Sep 10, 2024