Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to call SwitchController DBus method to bluechi-agent #950

Closed
dofmind opened this issue Sep 20, 2024 · 4 comments · Fixed by #961
Closed

Failed to call SwitchController DBus method to bluechi-agent #950

dofmind opened this issue Sep 20, 2024 · 4 comments · Fixed by #961
Assignees
Labels
bug Something isn't working jira Issues that are synced to Jira
Milestone

Comments

@dofmind
Copy link
Contributor

dofmind commented Sep 20, 2024

Describe the bug

I have an issue where calling SwitchController DBus method on bluechi-agent fails with a timeout when bluechi-agent never connected to bluechi-controller. However, if bluechi-agent has ever connected to bluechi-controller and retries to connect to bluechi-controller, calling SwitchController DBus method on bluechi-agent succeeds.

To Reproduce

Prerequisite

  • host 192.168.16.101 exists on local network and is running bluechi-controller.
  • host 192.168.16.111 does not exist on local network.

Failure case

  1. start bluechi-agent with host 192.168.16.111
root@42dot-ak7:~# cat /etc/bluechi/agent.conf 
[bluechi-agent]
NodeName=ak7_master_main
ControllerHost=192.168.16.111
HeartbeatInterval=2000
root@42dot-ak7:~# systemctl start bluechi-agent
root@42dot-ak7:~# journalctl -f -u bluechi-agent
Sep 20 15:02:53 42dot-ak7 systemd[1]: Started BlueChi systemd service controller agent daemon.
Sep 20 15:02:53 42dot-ak7 bluechi-agent[3062]: Starting bluechi-agent 0.8.0-1
Sep 20 15:02:53 42dot-ak7 bluechi-agent[3062]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:02:56 42dot-ak7 bluechi-agent[3062]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:02:56 42dot-ak7 bluechi-agent[3062]: Initial controller connection failed, retrying
Sep 20 15:02:56 42dot-ak7 bluechi-agent[3062]: Trying to connect to controller (try 1)
Sep 20 15:02:56 42dot-ak7 bluechi-agent[3062]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:02:59 42dot-ak7 bluechi-agent[3062]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:02:59 42dot-ak7 bluechi-agent[3062]: Trying to connect to controller (try 2)
Sep 20 15:02:59 42dot-ak7 bluechi-agent[3062]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:03:02 42dot-ak7 bluechi-agent[3062]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:03:02 42dot-ak7 bluechi-agent[3062]: Trying to connect to controller (try 3)
Sep 20 15:03:02 42dot-ak7 bluechi-agent[3062]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:03:05 42dot-ak7 bluechi-agent[3062]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
...
  1. call SwitchController DBus method with host 192.168.16.101
root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.101,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
  1. busctl introspect command also is failed
root@42dot-ak7:~# busctl introspect org.eclipse.bluechi.Agent /org/eclipse/bluechi
Failed to introspect object /org/eclipse/bluechi of service org.eclipse.bluechi.Agent: Connection timed out

Success case

  1. start bluechi-agent with host 192.168.16.101
  2. call SwitchController DBus method with host 192.168.16.111
  3. call SwitchController DBus method with host 192.168.16.101
root@42dot-ak7:~# cat /etc/bluechi/agent.conf 
[bluechi-agent]
NodeName=ak7_master_main
ControllerHost=192.168.16.101
HeartbeatInterval=2000
root@42dot-ak7:~# systemctl start bluechi-agent
root@42dot-ak7:~#
root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.111,port=842'
method return time=1726812297.956766 sender=:1.38 -> destination=:1.40 serial=9 reply_serial=2
root@42dot-ak7:~#
root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.101,port=842'
method return time=1726812313.323543 sender=:1.38 -> destination=:1.41 serial=14 reply_serial=2
root@42dot-ak7:~# journalctl -f -u bluechi-agent
Sep 20 15:04:34 42dot-ak7 systemd[1]: Started BlueChi systemd service controller agent daemon.
Sep 20 15:04:34 42dot-ak7 bluechi-agent[3347]: Starting bluechi-agent 0.8.0-1
Sep 20 15:04:34 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.101,port=842
Sep 20 15:04:34 42dot-ak7 bluechi-agent[3347]: Connected to controller as 'ak7_master_main'
Sep 20 15:04:54 42dot-ak7 bluechi-agent[3347]: CONTROLLER ADDRESS changed to tcp:host=192.168.16.111,port=842
Sep 20 15:04:54 42dot-ak7 bluechi-agent[3347]: Disconnected from controller
Sep 20 15:04:54 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:04:57 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:04:57 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 1)
Sep 20 15:04:57 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:01 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:01 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 2)
Sep 20 15:05:01 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:04 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:04 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 3)
Sep 20 15:05:04 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:07 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:07 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 4)
Sep 20 15:05:07 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:10 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:10 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 5)
Sep 20 15:05:10 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: CONTROLLER ADDRESS changed to tcp:host=192.168.16.101,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Disconnected from controller
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.101,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Connected to controller as 'ak7_master_main'

Expected behavior

When bluechi-agent never connected to bluechi-controller, if call SwitchController DBus method with host 192.168.16.101, bluechi-agent must connect to bluechi-controller of host 192.168.16.101.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.101,port=842'
method return time=1726812313.323543 sender=:1.38 -> destination=:1.41 serial=14 reply_serial=2
root@42dot-ak7:~# journalctl -f -u bluechi-agent
...
Sep 20 15:05:10 42dot-ak7 bluechi-agent[3347]: Trying to connect to controller (try 5)
Sep 20 15:05:10 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.111,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: CONTROLLER ADDRESS changed to tcp:host=192.168.16.101,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Disconnected from controller
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Connecting to controller on tcp:host=192.168.16.101,port=842
Sep 20 15:05:13 42dot-ak7 bluechi-agent[3347]: Connected to controller as 'ak7_master_main'
@dofmind dofmind added the bug Something isn't working label Sep 20, 2024
@mkemel mkemel self-assigned this Sep 26, 2024
@mkemel mkemel added the jira Issues that are synced to Jira label Sep 26, 2024
@mkemel mkemel added this to the v0.9 milestone Sep 26, 2024
@mkemel
Copy link
Member

mkemel commented Sep 29, 2024

I was able to reproduce the bug.
One thing that I have noticed, is it behaves this way only when host 192.168.16.111 does not exist on local network. When it does exist, but not running bluechi-controller - there is no problem, and the bluechi-agent's DBus API works as expected.

The reason for that is that the reconnect mechanism seats on the heartbeat timer, and as long as agent->connection_state == AGENT_CONNECTION_STATE_RETRY - it will try to reconnect on each heartbeat. The heartbeat timer is set by default to 2000ms.
On the other hand, when the host to which the agent is trying to connect to does not exist on the network, the Register method call fails by timeout, which also takes around 2 seconds or a bit more. As a result, the event_loop is always busy with trying to reconnect, and does not handle any incoming API calls. When a host does exist, but not listening on port 842 - the connection is refused immediately, so no problem occurs.

There are a couple of ways to mitigate this:

  1. The easiest way to handle it on the user level is to increase HeartbeatInterval config value. Setting it to be 5000 for example fixed the issue, but I guess lower values of 4000 or even 3000 will do the trick.
  2. Decreasing the priority for the heartbeat timer event also fixes the issue.

    I have set priority to be 50 and it made the SwitchController request work right away. For introspect though it took some time to return.
    mkemel@55e1457
    I didn't find how to increase API calls' priority instead, though that would also work if possible. I wonder if decreasing the heartbeat timer priority won't have an impact when we are having a lot of requests or something like that.
  3. We can hardcode something in agent_heartbeat_timer_callback to mitigate that specific issue.

@alexlarsson @engelmi @mwperina WDYT?

@mkemel
Copy link
Member

mkemel commented Sep 30, 2024

Another option:
4. Make the "Register" method call asynchronous

@engelmi
Copy link
Member

engelmi commented Oct 9, 2024

@dofmind Thank you for raising the issue!

@mkemel Nice analysis!
Based on the logs it really is the Register call that blocks the event loop, preventing any other request to be handled. Apparently, the opening (and starting) the D-Bus doesn't involve any (blocking) networking operation.
Regarding the different options you listed to solve/mitigate this issue:

  1. Since users of BlueChi can configure the interval, this is wouldn't be a good choice
  2. This "just" makes sure the API request is queued earlier than the heartbeat task. In addition, if the agent is busy with a lot of commands from controller, the heartbeat is queued late (if ever) - and since we disconnect on thresholds this might have bad implications.
  3. I am not sure what we can change in the callback to mitigate this specific issue - Disabling the heartbeat? Probably bad idea.
  4. Since the Register API call is blocking, making it async seems like the way to go. We shouldn't have any long-running task in the queue anyway in order to prevent any blockage.

So option 4 - having the Register API call async seems like the way to go.

@mkemel
Copy link
Member

mkemel commented Oct 9, 2024

So I'm on it.

mkemel added a commit to mkemel/bluechi that referenced this issue Oct 16, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 16, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 16, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 29, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 29, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit to mkemel/bluechi that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: eclipse-bluechi#950
Signed-off-by: Mark Kemel <[email protected]>
mkemel added a commit that referenced this issue Oct 30, 2024
The "Register" agent->controller method call could get stuck
when the IP address configured as ControllerHost does not exist.
This timeout of about 2 seconds could completely occupy the event
loop and would not allow other calls on the DBus API of the Agent.

Making the "Register" method call asynchronous we allow the event
loop to stay unoccupied and be able to accept SwitchController
request.

Resolves: #950
Signed-off-by: Mark Kemel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jira Issues that are synced to Jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants