Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to trigger SwitchController DBus method when bluechi-agent tries reconnecting to bluechi-controller #966

Closed
dofmind opened this issue Oct 24, 2024 · 4 comments · Fixed by #961
Assignees
Labels
bug Something isn't working

Comments

@dofmind
Copy link
Contributor

dofmind commented Oct 24, 2024

Describe the bug

I have an issue where calling SwitchController DBus method on bluechi-agent fails with a timeout when bluechi-agent tries reconnecting to bluechi-controller after bluechi-agent disconnects to bluechi-controller by network disconnection of node that bluechi-controller is running.

To Reproduce

There are two host 192.168.16.101, 192.168.16.102 that are running bluechi-controller.
There is a host 192.168.16.103 that is running bluechi-agent.
Three hosts are connected to the same local network.

  1. Start bluechi-agent on host 192.168.16.103 and the bluechi-agent connected to bluechi-controller of host 192.168.16.101
Oct 24 14:55:47 42dot-ak7 bluechi-agent[636331]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 24 14:55:47 42dot-ak7 bluechi-agent[636331]: Connected to controller as 'ak7_master_main'
  1. Unplug network cable of host 192.168.16.101, then bluechi-agent disconnects to bluechi-controller of host 192.168.16.101
Oct 24 14:56:51 42dot-ak7 bluechi-agent[636331]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 24 14:56:51 42dot-ak7 bluechi-agent[636331]: Disconnected from controller
Oct 24 14:56:51 42dot-ak7 bluechi-agent[636331]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 24 14:57:56 42dot-ak7 bluechi-agent[636331]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 24 14:57:56 42dot-ak7 bluechi-agent[636331]: Trying to connect to controller (try 1)
Oct 24 14:57:56 42dot-ak7 bluechi-agent[636331]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 24 14:57:59 42dot-ak7 bluechi-agent[636331]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 24 14:57:59 42dot-ak7 bluechi-agent[636331]: Trying to connect to controller (try 2)
Oct 24 14:57:59 42dot-ak7 bluechi-agent[636331]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 24 14:58:02 42dot-ak7 bluechi-agent[636331]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
...
  1. Trigger the SwitchController on host 192.168.16.103 to switch controller to 192.168.16.102 before the first "Registering as 'ak7_master_main' failed: Transport endpoint is not connected" message appears in, then i get an error with timeout.
root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Expected behavior

Under the above conditions, if we trigger SwitchController on host 192.168.16.103 to switch the controller to 192.168.16.102, the SwitchController trigger should succeed.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
method return time=1729749456.721574 sender=:1.899 -> destination=:1.1191 serial=338 reply_serial=2
@dofmind dofmind added the bug Something isn't working label Oct 24, 2024
@dofmind
Copy link
Contributor Author

dofmind commented Oct 25, 2024

This is similar with #950. However, it takes too long time than failure case of #950 issue until the first connection is failed. Also if i don't trigger SwitchController until the first connection is failed, after the first "Registering as ..." message appears in, triggering SwitchController is successful, but if i triggered SwitchController before the first connection is failed, after the first "Registering as ..." message appears in, triggering SwitchController is also failed with timeout.

@dofmind
Copy link
Contributor Author

dofmind commented Oct 25, 2024

I tested this after applying PR - #961, but got the same result.
I triggered SwitchController DBus method right after "Trying to connect to controller (try 1)" message appeared.

Oct 25 15:19:10 42dot-ak7 bluechi-agent[677329]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 25 15:19:10 42dot-ak7 bluechi-agent[677329]: Disconnected from controller
Oct 25 15:19:10 42dot-ak7 bluechi-agent[677329]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 25 15:19:10 42dot-ak7 bluechi-agent[677329]: Agent connection attempt failed, retrying
Oct 25 15:19:10 42dot-ak7 bluechi-agent[677329]: Trying to connect to controller (try 1)
Oct 25 15:20:16 42dot-ak7 bluechi-agent[677329]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 25 15:20:16 42dot-ak7 bluechi-agent[677329]: Agent connection attempt failed, retrying
Oct 25 15:20:16 42dot-ak7 bluechi-agent[677329]: Trying to connect to controller (try 2)
Oct 25 15:20:19 42dot-ak7 bluechi-agent[677329]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 25 15:20:19 42dot-ak7 bluechi-agent[677329]: Agent connection attempt failed, retrying
Oct 25 15:20:19 42dot-ak7 bluechi-agent[677329]: Trying to connect to controller (try 3)
Oct 25 15:20:22 42dot-ak7 bluechi-agent[677329]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 25 15:20:22 42dot-ak7 bluechi-agent[677329]: Agent connection attempt failed, retrying
Oct 25 15:20:22 42dot-ak7 bluechi-agent[677329]: Trying to connect to controller (try 4)
...

@engelmi
Copy link
Member

engelmi commented Oct 28, 2024

I was able to reproduce this issue and I think it is related to #950 - the TCP mechanism for retransmitting the TCP SYN packages keeps the "connection alive" so we have to forcefully terminate it in cases such as switching the controller. So it should be solved by #961, but the current solution there doesn't fix the problem yet. We are working on it - unfortunately, its a bit more tricky than initially anticipated.

@mkemel
Copy link
Member

mkemel commented Oct 31, 2024

@dofmind the fix for this issue and for #950 was merged, it works well on our side now, please make sure that it works well on your end as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants