-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
edgehub does not reconnect through AMQPS after being offline for a few days #6421
Comments
@emilm Thank you for reporting this and the repro steps. You are right that the scenario when the local clock gets synchronized by some delay should work. I am opening a bug ticket for this and will be fixed. |
Thanks! @vipeller ! Do you have a suggestion for a workaround in the meantime that can be solved in code? It works when iotedged is restarted (not hub or agent, or any of my services) . It could be a little difficult to restart a systemd service from within a container. |
@emlim sorry, I have not dug into the code yet. I need to understand the root cause to suggest something for workaround in the interim. I'll come back with some more information later. |
Hi @emilm, just an update that I started working on this. So far I have checked the source code, which seems to be written the way that when the connection fails (for any reason, including the invalid token), the does regenerate the token. I tried it from visual studio manipulating the time and it recovers after the correct time is set. |
some more updates: I tried to repro it using ubuntu 20.04 - one difference is that I use the 1.2 services (as I had that already installed). but 1.14 agent and hub It works differently: if I start the edge with time back (by a week), I see the errors you see. But for some reason when I set the time (using timedatectl set-ntp 1), then the containers get recreated immediately (and therefore later everything works) Keep trying with the repro |
Hello! Thanks for trying to reproduce. I use 1.1.13. https://github.com/Azure/azure-iotedge/releases/download/1.1.13/iotedge-1.1.13.tar.gz and mcr.microsoft.com/azureiotedge-hub:1.1.13-linux-arm32v7 + Set the date to 5+ days behind before starting iotedged and any container, then start them. I would like to try 1.1.14 but there seems to be something different with this release so it doesn't contain the same structure |
Hey @emilm, did you manage to make any headway with this? |
Hello @rahulrana-XOM ! As you can see these are quite clunky and dirty solutions, and time consuming and costly to implement. btw: How long does the clock drift have to be for it to be invalid? |
@emilm I've been trying to repro it - last time I used a raspberry for 32 bit, but it seems working the way I test it, or it work differently than you described. on a pi I tested two ways: first step in all case is to change the time back 5 days. then If I restart the device, what happens is that edge daemon tries to connect iothub, which will fail with a token problem and at the end no modules will be started. If I understand, this does not happen to you if I don't restart the entire device, only I stop the containers, then they come up, showing the same error message you have. But when I turn the time sync on and the time gets synced, both recover. Best would be a call so we can look closely how you do your stuff, maybe that would help me to repro it on my side. If you shoot me a mail, the address is like my github id and the domain is the MS domain - i can set a call up then |
Thanks a lot @vipeller ! I am definitely not ruling out that I am doing something wrong. It could be similar symptoms but the problem might be different. But I can consistently reproduce it. I will shoot you a mail! thanks! |
This issue is being marked as stale because it has been open for 30 days with no activity. |
Expected Behavior
It should reauthenticate through TPM when booting
Current Behavior
It fails to get new token in edgeHub
Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
Context (Environment)
Embedded
Output of
iotedge check
Click here
Device Information
Runtime Versions
Logs
aziot-edged logs
edge-agent logs
edge-hub logs
``` <6> 2022-06-07 13:16:59.296 +00:00 [INF] - Error getting cloud connection for device /$edgeHub Microsoft.Azure.Devices.Client.Exceptions.UnauthorizedException: error(condition:amqp:unauthorized-access,description:Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not m atch either the primary or secondary key..) ---> Microsoft.Azure.Amqp.AmqpException: Put token failed. status-code: 401, status-description: The specified SAS token has an invalid signature. It does not match either the primary or secondary key.. at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result) at Microsoft.Azure.Amqp.AmqpCbsLink.<>c__DisplayClass4_0.b__1(IAsyncResult a) at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTCbsLink.SendTokenAsync(ICbsTokenProvider tokenProvider, Uri namespaceAddress, String audience, String resource, String[] requiredClaims, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpIoTConnection.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.CreateRefresherAsync(DeviceIdentity deviceIdentity, TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.EnsureSessionAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.AmqpIoT.AmqpUnit.OpenAsync(TimeSpan timeout) at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass23_0.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation) at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass33_0.<b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken) at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync() at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 144 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ModuleClientWrapper.OpenAsync() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ModuleClientWrapper.cs:line 63 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.<>c__DisplayClass30_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 195 --- End of stack trace from previous location where exception was thrown --- at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/Connec tivityAwareClient.cs:line 188 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/Connec tivityAwareClient.cs:line 188 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.OpenAsync() in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 64 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.ConnectToIoTHub(ITokenProvider newTokenProvider) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 134 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.CreateNewCloudProxyAsync(ITokenProvider newTokenProvider) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 117 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.Create(IIdentity identity, Action`2 connectionStatusChangedHandler, ITransportSettings[] transportSettings, IMessageConverterProvider messageConverterProvider, IClientProv ider clientProvider, ICloudListener cloudListener, ITokenProvider tokenProvider, TimeSpan idleTimeout, Boolean closeOnIdleTimeout, TimeSpan operationTimeout, String productInfo, Option`1 modelId) in /mnt/vss/_work/1/s/edge-hub/src/Micros oft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 99 at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.Connect(IClientCredentials clientCredentials, Action`2 connectionStatusChangedHandler) in /mnt/vss/_work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudPro xy/CloudConnectionProvider.cs:line 135 <6> 2022-06-07 13:17:03.487 +00:00 [INF] - Entering periodic task to reauthenticate connected clients ```Additional Information
This has happened between several iotedge releases - it fails to reconnect after being offline for a few days. I am not sure what I am doing wrong.
I use iot-device-client 1.34.3 from Maven calling open() until it succeeds...
I have persistent storage of the container states
Not using RTC, but save the date regularily to file and load it at the earliest point possible after boot. But it would still be days behind the today's date until systemd-timesyncd syncs the time. Could it be that it gets the token before time is synced? Shouldn't it try to get token again if it fails? I am 99% sure this is happening since I am able to reproduce with the clock put in the past. It should seriously recover from this in my opinion.
Deployment config for edgeHub / edgeAgent
The text was updated successfully, but these errors were encountered: