fix(discovery-client): allow discovery client to track network interfaces #7608

mcous · 2021-04-07T21:57:50Z

overview

Previously, #5933 added network interface checking to enable Discovery Client restarts when the computer's active network interface list changed. This was required because:

The discovery client's mDNS browser binds to network interfaces when it's created, so advertisements coming in on new interfaces would never be caught
Plugging an OT-2 into the app via "USB" is actually an entire USB-to-Ethernet network interfaces

The implementation relied on the app-shell module to check the system's network interfaces, and if they changed, restart the entire discovery client. Either due to limitations in this implementation, or due to changes to the discovery client since then, this functionality either isn't working as well as we thought it was or has potentially broken entirely, as confirmed by real world testing.

This worked, but only because the Discovery Client was also periodically re-sending its discovery query on an interval, which would ensure that once the DC picked up a new interface, it would (eventually) issue a discovery query on that interface. This was not an intentional effect of the re-query functionality, so when the re-query was removed according to #5985 and #6193 added fairly aggressive discovery cache cleanup, robot rediscovery over the U2E interface became a lot less reliable.

This PR fixes the "hotplug" functionality, but on purpose this time. Related to #6585, in that this problem appears to be a root cause of the app never seeing an updated robot.

changelog

Rather than rely on its caller, this PR pulls the network interface check directly into the Discovery Client in order to:

Compare the system's network interface list directly with the interfaces that the mDNS browser has bound to
Restart the mDNS browser rather than the entire Discovery Client on changes

It also:

Reintroduces periodic rediscovery queries, but this time on an exponentially backed-off interval, topping out at once every 2 minutes

This brings the following benefits:

As a standalone application / library, the Discovery Client can support U2E robot hotplugs on its own
- This has very positive implications for stuff like factory QC tools and support tools
Instead of restarting on "any" change, we now restart if we know the browser is not bound to known system interfaces (or if it's bound to too many)
- The goal here is to restart more reliably and have some retry built in if we fail to bind to an interface on the first go or something

Review requests

Basic test procedure with Discovery Client CLI:

Make sure your app and OT-2 can only talk over U2E, and is not plugged into the PC
- e.g. disconnect your OT-2 from Wi-Fi

Build and start the discovery client CLI

make -C discovery-client && yarn discovery browse --logLevel debug

Once the DC is running, plug the OT-2 in via U2E
Robot should eventually show up

Basic test procedure with app:

Make sure your app and OT-2 can only talk over U2E, and is not plugged into the PC
- e.g. disconnect your OT-2 from Wi-Fi
Clear the app's discovery cache with "More > Network & System > Clear Discovered Robots List"
Close the app and re-open via the command line so you can see the logs
```
make -C app dev
```
Wait a minute or so, and make sure the discovery spinner is not running
Watch for a log entry that looks like:
- debug: Restarting mDNS due to network interface mismatch { extra: [], missing: [ ... ] }
The robot should show up as connectable in the list without restarting discovery via the button

Risk assessment

Medium. There's a chance here that this mDNS browser restart strategy is overly aggressive in a way that could result in either excessive network load, missing valid advertisements due to frequent restarts, or both.

I've added app logging to record when these restarts happen, so we should carefully study these logs while testing and, if possible, stress test this change somehow.

If we saw the log happening every 5 seconds (its poll interval) on a given computer, that would be a red flag and force us to reconsider this approach.

…ace changes Rather than rely on its caller, this change allows the Discovery Client to check the system's network interfaces to determine if its mDNS browser needs to be restarted to bind on any newly added interfaces (or to flush out any removed interfaces). This change is necessary to increase Discovery Client reliability when robots connected to U2E adapters are plugged and unplugged from the user's computer.

SyntaxColoring

The basic idea of this PR—move responsibility of keeping track of interfaces to inside the discovery client—sounds great to me.
I went through the basic test procedure. What I saw matched what you described.
I've only very briefly looked at the code, so far.
I've added app logging to record when these restarts happen, so we should carefully study these logs and, if possible, stress test this change somehow.

How will we do this? I thought only robot logs get forwarded to us over the Internet?

SyntaxColoring · 2021-04-07T23:41:57Z

discovery-client/src/mdns-browser/index.ts

-      originalServiceTypeFromString.apply(this, args)
-    } catch (error) {
-      console.warn(error)
+      if (!interfacesMatch) {


Is it right to use the set of interface names changing across polls as the criterion for restarting the mDNS browser, like this?

For example, this sequence:

Poll the OS's current set of network interface names.

An interface goes down.

That interface comes back up.

Again poll the OS's current set of network interface names.

The set of network interface names doesn't change between steps 1 and 4. But the interface still goes down and comes back up.

I guess this hinges on what it really means for the mDNS browser to "bind to network interfaces when it's created." What's it latching onto? An IP? An OS handle to a hardware device? Is the handle invalidated when the hardware device is removed?

"network interfaces" are logical interfaces to hardware, with many-to-many bindings to actual hardware and occasionally fully logical implementations - consider a loopback iface, or a bridge iface that really only exists to route packets between two other ifaces. Depending on the os, names may be unique plug-to-plug; more often, they won't be. interfaces is a wrapper around node's os.networkInterfaces in particular, which appears to elide any os-provided unique ids for ifaces even if they do happen to exist. Looks like the implementation boils down to getifaddrs(3) on linux and similar on other platforms, so it'll get any iface with a bound address (and no others, which probably isn't a problem for us because the robot won't send mdns packets over an iface with no address anyway).

meanwhile, the problem we're trying to get around is

mdns-js will open a socket on the appropriate port

if the iface goes away, the socket will go away (and it only maybe gets reported)

even if it does get reported, we're presumably not really detecting the problem

so we kind of need to brute force it

this all adds up to, yeah, you could have a kind of problem with this implementation where the iface goes away and comes back all in between polls and you never detect the difference, but the socket has in fact gone away. in practice i think this may not actually happen because local ip namespaces take forever to establish; but it may be work looking into whether there's an error we can hook that will actually tell us when the socket goes away, if it exists (which it may not).

So this is mostly problematic on link-local; I wonder if, in the absence of anything better, we could check the IP address of the interface, too? The computer's link-local address will likely change (I think?) on a U2E hotplug.

Looking into it a little bit more, we have access to the UDP socket, so we could (presumably) listen for the error and close events ourselves

The computer's link-local address will likely change (I think?) on a U2E hotplug

Anecdotally, I think I've seen the IP address not change (and been surprised by that). But if the IP address doesn't change, maybe that also suggests that the original binding wasn't invalidated, either? Hmm.

I think being able to get the udp socket's events will help. What we expect to happen is

the network interface goes away

any ongoing selects or whatever should return an EIO or whatever

that should bubble up through the node runtime to the events and call back

The normal danger with sockets is that you can't really detect when the other side stops listening especially on udp, but because what we're detecting is the local interface going away, I'd hope that we get an error. That said, it's absolutely in the "hope so!" category, and may be platform-dependent.

I've done some testing on macOS with our underlying mdns-js and the Node.js UDP sockets it exposes. Results are... not ideal.

A Node.js UDP socket does not emit any sort of event on the socket itself if the interface goes away

A given send will fail with EADDRNOTAVAIL, though

Unlike socket-level events, we do not have a convenient way to capture send errors from mdns-js

Additionally, the mDNS library really only sends once for the initial discovery query at boot

The race condition @sfoster1 described above is definitely possible and repeatable; if the interface goes down and comes back up within a interface check poll interval, we basically lose the socket for receiving mDNS packets

Based on this testing, I'm at a TL;DR that we'd have to write our own mDNS browser (possibly / probably not in Node.js) to actually cover our bases here. I don't want to do that (right now), so here's my thoughts:

A 5-second polling interval for network interface changes is actually fairly effective, so I'll bring the interval down from 10 to 5

Given the typical timing of interface bring-up on both ends of the U2E adapter, I have not successfully been able to sneak an iface up/down/up through this interval

A polling interval in the range of ~1 min if definitely big enough to sneak through, though

Our existing slow poll of network interfaces is 30 seconds, which feels big enough to sneak through

If a robot reboots and comes back with the same IP address, we're fine because of the discovery client's HTTP health polling loop

I think I will add IP address checking to the network interface comparison, just to cover an extra base if something sneaks past the 5 second poll interval

That leaves us with one case where we're in trouble:

Robot reboots and self-assigns a new IP address

PC self-assigns the same IP address

The interface up/down/up on the computer-side sneaks past the 5-second check interval

In that case, we "lose" the robot. However, we only lose the robot until the user manually triggers a discovery re-check. Combined with the fact that full page restart spinners won't stick around forever (#7589), I think I'm willing to call that good enough until we can write our own mDNS browser to handle this properly

Agreed, I think that's good enough. I think we can test whether it's a node problem or a "that's how sockets work" problem that we don't get an error if we're listening on a socket on an if that goes away, but I also think that's not super relevant.

mcous · 2021-04-09T21:31:10Z

Alright, this has been an... adventure. I think this is finally ready to go. Here are the latest updates:

Shortened the OS network interface poll to 5 seconds
- This is quick enough to catch robot reboots reliably in my testing
- Technically, a race condition still exists here, but its quite difficult to resolve without writing our own mDNS browser
Added an explicit re-query call on an (exponentially backed-off) interval
- (1) by itself doesn't resolve the issue, because the interface comes up on the PC side before it comes up on the OT-2 side
- So, there's a failure mode where PC interface comes up, then the PC issues mDNS query, then the robot comes up, and the robot missed the query request so it never advertises itself

With (1) and (2) combined, we get very reliable robot hotplugs. There's probably some timing tuning we could do, but as is, it will eventually pick up the robot.

As an aside, it turns out this is a regression introduced by #6193 and removing the functionality called out in #5985. The "old" discovery client re-queried the mDNS browser on an interval for an entirely different purpose, inadvertently ensuring that the network interface checking added by #5933 actually solved the problem.

When the re-query behavior was removed, the network interface checking that was in place was not sufficient to allow robots to be hotplugged.

sfoster1 · 2021-04-12T12:51:25Z

discovery-client/src/mdns-browser/base-browser.ts

+    // this method can throw (without emitting), so we need to patch this up
+    const originalServiceTypeFromString = ServiceType.prototype.fromString
+
+    ServiceType.prototype.fromString = function (...args) {


should we try and upstream this?

Yeah probably. Upstream has not been touched in a while and it's an old issue, but worth a shot

sfoster1

Nice! Looks good, hopefully this will be enough going forward.

SyntaxColoring

This is mostly over my head now, but, one thing:

Added an explicit re-query call on an (exponentially backed-off) interval

I don't understand why this is necessary. Even if the OT-2 misses the first query, shouldn't it automatically announce itself when it detects its network interface has come up?

Whenever a Multicast DNS responder starts up, wakes up from sleep, receives an indication of a network interface "Link Change" event, or has any other reason to believe that its network connectivity may have changed in some relevant way, it MUST perform the two startup steps below: Probing (Section 8.1) and Announcing (Section 8.3).

https://tools.ietf.org/html/rfc6762#section-8

mcous · 2021-04-14T17:09:13Z

QA approved for more thorough testing in release test suite

mcous added fix PR fixes a bug robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). discovery-client labels Apr 7, 2021

mcous requested a review from a team April 7, 2021 21:57

mcous requested review from a team as code owners April 7, 2021 21:57

mcous requested review from SyntaxColoring and b-cooper and removed request for a team April 7, 2021 21:57

SyntaxColoring reviewed Apr 7, 2021

View reviewed changes

mcous added the WIP label Apr 8, 2021

fixup: add repeated discovery queries on a backed-off interval

e194dae

mcous requested a review from a team as a code owner April 9, 2021 20:57

mcous removed the WIP label Apr 9, 2021

sfoster1 reviewed Apr 12, 2021

View reviewed changes

sfoster1 approved these changes Apr 12, 2021

View reviewed changes

SyntaxColoring self-requested a review April 12, 2021 13:55

SyntaxColoring reviewed Apr 12, 2021

View reviewed changes

mcous added the qa QA input / review required label Apr 13, 2021

mcous mentioned this pull request Apr 13, 2021

feat(app): allow robot restarts to track boot ID and timeout #7589

Merged

mcous merged commit c526b19 into edge Apr 14, 2021

mcous deleted the discovery_track-iface-changes branch April 14, 2021 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery-client): allow discovery client to track network interfaces #7608

fix(discovery-client): allow discovery client to track network interfaces #7608

mcous commented Apr 7, 2021 •

edited

Loading

SyntaxColoring left a comment

SyntaxColoring Apr 7, 2021

sfoster1 Apr 8, 2021

mcous Apr 8, 2021 •

edited

Loading

SyntaxColoring Apr 8, 2021

sfoster1 Apr 9, 2021 •

edited

Loading

mcous Apr 9, 2021 •

edited

Loading

sfoster1 Apr 9, 2021

mcous commented Apr 9, 2021 •

edited

Loading

sfoster1 Apr 12, 2021

mcous Apr 12, 2021

sfoster1 left a comment

SyntaxColoring left a comment

mcous commented Apr 14, 2021

fix(discovery-client): allow discovery client to track network interfaces #7608

fix(discovery-client): allow discovery client to track network interfaces #7608

Conversation

mcous commented Apr 7, 2021 • edited Loading

overview

changelog

Review requests

Risk assessment

SyntaxColoring left a comment

Choose a reason for hiding this comment

SyntaxColoring Apr 7, 2021

Choose a reason for hiding this comment

sfoster1 Apr 8, 2021

Choose a reason for hiding this comment

mcous Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

SyntaxColoring Apr 8, 2021

Choose a reason for hiding this comment

sfoster1 Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

mcous Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

sfoster1 Apr 9, 2021

Choose a reason for hiding this comment

mcous commented Apr 9, 2021 • edited Loading

sfoster1 Apr 12, 2021

Choose a reason for hiding this comment

mcous Apr 12, 2021

Choose a reason for hiding this comment

sfoster1 left a comment

Choose a reason for hiding this comment

SyntaxColoring left a comment

Choose a reason for hiding this comment

mcous commented Apr 14, 2021

mcous commented Apr 7, 2021 •

edited

Loading

mcous Apr 8, 2021 •

edited

Loading

sfoster1 Apr 9, 2021 •

edited

Loading

mcous Apr 9, 2021 •

edited

Loading

mcous commented Apr 9, 2021 •

edited

Loading