Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(discovery-client): allow discovery client to track network interfaces #7608

Merged
merged 2 commits into from
Apr 14, 2021

Conversation

mcous
Copy link
Contributor

@mcous mcous commented Apr 7, 2021

overview

Previously, #5933 added network interface checking to enable Discovery Client restarts when the computer's active network interface list changed. This was required because:

  • The discovery client's mDNS browser binds to network interfaces when it's created, so advertisements coming in on new interfaces would never be caught
  • Plugging an OT-2 into the app via "USB" is actually an entire USB-to-Ethernet network interfaces

The implementation relied on the app-shell module to check the system's network interfaces, and if they changed, restart the entire discovery client. Either due to limitations in this implementation, or due to changes to the discovery client since then, this functionality either isn't working as well as we thought it was or has potentially broken entirely, as confirmed by real world testing.

This worked, but only because the Discovery Client was also periodically re-sending its discovery query on an interval, which would ensure that once the DC picked up a new interface, it would (eventually) issue a discovery query on that interface. This was not an intentional effect of the re-query functionality, so when the re-query was removed according to #5985 and #6193 added fairly aggressive discovery cache cleanup, robot rediscovery over the U2E interface became a lot less reliable.

This PR fixes the "hotplug" functionality, but on purpose this time. Related to #6585, in that this problem appears to be a root cause of the app never seeing an updated robot.

changelog

Rather than rely on its caller, this PR pulls the network interface check directly into the Discovery Client in order to:

  • Compare the system's network interface list directly with the interfaces that the mDNS browser has bound to
  • Restart the mDNS browser rather than the entire Discovery Client on changes

It also:

  • Reintroduces periodic rediscovery queries, but this time on an exponentially backed-off interval, topping out at once every 2 minutes

This brings the following benefits:

  • As a standalone application / library, the Discovery Client can support U2E robot hotplugs on its own
    • This has very positive implications for stuff like factory QC tools and support tools
  • Instead of restarting on "any" change, we now restart if we know the browser is not bound to known system interfaces (or if it's bound to too many)
    • The goal here is to restart more reliably and have some retry built in if we fail to bind to an interface on the first go or something

Review requests

Basic test procedure with Discovery Client CLI:

  1. Make sure your app and OT-2 can only talk over U2E, and is not plugged into the PC
    • e.g. disconnect your OT-2 from Wi-Fi
  2. Build and start the discovery client CLI
    make -C discovery-client && yarn discovery browse --logLevel debug
  3. Once the DC is running, plug the OT-2 in via U2E
  4. Robot should eventually show up

Basic test procedure with app:

  1. Make sure your app and OT-2 can only talk over U2E, and is not plugged into the PC
    • e.g. disconnect your OT-2 from Wi-Fi
  2. Clear the app's discovery cache with "More > Network & System > Clear Discovered Robots List"
  3. Close the app and re-open via the command line so you can see the logs
    make -C app dev
  4. Wait a minute or so, and make sure the discovery spinner is not running
  5. Watch for a log entry that looks like:
    • debug: Restarting mDNS due to network interface mismatch { extra: [], missing: [ ... ] }
  6. The robot should show up as connectable in the list without restarting discovery via the button

Risk assessment

Medium. There's a chance here that this mDNS browser restart strategy is overly aggressive in a way that could result in either excessive network load, missing valid advertisements due to frequent restarts, or both.

I've added app logging to record when these restarts happen, so we should carefully study these logs while testing and, if possible, stress test this change somehow.

If we saw the log happening every 5 seconds (its poll interval) on a given computer, that would be a red flag and force us to reconsider this approach.

…ace changes

Rather than rely on its caller, this change allows the Discovery Client to check the system's
network interfaces to determine if its mDNS browser needs to be restarted to bind on any newly added
interfaces (or to flush out any removed interfaces). This change is necessary to increase Discovery
Client reliability when robots connected to U2E adapters are plugged and unplugged from the user's
computer.
@mcous mcous added fix PR fixes a bug robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). discovery-client labels Apr 7, 2021
@mcous mcous requested a review from a team April 7, 2021 21:57
@mcous mcous requested review from a team as code owners April 7, 2021 21:57
@mcous mcous requested review from SyntaxColoring and b-cooper and removed request for a team April 7, 2021 21:57
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The basic idea of this PR—move responsibility of keeping track of interfaces to inside the discovery client—sounds great to me.

  • I went through the basic test procedure. What I saw matched what you described.

  • I've only very briefly looked at the code, so far.

  • I've added app logging to record when these restarts happen, so we should carefully study these logs and, if possible, stress test this change somehow.

    How will we do this? I thought only robot logs get forwarded to us over the Internet?

originalServiceTypeFromString.apply(this, args)
} catch (error) {
console.warn(error)
if (!interfacesMatch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right to use the set of interface names changing across polls as the criterion for restarting the mDNS browser, like this?

For example, this sequence:

  1. Poll the OS's current set of network interface names.
  2. An interface goes down.
  3. That interface comes back up.
  4. Again poll the OS's current set of network interface names.

The set of network interface names doesn't change between steps 1 and 4. But the interface still goes down and comes back up.

I guess this hinges on what it really means for the mDNS browser to "bind to network interfaces when it's created." What's it latching onto? An IP? An OS handle to a hardware device? Is the handle invalidated when the hardware device is removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"network interfaces" are logical interfaces to hardware, with many-to-many bindings to actual hardware and occasionally fully logical implementations - consider a loopback iface, or a bridge iface that really only exists to route packets between two other ifaces. Depending on the os, names may be unique plug-to-plug; more often, they won't be. interfaces is a wrapper around node's os.networkInterfaces in particular, which appears to elide any os-provided unique ids for ifaces even if they do happen to exist. Looks like the implementation boils down to getifaddrs(3) on linux and similar on other platforms, so it'll get any iface with a bound address (and no others, which probably isn't a problem for us because the robot won't send mdns packets over an iface with no address anyway).

meanwhile, the problem we're trying to get around is

  • mdns-js will open a socket on the appropriate port
  • if the iface goes away, the socket will go away (and it only maybe gets reported)
  • even if it does get reported, we're presumably not really detecting the problem
  • so we kind of need to brute force it

this all adds up to, yeah, you could have a kind of problem with this implementation where the iface goes away and comes back all in between polls and you never detect the difference, but the socket has in fact gone away. in practice i think this may not actually happen because local ip namespaces take forever to establish; but it may be work looking into whether there's an error we can hook that will actually tell us when the socket goes away, if it exists (which it may not).

Copy link
Contributor Author

@mcous mcous Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is mostly problematic on link-local; I wonder if, in the absence of anything better, we could check the IP address of the interface, too? The computer's link-local address will likely change (I think?) on a U2E hotplug.

Looking into it a little bit more, we have access to the UDP socket, so we could (presumably) listen for the error and close events ourselves

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computer's link-local address will likely change (I think?) on a U2E hotplug

Anecdotally, I think I've seen the IP address not change (and been surprised by that). But if the IP address doesn't change, maybe that also suggests that the original binding wasn't invalidated, either? Hmm.

Copy link
Member

@sfoster1 sfoster1 Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being able to get the udp socket's events will help. What we expect to happen is

  • the network interface goes away
  • any ongoing selects or whatever should return an EIO or whatever
  • that should bubble up through the node runtime to the events and call back

The normal danger with sockets is that you can't really detect when the other side stops listening especially on udp, but because what we're detecting is the local interface going away, I'd hope that we get an error. That said, it's absolutely in the "hope so!" category, and may be platform-dependent.

Copy link
Contributor Author

@mcous mcous Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some testing on macOS with our underlying mdns-js and the Node.js UDP sockets it exposes. Results are... not ideal.

  • A Node.js UDP socket does not emit any sort of event on the socket itself if the interface goes away
    • A given send will fail with EADDRNOTAVAIL, though
    • Unlike socket-level events, we do not have a convenient way to capture send errors from mdns-js
    • Additionally, the mDNS library really only sends once for the initial discovery query at boot
  • The race condition @sfoster1 described above is definitely possible and repeatable; if the interface goes down and comes back up within a interface check poll interval, we basically lose the socket for receiving mDNS packets

Based on this testing, I'm at a TL;DR that we'd have to write our own mDNS browser (possibly / probably not in Node.js) to actually cover our bases here. I don't want to do that (right now), so here's my thoughts:

  • A 5-second polling interval for network interface changes is actually fairly effective, so I'll bring the interval down from 10 to 5
    • Given the typical timing of interface bring-up on both ends of the U2E adapter, I have not successfully been able to sneak an iface up/down/up through this interval
    • A polling interval in the range of ~1 min if definitely big enough to sneak through, though
    • Our existing slow poll of network interfaces is 30 seconds, which feels big enough to sneak through
  • If a robot reboots and comes back with the same IP address, we're fine because of the discovery client's HTTP health polling loop
  • I think I will add IP address checking to the network interface comparison, just to cover an extra base if something sneaks past the 5 second poll interval

That leaves us with one case where we're in trouble:

  • Robot reboots and self-assigns a new IP address
  • PC self-assigns the same IP address
  • The interface up/down/up on the computer-side sneaks past the 5-second check interval

In that case, we "lose" the robot. However, we only lose the robot until the user manually triggers a discovery re-check. Combined with the fact that full page restart spinners won't stick around forever (#7589), I think I'm willing to call that good enough until we can write our own mDNS browser to handle this properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think that's good enough. I think we can test whether it's a node problem or a "that's how sockets work" problem that we don't get an error if we're listening on a socket on an if that goes away, but I also think that's not super relevant.

@mcous mcous added the WIP label Apr 8, 2021
@mcous mcous requested a review from a team as a code owner April 9, 2021 20:57
@mcous
Copy link
Contributor Author

mcous commented Apr 9, 2021

Alright, this has been an... adventure. I think this is finally ready to go. Here are the latest updates:

  1. Shortened the OS network interface poll to 5 seconds
    • This is quick enough to catch robot reboots reliably in my testing
    • Technically, a race condition still exists here, but its quite difficult to resolve without writing our own mDNS browser
  2. Added an explicit re-query call on an (exponentially backed-off) interval
    • (1) by itself doesn't resolve the issue, because the interface comes up on the PC side before it comes up on the OT-2 side
    • So, there's a failure mode where PC interface comes up, then the PC issues mDNS query, then the robot comes up, and the robot missed the query request so it never advertises itself

With (1) and (2) combined, we get very reliable robot hotplugs. There's probably some timing tuning we could do, but as is, it will eventually pick up the robot.


As an aside, it turns out this is a regression introduced by #6193 and removing the functionality called out in #5985. The "old" discovery client re-queried the mDNS browser on an interval for an entirely different purpose, inadvertently ensuring that the network interface checking added by #5933 actually solved the problem.

When the re-query behavior was removed, the network interface checking that was in place was not sufficient to allow robots to be hotplugged.

@mcous mcous removed the WIP label Apr 9, 2021
// this method can throw (without emitting), so we need to patch this up
const originalServiceTypeFromString = ServiceType.prototype.fromString

ServiceType.prototype.fromString = function (...args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we try and upstream this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably. Upstream has not been touched in a while and it's an old issue, but worth a shot

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good, hopefully this will be enough going forward.

@SyntaxColoring SyntaxColoring self-requested a review April 12, 2021 13:55
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly over my head now, but, one thing:

Added an explicit re-query call on an (exponentially backed-off) interval

I don't understand why this is necessary. Even if the OT-2 misses the first query, shouldn't it automatically announce itself when it detects its network interface has come up?

Whenever a Multicast DNS responder starts up, wakes up from sleep, receives an indication of a network interface "Link Change" event, or has any other reason to believe that its network connectivity may have changed in some relevant way, it MUST perform the two startup steps below: Probing (Section 8.1) and Announcing (Section 8.3).

https://tools.ietf.org/html/rfc6762#section-8

@mcous mcous added the qa QA input / review required label Apr 13, 2021
@mcous
Copy link
Contributor Author

mcous commented Apr 14, 2021

QA approved for more thorough testing in release test suite

@mcous mcous merged commit c526b19 into edge Apr 14, 2021
@mcous mcous deleted the discovery_track-iface-changes branch April 14, 2021 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discovery-client fix PR fixes a bug qa QA input / review required robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants