-
Notifications
You must be signed in to change notification settings - Fork 5
Measure QUIC and TCP dial backs #117
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts:
- I'm concerned about the amount of extra work hydras will end up doing. We already pack as many heads as possible onto a VM, we may need to increase their spec.
- How long should this be recorded for? I think ipfs-search.com uses (or were at least thinking of using) their own deployed hydra nodes. If we're not going to need to collect the stats forever we could deploy a branch with this code for our purposes and not affect other users. Otherwise it might be an idea to have this behind a feature flag?
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) | ||
|
||
// dial back on quic if peer advertises a quic address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to somehow exclude inbound dials from other hydras. Although I don't think that's going to be possible... but you could probably exclude heads from the same hydra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you mean, if the inbound dial is from a head in our hydra, don't attempt this dial back ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're waiting for identify to complete we could check the useragent and exclude hydras
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aschmahmann I agree. Will put in a fix.
hydra/hydra.go
Outdated
"go.opencensus.io/stats" | ||
"go.opencensus.io/tag" | ||
"go.uber.org/atomic" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync/atomic
? I don't think we need to pull in another external dependency. atomic.AddUint32
should be easy enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
hydra/hydra.go
Outdated
if mafmt.TCP.Matches(a) { | ||
nTCPConns.Add(1) | ||
|
||
if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log the error if one occurs? Or have a sanity check that this is a timeout error (net.Error
with Timeout() == true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other error I could imagine is a QUIC version negotiation error, which should be rare, as we've deployed draft-29 for quite a while now. If there are other errors, we should probably look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... are you sure that dial errors because of NATS would all be timeout errors ? Aren't there NATs out there that send ICMP messages for unsolicited requests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadn't thought of ICMP. In UDP, you only get ICMP messages for connected sockets (which we don't use), not sure how this works in TCP. Actually, it would actually be an interesting thing to know how those dials fail.
Another reason I'm asking for error logging is that I recently went through the QUIC error logs and discovered that a lot of dials were failing when connection to our old Bootstrap nodes. We managed to make sense of this, but it was unexpected at first, so I suggest that there's value in logging these errors (or maybe exporting the error string to Grafana if we want a single source of truth? Not sure how to best do this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, you can get any error. You might get a TCP reset from the NAT, you might get an ICMP rejection, you might get a timeout.
case evt := <-subs.Out(): | ||
ev := evt.(event.EvtPeerIdentificationCompleted) | ||
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which addresses will the peer store contain? Only the address obtained via identify, or also (potentially local interface) addresses that the peer might have announced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marten-seemann Both.
We receive all address advertised by a peer on Identify protocol -> add them all to the peerstore . The addresses a peer listens on for it's local interfaces are also sent by it over the Identify protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will lead to a lot of dial failures then which are unrelated to NATting, won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think we want to know "is the peer dialable period".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aschmahmann It's the latter. What do you mean by the peer's actual remote address ? You mean the remote address we'd see on a connection to it ? That is usually discovered by the peer using AutoNAT post which it includes it in it' s set of dialable addresses and works pretty well right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I mean the remote address the peer dials out from and that we see when they connect to us. I agree that if everything with AutoNAT is working fine and we don't connect to the hydra nodes before learning our addresses via AutoNAT then this should be good. However, if either of those conditions aren't true then we may get some false reports of NAT'd nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have addressed your review. Please take a look.
case evt := <-subs.Out(): | ||
ev := evt.(event.EvtPeerIdentificationCompleted) | ||
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.
hydra/hydra.go
Outdated
if mafmt.TCP.Matches(a) { | ||
nTCPConns.Add(1) | ||
|
||
if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on the second bullet here?: #117 (review)
@alanshaw Have made the changes. On the resource consumption questions, how about we deploy on one hydra, monitor and then deploy on more if there are no concerns. I'd like for this to be along running measurement as NAT traversal is going to be a long term endeavor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea on the test failures? I'll deploy this branch to a hydra when I get a sec and ping you when done.
@alanshaw I think the test code was broken. Have pushed a fix. Also, do I need to make any changes to the |
@Stebalien Can we use Transports instead of Hosts here for dial back ? |
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) | ||
|
||
// dial back on quic if peer advertises a quic address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're waiting for identify to complete we could check the useragent and exclude hydras
hydra/hydra.go
Outdated
|
||
// dial back on quic if peer advertises a quic address | ||
for _, a := range addrs { | ||
if mafmt.QUIC.Matches(a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure off hand how this (or the TCP equivalent) interacts with circuit relay addresses, I don't know how many of those nodes are around but we may want to count them separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, we can probably just try and fail. The failure should be pretty fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you mean the failure will be fast if there are all TCP addrs ? That makes sense.
However, we still need to filter out Relay addresses here and below for TCP, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But we'll ignore relay addresses as well if we simply don't enable the relay transport. Basically, if we construct two bare-bones hosts (maybe use the blank host?) and only enable one transport, all addresses that don't match that transport will fail.
case evt := <-subs.Out(): | ||
ev := evt.(event.EvtPeerIdentificationCompleted) | ||
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than not dialing other heads, this looks correct.
case evt := <-subs.Out(): | ||
ev := evt.(event.EvtPeerIdentificationCompleted) | ||
p := ev.Peer | ||
addrs := hd.Host.Peerstore().Addrs(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think we want to know "is the peer dialable period".
hydra/hydra.go
Outdated
|
||
// dial back on quic if peer advertises a quic address | ||
for _, a := range addrs { | ||
if mafmt.QUIC.Matches(a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.
hydra/hydra.go
Outdated
|
||
// dial back on quic if peer advertises a quic address | ||
for _, a := range addrs { | ||
if mafmt.QUIC.Matches(a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, we can probably just try and fail. The failure should be pretty fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM
Closes libp2p/go-libp2p#1011.
@alanshaw I'm not sure how to place the code for the new measures in the
hydra-boosters/ui
package like we've done for other measures. How do I get the new measures to flow to the Prometheus UI ?@marten-seemann @Stebalien Does this look correct from a libp2p/what we're trying to accomplish POV ?