Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Measure QUIC and TCP dial backs #117

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

aarshkshah1992
Copy link

@aarshkshah1992 aarshkshah1992 commented Nov 6, 2020

Closes libp2p/go-libp2p#1011.

@alanshaw I'm not sure how to place the code for the new measures in the hydra-boosters/ui package like we've done for other measures. How do I get the new measures to flow to the Prometheus UI ?

@marten-seemann @Stebalien Does this look correct from a libp2p/what we're trying to accomplish POV ?

Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts:

  • I'm concerned about the amount of extra work hydras will end up doing. We already pack as many heads as possible onto a VM, we may need to increase their spec.
  • How long should this be recorded for? I think ipfs-search.com uses (or were at least thinking of using) their own deployed hydra nodes. If we're not going to need to collect the stats forever we could deploy a branch with this code for our purposes and not affect other users. Otherwise it might be an idea to have this behind a feature flag?

head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
hydra/hydra.go Outdated Show resolved Hide resolved
p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)

// dial back on quic if peer advertises a quic address
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to somehow exclude inbound dials from other hydras. Although I don't think that's going to be possible... but you could probably exclude heads from the same hydra.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you mean, if the inbound dial is from a head in our hydra, don't attempt this dial back ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're waiting for identify to complete we could check the useragent and exclude hydras

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann I agree. Will put in a fix.

hydra/hydra.go Outdated Show resolved Hide resolved
hydra/hydra.go Outdated
"go.opencensus.io/stats"
"go.opencensus.io/tag"
"go.uber.org/atomic"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync/atomic? I don't think we need to pull in another external dependency. atomic.AddUint32 should be easy enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

hydra/hydra.go Outdated
if mafmt.TCP.Matches(a) {
nTCPConns.Add(1)

if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log the error if one occurs? Or have a sanity check that this is a timeout error (net.Error with Timeout() == true?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other error I could imagine is a QUIC version negotiation error, which should be rare, as we've deployed draft-29 for quite a while now. If there are other errors, we should probably look into that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... are you sure that dial errors because of NATS would all be timeout errors ? Aren't there NATs out there that send ICMP messages for unsolicited requests ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't thought of ICMP. In UDP, you only get ICMP messages for connected sockets (which we don't use), not sure how this works in TCP. Actually, it would actually be an interesting thing to know how those dials fail.

Another reason I'm asking for error logging is that I recently went through the QUIC error logs and discovered that a lot of dials were failing when connection to our old Bootstrap nodes. We managed to make sense of this, but it was unexpected at first, so I suggest that there's value in logging these errors (or maybe exporting the error string to Grafana if we want a single source of truth? Not sure how to best do this).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, you can get any error. You might get a TCP reset from the NAT, you might get an ICMP rejection, you might get a timeout.

case evt := <-subs.Out():
ev := evt.(event.EvtPeerIdentificationCompleted)
p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which addresses will the peer store contain? Only the address obtained via identify, or also (potentially local interface) addresses that the peer might have announced?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marten-seemann Both.

We receive all address advertised by a peer on Identify protocol -> add them all to the peerstore . The addresses a peer listens on for it's local interfaces are also sent by it over the Identify protocol.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will lead to a lot of dial failures then which are unrelated to NATting, won't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think we want to know "is the peer dialable period".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann It's the latter. What do you mean by the peer's actual remote address ? You mean the remote address we'd see on a connection to it ? That is usually discovered by the peer using AutoNAT post which it includes it in it' s set of dialable addresses and works pretty well right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I mean the remote address the peer dials out from and that we see when they connect to us. I agree that if everything with AutoNAT is working fine and we don't connect to the hydra nodes before learning our addresses via AutoNAT then this should be good. However, if either of those conditions aren't true then we may get some false reports of NAT'd nodes.

Copy link
Author

@aarshkshah1992 aarshkshah1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanshaw @marten-seemann

Have addressed your review. Please take a look.

head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
head/head.go Outdated Show resolved Hide resolved
hydra/hydra.go Outdated Show resolved Hide resolved
case evt := <-subs.Out():
ev := evt.(event.EvtPeerIdentificationCompleted)
p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.

hydra/hydra.go Outdated
if mafmt.TCP.Matches(a) {
nTCPConns.Add(1)

if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.

Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the second bullet here?: #117 (review)

hydra/hydra.go Outdated Show resolved Hide resolved
hydra/hydra.go Outdated Show resolved Hide resolved
hydra/hydra.go Show resolved Hide resolved
metrics/definitions.go Outdated Show resolved Hide resolved
@aarshkshah1992
Copy link
Author

@alanshaw Have made the changes.

On the resource consumption questions, how about we deploy on one hydra, monitor and then deploy on more if there are no concerns. I'd like for this to be along running measurement as NAT traversal is going to be a long term endeavor.

Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea on the test failures? I'll deploy this branch to a hydra when I get a sec and ping you when done.

@aarshkshah1992
Copy link
Author

@alanshaw I think the test code was broken. Have pushed a fix.

Also, do I need to make any changes to the ui package to show these stats correctly on the UI ?

@aarshkshah1992
Copy link
Author

@Stebalien Can we use Transports instead of Hosts here for dial back ?

p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)

// dial back on quic if peer advertises a quic address
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're waiting for identify to complete we could check the useragent and exclude hydras

hydra/hydra.go Outdated

// dial back on quic if peer advertises a quic address
for _, a := range addrs {
if mafmt.QUIC.Matches(a) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure off hand how this (or the TCP equivalent) interacts with circuit relay addresses, I don't know how many of those nodes are around but we may want to count them separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, we can probably just try and fail. The failure should be pretty fast.

Copy link
Author

@aarshkshah1992 aarshkshah1992 Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien

Oh you mean the failure will be fast if there are all TCP addrs ? That makes sense.
However, we still need to filter out Relay addresses here and below for TCP, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But we'll ignore relay addresses as well if we simply don't enable the relay transport. Basically, if we construct two bare-bones hosts (maybe use the blank host?) and only enable one transport, all addresses that don't match that transport will fail.

case evt := <-subs.Out():
ev := evt.(event.EvtPeerIdentificationCompleted)
p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than not dialing other heads, this looks correct.

case evt := <-subs.Out():
ev := evt.(event.EvtPeerIdentificationCompleted)
p := ev.Peer
addrs := hd.Host.Peerstore().Addrs(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think we want to know "is the peer dialable period".

hydra/hydra.go Outdated

// dial back on quic if peer advertises a quic address
for _, a := range addrs {
if mafmt.QUIC.Matches(a) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.

hydra/hydra.go Outdated

// dial back on quic if peer advertises a quic address
for _, a := range addrs {
if mafmt.QUIC.Matches(a) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, we can probably just try and fail. The failure should be pretty fast.

Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NAT traversal: Quantify number of NATT'd and thus undialable peers
5 participants