Measure QUIC and TCP dial backs #117

aarshkshah1992 · 2020-11-06T10:23:21Z

@alanshaw I'm not sure how to place the code for the new measures in the hydra-boosters/ui package like we've done for other measures. How do I get the new measures to flow to the Prometheus UI ?

@marten-seemann @Stebalien Does this look correct from a libp2p/what we're trying to accomplish POV ?

alanshaw

Thoughts:

I'm concerned about the amount of extra work hydras will end up doing. We already pack as many heads as possible onto a VM, we may need to increase their spec.
How long should this be recorded for? I think ipfs-search.com uses (or were at least thinking of using) their own deployed hydra nodes. If we're not going to need to collect the stats forever we could deploy a branch with this code for our purposes and not affect other users. Otherwise it might be an idea to have this behind a feature flag?

head/head.go

hydra/hydra.go

alanshaw · 2020-11-06T11:10:15Z

hydra/hydra.go

+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)
+
+					// dial back on quic if peer advertises a quic address


It would be good to somehow exclude inbound dials from other hydras. Although I don't think that's going to be possible... but you could probably exclude heads from the same hydra.

Oh you mean, if the inbound dial is from a head in our hydra, don't attempt this dial back ?

If we're waiting for identify to complete we could check the useragent and exclude hydras

@aschmahmann I agree. Will put in a fix.

hydra/hydra.go

marten-seemann · 2020-11-06T11:17:12Z

hydra/hydra.go

 	"go.opencensus.io/stats"
 	"go.opencensus.io/tag"
+	"go.uber.org/atomic"


sync/atomic? I don't think we need to pull in another external dependency. atomic.AddUint32 should be easy enough?

marten-seemann · 2020-11-06T11:20:44Z

hydra/hydra.go

+						if mafmt.TCP.Matches(a) {
+							nTCPConns.Add(1)
+
+							if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil {


Should we log the error if one occurs? Or have a sanity check that this is a timeout error (net.Error with Timeout() == true?

The only other error I could imagine is a QUIC version negotiation error, which should be rare, as we've deployed draft-29 for quite a while now. If there are other errors, we should probably look into that.

Hmmm... are you sure that dial errors because of NATS would all be timeout errors ? Aren't there NATs out there that send ICMP messages for unsolicited requests ?

Hadn't thought of ICMP. In UDP, you only get ICMP messages for connected sockets (which we don't use), not sure how this works in TCP. Actually, it would actually be an interesting thing to know how those dials fail.

Another reason I'm asking for error logging is that I recently went through the QUIC error logs and discovered that a lot of dials were failing when connection to our old Bootstrap nodes. We managed to make sense of this, but it was unexpected at first, so I suggest that there's value in logging these errors (or maybe exporting the error string to Grafana if we want a single source of truth? Not sure how to best do this).

@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.

Really, you can get any error. You might get a TCP reset from the NAT, you might get an ICMP rejection, you might get a timeout.

marten-seemann · 2020-11-06T11:23:59Z

hydra/hydra.go

+				case evt := <-subs.Out():
+					ev := evt.(event.EvtPeerIdentificationCompleted)
+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)


Which addresses will the peer store contain? Only the address obtained via identify, or also (potentially local interface) addresses that the peer might have announced?

@marten-seemann Both.

We receive all address advertised by a peer on Identify protocol -> add them all to the peerstore . The addresses a peer listens on for it's local interfaces are also sent by it over the Identify protocol.

That will lead to a lot of dial failures then which are unrelated to NATting, won't it?

We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.

Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.

In this case, I think we want to know "is the peer dialable period".

@aschmahmann It's the latter. What do you mean by the peer's actual remote address ? You mean the remote address we'd see on a connection to it ? That is usually discovered by the peer using AutoNAT post which it includes it in it' s set of dialable addresses and works pretty well right now.

Yep, I mean the remote address the peer dials out from and that we see when they connect to us. I agree that if everything with AutoNAT is working fine and we don't connect to the hydra nodes before learning our addresses via AutoNAT then this should be good. However, if either of those conditions aren't true then we may get some false reports of NAT'd nodes.

aarshkshah1992

@alanshaw @marten-seemann

Have addressed your review. Please take a look.

head/head.go

hydra/hydra.go

aarshkshah1992 · 2020-11-09T05:33:08Z

hydra/hydra.go

+				case evt := <-subs.Out():
+					ev := evt.(event.EvtPeerIdentificationCompleted)
+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)


We consider a dial to be a failure here ONLY if we exhaust all addresses and still aren't able to dial a peer. That's how the Swarm dial logic works.

aarshkshah1992 · 2020-11-09T05:42:56Z

hydra/hydra.go

+						if mafmt.TCP.Matches(a) {
+							nTCPConns.Add(1)
+
+							if err := hd.TcpDialBackHost.Connect(hdCtx, peer.AddrInfo{ID: p}); err == nil {


@Stebalien Would you know what we'd see in the logs for these dial failures if the peers were NATT'd ? My only concerns is bloating the logs because large parts of the network are still undialable.

alanshaw

Any thoughts on the second bullet here?: #117 (review)

hydra/hydra.go

metrics/definitions.go

aarshkshah1992 · 2020-11-12T10:19:35Z

@alanshaw Have made the changes.

On the resource consumption questions, how about we deploy on one hydra, monitor and then deploy on more if there are no concerns. I'd like for this to be along running measurement as NAT traversal is going to be a long term endeavor.

alanshaw

Any idea on the test failures? I'll deploy this branch to a hydra when I get a sec and ping you when done.

aarshkshah1992 · 2020-11-17T04:48:28Z

@alanshaw I think the test code was broken. Have pushed a fix.

Also, do I need to make any changes to the ui package to show these stats correctly on the UI ?

aarshkshah1992 · 2020-11-18T11:38:07Z

@Stebalien Can we use Transports instead of Hosts here for dial back ?

aschmahmann · 2020-11-18T14:22:58Z

hydra/hydra.go

+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)
+
+					// dial back on quic if peer advertises a quic address


If we're waiting for identify to complete we could check the useragent and exclude hydras

aschmahmann · 2020-11-18T14:30:35Z

hydra/hydra.go

+
+					// dial back on quic if peer advertises a quic address
+					for _, a := range addrs {
+						if mafmt.QUIC.Matches(a) {


I'm not sure off hand how this (or the TCP equivalent) interacts with circuit relay addresses, I don't know how many of those nodes are around but we may want to count them separately.

I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.

Really, we can probably just try and fail. The failure should be pretty fast.

@Stebalien

Oh you mean the failure will be fast if there are all TCP addrs ? That makes sense.
However, we still need to filter out Relay addresses here and below for TCP, right ?

Yes. But we'll ignore relay addresses as well if we simply don't enable the relay transport. Basically, if we construct two bare-bones hosts (maybe use the blank host?) and only enable one transport, all addresses that don't match that transport will fail.

aschmahmann · 2020-11-18T14:37:04Z

hydra/hydra.go

+				case evt := <-subs.Out():
+					ev := evt.(event.EvtPeerIdentificationCompleted)
+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)


Are we meant to measure if AutoNAT is working or how many of the nodes are actually dialable? If the latter then I'm concerned that if there are AutoNAT bugs then we'll run into issues because the list of addresses does not the peer's actual remote address. If it's the former then this seems reasonable.

Stebalien

Other than not dialing other heads, this looks correct.

Stebalien · 2020-11-18T19:40:36Z

hydra/hydra.go

+				case evt := <-subs.Out():
+					ev := evt.(event.EvtPeerIdentificationCompleted)
+					p := ev.Peer
+					addrs := hd.Host.Peerstore().Addrs(p)


In this case, I think we want to know "is the peer dialable period".

Stebalien · 2020-11-18T19:41:36Z

hydra/hydra.go

+
+					// dial back on quic if peer advertises a quic address
+					for _, a := range addrs {
+						if mafmt.QUIC.Matches(a) {


I think this also catch p2p circuit addresses. Instead, I think we need to look at the last protocol in the address.

Stebalien · 2020-11-18T19:42:21Z

hydra/hydra.go

+
+					// dial back on quic if peer advertises a quic address
+					for _, a := range addrs {
+						if mafmt.QUIC.Matches(a) {


Really, we can probably just try and fail. The failure should be pretty fast.

alanshaw

Changes LGTM

measure quic and tcp dial backs

42c886a

aarshkshah1992 requested review from alanshaw, Stebalien and marten-seemann November 6, 2020 10:23

register new views

82f267e

alanshaw reviewed Nov 6, 2020

View reviewed changes

marten-seemann reviewed Nov 6, 2020

View reviewed changes

changes as per review

3b508bb

aarshkshah1992 commented Nov 9, 2020

View reviewed changes

dial back host should not have listen address

183f19c

alanshaw reviewed Nov 10, 2020

View reviewed changes

hydra/hydra.go Outdated Show resolved Hide resolved

hydra/hydra.go Outdated Show resolved Hide resolved

hydra/hydra.go Show resolved Hide resolved

metrics/definitions.go Outdated Show resolved Hide resolved

fixed typo

4325620

alanshaw approved these changes Nov 12, 2020

View reviewed changes

fix test

2d0144c

aarshkshah1992 added 2 commits November 18, 2020 15:52

do not dial back heads

8ac65aa

clear addrs in peerstore

352c069

aarshkshah1992 requested a review from aschmahmann November 18, 2020 11:38

aschmahmann reviewed Nov 18, 2020

View reviewed changes

Stebalien reviewed Nov 18, 2020

View reviewed changes

do not dial other hydras and de-dup peers

883bd8d

alanshaw approved these changes Dec 18, 2020

View reviewed changes

Alan Shaw and others added 2 commits December 18, 2020 10:51

fix: check if key exists before putting

bb300c9

log dial error

1143ece

Measure QUIC and TCP dial backs #117

Are you sure you want to change the base?

Measure QUIC and TCP dial backs #117

Conversation

aarshkshah1992 commented Nov 6, 2020 • edited Loading

alanshaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw left a comment

Choose a reason for hiding this comment

aarshkshah1992 commented Nov 12, 2020

alanshaw left a comment

Choose a reason for hiding this comment

aarshkshah1992 commented Nov 17, 2020

aarshkshah1992 commented Nov 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aarshkshah1992 Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw left a comment

Choose a reason for hiding this comment

aarshkshah1992 commented Nov 6, 2020 •

edited

Loading

aarshkshah1992 Nov 19, 2020 •

edited

Loading