Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipld: Manual providing #375

Merged
merged 20 commits into from
Jun 11, 2021
Merged

ipld: Manual providing #375

merged 20 commits into from
Jun 11, 2021

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented May 28, 2021

Note: those PR does not only manual providing but I chose the name for the biggest change.

This PR:

  • Provides networking tests in IPLD. Probably closes ipfs: improve ipfs tests #349, not sure if that's enough
  • Makes light client serve IPFS API by default(should be changed later to be configurable)
  • Disables Bitswap providing and reproviding.
  • Extends PutBlock with logger and code to manually provide Row and Cols to DHT
  • Introduces provider in p2p/ipld that handles point above. NOTE: We need to extract it into an independent singleton. We can do this along with https://github.com/lazyledger/lazyledger-core/issues/393, so the code for provider is not final.
  • Starts using ContentRouting interface from libp2p in Node(gained from IpfsNode which is now returned instead of io.Closer in ipfs.APIProvider)
  • Update Mock IPFS provider and sets nil content routing. That's needed for tests.
  • Does tricks to make CI happy

@Wondertan Wondertan force-pushed the hlib/zero-bootstrapping branch 3 times, most recently from 8dc07ee to aa9649b Compare May 31, 2021 00:48
@Wondertan Wondertan changed the title node bootstrapping experiments ipld: Manual providing Jun 4, 2021
@Wondertan Wondertan force-pushed the hlib/zero-bootstrapping branch from eec6a12 to 6ecd31a Compare June 4, 2021 20:30
@Wondertan Wondertan force-pushed the hlib/zero-bootstrapping branch from e6e353c to f200949 Compare June 4, 2021 22:06
@Wondertan Wondertan force-pushed the hlib/zero-bootstrapping branch from f200949 to dc1179d Compare June 4, 2021 22:08
@Wondertan Wondertan marked this pull request as ready for review June 4, 2021 22:18
@Wondertan
Copy link
Member Author

Wondertan commented Jun 4, 2021

I[2021-06-04|22:15:14.320] Waiting for initial height 1000...           
E[2021-06-04|22:15:34.382] chain stalled at unknown height              
Error: Process completed with exit code 1.

Does anybody have any idea what could be the reason or encountered something similar before?

@evan-forbes
Copy link
Member

evan-forbes commented Jun 4, 2021

Does anybody have any idea what could be the reason or encountered something similar before?

Unfortunately, I have gotten that error quite a few times for different reasons. The e2e test is failing locally for me and producing the same logs as the CI e2e test, which is actually good news, as that means it's not a mysterious CI issue. Per the logs

validator02    | E[2021-06-04|22:15:02.310] Error on broadcastTxCommit                   module=rpc err="timed out waiting for tx to be included in a block"

This happens to all validators very early on, and all of the other tests are passing. The normal tests don't ever transfer data over a network, which leads me to believe that there's some issue with the block data actually being transferred over the network via IPFS. Could be something else though, I'm not sure.

@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2021

Codecov Report

Merging #375 (0b87d05) into master (6a4b0a7) will increase coverage by 0.00%.
The diff coverage is 66.66%.

@@           Coverage Diff           @@
##           master     #375   +/-   ##
=======================================
  Coverage   61.74%   61.75%           
=======================================
  Files         262      262           
  Lines       22933    22981   +48     
=======================================
+ Hits        14161    14192   +31     
- Misses       7265     7281   +16     
- Partials     1507     1508    +1     
Impacted Files Coverage Δ
cmd/tendermint/commands/light.go 17.90% <0.00%> (-0.12%) ⬇️
consensus/replay_file.go 0.00% <0.00%> (ø)
p2p/ipld/validate.go 54.54% <ø> (ø)
node/node.go 55.92% <63.63%> (-0.48%) ⬇️
p2p/ipld/write.go 65.00% <66.00%> (+5.00%) ⬆️
consensus/state.go 68.16% <85.71%> (+0.11%) ⬆️
p2p/ipld/sample.go 78.94% <0.00%> (-5.27%) ⬇️
blockchain/v0/pool.go 80.98% <0.00%> (-2.67%) ⬇️
blockchain/v0/reactor.go 63.06% <0.00%> (-2.26%) ⬇️
p2p/conn/connection.go 78.17% <0.00%> (-0.56%) ⬇️
... and 6 more

@Wondertan
Copy link
Member Author

Fortunately, I unintentionally fixed e2e by implementing proper context logic for PutBlock in consensus.

evan-forbes added a commit that referenced this pull request Jun 7, 2021
Copy link
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, and it helps to fulfill the vision of breaking apart and optimizing only what we need from IPFS. 🚀 I left some minor comments and questions

Probably a stupid question, but I admittedly don't know. Is this PR an optimization or is this a necessary step towards block propagation via IPFS?

I also applied the typical job worker pool template as a minor refactor https://github.com/lazyledger/lazyledger-core/blob/898f83dbd8b3cb2e011d18a3be2e728e80186d2c/p2p/ipld/write.go#L87-L117
I do think there are some easy readability gain there, but we definitely don't have to use that exact code.

ref #397

p2p/ipld/write.go Show resolved Hide resolved
p2p/ipld/write.go Outdated Show resolved Hide resolved
}

func (p *provider) provided() {
if atomic.AddInt32(&p.total, -1) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is replicating the logic of a channel in a less traditional way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may look unfamiliar if you are not comfortable with the atomic package. I tend to think about reducing thread/goroutine blocking when I write async code, so I use atomic where possible. Furthermore, for this case using atomics makes code smaller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something here, how does it block less or make the code smaller?

While it is my opinion that we should use the standard way of doing things when possible, especially if it's a quick fix, whether we utilize the logic of a channel or re implement it here is arbitrary, and I'm leaving it up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I use atomic where possible.

Quoting the atomic package documentation:

These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package. Share memory by communicating; don't communicate by sharing memory.

https://golang.org/pkg/sync/atomic/

I feel like the code should contain a very clear argument (e.g. as a comment in the code) why this was preferred over channels. Not only would that help answer @evan-forbes question but it would also would the code speak for itself.

It may look unfamiliar if you are not comfortable with the atomic package.

The atomic package will be unfamiliar to most go engineers as they likely won't need it during their career.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Share memory by communicating; don't communicate by sharing memory.

That's a famous Rob Pike quote and I agree with it. (Don't think mentioning here is a good idea).

I feel like the code should contain a very clear argument (e.g. as a comment in the code) why this was preferred over channels.

Ok, there is no important reason to use atomics in this case. That's just my preference. I agree that channels would be more readable for a bigger audience. I will rewrite this to use channels in singleton Provider or maybe consider using some simple 3rd-party lib for worker pool, like https://github.com/gammazero/workerpool, as it matches ideally for our use-case. However, I don't see any reason to rewrite this now, as this is not the final code and it works well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Don't think mentioning here is a good idea).

It's just part of the atomic package docs.

I will rewrite this to use channels in singleton Provider or maybe consider using some simple 3rd-party lib for worker pool, like https://github.com/gammazero/workerpool, as it matches ideally for our use-case.

If we can do this simply without introducing another dependency that is preferred.

I don't see any reason to rewrite this now, as this is not the final code and it works well.

I think it wasn't always clear which parts are temporary and will be changed asap. Note that there is always the risk that we just forget about this if not done in the PR, or immediately afterwards.

p2p/ipld/write.go Outdated Show resolved Hide resolved
node/node.go Show resolved Hide resolved
Comment on lines 1128 to 1141
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO())
go func() {
cs.Logger.Info("Putting Block to IPFS", "height", block.Height)
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger)
if err != nil {
if errors.Is(err, context.Canceled) {
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height)
return
}
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height)
return
}
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height)
}()
Copy link
Member

@evan-forbes evan-forbes Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocking while posting a proposal block to IPFS seems logical, but it's a big change. We can do this, right? cc @liamsi @adlerjohn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a big change.

Why??? Previously that was just a block-saving function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evan-forbes, I guess you miss the thing that now PutBlock can take minutes due to blocking providing what can freeze consensus.

Btw, I hope you understand why we need that context logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been clearer, that's my mistake. We've discussed posting proposal block data to IPFS sync and async a few times previously, so I'm bringing to everyone's attention that this PR changes that. I understand that it is necessary, I too ran the experiments to find out exactly how long it takes. However, this detail is not in the description of this PR or documented anywhere else, so I just wanted to make sure that everyone is on the same page. 🙂

Why??? Previously that was just a block-saving function

We currently panic if something goes wrong, and obvi don't move on with the rest of the proposal process until we are sure that the block is posted to the embedded IPFS node. While entirely necessary, this is a change in consensus, which is inherently a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general remark:

Btw, I hope you understand why we need that context logic.

This could easily be misunderstood as condescending. Please be careful with this and try to always be helpful to and constructive with each other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a big change.

Why??? Previously that was just a block-saving function

It is actually a big change:

Although we did not manually "provide" previously, the PutBlock method has always been more than "just block-saving". We implicitly were using whatever the IPFS team decided as a good default for providing on the public IPFS network. Now, we chose a different "providing strategy" which makes this block longer. Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.

Also:

While entirely necessary, this is a change in consensus, which is inherently a big deal.

A few comments on this that point to deeper underlying issues in my opinion:

  1. The specification should capture these things on a high level and be the authority here. This is why I think, this is actually a pretty high priority right now as this will inform the implementation on a high level: Pseudocode for core-consensus protocol celestia-specs#176 I think it is important to start this with the next iteration (cc @adlerjohn we can tag-team on this).

  2. The question if PutBlock needs to blocking or not depends a bit on the question if we consider a proposal decided or not if providing has not finished. We should be looking at the tendermint code here: IIRC, it just puts the parts into a queue for processing without actually waiting for them to be broadcasted. If that is right, I think we are basically still doing the same, and handling PutBlock in a go-routine should be fine. Note that currently we still have both: Parts gossiping and IPFS providing. So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.

  3. We need to update the ADRs accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condescending

Thanks for the new word I didn't know before. Translation tells me that you are right if we just consider selected code in the GH comment, bc there is a code comment which explains the idea behind the context logic and it's not part of selected GH comments. Although. asking like "Do you understand" would sound much softer and friendlier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.

I thought that my investigation comment on the timeout issue and multiple sync discussions explained that well 😞

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.

I thought this through well enough along with experiments to convince myself that's the best way to go. If you go through the commit history, you will see that I changed that part at least 3 times in the code. That should also work well enough, in case we start gossiping in PurBlock. In some way, I am even kinda proud of this code, specifically for how many cases it covers and how it handles context. Currently, I think that's a final state for that unless we decide to take another very different way to work with IPFS related components, which is very unlikely.

ipfs/embedded.go Show resolved Hide resolved
ipfs/mock.go Show resolved Hide resolved
@Wondertan
Copy link
Member Author

Is this PR an optimization or is this a necessary step towards block propagation via IPFS?

@evan-forbes, this is not an optimization by original intent, though it introduces some. It's necessary cause we need more control over providing. Previously, Bitswap did that in an async and throttled way which does not allow us to track if providing for a block is finished(async) and slowed down providing in general(throttled).

@Wondertan
Copy link
Member Author

@evan-forbes, regarding worker pool. Itt may have some better readability if looking at worker logic specifically, but If you look at PutBlock, new code decreases readability there indeed and I think that's higher PutBlock readability is more important.This comment is also related.

Note that the provider code is not final, as mentioned in the root comment. It is not final first of all, as we are missing reproviding now, but there are also other improvements that could be made.

@evan-forbes
Copy link
Member

evan-forbes commented Jun 8, 2021

@Wondertan

Itt may have some better readability if looking at worker logic specifically, but If you look at PutBlock, new code decreases readability there indeed and I think that's higher PutBlock readability is more important.This comment is also related.

While there are few more variables declared in the PutBlock code using the worker pool template, it is more explicit, which is actually why I think it's more readable. There's very little hidden from what is actually going on. Some of those vars could be tucked away in the worker struct, but they were purposefully left in PutBlock to show the reader precisely what is going on.

It also has the benefit of calling the provide func in parallel, as it does the opposite of the current async code. It has a single goroutine feed jobs to many workers, where the current code has many goroutines each feed a single job to single worker goroutine. that's not true, my mistake

Like you said, it's not final code and probably arbitrary. The worker pool code is only a suggestion, so I'm definitely leaving it up to you. More importantly, if we plan on changing it later, we should write that down in #393 or #395

Copy link
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides a few minor nitpicks, this PR is awesome and ready to go! 🚀

I know I keep saying it, but I reeeaalllly like any PR that is giving us more control over what happens under the hood in IPFS

great work, @Wondertan

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before I dive into the weeds here, some comments:

As it stands, with the closes keyword in here it will actually close that issue. Can you please clarify your doubts? "Probably", "not sure if that's enough" and capture them somewhere here (and later in a followup issue)?

  • Makes light client serve IPFS API by default(should be changed later to be configurable)

Why? Is this only temporary to help with debugging?

  • Introduces provider in p2p/ipld that handles point above. NOTE: We need to extract it into an independent singleton. We can do this along with #393, so the code for provider is not final.

A singelton should only be used if everything else does make less sense. I see singeltons as some form of global variables or state and they should be avoided where possible.

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a beautiful change. Especially, as it disentangles storing and providing to some extent (previously we treated the IPFS Api more like a black-box). But also some of the review comments require minor clarifications (in the code).

p2p/ipld/write.go Outdated Show resolved Hide resolved
p2p/ipld/write.go Show resolved Hide resolved
cmd/tendermint/commands/light.go Show resolved Hide resolved
Comment on lines +83 to +84
cs := NewState(thisConfig.Consensus, state, blockExec, blockStore,
mempool, dag, ipfs.MockRouting(), evpool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I prefer to break this consistently. i.e. one line per param. Otherwise, it looks like these are logical groups that belong together (thisConfig.Consensus, state, blockExec, blockStore) and (mempool, dag, ipfs.MockRouting(), evpool).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer those two actually, I just followed the pattern in the code. Let's change that in follow-up RP's more connected to consensus code.

Comment on lines 1128 to 1141
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO())
go func() {
cs.Logger.Info("Putting Block to IPFS", "height", block.Height)
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger)
if err != nil {
if errors.Is(err, context.Canceled) {
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height)
return
}
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height)
return
}
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general remark:

Btw, I hope you understand why we need that context logic.

This could easily be misunderstood as condescending. Please be careful with this and try to always be helpful to and constructive with each other.

// generating the roots will start adding the data to IPFS
eds.RowRoots()
eds.ColumnRoots()
var provideWorkers = 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why isn't this const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. Perhaps. because it's rapidly written code.
Let's keep it as var for now as it does not hurt and later constant in proper Provider

p2p/ipld/write.go Show resolved Hide resolved
ipfs/defaults.go Show resolved Hide resolved
p2p/ipld/write.go Show resolved Hide resolved
Comment on lines 1128 to 1141
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO())
go func() {
cs.Logger.Info("Putting Block to IPFS", "height", block.Height)
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger)
if err != nil {
if errors.Is(err, context.Canceled) {
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height)
return
}
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height)
return
}
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a big change.

Why??? Previously that was just a block-saving function

It is actually a big change:

Although we did not manually "provide" previously, the PutBlock method has always been more than "just block-saving". We implicitly were using whatever the IPFS team decided as a good default for providing on the public IPFS network. Now, we chose a different "providing strategy" which makes this block longer. Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.

Also:

While entirely necessary, this is a change in consensus, which is inherently a big deal.

A few comments on this that point to deeper underlying issues in my opinion:

  1. The specification should capture these things on a high level and be the authority here. This is why I think, this is actually a pretty high priority right now as this will inform the implementation on a high level: Pseudocode for core-consensus protocol celestia-specs#176 I think it is important to start this with the next iteration (cc @adlerjohn we can tag-team on this).

  2. The question if PutBlock needs to blocking or not depends a bit on the question if we consider a proposal decided or not if providing has not finished. We should be looking at the tendermint code here: IIRC, it just puts the parts into a queue for processing without actually waiting for them to be broadcasted. If that is right, I think we are basically still doing the same, and handling PutBlock in a go-routine should be fine. Note that currently we still have both: Parts gossiping and IPFS providing. So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.

  3. We need to update the ADRs accordingly.

@Wondertan
Copy link
Member Author

Wondertan commented Jun 11, 2021

As it stands, with the closes keyword in here it will actually close that issue. Can you please clarify your doubts? "Probably", "not sure if that's enough" and capture them somewhere here (and later in a followup issue)?

My doubt is DOD for of #349. I personally think that the networking tests I introduced are enough for that, but there is no e2e for IPFS related things. It's not obvious how to make them for me.

Why? Is this only temporary to help with debugging?

While experimenting, I turned that on manually multiples times in the code, until I decided just to commit that. We can keep this for now and fix in #353

A singelton should only be used if everything else does make less sense. I see singeltons as some form of global variables or state and they should be avoided where possible.

Imagine that as Providing service, similar to e.g. Consensus component. We don't need many Providers.

Co-authored-by: Ismail Khoffi <[email protected]>
@Wondertan Wondertan force-pushed the hlib/zero-bootstrapping branch from 57f2dd2 to 48f281c Compare June 11, 2021 16:10
@Wondertan
Copy link
Member Author

WARNING: DATA RACE

That's interesting! Fixing...

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really dope 🚀 Great work!

@Wondertan
Copy link
Member Author

@evan-forbes, should I merge this before or after NodeProvider. WDYT?

@Wondertan
Copy link
Member Author

My current guess is to merge this first due to CI fix and than apply NodeProvider on top with ease

@evan-forbes
Copy link
Member

I agree, merging this first makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ipfs: improve ipfs tests
4 participants