-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipld: Manual providing #375
Conversation
8dc07ee
to
aa9649b
Compare
eec6a12
to
6ecd31a
Compare
e6e353c
to
f200949
Compare
f200949
to
dc1179d
Compare
Does anybody have any idea what could be the reason or encountered something similar before? |
Unfortunately, I have gotten that error quite a few times for different reasons. The e2e test is failing locally for me and producing the same logs as the CI e2e test, which is actually good news, as that means it's not a mysterious CI issue. Per the logs
This happens to all validators very early on, and all of the other tests are passing. The normal tests don't ever transfer data over a network, which leads me to believe that there's some issue with the block data actually being transferred over the network via IPFS. Could be something else though, I'm not sure. |
Codecov Report
@@ Coverage Diff @@
## master #375 +/- ##
=======================================
Coverage 61.74% 61.75%
=======================================
Files 262 262
Lines 22933 22981 +48
=======================================
+ Hits 14161 14192 +31
- Misses 7265 7281 +16
- Partials 1507 1508 +1
|
Fortunately, I unintentionally fixed e2e by implementing proper context logic for PutBlock in consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool, and it helps to fulfill the vision of breaking apart and optimizing only what we need from IPFS. 🚀 I left some minor comments and questions
Probably a stupid question, but I admittedly don't know. Is this PR an optimization or is this a necessary step towards block propagation via IPFS?
I also applied the typical job worker pool template as a minor refactor https://github.com/lazyledger/lazyledger-core/blob/898f83dbd8b3cb2e011d18a3be2e728e80186d2c/p2p/ipld/write.go#L87-L117
I do think there are some easy readability gain there, but we definitely don't have to use that exact code.
ref #397
} | ||
|
||
func (p *provider) provided() { | ||
if atomic.AddInt32(&p.total, -1) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is replicating the logic of a channel in a less traditional way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may look unfamiliar if you are not comfortable with the atomic package. I tend to think about reducing thread/goroutine blocking when I write async code, so I use atomic where possible. Furthermore, for this case using atomics makes code smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing something here, how does it block less or make the code smaller?
While it is my opinion that we should use the standard way of doing things when possible, especially if it's a quick fix, whether we utilize the logic of a channel or re implement it here is arbitrary, and I'm leaving it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I use atomic where possible.
Quoting the atomic package documentation:
These functions require great care to be used correctly. Except for special, low-level applications, synchronization is better done with channels or the facilities of the sync package. Share memory by communicating; don't communicate by sharing memory.
https://golang.org/pkg/sync/atomic/
I feel like the code should contain a very clear argument (e.g. as a comment in the code) why this was preferred over channels. Not only would that help answer @evan-forbes question but it would also would the code speak for itself.
It may look unfamiliar if you are not comfortable with the atomic package.
The atomic package will be unfamiliar to most go engineers as they likely won't need it during their career.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Share memory by communicating; don't communicate by sharing memory.
That's a famous Rob Pike quote and I agree with it. (Don't think mentioning here is a good idea).
I feel like the code should contain a very clear argument (e.g. as a comment in the code) why this was preferred over channels.
Ok, there is no important reason to use atomics in this case. That's just my preference. I agree that channels would be more readable for a bigger audience. I will rewrite this to use channels in singleton Provider or maybe consider using some simple 3rd-party lib for worker pool, like https://github.com/gammazero/workerpool, as it matches ideally for our use-case. However, I don't see any reason to rewrite this now, as this is not the final code and it works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Don't think mentioning here is a good idea).
It's just part of the atomic package docs.
I will rewrite this to use channels in singleton Provider or maybe consider using some simple 3rd-party lib for worker pool, like https://github.com/gammazero/workerpool, as it matches ideally for our use-case.
If we can do this simply without introducing another dependency that is preferred.
I don't see any reason to rewrite this now, as this is not the final code and it works well.
I think it wasn't always clear which parts are temporary and will be changed asap. Note that there is always the risk that we just forget about this if not done in the PR, or immediately afterwards.
consensus/state.go
Outdated
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO()) | ||
go func() { | ||
cs.Logger.Info("Putting Block to IPFS", "height", block.Height) | ||
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger) | ||
if err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking while posting a proposal block to IPFS seems logical, but it's a big change. We can do this, right? cc @liamsi @adlerjohn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's a big change.
Why??? Previously that was just a block-saving function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evan-forbes, I guess you miss the thing that now PutBlock
can take minutes due to blocking providing what can freeze consensus.
Btw, I hope you understand why we need that context logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been clearer, that's my mistake. We've discussed posting proposal block data to IPFS sync and async a few times previously, so I'm bringing to everyone's attention that this PR changes that. I understand that it is necessary, I too ran the experiments to find out exactly how long it takes. However, this detail is not in the description of this PR or documented anywhere else, so I just wanted to make sure that everyone is on the same page. 🙂
Why??? Previously that was just a block-saving function
We currently panic if something goes wrong, and obvi don't move on with the rest of the proposal process until we are sure that the block is posted to the embedded IPFS node. While entirely necessary, this is a change in consensus, which is inherently a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general remark:
Btw, I hope you understand why we need that context logic.
This could easily be misunderstood as condescending. Please be careful with this and try to always be helpful to and constructive with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's a big change.
Why??? Previously that was just a block-saving function
It is actually a big change:
Although we did not manually "provide" previously, the PutBlock method has always been more than "just block-saving". We implicitly were using whatever the IPFS team decided as a good default for providing on the public IPFS network. Now, we chose a different "providing strategy" which makes this block longer. Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.
Also:
While entirely necessary, this is a change in consensus, which is inherently a big deal.
A few comments on this that point to deeper underlying issues in my opinion:
-
The specification should capture these things on a high level and be the authority here. This is why I think, this is actually a pretty high priority right now as this will inform the implementation on a high level: Pseudocode for core-consensus protocol celestia-specs#176 I think it is important to start this with the next iteration (cc @adlerjohn we can tag-team on this).
-
The question if PutBlock needs to blocking or not depends a bit on the question if we consider a proposal decided or not if providing has not finished. We should be looking at the tendermint code here: IIRC, it just puts the parts into a queue for processing without actually waiting for them to be broadcasted. If that is right, I think we are basically still doing the same, and handling PutBlock in a go-routine should be fine. Note that currently we still have both: Parts gossiping and IPFS providing. So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.
-
We need to update the ADRs accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condescending
Thanks for the new word I didn't know before. Translation tells me that you are right if we just consider selected code in the GH comment, bc there is a code comment which explains the idea behind the context logic and it's not part of selected GH comments. Although. asking like "Do you understand" would sound much softer and friendlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.
I thought that my investigation comment on the timeout issue and multiple sync discussions explained that well 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.
I thought this through well enough along with experiments to convince myself that's the best way to go. If you go through the commit history, you will see that I changed that part at least 3 times in the code. That should also work well enough, in case we start gossiping in PurBlock. In some way, I am even kinda proud of this code, specifically for how many cases it covers and how it handles context. Currently, I think that's a final state for that unless we decide to take another very different way to work with IPFS related components, which is very unlikely.
@evan-forbes, this is not an optimization by original intent, though it introduces some. It's necessary cause we need more control over providing. Previously, Bitswap did that in an async and throttled way which does not allow us to track if providing for a block is finished(async) and slowed down providing in general(throttled). |
@evan-forbes, regarding worker pool. Itt may have some better readability if looking at worker logic specifically, but If you look at Note that the |
While there are few more variables declared in the
Like you said, it's not final code and probably arbitrary. The worker pool code is only a suggestion, so I'm definitely leaving it up to you. More importantly, if we plan on changing it later, we should write that down in #393 or #395 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides a few minor nitpicks, this PR is awesome and ready to go! 🚀
I know I keep saying it, but I reeeaalllly like any PR that is giving us more control over what happens under the hood in IPFS
great work, @Wondertan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before I dive into the weeds here, some comments:
- Provides networking tests in IPLD. Probably closes ipfs: improve ipfs tests #349, not sure if that's enough
As it stands, with the closes
keyword in here it will actually close that issue. Can you please clarify your doubts? "Probably", "not sure if that's enough" and capture them somewhere here (and later in a followup issue)?
- Makes light client serve IPFS API by default(should be changed later to be configurable)
Why? Is this only temporary to help with debugging?
- Introduces provider in p2p/ipld that handles point above. NOTE: We need to extract it into an independent singleton. We can do this along with #393, so the code for provider is not final.
A singelton should only be used if everything else does make less sense. I see singeltons as some form of global variables or state and they should be avoided where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a beautiful change. Especially, as it disentangles storing and providing to some extent (previously we treated the IPFS Api more like a black-box). But also some of the review comments require minor clarifications (in the code).
cs := NewState(thisConfig.Consensus, state, blockExec, blockStore, | ||
mempool, dag, ipfs.MockRouting(), evpool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I prefer to break this consistently. i.e. one line per param. Otherwise, it looks like these are logical groups that belong together (thisConfig.Consensus, state, blockExec, blockStore)
and (mempool, dag, ipfs.MockRouting(), evpool)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer those two actually, I just followed the pattern in the code. Let's change that in follow-up RP's more connected to consensus code.
consensus/state.go
Outdated
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO()) | ||
go func() { | ||
cs.Logger.Info("Putting Block to IPFS", "height", block.Height) | ||
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger) | ||
if err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general remark:
Btw, I hope you understand why we need that context logic.
This could easily be misunderstood as condescending. Please be careful with this and try to always be helpful to and constructive with each other.
// generating the roots will start adding the data to IPFS | ||
eds.RowRoots() | ||
eds.ColumnRoots() | ||
var provideWorkers = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why isn't this const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. Perhaps. because it's rapidly written code.
Let's keep it as var for now as it does not hurt and later constant in proper Provider
consensus/state.go
Outdated
cs.proposalCtx, cs.proposalCancel = context.WithCancel(context.TODO()) | ||
go func() { | ||
cs.Logger.Info("Putting Block to IPFS", "height", block.Height) | ||
err = ipld.PutBlock(cs.proposalCtx, cs.dag, block, cs.croute, cs.Logger) | ||
if err != nil { | ||
if errors.Is(err, context.Canceled) { | ||
cs.Logger.Error("Putting Block didn't finish in time and was terminated", "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Error("Failed to put Block to IPFS", "err", err, "height", block.Height) | ||
return | ||
} | ||
cs.Logger.Info("Finished putting block to IPFS", "height", block.Height) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's a big change.
Why??? Previously that was just a block-saving function
It is actually a big change:
Although we did not manually "provide" previously, the PutBlock method has always been more than "just block-saving". We implicitly were using whatever the IPFS team decided as a good default for providing on the public IPFS network. Now, we chose a different "providing strategy" which makes this block longer. Hence, you've changed this to non-blocking. And this intentional and good but it should be clearly communicated why to the reviewer, and more so, in the code itself.
Also:
While entirely necessary, this is a change in consensus, which is inherently a big deal.
A few comments on this that point to deeper underlying issues in my opinion:
-
The specification should capture these things on a high level and be the authority here. This is why I think, this is actually a pretty high priority right now as this will inform the implementation on a high level: Pseudocode for core-consensus protocol celestia-specs#176 I think it is important to start this with the next iteration (cc @adlerjohn we can tag-team on this).
-
The question if PutBlock needs to blocking or not depends a bit on the question if we consider a proposal decided or not if providing has not finished. We should be looking at the tendermint code here: IIRC, it just puts the parts into a queue for processing without actually waiting for them to be broadcasted. If that is right, I think we are basically still doing the same, and handling PutBlock in a go-routine should be fine. Note that currently we still have both: Parts gossiping and IPFS providing. So at least in its current form, it's actually not such a big deal as validators still reply on the tendermint block gossiping anyways. IMO, we can merge this as is. But should carefully consider this when we actually replace the current block gossiping mechanism.
-
We need to update the ADRs accordingly.
My doubt is DOD for of #349. I personally think that the networking tests I introduced are enough for that, but there is no e2e for IPFS related things. It's not obvious how to make them for me.
While experimenting, I turned that on manually multiples times in the code, until I decided just to commit that. We can keep this for now and fix in #353
Imagine that as Providing service, similar to e.g. Consensus component. We don't need many Providers. |
Co-authored-by: Ismail Khoffi <[email protected]>
57f2dd2
to
48f281c
Compare
That's interesting! Fixing... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really dope 🚀 Great work!
@evan-forbes, should I merge this before or after |
My current guess is to merge this first due to CI fix and than apply NodeProvider on top with ease |
I agree, merging this first makes sense. |
This PR: