Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetOrchestrator slowness when PM Sender data is not cached #2849

Merged
merged 6 commits into from
Nov 6, 2023

Conversation

stronk-dev
Copy link
Contributor

@stronk-dev stronk-dev commented Aug 30, 2023

What does this pull request do?

Titan node had some curious insights into the relationship between your Eth RPC provider and the response times to GetOrchestrator requests, and started investigating. @ad-astra-video assisted and they found out that during a GetOrchestrator request, it was possible that a blocking RPC call would be made to check the float of the Broadcaster

This commit is pulled from @ad-astra-video fork of go-livepeer and it increases the timeout for how long B info gets cached and does some explicit calls to refresh data in the background every 10 minutes

My questions for the person reviewing this:

  • Are these explicit update calls on the cleanupInterval required, or are the blockwatchers already keeping all relevant B data up-to-date as long as the PM sender is cached?
    case <-ticker.C:
    sm.smgr.UpdateSenderInfos()
    sm.smgr.UpdateClaimedReserves(sm.cfg.Claimant)
    sm.cleanup()
  • If there is data we can keep cached this way, what would be a sensible cleanupInterval and are there any other calls we might want to do?

@stronk-dev stronk-dev changed the title move SenderInfo and ClaimedReserve to background refreshes in cleanup… GetOrchestrator slowness when PM Sender data is not cached Aug 30, 2023
@ad-astra-video
Copy link
Collaborator

Want to add log lines pulled from Titan Node orchestrator when testing this. These log lines have been removed but are relevant to where in the code the OrchestratorInfo response generation slowed down. You can also see every other step of generating the response takes 4ms of the 484ms.

When autoAdjustPrice is not set to false the same SenderInfo and ClaimedReserve calls are the slowdown point.

log (2).txt

@Titan-Node
Copy link

Been running this update for 48 hours, all streams and node stats look fine.

@thomshutt
Copy link
Contributor

Thanks @stronk-dev, I'm going to get someone to start looking at this today

@thomshutt
Copy link
Contributor

Looks like we have genuine test failures here, could you take a look please @stronk-dev?

@ad-astra-video
Copy link
Collaborator

Looks like we have genuine test failures here, could you take a look please @stronk-dev?

The tests were not updated yet. We were looking for some feedback on moving the eth rpc calls in the OrchestratorInfo response generation to the background and change in cache time from 60 seconds to 1 week. Also on the cleanup interval changed from 1 minute to 10 minutes. There appears to be a blockWatcher that accounts for broadcaster reserve changes but the cache was so short at 60 seconds that it did not operate very long. With Titan's testing at the 1-2 minute mark the OrchestratorInfo response would slow down because of the eth rpc calls.

The gist is this change can significantly speed up OrchestratorInfo response times that are required at each start of stream and periodically during streams. The goal is to provide more reliable response times, and possibly, shorten the initial session pool building to speed up first segment transcoding.

@leszko
Copy link
Contributor

leszko commented Sep 6, 2023

I'm trying to catch up on this issue.

So, if I understand correctly, there is a bug that when we have GetOrchestrator call and an RPC request at the same time, then the GetOrchestrator response is delayed for some reason? Is that correct?

I'm actually not sure how changing the timeouts solves this issue. Could you provide more detailed description on that?

About your questions, I don't know, would need to dig into the code, but first I'm trying to understand the actual issue.

@stronk-dev
Copy link
Contributor Author

stronk-dev commented Sep 6, 2023

There's no bug, but due to the timeout being only 1 minute, it happens often that an Orchestrator has to do a blocking RPC call in order to respond to a GetOrchestrator request. In this case the Orchestrator needs to have up-to-date reserve info on the B making the GetOrchestrator request in order to respond

Depending on the speed, distance and availability of the L2 RPC provider (which can be expecially problematic for regions like Singapore) this causes long fluctuations in response times to GetOrchestrator requests. Sometimes it's high enough for the Orchestrator to go above the threshold a B has set to populate their working set of Orchestrators. You can see here when Titan applied this patch and his response times immediately became excellent

Doing blocking RPC calls during a request like that is inefficient anyway. It's a good idea to make sure any data an O needs to respond to B requests is always up-to-date and available. We're trying to figure out here:

  • is increasing the timeout enough: will the block watchers keep any data we need up-to-date already? Or do we require the explicit calls added in this PR to the cleanup-check-loop? If so, is there other data we want to pre-fetch?

@leszko
Copy link
Contributor

leszko commented Sep 7, 2023

Ok, I understand, thanks for the explanation Marco.

One thing I don't understand is why GetOrchestrator needs to know anything about the Broadcaster / Sender? Do we use any of this info in getOrchestrator()?

Other than that, my understanding is that, in theory, you should never need to refresh the cache because it's updated with the block feed here. Not sure actually why we're cleaning up this cache at all 🤔 Saying that, I this PR is ok, to be on the safe side, we can refresh the cache from time to time.

I'll put some comments inline, other than that, I'd love to understand my question above ☝️

Copy link
Contributor

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comments. In general, I understand the issue and this PR, so I think we should get it merged.

cmd/livepeer/starter/starter.go Outdated Show resolved Hide resolved
pm/sendermonitor.go Outdated Show resolved Hide resolved
@stronk-dev
Copy link
Contributor Author

One thing I don't understand is why GetOrchestrator needs to know anything about the Broadcaster / Sender? Do we use any of this info in getOrchestrator()?

It generates B specific ticket parameters and has a special reject case if the Broadcaster has insufficient reserves to do a payout:

go-livepeer/server/rpc.go

Lines 315 to 326 in d8ad5c7

func getOrchestrator(orch Orchestrator, req *net.OrchestratorRequest) (*net.OrchestratorInfo, error) {
addr := ethcommon.BytesToAddress(req.Address)
if err := verifyOrchestratorReq(orch, addr, req.Sig); err != nil {
return nil, fmt.Errorf("Invalid orchestrator request: %v", err)
}
if _, err := authenticateBroadcaster(addr.Hex()); err != nil {
return nil, fmt.Errorf("authentication failed: %v", err)
}
// currently, orchestrator == transcoder
return orchestratorInfo(orch, addr, orch.ServiceURI().String())

go-livepeer/server/rpc.go

Lines 348 to 354 in d8ad5c7

func orchestratorInfo(orch Orchestrator, addr ethcommon.Address, serviceURI string) (*net.OrchestratorInfo, error) {
priceInfo, err := getPriceInfo(orch, addr)
if err != nil {
return nil, err
}
params, err := orch.TicketParams(addr, priceInfo)

func (orch *orchestrator) TicketParams(sender ethcommon.Address, priceInfo *net.PriceInfo) (*net.TicketParams, error) {
if orch.node == nil || orch.node.Recipient == nil {
return nil, nil
}
ratPriceInfo, err := common.RatPriceInfo(priceInfo)
if err != nil {
return nil, err
}
params, err := orch.node.Recipient.TicketParams(sender, ratPriceInfo)

go-livepeer/pm/recipient.go

Lines 192 to 201 in f350fba

func (r *recipient) TicketParams(sender ethcommon.Address, price *big.Rat) (*TicketParams, error) {
randBytes := RandBytes(32)
seed := new(big.Int).SetBytes(randBytes)
faceValue := big.NewInt(0)
// If price is 0 face value, win prob and EV are 0 because no payments are required
if price.Num().Cmp(big.NewInt(0)) > 0 {
var err error
faceValue, err = r.faceValue(sender)

go-livepeer/pm/recipient.go

Lines 264 to 274 in f350fba

func (r *recipient) faceValue(sender ethcommon.Address) (*big.Int, error) {
txCost := r.txCost()
// faceValue = txCost * txCostMultiplier
faceValue := new(big.Int).Mul(txCost, big.NewInt(int64(r.cfg.TxCostMultiplier)))
if faceValue.Cmp(r.cfg.EV) < 0 {
faceValue = new(big.Int).Mul(r.cfg.EV, evMultiplier)
}
// Fetch current max float for sender
maxFloat, err := r.sm.MaxFloat(sender)

func (sm *LocalSenderMonitor) MaxFloat(addr ethcommon.Address) (*big.Int, error) {
sm.mu.Lock()
defer sm.mu.Unlock()
sm.ensureCache(addr)
return sm.maxFloat(addr)
}

Not sure how important these ticket parameters are for the B. But if the SenderMonitor keeps this data up-to-date anyway, increasing the timeout would be a nice first step

@stronk-dev
Copy link
Contributor Author

In order to get this PR ready for merge, I think it's just 1 minor question ATM:

Since handleRoundEvent already calls GetSenderInfo and ClaimedReserve on a new round, should the newly added code in this PR which calls these functions simply be dropped from startCleanupLoop? Seems like everything should be fine, otherwise we could still at the extra timed loop to refresh that data specificially
cc @leszko

@leszko
Copy link
Contributor

leszko commented Nov 2, 2023

startCleanupLoop

Yes, in my opinion, we can remove it from startCleanupLoop.

@stronk-dev
Copy link
Contributor Author

Modified. I guess the added print for when a sender info gets cleared could be nice to keep in there

pm/sendermonitor.go Outdated Show resolved Hide resolved
@leszko
Copy link
Contributor

leszko commented Nov 6, 2023

Waiting for the CI build and then I'll merge! Thanks for the PR @stronk-dev

Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #2849 (26be254) into master (063b1ea) will not change coverage.
The diff coverage is 100.00000%.

Impacted file tree graph

@@              Coverage Diff              @@
##              master       #2849   +/-   ##
=============================================
  Coverage   56.53269%   56.53269%           
=============================================
  Files             89          89           
  Lines          19456       19456           
=============================================
  Hits           10999       10999           
  Misses          7849        7849           
  Partials         608         608           
Files Coverage Δ
cmd/livepeer/starter/starter.go 4.73422% <ø> (ø)
pm/sendermonitor.go 88.81579% <100.00000%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 063b1ea...26be254. Read the comment docs.

Files Coverage Δ
cmd/livepeer/starter/starter.go 4.73422% <ø> (ø)
pm/sendermonitor.go 88.81579% <100.00000%> (ø)

@leszko leszko merged commit 8053582 into master Nov 6, 2023
18 checks passed
@leszko leszko deleted the md/adastra-fix branch November 6, 2023 12:23
eliteprox pushed a commit to eliteprox/go-livepeer that referenced this pull request Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants