-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAAS-28597: Add counter for limiter wait calls #117
Conversation
@@ -213,6 +214,10 @@ func (self *Runtime) Ticks() uint64 { | |||
return self.ticks | |||
} | |||
|
|||
func (self *Runtime) LimiterWaitCount() uint64 { | |||
return self.limiterWaitCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arahmanan the baas counterpart will be to log the LimiterWaitCount when the function execution finishes similar to how we log Ticks()
. (This could also be a prometheus counter potentially, or perhaps I can piggyback on the Ticks log and include limiterWaitCount... we can discuss in the BAAS pr though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I like the idea of adding it to the ticks log! If we do that, adding a new prom metric won't be necessary.
vm.go
Outdated
@@ -617,6 +617,7 @@ func (vm *vm) run() { | |||
ctx = context.Background() | |||
} | |||
|
|||
vm.r.limiterWaitCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we aren't actually waiting every time we call WaitN. To be able to determine if we are actually waiting take a look at this as an example. We'll only wait if the delay is > 0.
[opt] in addition to the number of times we waited, I think it would also be valuable to track the total delay for the entire execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see the distinction, does that come down to timing?
We have a rate limit of 10_000_000 and a burst value of 250_000 which I think is effectively 50_000 when you account for the burst divisor,
Aside from a lot of process yielding, I'm not sure I fully understand what scenario we actually wait and for how long.
If we use 10_000_000 ticks in half a second would we wait another half a second because that is the amount left until we can execute 50_000 ticks (provided by fillBucket)?
If we use 10_000_000 ticks with only 1ms left in the "second" window, would we only wait that 1ms before we can execute 50_000 ticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above is true I assume we could do something like:
reservation := vm.r.limiter.Reserve(time.Now(), vm.r.limiterTicksLeft)
// count reservation.Delay()
// increment wait counter if Delay > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if r := vm.r.limiter.ReserveN(time.Now(), vm.r.limiterTicksLeft); r.OK() {
waitDelayMS := r.Delay().Milliseconds()
if waitDelayMS > 0 {
vm.r.limiterWaitTotalMS += waitDelayMS
vm.r.limiterWaitCount++
}
}
if waitErr := vm.r.limiter.WaitN(ctx, vm.r.limiterTicksLeft); waitErr != nil {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comment above WaitN
:
WaitN blocks until lim permits n events to happen.
In other words, we aren't going to wait if the limit allows those n events to be processed. To be able to determine if we're actually waiting or not, you want to do something along these lines:
now = time.Now()
r := vm.r.limiter.ReserveN(now, vm.r.limiterTicksLeft)
if !r.OK() {
panic("")
}
// Wait if necessary
delay = r.DelayFrom(now)
if delay > 0 {
vm.r.limiterWaitCount++
err := util.Sleep(ctx, delay)
if err != nil {
r.Cancel()
panic(err)
}
}
If we use 10_000_000 ticks in half a second would we wait another half a second because that is the amount left until we can execute 50_000 ticks (provided by fillBucket)?
Not exactly. We would approximately wait until enough time has passed to process the next 50k ticks. i.e. (50_000 / 10_000_000) => 0.005s => 5ms. You can find more about that here.
If we use 10_000_000 ticks with only 1ms left in the "second" window, would we only wait that 1ms before we can execute 50_000 ticks?
This can't happen. The rate limiter won't process more than 10_000_000 ticks / s. So it would just take 1s to process the first 10MM ticks. In other words, since we process 50k ticks at a time, a function can process at most 50k ticks every 5ms.
These docs do a decent job at explaining how this works. Let me know if you have more questions about it. I'm happy to also hop on a quick call.
@@ -213,6 +214,10 @@ func (self *Runtime) Ticks() uint64 { | |||
return self.ticks | |||
} | |||
|
|||
func (self *Runtime) LimiterWaitCount() uint64 { | |||
return self.limiterWaitCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I like the idea of adding it to the ticks log! If we do that, adding a new prom metric won't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending benchmark results and one super minor comment.
vm.go
Outdated
delay := r.DelayFrom(now) | ||
if delay > 0 { | ||
vm.r.limiterWaitCount++ | ||
vm.r.limiterWaitTotalTime += delay.Nanoseconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] thoughts on changing limiterWaitTotalTime
to be of type time.Duration
? That makes what's returned by LimiterWaitTotalTime
a little more explicit without having to read the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo I like that! Great idea
vm.go
Outdated
select { | ||
case <-ctx.Done(): | ||
panic(ctx.Err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[opt] I don't believe this extra check is necessary if delay == 0
. We already interrupt the VM here, when the context times out / is canceled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this was added in response to this test failing.. Though I still have test failures so I need to investigate what is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I follow. Did you figure out what caused the test to fail? I still think we can get rid of this extra select statement. Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you'd like to chat about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically the TestFunctionExecTimeLimit test is what never finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. [opt] The wait
function has the select statement before it performs the reserve/wait operations. Should we do the same here? i.e. have this select statement right after we define ctx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's a good idea to stay consistent.
…at might be why it was failing
vm.go
Outdated
now := time.Now() | ||
r := vm.r.limiter.ReserveN(now, vm.r.limiterTicksLeft) | ||
if !r.OK() { | ||
panic("failed to make reservation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic("failed to make reservation") | |
panic(context.DeadlineExceeded) |
This will keep the same behavior as the if strings.Contains(waitErr.Error(), "would exceed") {
check. That being said, I don't believe this can ever happen at the moment.
vm.go
Outdated
select { | ||
case <-ctx.Done(): | ||
panic(ctx.Err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I follow. Did you figure out what caused the test to fail? I still think we can get rid of this extra select statement. Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just an optional comment.
vm.go
Outdated
select { | ||
case <-ctx.Done(): | ||
panic(ctx.Err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. [opt] The wait
function has the select statement before it performs the reserve/wait operations. Should we do the same here? i.e. have this select statement right after we define ctx
.
No description provided.