Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAAS-28597: Add counter for limiter wait calls #117

Merged
merged 13 commits into from
Mar 11, 2024

Conversation

Calvinnix
Copy link

No description provided.

@@ -213,6 +214,10 @@ func (self *Runtime) Ticks() uint64 {
return self.ticks
}

func (self *Runtime) LimiterWaitCount() uint64 {
return self.limiterWaitCount
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arahmanan the baas counterpart will be to log the LimiterWaitCount when the function execution finishes similar to how we log Ticks(). (This could also be a prometheus counter potentially, or perhaps I can piggyback on the Ticks log and include limiterWaitCount... we can discuss in the BAAS pr though)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I like the idea of adding it to the ticks log! If we do that, adding a new prom metric won't be necessary.

runtime.go Outdated Show resolved Hide resolved
vm.go Outdated
@@ -617,6 +617,7 @@ func (vm *vm) run() {
ctx = context.Background()
}

vm.r.limiterWaitCount++

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't actually waiting every time we call WaitN. To be able to determine if we are actually waiting take a look at this as an example. We'll only wait if the delay is > 0.

[opt] in addition to the number of times we waited, I think it would also be valuable to track the total delay for the entire execution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see the distinction, does that come down to timing?

We have a rate limit of 10_000_000 and a burst value of 250_000 which I think is effectively 50_000 when you account for the burst divisor,

Aside from a lot of process yielding, I'm not sure I fully understand what scenario we actually wait and for how long.

If we use 10_000_000 ticks in half a second would we wait another half a second because that is the amount left until we can execute 50_000 ticks (provided by fillBucket)?

If we use 10_000_000 ticks with only 1ms left in the "second" window, would we only wait that 1ms before we can execute 50_000 ticks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is true I assume we could do something like:

reservation := vm.r.limiter.Reserve(time.Now(), vm.r.limiterTicksLeft)
// count reservation.Delay()
// increment wait counter if Delay > 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

			if r := vm.r.limiter.ReserveN(time.Now(), vm.r.limiterTicksLeft); r.OK() {
				waitDelayMS := r.Delay().Milliseconds()
				if waitDelayMS > 0 {
					vm.r.limiterWaitTotalMS += waitDelayMS
					vm.r.limiterWaitCount++
				}
			}
			if waitErr := vm.r.limiter.WaitN(ctx, vm.r.limiterTicksLeft); waitErr != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the comment above WaitN:

WaitN blocks until lim permits n events to happen.

In other words, we aren't going to wait if the limit allows those n events to be processed. To be able to determine if we're actually waiting or not, you want to do something along these lines:

	now = time.Now()
	r := vm.r.limiter.ReserveN(now, vm.r.limiterTicksLeft)
	if !r.OK() {
		panic("")
	}

	// Wait if necessary
	delay = r.DelayFrom(now)
	if delay > 0 {
		vm.r.limiterWaitCount++
		
		err := util.Sleep(ctx, delay)
		if err != nil {
			r.Cancel()
			panic(err)
		}
	}

If we use 10_000_000 ticks in half a second would we wait another half a second because that is the amount left until we can execute 50_000 ticks (provided by fillBucket)?

Not exactly. We would approximately wait until enough time has passed to process the next 50k ticks. i.e. (50_000 / 10_000_000) => 0.005s => 5ms. You can find more about that here.

If we use 10_000_000 ticks with only 1ms left in the "second" window, would we only wait that 1ms before we can execute 50_000 ticks?

This can't happen. The rate limiter won't process more than 10_000_000 ticks / s. So it would just take 1s to process the first 10MM ticks. In other words, since we process 50k ticks at a time, a function can process at most 50k ticks every 5ms.

These docs do a decent job at explaining how this works. Let me know if you have more questions about it. I'm happy to also hop on a quick call.

@@ -213,6 +214,10 @@ func (self *Runtime) Ticks() uint64 {
return self.ticks
}

func (self *Runtime) LimiterWaitCount() uint64 {
return self.limiterWaitCount

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I like the idea of adding it to the ticks log! If we do that, adding a new prom metric won't be necessary.

Copy link

@arahmanan arahmanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending benchmark results and one super minor comment.

vm.go Outdated
delay := r.DelayFrom(now)
if delay > 0 {
vm.r.limiterWaitCount++
vm.r.limiterWaitTotalTime += delay.Nanoseconds()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] thoughts on changing limiterWaitTotalTime to be of type time.Duration? That makes what's returned by LimiterWaitTotalTime a little more explicit without having to read the docstring.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo I like that! Great idea

vm.go Outdated
Comment on lines 636 to 638
select {
case <-ctx.Done():
panic(ctx.Err())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[opt] I don't believe this extra check is necessary if delay == 0. We already interrupt the VM here, when the context times out / is canceled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this was added in response to this test failing.. Though I still have test failures so I need to investigate what is happening.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow. Did you figure out what caused the test to fail? I still think we can get rid of this extra select statement. Correct me if I'm wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok tests are finally passing with these changes, test-api-other and test-vm-goja tasks do not finish without the extra context check (evergreen).

This check also exists in the WaitN function, so we're just pulling that over here.
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you'd like to chat about this.

Copy link
Author

@Calvinnix Calvinnix Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically the TestFunctionExecTimeLimit test is what never finishes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. [opt] The wait function has the select statement before it performs the reserve/wait operations. Should we do the same here? i.e. have this select statement right after we define ctx.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's a good idea to stay consistent.

vm.go Outdated
now := time.Now()
r := vm.r.limiter.ReserveN(now, vm.r.limiterTicksLeft)
if !r.OK() {
panic("failed to make reservation")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panic("failed to make reservation")
panic(context.DeadlineExceeded)

This will keep the same behavior as the if strings.Contains(waitErr.Error(), "would exceed") { check. That being said, I don't believe this can ever happen at the moment.

vm.go Outdated
Comment on lines 636 to 638
select {
case <-ctx.Done():
panic(ctx.Err())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow. Did you figure out what caused the test to fail? I still think we can get rid of this extra select statement. Correct me if I'm wrong.

Copy link

@arahmanan arahmanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just an optional comment.

vm.go Outdated
Comment on lines 636 to 638
select {
case <-ctx.Done():
panic(ctx.Err())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. [opt] The wait function has the select statement before it performs the reserve/wait operations. Should we do the same here? i.e. have this select statement right after we define ctx.

@Calvinnix Calvinnix merged commit 06a8a67 into mongodb-forks:realm Mar 11, 2024
2 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants