Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test approach for detecting issues #93

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

storozhukBM
Copy link
Contributor

The problem in the previous approach was that mock time literally didn't run between limite.Take() calls, so now.Sub(oldState.last) would always be 0 that can be problematic for some issues detection.

This PR adds time increments between limite.Take() calls in (r *runnerImpl) startTaking. Unfortunately our library for clock mocking github.com/benbjohnson/clock has some races reported here benbjohnson/clock#44 and here benbjohnson/clock#42.

So I decided to use the time mocking approach that Ian Lance Taylor used 19 days ago in https://github.com/golang/time repository here golang/time@579cf78

Note: this tests approach detects an error in int64 based limiter implementation that the previous test didn't manage to catch.

@storozhukBM storozhukBM marked this pull request as ready for review June 27, 2022 20:38
@codecov
Copy link

codecov bot commented Jun 27, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (b62b799) to head (31dcb6c).
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #93   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines           95        97    +2     
=========================================
+ Hits            95        97    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"github.com/stretchr/testify/assert"
)

// testTime is a fake time used for testing.
type testTime struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is copied over from golang/time@579cf78, could we:
(1) put this is a separate file
(2) attribute the source clearly
(3) call out differences in the implementation? For example, I see that advanceUnlocked is different than the reference implementation, but I cannot easily tell why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rabbbit
Copy link
Contributor

rabbbit commented Jun 27, 2022

Thanks!

This looks complicated, it'll take me some time to grok this (sorry) - I would expect maybe some time over the weekend.

It's interesting that benbjohnson/clock#42 was fixed by @abhinav, looks like the clock bit us a few times.

Note: this tests approach detects an error in int64 based limiter implementation that the previous test didn't manage to catch.

@storozhukBM - would you like to partially revert #91? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in git log :)

@storozhukBM
Copy link
Contributor Author

storozhukBM commented Jun 28, 2022

would you like to partially revert #91? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in git log :)

2 Questions:
Should I do the revert in a separate PR or this one?
Should I fix the bug in the int64 based implementation right away, since I already have a fix?

@rabbbit
Copy link
Contributor

rabbbit commented Jun 29, 2022 via email

@storozhukBM
Copy link
Contributor Author

@rabbbit

Let's do the following thing:

  1. I'll create a PR with an almost complete revert of Revert "New atomic-based implementation squeezed into int64" #91, except making int64-based rate limiter a default implementation.
  2. After its merge, I'll rebase this branch with a new master, so this branch will start to fail.
  3. Then, I'll make a second PR that fixes a known bug.
  4. After the fix PR gets merged, I'll rebase this branch again, and tests will show that the bug is fixed.

@storozhukBM
Copy link
Contributor Author

@rabbbit
First step PR #94

rabbbit pushed a commit that referenced this pull request Jul 2, 2022
This limiter was introduced and merged in the following PR #85
Later @twelsh-aw found an issue with this implementation #90
So @rabbbit reverted this change in #91

Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
@rabbbit
Copy link
Contributor

rabbbit commented Jul 2, 2022 via email

storozhukBM added a commit to storozhukBM/ratelimit that referenced this pull request Jul 2, 2022
This limiter was introduced and merged in the following PR uber-go#85
Later @twelsh-aw found an issue with this implementation uber-go#90
So @rabbbit reverted this change in uber-go#91

Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better.
With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug.
Right after it, we'll open a subsequent PR to fix this bug.
@storozhukBM storozhukBM force-pushed the fix_tests_approach_for_issue_detection branch from 9d5dd18 to 1053ca8 Compare July 2, 2022 19:40
@storozhukBM
Copy link
Contributor Author

@rabbbit

Now we can see that new test approach detects issues with atomic_int64 implementation.

rabbbit added a commit that referenced this pull request Jul 10, 2022
In #93 @storozhukBM adds test parallization, which is unrelated
to the actual change. I'd rather have this as a separate PR.

The test currently take ~12 seconds on my laptop, so I see why we might
want to run them in parallel. There are other parallelization
opportunities (we could paralelize each of the subtests for each
of the rate-limiters), but this is a good start.

Before:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   12.739s
```

After:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   6.178s
```
@rabbbit rabbbit mentioned this pull request Jul 10, 2022
ratelimit_test.go Outdated Show resolved Hide resolved
rabbbit added a commit that referenced this pull request Jul 10, 2022
In #93 @storozhukBM adds test parallization, which is unrelated
to the actual change. I'd rather have this as a separate PR.

The test currently take ~12 seconds on my laptop, so I see why we might
want to run them in parallel. There are other parallelization
opportunities (we could paralelize each of the subtests for each
of the rate-limiters), but this is a good start.

Before:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   12.739s
```

After:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   6.178s
```
https://github.com/golang/time/commit/579cf78fd858857c0d766e0d63eb2b0ccf29f436

Modified parts:
- advanceUnlocked method uses in-place filtering of timers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the goal of in-place filtering just performance of the fake-clock?

I don't think this feels very important (?) - I'd rather have simpler code that's slightly slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this approach looks simpler as well as more performant, but check the original and give me your opinion, if you think we should keep the original, I don't mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that the whole i, j looks like we would need tests for tests (off by one errors).

I think I would rather pull the other version as is, and then push a separate PR to fix the copying with clear changes. If we worried about the performance though, wouldn't we want to sort once on insert, rather than advance?

But:

  • see my other comment below
  • I understand if fixing the time is not your primary focus for now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can go back to the prev version.

ratelimit_mock_time_test.go Outdated Show resolved Hide resolved
ratelimit_test.go Show resolved Hide resolved
Comment on lines 93 to 95
for {
move()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I think this needs a much larger explanation - perhaps as comments in code?

Am I reading this right that this is an infinite loop, trying to move to infinitely move to the next timer ... by recovering from a panic?

Irrespective of goroutine leaks, using panics for signaling - do you think the maxDuration is not useful here? It seems we should be able to just pass the duration into advance, theoretically?

Copy link
Contributor Author

@storozhukBM storozhukBM Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrespective of goroutine leaks

I think 1 goroutine leak per test case is not a big deal, but if you want, I can add some exit logic here.

using panics for signaling

Code from Ian just panics when you try to run the next timer, and there is nothing to run. We can potentially change that behavior to not handle panics, but I wanted to limit changes in that code.

Sorry, but I think this needs a much larger explanation - perhaps as comments in code?

I'll write some description here first and when we clarify all questions about it, I'll move it into code comments:

testTime internally has a queue of all timers currently registered within the clock. In our case, there will be timers for goroutines that try to consume permissions and timers for goroutines that should run test asserts.
When we call testTime.advanceToTimer() it basically shifts time to the next registered timer, runs it, and also runs runtime.Gosched() multiple times to give a chance for other goroutines to run and react to what happened in the current timer. Usually, when we run runtime.Gosched() other goroutines will observe the new state, make their moves, and potentially create new timers inside this clock to wait for some
other state change.

If testTime.advanceToTimer() panics, it means that there are no timers to run, and it can happen if the test is already finished or it didn't manage to set up the full test case, so we just sleep a bit to allow and try again to run timers.

Do you think the maxDuration is not useful here?

No, we can use that if we want to get out of the currently infinite loop, we can check on every iteration if testTime is already passed maxDuration.

It seems we should be able to just pass the duration into advance, theoretically?

Yes, but since we advance time in different goroutines in this approach, it is necessary for us to do it one step at a time and give other goroutines the opportunity to make their steps as well. There also can be situations where we start to advance time in this goroutine before test goroutines spawned all permission consumers and asserting goroutines, so we need to just wait for everything to setup, that is what we do when panic happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps naively, but do we need both

  • advanceToTimer
  • advance(Unlocked?)

Our test runner knows exactly how much time needs to move (maxDuration). We can just pass it into advance. advanceUnlocked already does all the runtime.Gosched needed.

And soit seem to me that advance should theoretically be enough?
If the problem is the initial goroutine setup, should we just add the 16*runtime.Gosched at the beginning of advance to let the scheduler kick in?
This would also help with r.clock.advance(time.Nanosecond) you have below.

I understand this kinda modifies the code we copy-paste, but I'd feel much more comfortable with we decreased the API by 50% here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabbbit

Right now, it is hard for me to make changes because tests do not pass until we merge #95

I agree with the comments and changes that you propose right now, but they don't change the approach in general.
I propose to merge #95, and then I can apply your suggested changes, or we can merge this PR as it is right now, and I'll make these improvements in subsequent PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged #95. I might still want to re-review it later on, it feels like some simplification is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this branch with new version and this should potentially fix all tests, event though I'm not sure how your new test for initial calls will work 🤞

storozhukBM pushed a commit to storozhukBM/ratelimit that referenced this pull request Jul 13, 2022
In uber-go#93 @storozhukBM adds test parallization, which is unrelated
to the actual change. I'd rather have this as a separate PR.

The test currently take ~12 seconds on my laptop, so I see why we might
want to run them in parallel. There are other parallelization
opportunities (we could paralelize each of the subtests for each
of the rate-limiters), but this is a good start.

Before:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   12.739s
```

After:
```
> go test ./... -race -count 1
ok      go.uber.org/ratelimit   6.178s
```
rabbbit pushed a commit that referenced this pull request Jul 13, 2022
This PR fixes the issue found by @twelsh-aw with int64 based implementation #90

Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.
@storozhukBM
Copy link
Contributor Author

storozhukBM commented Jul 13, 2022

@rabbbit
After merging with the recent main branch, all tests pass, except TestInitial for mutex rate limiter for some reason. I'll investigate what happens with it when I have time in the next few days.

@rabbbit
Copy link
Contributor

rabbbit commented Jul 13, 2022 via email

@storozhukBM
Copy link
Contributor Author

@rabbbit

It's possible that mutex behaviour was broken, but we didn't detect it
because of the broken clock :|

No, the use of advanceToTimer approach removes the issue. I think just advance(time.Duration) is not sufficient with this testTime implementation. I'll figure it out.

@storozhukBM
Copy link
Contributor Author

@rabbbit
I think I addressed all your comments, so please take another look

Copy link
Contributor

@rabbbit rabbbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost SGTM:

  • I don't think we need the GetName (?)
  • I don't like the amount of time.Sleep but perhaps it's unavoidable

I've been wondering about some simplified fake clock library - given the amount of issues we've seen it seems it must be possible to make it simpler -__-

type testRunner interface {
getName() string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

The name is included in the test-name automatically via tt.run (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

"github.com/stretchr/testify/assert"
)

const advanceBackoffDuration = 5 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need this? Is the gosched not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can notice that gosched calls happen under lock. It means that all goroutines that would like to get the current time from this clock or register some additional timer won't be able to do so, even if we run these gosched calls. In our case, it means that most "startTaking" goroutines will never work while we step through timers, but they need the most time resource in our tests run.

So we actually need both types of pauses between runs. One is using gosched under lock for all goroutines that were waiting for the timer to click and are not blocked on mutex of this clock.
The other one is for all goroutines that are blocked on mutex of this clock and need to make some progress between timer advances.

@@ -102,6 +115,7 @@ func (r *runnerImpl) startTaking(rls ...Limiter) {
for _, rl := range rls {
rl.Take()
}
r.clock.advance(time.Nanosecond, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we need this one as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this one imitates time progressions between rl.Take() calls. With a real clock, when you call rl.Take() do something and call rl.Take() again time.Now() will move forward in time, so we want to imitate that.

@storozhukBM
Copy link
Contributor Author

I've been wondering about some simplified fake clock library - given the amount of issues we've seen it seems it must be possible to make it simpler -__-

If you take a look at the libraries we have used in the past, like https://github.com/benbjohnson/clock/ it is MUCH more complex, has concurrency bugs, and `time. Sleep's as well here

I researched every library that I was able to find on Github, and this approach from golang/time is the most straightforward I've seen.

One more consideration that we should take into account is that we can find a lib or implementation that will pass all our tests but is going to detect the real bug. After we change the time library, we should use mutation testing at least manually introduce some bugs into the existing rate limiter implementations that we have and make sure that our tests with the new clock library detect the issue.

@storozhukBM
Copy link
Contributor Author

Hey @rabbbit, what do you think? Should we move forward with this PR?

@rabbbit
Copy link
Contributor

rabbbit commented Oct 31, 2022

We spoke about this briefly in #101 and #90 - I'm not sure about merging this:

  • without in an depth review
  • without a plan for this library

For now I'll put this as "on-hold". This means that the library is effectively in code-freeze, as our test clock implementation is buggy.

@storozhukBM
Copy link
Contributor Author

@rabbbit can we find someone form Uber to make a depth review?

rabbbit added a commit that referenced this pull request Jul 9, 2023
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
rabbbit added a commit that referenced this pull request Mar 4, 2024
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
rabbbit added a commit that referenced this pull request May 1, 2024
There were quite a few fixes in that repo, let's pull them in.

We've had a few mock-clock bugs before (#90), I'm hoping that with the
new versions #93 won't be necessary. I'll try reverting #95 on a branch
temporarily to see if it would have been caught.
@storozhukBM
Copy link
Contributor Author

We maybe can get something in standard library for this golang/go#67434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants