Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip values (moreso, RandomRuns) we have already tested #207

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Janiczek
Copy link
Collaborator

@Janiczek Janiczek commented Oct 13, 2022

WIP

closes #11 without need for major version since we don't add a new Failure variant.

Allows FuzzOptions.runs * 2 skips of values already seen.
This, ironically, speeds up our fuzzer distribution machinery as well.

Exhaustive checking (#188) would help us save even more unneeded work (we'd know we can stop skipping) -- and I want to do that -- but I'm not trying that in this PR. The skipping works reasonably well even without it.


Should be working already! Now just measuring whether this has any noticeable impact, with tests/randomized-tests.sh.

Copy link
Contributor

@gampleman gampleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concerns about memory use, particularly for large run count scenarios?

Comment on lines 22 to 28
gen : List Int -> Fuzzer a -> GenResult a
gen randomRun fuzzer =
generate (PRNG.hardcoded (RandomRun.fromList randomRun)) fuzzer


genR : Int -> Fuzzer a -> GenResult a
genR seed fuzzer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these could use a more descriptive name 😛

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry these shouldn't be in this PR 😅 Just wanted to check stuff in my REPL, and forgot about them when git add .-ing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully they're now all gone!

@@ -282,7 +309,7 @@ allSufficientlyCovered c state normalizedDistributionCount =
True

AtLeast n ->
Test.Distribution.Internal.sufficientlyCovered state.runsElapsed count (n / 100)
Test.Distribution.Internal.sufficientlyCovered (state.runsElapsed + state.runsSkipped) count (n / 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to include the skipped runs in the distribution coverage? I'm not saying no, but I think it's a question with a non-obvious answer...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer should be yes: we already know what the fuzzed value was when skipping, so we can count it into the fuzzer distribution. What we skipped was running the test function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is it valuable information? Imagine that I want to classify even/odd, and my generator generates the following:

0, 1, 2, 4, 1, 6, 1, 8, 1, 10, 1, 12, 1

Then I'll see that I had a an 50%/50% distribution of even/odd, but does it give me any more confidence about what I'm testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe fuzzer distribution is about classifying what the fuzzer gives you, irregardless of the test function (... -> Expectation). ie. you could just as well Fuzz.examples 10000 myFuzzer in the REPL and then do some statistics on that, and get the same result.

Test skipping only deals with skipping the test function, not skipping the fuzzing. This PR shouldn't change the behaviour of the fuzzer distribution reporting.

I believe I understand your issue, but it seems to me it's unrelated to this PR.

Copy link
Contributor

@gampleman gampleman Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me it's unrelated to this PR

That's fair enough. I think the main relation here is that the implementation here could implement both concerns by simultaneously being simpler. However, if you don't feel like wading into those waters in this PR, that's fine.

Copy link
Collaborator Author

@Janiczek Janiczek Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main relation here is that the implementation here could implement both concerns by simultaneously being simpler.

I think you're suggesting we remove skipped values from the distribution counts. But wouldn't that invalidate the claim that the distribution counts deal with what the fuzzer returns (again, thinking about Fuzz.examples 10000 myFuzzer)? That would change it into "unique values the fuzzer returns". I don't know if that's helpful.

It would basically change

Distribution report:
====================
  2-19:         90.1%  (27007x)  ███████████████████████████░░░
  1:               5%   (1485x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  20:            4.9%   (1468x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  <1, >20:         0%      (0x)  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

into

Distribution report:
====================
  2-19:           90%  (18x)  ███████████████████████████░░░
  1:               5%   (1x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  20:              5%   (1x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  <1, >20:         0%   (0x)  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

That to me is a different feature 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unique values is what we care about though isn't it? So distribution of unique values is the valuable thing that we should be highlighting to developers.

Also I disagree that distributions are just about the Fuzzer - they are about giving you confidence that your fuzz test is adequately sampling the problem space and not just focusing on some particular corner of it.


Additionally there could be some nice highlighting of overproduction of duplicates:

Distribution report:
====================
  2-19:           90%  (18x)  ███████████████████████████░░░
  1:               5%   (1x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ⚠️ Fuzzer produces 1480 duplicated values
  20:              5%   (1x)  █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
  <1, >20:         0%   (0x)  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

or some such, but that's perhaps just overcomplicating things for little benefit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unique values is what we care about though isn't it? So distribution of unique values is the valuable thing that we should be highlighting to developers.

That's where our understanding differs currently -- I'll think about it and give the idea a chance! Perhaps you're right

}
, failure = Just failure
}

Nothing ->
if state.runsElapsed < c.runsNeeded then
if state.runsElapsed < c.runsNeeded && state.runsSkipped < c.skipsAllowed then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be some sort of distinct failure when the skipsAllowed is exceeded? That basically means your Fuzzer is kind of rubbish, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipsAllowed exceeded could mean you've exhausted all possible values (or at least, finding new values is improbable). Consider Fuzz.bool or Fuzz.intRange 1 10. Over 100 runs those could skip 98 times and 90 times respectively (depending on your settings)

Copy link
Collaborator Author

@Janiczek Janiczek Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this reason I don't think we should fail. If anything, we could stop early, but let's do that when implementing exhaustiveness? This way skipping basically only lets you fuzz more diverse inputs / not waste time on things you already know answer to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can punt on this question until we implement exhaustiveness. Because in those cases the test runner should just switch to exhaustive checking, so this kind of failure should be reserved for more pathological fuzzers:

Fuzz.floatRange 0 1
|> Fuzz.andThen (\p ->
 if p < 0.9 then 
    Fuzz.constant 0
 else
    Fuzz.intRange 0 100
)

which would just be kind of bad, but not really exhaustive...

@@ -75,25 +75,19 @@ fromExpectation labels expectation summary =
expectation
|> Test.Runner.getDistributionReport
|> Runner.String.Distribution.report labels

summaryWithDistribution : Summary
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this module improve the visuals of how we present the fuzzer distribution tables in our homebrew test runner. (This is not exposed to users in any way)

@Janiczek
Copy link
Collaborator Author

Do we have any concerns about memory use, particularly for large run count scenarios?

I can try to measure that.

For now here are runtime counts: I've tried over multiple repos and the result is roughly the same everywhere: skipping doesn't give any performance benefit (it adds an overhead if anything), all it does is give you more diverse inputs.

Screenshot 2022-10-14 at 11 58 09

Screenshot 2022-10-14 at 11 53 04

@gampleman
Copy link
Contributor

What number of runs are those charts for?

@Janiczek
Copy link
Collaborator Author

Janiczek commented Oct 14, 2022

@gampleman 50 runs per configuration

@gampleman
Copy link
Contributor

I wonder if 50 is too low to:

a) manifest all that much duplication
b) exert significant memory pressure from potentially large sets of generated values

Could we also test with say 10,000 runs?

@Janiczek
Copy link
Collaborator Author

Janiczek commented Oct 14, 2022

Could we also test with say 10,000 runs?

Each run of the elm-test test suite (where various fuzzers differ, some have runs=100, some have runs=10000) is about 10s on average. Getting 10k test suite runs would take me 27 hours straight (unless somehow parallelized)

EDIT: sorry if I previously led you to believe the number 50 was for the runs=... configuration!
Basically what the charts above are is: I run something akin to for SEED in {1..50}; do time elm-test --seed=$SEED; done; and collect the numbers into a table. I do this for various configurations (master code, this PR with multiplier 1,2,5,10) to compare the effect on runtime across a large test suite.

@gampleman
Copy link
Contributor

Ah OK, that makes sense. I was a bit surprised you were doing fewer than the default 100...

@Janiczek
Copy link
Collaborator Author

Out of curiosity I'll try another set of 50 seeds * {master,skip01,skip02,skip05,skip10} with the --fuzz=10000 option active, to see whether perhaps there is some difference there. But I'd only expect performance savings where the overhead of generating+skipping extra values would be lower than that of running the test on the would-be-skipped values -- this PR is not about short-circuiting 🙂

@Janiczek
Copy link
Collaborator Author

Tried the same thing with defaultRuns = 10000, as said above.

Screenshot 2022-10-14 at 17 25 25

Screenshot 2022-10-14 at 17 25 37

Screenshot 2022-10-14 at 17 25 54

Screenshot 2022-10-14 at 17 29 46

Screenshot 2022-10-14 at 17 29 15

So eg. in test suite runs taking ~18s the skip x2 approach added ~1.5s to the runtime.

@gampleman
Copy link
Contributor

Yeah I think we would need some nice test for the F-metric to be able to also see the upside of a PR like this. FWIW I don't think that's a too bad perf degradation, and ultimately helps with the mission of actually finding bugs, so 👍

@Janiczek
Copy link
Collaborator Author

F-metric

I think I vaguely recall this mentioned in our discussion of quasirandom numbers? I'm a stats noob, could you please share some Wikipedia/paper links for its definition etc.? My googling is returning F-score (not sure it is what you're talking about) and F-metric being some kind of web development stuff (that definitely isn't it) :)

@gampleman
Copy link
Contributor

gampleman commented Oct 17, 2022

It's a horrible name, since it's almost impossible to google for. I mentioned it in here:

[F-metric is] defined as the number of test cases the test system needs to generate before a defect is uncovered

So I suspect we could make a benchmark of Fuzz tests that fail in various circumstances and then run them with a high number of runs, and report out how many runs it took to find each failure condition. I suspect this present PR would then present a potentially substantial improvement on such a benchmark.

You could also not report out runs but the time it took to find each bug. Then that would be a fairly sensible performance metric, since that's in some ways more meaningful than how long it takes to generate and run some arbitrary number of cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hash fuzzed input to detect duplicates?
2 participants