Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Are we using the right statistics to show improvement in our benchmarks? #688

Open
mdboom opened this issue Jun 24, 2024 · 4 comments

Comments

@mdboom
Copy link
Contributor

mdboom commented Jun 24, 2024

Based on a conversation I had with @brandtbucher, I feel it's time to reinvestigate the various methods we use to arrive at an overall improvement number for our benchmarks. To summarize, we currently provide:

  • The geometric mean of the means of each of the benchmarks (which reduces the distribution of each benchmark to a single mean before combining them). EDIT: This is the "classic" geometric mean number from pyperf.
  • The HPT method, for which we show the reliability and the improvement at the 99th percentile
  • The "overall mean", which combines the distributions of all of the benchmarks together and computes an overall mean over that (this is currently only exposed on the plots)

There are a few puzzling things:

  • We've had a number of obvious incremental improvements in recent months, but these don't seem to be "stacking" over time, at least in the HPT or geometric mean numbers
  • Some of the single PR improvements might show a massive improvement in a handful of benchmarks, and minimal regression in a small number of benchmarks, which seems like a strong signal to accept the change, but the HPT number will say 1.00x improvement. This may not be a "bug" but it's perhaps showing the limitations of HPT as a way to give a thumbs up/down on a specific change.

We are now in a good position with a lot of data collected over a long period. I should play with the different statistical methods we have to see which are truly the most valuable to meet the following (which may require different solutions):

a) understand if a change is helpful
b) show how far we've come
c) reduce measurement noise

@brandtbucher
Copy link
Member

I think that a useful invariant would be that if we have two independent changes with "headline" numbers A and B vs some common base, then landing both changes should result in a new headline number that's equal to A * B, regardless of the "shape" of the results for each. I have a nagging worry that we might have statistical situations where two "one percent" improvements could combine to a one percent (or even a zero percent) improvement.

@mdboom
Copy link
Contributor Author

mdboom commented Jun 25, 2024

I think that a useful invariant would be that if we have two independent changes with "headline" numbers A and B vs some common base, then landing both changes should result in a new headline number that's equal to A * B, regardless of the "shape" of the results for each.

I agree with what you are saying, but in practice it does seem like one change could hide in another, e.g. both changes create better cache locality, and when you put them together you don't get that win "twice". I think a looser invariant is that if A > 1 and B > 1, also A+B > 1, and I'm not even sure we are currently meeting that basic invariant.

@mdboom
Copy link
Contributor Author

mdboom commented Jun 25, 2024

I created longitudinal plots that show 4 different aggregation methods together:

  • HPT 99: HPT at the 99th percentile
  • HPT 90: HPT at the 90th percentile
  • GM: Geometric mean (pyperf method)
  • Overall mean: The mean of all of the benchmark difference samples

Here is that on 3.14.x with JIT against 3.13.0b3 (main Linux machine only):

evaluate-314

And here is that with the classic 3.11.x against 3.10.4 (main Linux machine only):

evaluate-311

It's nice to see that they all are more-or-less parallel with some offset, and while you can see HPT reducing variation (as designed) the other alternatives aren't uselessly noisy either. It's tempting to use "overall mean" because it's the most favourable, but that feels like cherry-picking.

We don't quite have all the data to measure Brandt's suggestion. However, we can test the following for each of the methods: for 2 adjacent commits A and B and a common base C, if B:A > 1, B:C must be > A:C. The only method where this doesn't hold true is the overall mean method.

Lastly, I experimented with bringing the same nuance we have in the benchmarking plots to the longitudinal ones -- it's possible to show violin plots for each of the entries like this (again 3.11.x vs. 3.10.4):

violins-311

(Imagine the x axis is dates -- it's a little tricky to make that work...)

This plot is interesting because it clearly shows where the "mean" improvement is but also that there are a significant number of specific use cases where you can do much better than that -- I do sort of find it helpful to see that.

Anyway, there's still more things to look at here -- just wanted to provide a braindump and get some feedback in the meantime.

@mdboom
Copy link
Contributor Author

mdboom commented Jun 25, 2024

A ridgeline plot seems kind of useful for visualizing the improvement of Python 3.11 over 3.10:

violins-311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants