Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: parametrize bench mark tests #4974

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

cm-iwata
Copy link
Contributor

@cm-iwata cm-iwata commented Dec 30, 2024

In the previous implementation, it was necessary to adjust the timeout value every time a benchmark test added.
By parametrizing the benchmark tests, the time required for each test becomes predictable, eliminating the need to adjust the timeout value

Changes

Parametrize the test by the list of criterion benchmarks.

By parametrizing the tests, git clone will executed for each parameter in here.

with TemporaryDirectory() as tmp_dir:
dir_a = git_clone(Path(tmp_dir) / a_revision, a_revision)
result_a = test_runner(dir_a, True)
if b_revision:
dir_b = git_clone(Path(tmp_dir) / b_revision, b_revision)
else:
# By default, pytest execution happens inside the `tests` subdirectory. Pass the repository root, as
# documented.
dir_b = Path.cwd().parent
result_b = test_runner(dir_b, False)
comparison = comparator(result_a, result_b)
return result_a, result_b, comparison

To run all parametrized tests with single git close would require major revisions to git_ab_test, so this PR does not address that issue.

Reason

close #4832

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

In the previous implementation, it was necessary to adjust the timeout
value every time a benchmark test added.
By parametrizing the benchmark tests, the time required for each test
becomes predictable, eliminating the need to adjust the timeout value

Signed-off-by: Tomoya Iwata <[email protected]>
@pb8o pb8o added the python Pull requests that update Python code label Jan 8, 2025
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.93%. Comparing base (dfb45dc) to head (3222c42).

Current head 3222c42 differs from pull request most recent head a10045b

Please upload reports for the commit a10045b to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4974      +/-   ##
==========================================
+ Coverage   83.10%   83.93%   +0.82%     
==========================================
  Files         245      248       +3     
  Lines       26723    27791    +1068     
==========================================
+ Hits        22209    23327    +1118     
+ Misses       4514     4464      -50     
Flag Coverage Δ
5.10-c5n.metal 84.51% <ø> (+0.94%) ⬆️
5.10-m5n.metal 84.49% <ø> (+0.94%) ⬆️
5.10-m6a.metal 83.78% <ø> (+1.02%) ⬆️
5.10-m6g.metal 80.61% <ø> (+1.04%) ⬆️
5.10-m6i.metal 84.50% <ø> (+0.95%) ⬆️
5.10-m7g.metal 80.61% <ø> (+1.04%) ⬆️
6.1-c5n.metal 84.51% <ø> (+0.94%) ⬆️
6.1-m5n.metal 84.50% <ø> (+0.95%) ⬆️
6.1-m6a.metal 83.78% <ø> (+1.02%) ⬆️
6.1-m6g.metal 80.61% <ø> (+1.04%) ⬆️
6.1-m6i.metal 84.49% <ø> (+0.94%) ⬆️
6.1-m7g.metal 80.61% <ø> (+1.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pb8o
pb8o previously approved these changes Jan 8, 2025
)

executables = []
for line in stdout.split("\n"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be stdout.splitlines()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like this we're no longer doing an A/B-test, we're just benchmarking the same binary compiled from the PR branch twice (e.g. comparing the PR results to themselves)

@pytest.mark.no_block_pr
@pytest.mark.timeout(900)
def test_no_regression_relative_to_target_branch():
@pytest.mark.timeout(600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the buildkite run, it seems like the longest duration of one of these is 150s for the queue benchmarks, so I think we can actually drop this timeout marker altogether and just rely on the default timeout specified in pytest.ini (which is 300s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in 215af23

Comment on lines 29 to 32
_, stdout, _ = cargo(
"bench",
f"--all --quiet --target {platform.machine()}-unknown-linux-musl --message-format json --no-run",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, I don't think this does what we want. We precompile the executables ones (from the PR branch), and then we use this precompiled executable for both A and B runs. What need to do though is compile each benchmark twice, ones from the main branch and once from the PR branch, so that this test does a meaningful comparison :/ That's why in #4832 I suggested to use --list-only or something: determine the names of the benchmarks here, and then compile them twice in _run_criterion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I misunderstood a bit how it works.

Let me confirm the modifications.
First, run cargo bench --all -- --list to generate parameters to pass to pytest.parametrize.
I will get the following output:

root@90de30508db0:/firecracker# cargo bench --all -- --list Finished `bench` profile [optimized] target(s) in 0.10s Running benches/block_request.rs (build/cargo_target/release/deps/block_request-2e4b90407b22a8d0) request_parse: benchmark Running benches/cpu_templates.rs (build/cargo_target/release/deps/cpu_templates-cd18fd51dbad16f4) Deserialization test - Template size (JSON string): [2380] bytes. Serialization test - Template size: [72] bytes. deserialize_cpu_template: benchmark serialize_cpu_template: benchmark Running benches/memory_access.rs (build/cargo_target/release/deps/memory_access-741f97a7c9c33391)
page_fault: benchmark

page_fault #2: benchmark

Running benches/queue.rs (build/cargo_target/release/deps/queue-b2dfffbab00c4157)
next_descriptor_16: benchmark

queue_pop_16: benchmark

queue_add_used_16: benchmark

queue_add_used_256: benchmark

And then, I will get benchmark name...for example queue_pop_16.

Finally run a command like cargo bench --all -- queue_pop_16 in _run_criterion.
Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's pretty much it! The main point is that the compilation of the benchmarks needs to happen in _run_criterion, because we actually have to compile them twice, once for the pull request target, and once for the pull request head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix in d42d39f

Since it would be very slow if git clone and build executable for each parameter, so I adjusted fixture so that git clone is executed only once.
033ca8f

use `splitlines()` instead of `split("\n")`.

Signed-off-by: Tomoya Iwata <[email protected]>
No longer need to set individual timeout values,
Because parameterized performance tests.

Signed-off-by: Tomoya Iwata <[email protected]>
In the previous implementation, same binary that built in the PR branch
execute twice,
which was not a correct A/B test. This has been fixed.

Signed-off-by: Tomoya Iwata <[email protected]>
In the previous implementation, git clone executed
for each parameter of the parametize test.
This has a large overhead, adjusted it so that
fixtures only called once per class.

Signed-off-by: Tomoya Iwata <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parametrize test_benchmarks.py test by criterion benchmarks
3 participants