Skip to content

nvFuser Python Benchmarks

Priya Mishra edited this page Jan 16, 2025 · 10 revisions

The python benchmarks use pytest-benchmark and torch.profiler. Most of the CPP benchmarks have been ported to Python. The key differences as compared to the CPP interface are:

  1. Validation: Python benchmarks validate the nvFuser output against torch output to verify correctness.
  2. PyTorch baselines (torch.compile and eager): Python benchmarks support benchmarking other executors such as torch.compile and eager.
  3. Python benchmarks use CUPTI (through torch.profiler) for accurate and low-overhead kernel measurements.

Adding a benchmark

To benchmark any target function, use run_benchmark (python_benchmarks/core.py):

run_benchmark(benchmark, target_function, function_inputs, iobytes=None)

Arguments:

  • benchmark: pytest-benchmark fixture passed to every function intended to be run as a benchmark by pytest.
  • target_function: Function to benchmark
  • function_inputs: List of inputs to the target_function
  • iobytes (Optional): This should be used for any executor other than nvFuser if the inputs/outputs are not the same as nvFuser. See PR #1725. By default, we compute the IObytes automatically based on the inputs/outputs of the target function.

Example:

# Parametrize over any number of arguments (e.g., input sizes, dtypes)
@pytest.mark.parametrize("param1", ...)
@pytest.mark.parametrize("param2", ...)
def test_example_benchmark(````
    benchmark, param1, param2, ...
):
   # Setup function inputs
   run_benchmark(benchmark, target_function, function_inputs)

The benchmark name should start with test_ to be automatically discovered by pytest.

Executing benchmarks

  • Running a benchmark file: NVFUSER_DISABLE=kernel_reuse pytest [options] <benchmark-file>.

  • Running the complete benchmark suite: NVFUSER_DISABLE=kernel_reuse pytest [options] python_benchmarks/

  • Sharding: Pytest is memory-intensive resulting in CPU OOMs when running a large number of tests. Sharding is recommended when running the complete benchmarking suite. We use pytest-shard in our CI. To execute a specific shard with n total shards:

    NVFUSER_DISABLE=kernel_reuse pytest --shard-id=i --num-shards=n [options] where i = {0..n-1}.

  • Running a subset of the inputs for any benchmark: NVFUSER_DISABLE=kernel_reuse pytest <benchmark-file> --benchmark-num-inputs=10. This will randomly sample 10 input sizes to run the given benchmark.

Note: It is recommended to disable kernel reuse to get reliable performance measurements in all benchmarks.

Some useful options for running benchmarks:

Pytest/Pytest-benchmark options:

  • Filtering benchmarks: -k <filter>
  • Saving benchmarks: --benchmark-save=NAME, --benchmark-autosave, --benchmark-json=PATH
  • Debugging: --benchmark-verbose.

Custom command-line options:

  • Disable output validation: --disable-validation Skips the output validation in the nvFuser benchmarks.
  • Disable benchmarking: --disable-benchmarking Skips the nvFuser benchmarking, useful for only testing correctness of fusion definitions without benchmarking the fusions.
  • Run eager mode benchmarks: --benchmark-eager
  • Run torch.compile mode benchmarks: --benchmark-torchcompile
  • Setting custom rounds / warmup-rounds: --benchmark-rounds and --benchmark-warmup-rounds can be used to override the default values (rounds=10, warmup_rounds=1)
  • Running subset of input sizes: --benchmark-num-inputs=n will randomly sample n input sizes out of the complete input set to run the benchmark. This is useful for testing new changes.

Resources:

  1. Pytest: https://docs.pytest.org/en/stable/
  2. Pytest-benchmarks: https://pytest-benchmark.readthedocs.io/en/latest/index.html
  3. Pytest-shard: https://pypi.org/project/pytest-shard/