Benchmark: matmul ukernel vs direct codegen #415

newling · 2024-06-13T19:40:41Z

End to end script. Running locally (nuc50)

Direct codegen

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*m*n*k): 3.43597e+10
execution times for all 10 runs [s]:
    0.0586853 0.0539645 0.0539398 0.0539536 0.0540862 0.0540921 0.054008 0.0540939 0.0540261 0.0540442
teraops/second over all 10 runs:
    0.585491 0.63671 0.637002 0.636839 0.635277 0.635208 0.636197 0.635187 0.635984 0.635771
mean time over runs: 0.0544894 [s]
minimum time over runs: 0.0539398 [s]
max teraops/second: 0.637002 [teraops/second]

Using ukernel

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*M*N*K): 3.43597e+10
execution times for all 10 runs [s]:
    0.0219439 0.0181417 0.0181282 0.0177506 0.0187402 0.0183464 0.0187545 0.0181528 0.0177086 0.0180813
teraops/second over all 10 runs:
    1.5658 1.89397 1.89537 1.93569 1.83348 1.87283 1.83208 1.89281 1.94029 1.90029
mean time over runs: 0.0185748 [s]
minimum time over runs: 0.0177086 [s]
max teraops/second: 1.94029 [teraops/second]

So the ukernel approach is currently 3x faster. This is a lower bound though (i.e. core ukernel probably more than 3x faster). Consider:

total-time-ukernel = time-in-ukernel + other-time
total-time-dcg = time-in-dcg + other-time

Where other-time is the same in the 2 experiments, as only the instruction memory is different (identical DMA data movement). We observed that

total-time-ukernel / total-time-dcg = 1/3

so that

time-in-ukernel / time-in-dcg = 1/3 - 2/3 * (other-time / time-in-dcg) < 1/3

as other-time is the same in both approaches (date movement between DDR <-> memtile <-> core is identical)

I think on this phoenix machine, theoretical max is 4 tops/second. So ukernel approach is 50% of theoretical max.

Two extremes:

all time in ukernel. i.e. other-time = 0. Then time-in-dcg = 3 * time-in-ukernel
50% of time in ukernel (i.e. ukernel itself is 100% efficient). i.e. other-time = time-in-ukernel. Then time-in-dcg = 5 * time-in-ukernel.

So performance of ukernel is between 3x and 5x better than dcg.

nirvedhmeshram · 2024-06-14T20:41:17Z

build_tools/ci/matmul_ukernel_vs_codegen_benchmark/test.cpp

+
+    // We will time the run, and print it.
+    auto start = std::chrono::high_resolution_clock::now();
+    auto run = kernel(bo_instr, instr_v.size(), bo_a, bo_b, bo_c);


was this kernel generated with latest iree-amd-aie, if the answer is yes then this would be the wrong way to run it. You can copy the changes done in the tests by Xilinx/mlir-aie#1517

jsetoain · 2024-07-10T16:25:45Z

build_tools/ci/matmul_ukernel_vs_codegen_benchmark/run_test.sh

+BASE_COMPILATION_FLAGS="-iree-hal-target-backends=amd-aie \
+-iree-amd-aie-peano-install-dir=${PEANO} \
+-iree-amd-aie-mlir-aie-install-dir=${MLIR_AIE_INSTALL} \
+-iree-amd-aie-vitis-install-dir=${VITIS} \
+-iree-amd-aie-show-invoked-commands"


This is missing the path to IREE:

-iree-amd-aie-install-dir=$IREE_INSTALL_DIR

Ok thanks. This is a recent flag addition afaik. Maybe I should ask/push for this PR to be landed so it doesn't go stale..

Yes, very recent (it just started failing for me). Having this test in CI wouldn't be entirely out of the question, I don't think. It's a good reference for people trying to do some performance analysis (like myself).

nirvedhmeshram reviewed Jun 14, 2024

View reviewed changes

newling force-pushed the matmul_benchmark branch from 9bc089d to ef6fb1a Compare June 14, 2024 22:50

newling mentioned this pull request Jun 17, 2024

Bridge matmul performance gap, peano vs ukernel #433

Open

squash commits

7b7b593

newling force-pushed the matmul_benchmark branch from 0f6d244 to 7b7b593 Compare June 28, 2024 19:39

jsetoain reviewed Jul 10, 2024

View reviewed changes

newling mentioned this pull request Jul 17, 2024

Add e2e tests that model switching costs and add func arg lowering for the air pipeline #566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: matmul ukernel vs direct codegen #415

Benchmark: matmul ukernel vs direct codegen #415

newling commented Jun 13, 2024 •

edited

Loading

nirvedhmeshram Jun 14, 2024 •

edited

Loading

jsetoain Jul 10, 2024

newling Jul 11, 2024

jsetoain Jul 11, 2024

Benchmark: matmul ukernel vs direct codegen #415

Are you sure you want to change the base?

Benchmark: matmul ukernel vs direct codegen #415

Conversation

newling commented Jun 13, 2024 • edited Loading

nirvedhmeshram Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

jsetoain Jul 10, 2024

Choose a reason for hiding this comment

newling Jul 11, 2024

Choose a reason for hiding this comment

jsetoain Jul 11, 2024

Choose a reason for hiding this comment

newling commented Jun 13, 2024 •

edited

Loading

nirvedhmeshram Jun 14, 2024 •

edited

Loading