Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark: matmul ukernel vs direct codegen #415

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

newling
Copy link
Contributor

@newling newling commented Jun 13, 2024

End to end script. Running locally (nuc50)

Direct codegen

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*m*n*k): 3.43597e+10
execution times for all 10 runs [s]:
    0.0586853 0.0539645 0.0539398 0.0539536 0.0540862 0.0540921 0.054008 0.0540939 0.0540261 0.0540442
teraops/second over all 10 runs:
    0.585491 0.63671 0.637002 0.636839 0.635277 0.635208 0.636197 0.635187 0.635984 0.635771
mean time over runs: 0.0544894 [s]
minimum time over runs: 0.0539398 [s]
max teraops/second: 0.637002 [teraops/second]

Using ukernel

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*M*N*K): 3.43597e+10
execution times for all 10 runs [s]:
    0.0219439 0.0181417 0.0181282 0.0177506 0.0187402 0.0183464 0.0187545 0.0181528 0.0177086 0.0180813
teraops/second over all 10 runs:
    1.5658 1.89397 1.89537 1.93569 1.83348 1.87283 1.83208 1.89281 1.94029 1.90029
mean time over runs: 0.0185748 [s]
minimum time over runs: 0.0177086 [s]
max teraops/second: 1.94029 [teraops/second]

So the ukernel approach is currently 3x faster. This is a lower bound though (i.e. core ukernel probably more than 3x faster). Consider:

total-time-ukernel = time-in-ukernel + other-time
total-time-dcg = time-in-dcg + other-time

Where other-time is the same in the 2 experiments, as only the instruction memory is different (identical DMA data movement). We observed that

total-time-ukernel / total-time-dcg = 1/3

so that

time-in-ukernel / time-in-dcg = 1/3 - 2/3 * (other-time / time-in-dcg) < 1/3

as other-time is the same in both approaches (date movement between DDR <-> memtile <-> core is identical)

I think on this phoenix machine, theoretical max is 4 tops/second. So ukernel approach is 50% of theoretical max.

Two extremes:

  1. all time in ukernel. i.e. other-time = 0. Then time-in-dcg = 3 * time-in-ukernel
  2. 50% of time in ukernel (i.e. ukernel itself is 100% efficient). i.e. other-time = time-in-ukernel. Then time-in-dcg = 5 * time-in-ukernel.

So performance of ukernel is between 3x and 5x better than dcg.


// We will time the run, and print it.
auto start = std::chrono::high_resolution_clock::now();
auto run = kernel(bo_instr, instr_v.size(), bo_a, bo_b, bo_c);
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this kernel generated with latest iree-amd-aie, if the answer is yes then this would be the wrong way to run it. You can copy the changes done in the tests by Xilinx/mlir-aie#1517

Comment on lines +143 to +147
BASE_COMPILATION_FLAGS="-iree-hal-target-backends=amd-aie \
-iree-amd-aie-peano-install-dir=${PEANO} \
-iree-amd-aie-mlir-aie-install-dir=${MLIR_AIE_INSTALL} \
-iree-amd-aie-vitis-install-dir=${VITIS} \
-iree-amd-aie-show-invoked-commands"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the path to IREE:

-iree-amd-aie-install-dir=$IREE_INSTALL_DIR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. This is a recent flag addition afaik. Maybe I should ask/push for this PR to be landed so it doesn't go stale..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very recent (it just started failing for me). Having this test in CI wouldn't be entirely out of the question, I don't think. It's a good reference for people trying to do some performance analysis (like myself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants