Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invasively print kernel execution time, percolate to CI output #892

Merged
merged 4 commits into from
Nov 14, 2024

Conversation

newling
Copy link
Contributor

@newling newling commented Nov 11, 2024

{4C086AF3-4CB3-4F77-960C-ABEC14FCC4E6}

I'll remove some spacing, but above is where you can see the kernel execution time in the CI logs.

Alternatives considered:

  1. Suggestion here: Invasively print kernel execution time, percolate to CI output #892 (comment) very nearly worked, but it resulted in large amounts of trace information being written to terminal, which I was worried would swamp the CI logs. Played around with IREE_TRACING_CONSOLE_FILE but not really designed for this.

  2. Tried tracy, but there is no simple text file dumped with tracy (as far as I can tell) so this didn't work

  3. Tried making a standalone test based on https://github.com/nod-ai/iree-amd-aie/blob/main/runtime/src/iree-amd-aie/driver/xrt/cts/matmul_dispatch_test.cc but this got more complicated than I could handle.

FWIW running the m=n=512 k=4096 matmul in a loop (ie having warm up runs) does not effect the time to run the kernel, at least that is what I've observed.

If we're brave we can use the execution time as a threshold above which the test fails (thoughts on this?)

We could also create an artifact and upload these numbers to table and make nice plots of our progress (although I'd prefer someone more experienced with this stuff to take over from me)

Comment on lines 210 to 215
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
double time_in_milliseconds = elapsed_seconds.count() * 1000;
std::cout << "\n\n\nTotal time was " << time_in_milliseconds << " [ms]\n";
std::cerr << "\n\n\nTotal time was " << time_in_milliseconds << " [ms]\n";

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know what the intent is here but this seems like not a good idea - you don't want this kind of stuff in the HAL itself...

besides with -DIREE_ENABLE_RUNTIME_TRACING=ON you already get timing for any HAL function that uses a trace zone (IREE_TRACE_ZONE_BEGIN(z0);)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to measure AIE kernel time, without any runtime syncs/copies/setup overheads. Therefore, this is not the same as the entire function. However, I agree that we shouldn't add this kind of logic here in the HAL. I think an alternative would be to create/extend a separate benchmark module that takes a vmfb and measures this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to measure smaller sets of calls, ie within trace zones, is just to add another trace zone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know this is not not a good idea, and that we should be doing this with a standalone benchmark tool (or with additional trace zones). But I'm struggling to build the 'right' solution, still working through the complexity of hal/xrt-lite/zones (and sparsity of documentation + examples)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try this

IREE_TRACE_ZONE_BEGIN_NAMED(z1, "kernel-dispatch);
hwq->issue_command(ebuf.get_exec_buf_bo());
hwq->wait_command(ebuf.get_exec_buf_bo(), 0);
IREE_TRACE_ZONE_END(z1);

and then build with

-DIREE_ENABLE_RUNTIME_TRACING=ON
-DIREE_TRACING_PROVIDER=console

which is (briefly) mentioned here.

still working through the complexity of hal/xrt-lite/zones (and sparsity of documentation + examples)

Well I didn't like hoard/hide such docs - there simply aren't any. I had the same complaint and Ben graciously wrote the null HAL to help me. So I studied that and the current HAL (and others) and bashed my head and this is what we got.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, thanks @makslevental

@newling newling changed the title [WIP] Flag to run.py to print time in kernel. Invasively print kernel execution time, percolate to CI output Nov 14, 2024
@yzhang93
Copy link
Contributor

LGTM. Thanks for adding this feature.

If we're brave we can use the execution time as a threshold above which the test fails (thoughts on this?)

This is a good idea. Main IREE also has some performance check in ci. We'd just need to be careful about the threshold.

We could also create an artifact and upload these numbers to table and make nice plots of our progress (although I'd prefer > someone more experienced with this stuff to take over from me)

Yeah, I think this feature can be added later. There are some folks in the group have this experience.

@newling newling merged commit f07a535 into nod-ai:main Nov 14, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants