-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invasively print kernel execution time, percolate to CI output #892
Conversation
auto end = std::chrono::high_resolution_clock::now(); | ||
std::chrono::duration<double> elapsed_seconds = end - start; | ||
double time_in_milliseconds = elapsed_seconds.count() * 1000; | ||
std::cout << "\n\n\nTotal time was " << time_in_milliseconds << " [ms]\n"; | ||
std::cerr << "\n\n\nTotal time was " << time_in_milliseconds << " [ms]\n"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know what the intent is here but this seems like not a good idea - you don't want this kind of stuff in the HAL itself...
besides with -DIREE_ENABLE_RUNTIME_TRACING=ON
you already get timing for any HAL function that uses a trace zone (IREE_TRACE_ZONE_BEGIN(z0);
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to measure AIE kernel time, without any runtime syncs/copies/setup overheads. Therefore, this is not the same as the entire function. However, I agree that we shouldn't add this kind of logic here in the HAL. I think an alternative would be to create/extend a separate benchmark module that takes a vmfb and measures this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to measure smaller sets of calls, ie within trace zones, is just to add another trace zone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know this is not not a good idea, and that we should be doing this with a standalone benchmark tool (or with additional trace zones). But I'm struggling to build the 'right' solution, still working through the complexity of hal/xrt-lite/zones (and sparsity of documentation + examples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try this
IREE_TRACE_ZONE_BEGIN_NAMED(z1, "kernel-dispatch);
hwq->issue_command(ebuf.get_exec_buf_bo());
hwq->wait_command(ebuf.get_exec_buf_bo(), 0);
IREE_TRACE_ZONE_END(z1);
and then build with
-DIREE_ENABLE_RUNTIME_TRACING=ON
-DIREE_TRACING_PROVIDER=console
which is (briefly) mentioned here.
still working through the complexity of hal/xrt-lite/zones (and sparsity of documentation + examples)
Well I didn't like hoard/hide such docs - there simply aren't any. I had the same complaint and Ben graciously wrote the null
HAL to help me. So I studied that and the current HAL (and others) and bashed my head and this is what we got.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, thanks @makslevental
runtime/src/iree-amd-aie/driver/xrt-lite/direct_command_buffer.cc
Outdated
Show resolved
Hide resolved
908a851
to
2dc0404
Compare
2dc0404
to
5d85f11
Compare
runtime/src/iree-amd-aie/driver/xrt-lite/direct_command_buffer.cc
Outdated
Show resolved
Hide resolved
LGTM. Thanks for adding this feature.
This is a good idea. Main IREE also has some performance check in ci. We'd just need to be careful about the threshold.
Yeah, I think this feature can be added later. There are some folks in the group have this experience. |
I'll remove some spacing, but above is where you can see the kernel execution time in the CI logs.
Alternatives considered:
Suggestion here: Invasively print kernel execution time, percolate to CI output #892 (comment) very nearly worked, but it resulted in large amounts of trace information being written to terminal, which I was worried would swamp the CI logs. Played around with
IREE_TRACING_CONSOLE_FILE
but not really designed for this.Tried tracy, but there is no simple text file dumped with tracy (as far as I can tell) so this didn't work
Tried making a standalone test based on https://github.com/nod-ai/iree-amd-aie/blob/main/runtime/src/iree-amd-aie/driver/xrt/cts/matmul_dispatch_test.cc but this got more complicated than I could handle.
FWIW running the m=n=512 k=4096 matmul in a loop (ie having warm up runs) does not effect the time to run the kernel, at least that is what I've observed.
If we're brave we can use the execution time as a threshold above which the test fails (thoughts on this?)
We could also create an artifact and upload these numbers to table and make nice plots of our progress (although I'd prefer someone more experienced with this stuff to take over from me)