Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge kernels in existing XCLBIN #418

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Jun 14, 2024

This PR uses a new xclbin generation tool from Xilinx/mlir-aie#1561 that allows us to merge kernel PDIs into the same XCLBIN.
A mlir-aie wheel bump is needed to use the tool.
An e2e example which forms multiple (three) dispatches is added.

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool that the e2e test works! Just to confirm: these matmuls form 3 different dispatches?

@@ -350,11 +368,24 @@ LogicalResult AIETargetBackend::serializeExecutable(
{
SmallVector<StringRef> cmdEnvRefs{cmdEnv.begin(), cmdEnv.end()};
int result = llvm::sys::ExecuteAndWait(cmdArgs[0], cmdArgs, cmdEnvRefs);
if (result != 0)
if (result != 0 && AttemptingMerge) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect before hand if the same kernel appears twice? Just because it'll be better to know why we failed. My concern is that we'll stop being able to merge for some other reason, and not know about it.

Copy link
Contributor Author

@nirvedhmeshram nirvedhmeshram Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I correctly understand the case you are thinking about but if I understand correctly you are thinking some kernel cannot be merged in the existing XCLBIN and if a later kernel is same as the one that couldnt be we will just not be able to merge that either?

If what I interpret is correct then a few reasons why this is not of concern.

  1. IREE has deduplication passes after dispatch formation so same kernel is never appearing twice.
  2. The success of the merge is independent of the content in the kernel. The only case that I am aware of when it will fail is when we exceed the number of PDIs that xclbin supports. We have one PDI per kernel for now until we get to super kernels. I believe the PDI limit is 16. So my idea here is the 17th kernel will fail and then start a new xclbin, and the process repeats for the 33rd kernel and so on. And as far as the xclbin utility is concerned all 33 of these kernels (PDIs) could have same content and that fact wouldn't matter. A bit hard to test this though and we arent at that scale anyway now so havent worked on testing that.

@nirvedhmeshram
Copy link
Contributor Author

Very cool that the e2e test works! Just to confirm: these matmuls form 3 different dispatches?

Yes that is correct.

@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Jun 18, 2024

On further investigating I realized that with the wheel the CI is using, no actual merging is taking place, you can see this error in the CI test added

Expected<T> must be checked before access or destruction.
Expected<T> value was in success state. (Note: Expected<T> values in success mode must still be checked prior to being destroyed).

This causes the merging to fail and it goes through the regular new xclbin per dispatch path.
This is not what I see when I build my own mlir-aie. However, given I am seeing same performance with and without the merging achieved from this PR I think I will hit pause on this for now and come back to it later perhaps after this PR lands,
#422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants