-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Optimizations for matmul #764
Comments
For the first point, I didn't see significant performance change after changing the single buffer to double buffer. However, the performance increases significantly if the L1/L2 sizes are increased (has to use the single buffer to avoid exceeding the memory bound). Here are some comparison results on the matmul shapes from VAE. The execution time is the average of 10 runs. Current parameter settings:
Now use single buffer:
Now increase tile sizes:
|
To add more details on 2), see for example this piece of control code for a 128x128x128 matmul after the
Here
Or after canonicalization:
After this transformation, the source access pattern is left with only 2 dimensions, so now the
|
Optimizing 2 would also reduce compile time. For the larger matmul above the pass |
I have point 2) optimized and working correctly for most shapes. However, the tests with large k size (>=1024) have numerics issue. Here's a simplified version of codes (with just L3 to L2 dma addressing change) I made for testing purpose #809. Note if I disable the second LoopSubsumptionPass(/DmaComposition), then all the tests pass, which means the changes within Here's the IR dump for 128x128x256 (worked) and 128x128x1024 (failed) for comparison. @jtuyls do you have any idea about this? UPDATE: This is currently solved by not subsuming loop iterations for large K size (>=1024) since it would exceed the size limit after inserting new dimensions. |
…812) Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side. In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from [this discussion](#764 (comment)), and this PR moves the dma optimization logic to an early pass where the dma ops are converted. Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as ``` AMDAIEConvertToDmaOptions dmaOptions; dmaOptions.packTransposeOnSource = false; dmaOptions.unpackTransposeOnSource = true; passManager.addPass(createAMDAIEConvertToDmaPass(dmaOptions)); ```
This issue is used as a tracker for ideas and discussions to improve performance for matmul ops. The data type for all these matmuls is bf16.
Some existing ideas include:
@jtuyls Feel free to add more points and details.
The text was updated successfully, but these errors were encountered: