Replies: 3 comments 4 replies
-
Point of reference: |
Beta Was this translation helpful? Give feedback.
-
Ok, so I think the discussion can be wrapped up in this sense:
So we have two options ahead of us: I would start with b) and then we can refactor things into a) |
Beta Was this translation helpful? Give feedback.
-
Intro
This is about this ticket: https://github.com/ROCmSoftwarePlatform/rocMLIR-internal/issues/1075
After some investigation, I found out that different libraries pipeline the loop in different ways. I summed up my findings here:
https://confluence.amd.com/display/MLSE/Pipelinining
My main question is: how do we move from our hardcoded software pipeline to a position where we can decide one of the methods above?
The idea I have is to write a normal loop during the gridwise pass, and then to create a "pipeline pass" that can restructure the loop in the way we want.
If you notice in the tensile pipeline, they pipeline the mfma/lds_read (to that they can run in parallel). So I think the pass should go after we do the threadwise lowering. This has the additional benefit that the blockwiseToThreadwise pass will also be simplified (because now is the pipeline's pass responsibility to decide that loop structure)
Possible structure
So the structure I have in mind is:
Implementation strategy
Please note that the analysis pass in point 3. might be beneficial also in other parts of the code (e.g., if we approximate the occupancy of the kernel, this can be used by the grid layout heuristic)
What do you guys think?
Beta Was this translation helpful? Give feedback.
All reactions