-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify accelerator and non-accelator paths #1544
base: develop
Are you sure you want to change the base?
Conversation
I'll note that kpack should actually be used in the non-accelerator path |
Like, while you're here, the FMA/dot product-based path should absolutely handle kpack And if there isn't room for it in the perf config, a new perf config version might be in order |
The fundamental observation, IMO, is that a a FMA is a Then, with that, you can use the general accelerator lowering, which is pretty independent of exactly what the accelerator does. (For practical reasons, you might want to announce the accelerator as mPerThread x kbase x nPerThread ... but also, mRepeats and nRepeats exist for a reason and ought to cover most of this) |
I plan to start by unifying the lowering paths in the way they are currently implemented, in order to get a better understanding of each lowerting level. And to make the verification easier, using existing tests. After that, I can make make additional refactorisation and improvement(We can open a separate issue based on you comments above.) I spirit of my current aproach do you have any aditional feedback? I would wery much appreciate it. I'm not quite sure I'm on the right path. |
void emitThreadwiseLoop(OpBuilder &b, Location loc, Value argA, Value argB, | ||
Value bufferC, ValueRange regCOffset) override; | ||
|
||
virtual Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to repeat virtual
here
} // namespace rock | ||
} // namespace mlir | ||
|
||
#endif // MLIR_FMA_INSN_GROUP_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: newline
StrAttr:$arch, | ||
Rock_GemmFeaturesAttr:$features, | ||
RockTuningParamAttrInterface:$params)> { | ||
let assemblyFormat = [{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary thing, right? Why's this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that this OP will replace threadwise_gemm and threadwise_accel_gemm. I am creating a new operation so I don't have to delete the existing code.
@@ -130,27 +131,110 @@ Value AccelEmitter::generateThreadwiseViewBufferC(PatternRewriter &b, | |||
return viewC; | |||
} | |||
|
|||
// ************************** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird comment format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... though that's file style. Maybe worth fixing to the //=---------------=//
style seen elsewhere in the codebase both here and below
void FmaEmitter::emitThreadwiseLoop(OpBuilder &b, Location loc, Value argA, Value argB, | ||
Value bufferC, ValueRange regCOffset){ | ||
|
||
Type dataType = fmaInsn.argTypeA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a git clang-format main
in order
int64_t inNPerThread, bool doSwapThreadIterSubDimsForM, | ||
bool doSwapThreadIterSubDimsForN){ | ||
|
||
//TO-DO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reasonably straightforward?
TopDownTMBuilder td(b, names, sizes, loc); | ||
// Convert the normalizedView to a real view by ignoring | ||
// the names contained in `ignoreNames` and letting the rest pass through | ||
for (pos = 0; pos < names.size(); pos++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move unsigned pos
into the for loop.
... bigger nit: auto [pos, name] : llvm::enumerate(names))
// Loop properties | ||
auto computeStart = llvm::to_vector(op.getComputeIndices()); | ||
|
||
if(isMfma || isWmma){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Don't WMMA and MFMA go down threadwise_gemm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is temporary for parts that are not unified yet. The goal is to have it removed at the end of the unification paths.
OpBuilder::InsertionGuard guard(b); | ||
b.setInsertionPointToStart(gemmLoop.getBody()); | ||
|
||
auto coordsA = gemmLoop.getLowerCoords(/*domain=*/0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this case? What prevents you from using emitAccelLoop
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is caused by different ways of creating gemmLoop and accelLoop TransformingForOp. I want to try to equalize these approaches by refactoring the blockwise level.
This PR is created for tracking the unification paths ticket.
The current commit is focused on the threadwise unification.
NormalizeView and TransformingForOp components remain non-unified(hardcoded).
For normalizeView, the kPack parameter is problematic, whereas for TransfromingForOp the differing indexing methond is the concern.
I am currently investigating whether they should be integrated through the AccelEmitter or if there is a possibility for refactoring them through a blockwise path unification.