Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify accelerator and non-accelator paths #1544

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

djramic
Copy link
Contributor

@djramic djramic commented Jun 6, 2024

This PR is created for tracking the unification paths ticket.
The current commit is focused on the threadwise unification.
NormalizeView and TransformingForOp components remain non-unified(hardcoded).
For normalizeView, the kPack parameter is problematic, whereas for TransfromingForOp the differing indexing methond is the concern.
I am currently investigating whether they should be integrated through the AccelEmitter or if there is a possibility for refactoring them through a blockwise path unification.

@djramic djramic requested a review from krzysz00 June 6, 2024 02:17
@djramic djramic requested review from jerryyin and sjw36 as code owners June 6, 2024 02:17
@krzysz00
Copy link
Collaborator

krzysz00 commented Jun 6, 2024

I'll note that kpack should actually be used in the non-accelerator path

@krzysz00
Copy link
Collaborator

krzysz00 commented Jun 6, 2024

Like, while you're here, the FMA/dot product-based path should absolutely handle kpack

And if there isn't room for it in the perf config, a new perf config version might be in order

@krzysz00
Copy link
Collaborator

krzysz00 commented Jun 6, 2024

The fundamental observation, IMO, is that a a FMA is a 1 x kbase * kbase x 1 multiplication (I say K here because this could also be easily extended to dot products, so you'd have, for example, kbase = 2 for fp16.

Then, with that, you can use the general accelerator lowering, which is pretty independent of exactly what the accelerator does.

(For practical reasons, you might want to announce the accelerator as mPerThread x kbase x nPerThread ... but also, mRepeats and nRepeats exist for a reason and ought to cover most of this)

@djramic
Copy link
Contributor Author

djramic commented Jun 7, 2024

I plan to start by unifying the lowering paths in the way they are currently implemented, in order to get a better understanding of each lowerting level. And to make the verification easier, using existing tests. After that, I can make make additional refactorisation and improvement(We can open a separate issue based on you comments above.) I spirit of my current aproach do you have any aditional feedback? I would wery much appreciate it. I'm not quite sure I'm on the right path.

void emitThreadwiseLoop(OpBuilder &b, Location loc, Value argA, Value argB,
Value bufferC, ValueRange regCOffset) override;

virtual Value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to repeat virtual here

} // namespace rock
} // namespace mlir

#endif // MLIR_FMA_INSN_GROUP_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: newline

StrAttr:$arch,
Rock_GemmFeaturesAttr:$features,
RockTuningParamAttrInterface:$params)> {
let assemblyFormat = [{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary thing, right? Why's this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this OP will replace threadwise_gemm and threadwise_accel_gemm. I am creating a new operation so I don't have to delete the existing code.

@@ -130,27 +131,110 @@ Value AccelEmitter::generateThreadwiseViewBufferC(PatternRewriter &b,
return viewC;
}

// **************************
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird comment format

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... though that's file style. Maybe worth fixing to the //=---------------=// style seen elsewhere in the codebase both here and below

void FmaEmitter::emitThreadwiseLoop(OpBuilder &b, Location loc, Value argA, Value argB,
Value bufferC, ValueRange regCOffset){

Type dataType = fmaInsn.argTypeA;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a git clang-format main in order

int64_t inNPerThread, bool doSwapThreadIterSubDimsForM,
bool doSwapThreadIterSubDimsForN){

//TO-DO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Weird

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reasonably straightforward?

TopDownTMBuilder td(b, names, sizes, loc);
// Convert the normalizedView to a real view by ignoring
// the names contained in `ignoreNames` and letting the rest pass through
for (pos = 0; pos < names.size(); pos++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move unsigned pos into the for loop.

... bigger nit: auto [pos, name] : llvm::enumerate(names))

// Loop properties
auto computeStart = llvm::to_vector(op.getComputeIndices());

if(isMfma || isWmma){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Don't WMMA and MFMA go down threadwise_gemm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is temporary for parts that are not unified yet. The goal is to have it removed at the end of the unification paths.

OpBuilder::InsertionGuard guard(b);
b.setInsertionPointToStart(gemmLoop.getBody());

auto coordsA = gemmLoop.getLowerCoords(/*domain=*/0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this case? What prevents you from using emitAccelLoop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is caused by different ways of creating gemmLoop and accelLoop TransformingForOp. I want to try to equalize these approaches by refactoring the blockwise level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants