Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiling for uneven tile sizes #138

Open
giuseros opened this issue Jan 4, 2022 · 12 comments
Open

Tiling for uneven tile sizes #138

giuseros opened this issue Jan 4, 2022 · 12 comments

Comments

@giuseros
Copy link
Collaborator

giuseros commented Jan 4, 2022

Hi all,
While I finally have some satisfying numbers with my specific 2048^3 experiment (I will push the latest transforms as soon as possible), I thought it was time to try out more exotic sizes.

This is what I found, assuming micro-tiles of 8x8 and k_c (i.e., reduction axis tiling) equal to 512.

When M==2047

Everything is not too bad here. The packing is fine, while in the micro-kernel contains if-else statements to avoid stepping out-of-memory of the output matrix C.

When N == 2047

Same story. Packing fine and if we use split-vector-transfers-to=vector-transfers performance are not affected (if we use none instead performance decrease slightly, to-be-investigated). This is not super-important right now, but @nicolasvasilache have you thought about the possibility of splitting the macro-kernel loop (in a main + remainder) instead of adding if-else clauses inside the micro-kernel? How hard would it be?

When K == 2047

We have issues here: about 15% performance hit.

The problem is that the micro-kernel loop is now a dynamic loop: this is good per se, because we don't want to spend time adding and multiplying zeros, but it has the draw-back that the pipelineing pass won't work for dynamic bounds. There is a check here:

bool LoopPipelinerInternal::initializeLoopInfo(
    ForOp op, const PipeliningOption &options) {
  forOp = op;
  auto upperBoundCst =
      forOp.getUpperBound().getDefiningOp<arith::ConstantIndexOp>();
  auto lowerBoundCst =
      forOp.getLowerBound().getDefiningOp<arith::ConstantIndexOp>();
  auto stepCst = forOp.getStep().getDefiningOp<arith::ConstantIndexOp>();
  if (!upperBoundCst || !lowerBoundCst || !stepCst)
    return false;

That doesn't make the pass to work. @ThomasRaoux I didn't start yet having a look, but how hard do you think extending the pass to dynamic loop boundaries would be ?

Thanks,
Giuseppe

@nicolasvasilache
Copy link
Contributor

Speaking of exotic sizes, I did some experiments in the past 2 weeks and found that tile size == 1 for K (i.e. vector.contract of vector<ax1xf32> * vector<1xbxf32>) yields some nice perf on X86 and gives slightly different tradeoffs in the streaming load vs packing size vs unrolling vs #registers.

In particular, see https://github.com/google/iree-llvm-sandbox/blob/main/python/examples/matmul/bench.py#L62

I do not have recommendations re pipelining yet but I added a simple transformation for targeted unrolling and will play with pipelining on my end too soonish.

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 4, 2022

Funny, your tile sizes:

 tile_sizes1=[128, 384, 512],
 tile_sizes2=[12, 32, 1],

Are not too dissimilar to mine:

 tile_sizes1=[192, 8192, 512],
 tile_sizes2=[12, 8, 1],

I also found that using k_r>1 makes everything worse. From what I found is better to unroll the micro-kernel and then pipeline afterwards, and I guess you are also going toward that direction. Please, let me know about your pipelining findings :)

@ThomasRaoux
Copy link
Contributor

@ThomasRaoux I didn't start yet having a look, but how hard do you think extending the pass to dynamic loop boundaries would be ?

It depends on the case we want to handle. If we have a dynamic loop where we know the number of iteration is greater than the number of stages, it is a fairly simple change. If we want to also support the case where the number of iteration is potentially low (case that wouldn't benefit from pipelining but we may still want to support for functionality) it is much harder as we need to handle the case where the kernel loop is never executed which would be harder to handle and would generate suboptimal code.

As @nicolasvasilache pointed out we could pick a tile size of 1 for K and let the backend potentially do the dynamic unroll. Another solution would be to peel out iterations to get to a number of iterations divisible by the tile size.

@nicolasvasilache
Copy link
Contributor

You could also add a level of tiling just K with peeling.
That should give you a full/partial separation and you could pipeline in the static case.

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 4, 2022

Just to be on the same page, I am calling k_c the outer-tiling on K (in the example above 512) and k_r the inner tiling on K (in the example above 1).

I am not after full functionality, and I am happy that for very small loops (k_c ~ 1) pipelining is disabled (I don't expect small tile sizes like that to be beneficial to performance).

When you say "simple", do you mean simple for you or simple in general also for a newbie like me? :)

I am not sure on X86, but on AArch64 pipelining is disabled because apparently it does not work super well. What I gathered from talking with back-end engineers here, is that it would be better to pipeline in MLIR and let the back-end optimize straight-line blocks (which is already quite hard, it seems).

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 4, 2022

@nicolasvasilache Ah I see what you mean (maybe). I could tile&peel K by 512 so that we have a full separation when I apply pipelineing. Doesn't seem like a bad idea, I will also try this out, thanks!

@ThomasRaoux
Copy link
Contributor

Just to be on the same page, I am calling k_c the outer-tiling on K (in the example above 512) and k_r the inner tiling on K (in the example above 1).

I am not after full functionality, and I am happy that for very small loops (k_c ~ 1) pipelining is disabled (I don't expect small tile sizes like that to be beneficial to performance).

What matters is actually the number of iteration so it is k_c/k_r in your case. The tricky part where small loops may happen is if you end up with a remainder outer-tile that is very small. Either we need to support that (but we would have to generate inefficient prologue/epilogue and add significant logic the pipelining) or we need to make sure this never happens.

When I wrote the transformation I was defensive and wanted to make sure nobody accidentally generate incorrect code, that's why I added a check to make sure the number of iteration was large enough.

When you say "simple", do you mean simple for you or simple in general also for a newbie like me? :)

It's not a one line change but it shouldn't be very complicated even for someone new to the code. It doesn't change the logic overall. The main difference is that we need to change the loop boundaries usage here to be Values instead of being a constant and update the associated math to generate ops. The rest of the logic stays unchanged. And we also need to add an option so that user guarantees that the number of iteration is large enough to make sure we nobody mis-compiles by mistake.

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 4, 2022

Ok, I have implemented it in the way you suggested, and you were right, it was simple and performance are back on track.

Unfortunately, I think that we might have situations where we have to take care about edge cases. For instance when K=513 and k_c=512, the last tile would be just one vector (while I am unrolling by u and pipelining with s<=u stages). What do you think? @nicolasvasilache this would be problematic also for the peeling solution, right?

I will jot down some experiments and let you guys know.

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 5, 2022

Ok, so I played a bit with everything, and this is the problem

The problem

When we tile with uneven sizes the loop becomes dynamic. Considering the loop over K, when tiling by k_c the first loops will be of a full k_c , but the last loop will be of whatever is the remainder length.

The problem is that if the remainder is quite small (smaller than the number of stages), we might execute more operations than necessary

The solution

I came up with two solutions:
a) A don't-care solution where it's fine to execute more operations than necessary. This holds in our case for instance, because the tile size is still k_c even though we are executing <k_c iterations. However, this solution is possibly suboptimal.
b) A loop switch solution where if the number of iterations is bigger than the number of stages, then we pipeline, otherwise we "disable" the pipelining, i.e., we revert to the original loop
In theory there could be a more fine grained solution where we put if-else surrounding the different stages, but I didn't implement this (not only for the complexity, but also because I am not sure it would make sense performance-wise).

I added an option to enable solution a), to enable solution b) or to disable dynamic loops (i.e., disabling the pipelining if the loop has non-constant boundaries).

@ThomasRaoux this is unfortunately in the main codebase. If this solution sounds fine, do you think it is fine to open a PR? It would be my first on the main repo, so would you mind if I added you as reviewer?

Thanks,
Giuseppe

@ThomasRaoux
Copy link
Contributor

The solution

I came up with two solutions: a) A don't-care solution where it's fine to execute more operations than necessary. This holds in our case for instance, because the tile size is still k_c even though we are executing <k_c iterations. However, this solution is possibly suboptimal. b) A loop switch solution where if the number of iterations is bigger than the number of stages, then we pipeline, otherwise we "disable" the pipelining, i.e., we revert to the original loop In theory there could be a more fine grained solution where we put if-else surrounding the different stages, but I didn't implement this (not only for the complexity, but also because I am not sure it would make sense performance-wise).

Solution a) is potentially incorrect right? Wouldn't executing more iteration potentially causes out of bound accesses?

Solution b) sounds like either doing loop peeling (basically peeling K % s iteration of the original loop) or loop versioning (after tiling doing a versioning of the loop if(numIteration < s) original_loop else pipelined_loop). My gut feeling is that peeling would be more efficient but both basically duplicate the loop. I don't think this should be a big concern in term of perf but it does increase code size.

I added an option to enable solution a), to enable solution b) or to disable dynamic loops (i.e., disabling the pipelining if the loop has non-constant boundaries).

@ThomasRaoux this is unfortunately in the main codebase. If this solution sounds fine, do you think it is fine to open a PR? It would be my first on the main repo, so would you mind if I added you as reviewer?

Yes, patches are welcome and I'm happy to review them. Note that mlir doesn't use github PRs (https://mlir.llvm.org/getting_started/Contributing/#contributing-code).

@giuseros
Copy link
Collaborator Author

giuseros commented Jan 6, 2022

Hi @ThomasRaoux ,
Yes to many of your points. Sorry, I still lack the proper terminology :). About a) It's true, it might be incorrect, but there are cases where the user might not care (like in my case). Option b) is exactly loop versioning and (correct again) it increases the code size. I am not sure about loop peeling. Actually, peeling by K%s is already what happens with pipelining, right? The first s-1 stages are executed in the prologue of the loop. The problem is exactly that some of that prologue should not be executed. I could have added logic to decide which iteration to execute, but I felt this would have impacted performance. So peeling vs versioning seemed to me like the usual time/space tradeoff. Thanks for the link, I will give it a go!

@ThomasRaoux
Copy link
Contributor

Actually, peeling by K%s is already what happens with pipelining, right? The first s-1 stages are executed in the prologue of the loop.

Sorry I meant peeling K%k_c iteration. If you do that before tiling you ensure that the number of iteration is divisible by k_c so only you have full tiles remanning.
You would end up with:

for(k= 0; k < round_down(K, k_c); k++) {


}
// remainder small loop that won't get tiled and pipelined:
for(k = round_down(K, k_c); k < K; k++) {

}

Anyway it sounds like we agree.

BTW if you have some good improvements to the pipeliner feel free to upstream them anyway when you have time, it might be useful next time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants