Implementing different software pipelining policies in rocMLIR #1226

giuseros · 2023-09-05T10:53:35Z

giuseros
Sep 5, 2023
Collaborator

Intro

This is about this ticket: https://github.com/ROCmSoftwarePlatform/rocMLIR-internal/issues/1075

After some investigation, I found out that different libraries pipeline the loop in different ways. I summed up my findings here:
https://confluence.amd.com/display/MLSE/Pipelinining

My main question is: how do we move from our hardcoded software pipeline to a position where we can decide one of the methods above?

The idea I have is to write a normal loop during the gridwise pass, and then to create a "pipeline pass" that can restructure the loop in the way we want.

If you notice in the tensile pipeline, they pipeline the mfma/lds_read (to that they can run in parallel). So I think the pass should go after we do the threadwise lowering. This has the additional benefit that the blockwiseToThreadwise pass will also be simplified (because now is the pipeline's pass responsibility to decide that loop structure)

Possible structure

So the structure I have in mind is:

GridwiseToBlockwise (no pipelining)
BlockwiseToThreadwise
ThreadwiseLowering (normal straight loops)
PipelinePass : this will decide the loop structure and the barrier placement

Implementation strategy

As a first cut, we could simply introduce a NFC to implement the pipeline via a pass (instead of hardcoding it in the gridwiseToBlockwise pass).
Then we can introduce the additional 2 different pipelining strategies, disabled by default
To pick among the different pipelining strategies, we can either a) let the tuner decide, b) introduce an analysis pass to come up with an heuristic to select one or another.

Please note that the analysis pass in point 3. might be beneficial also in other parts of the code (e.g., if we approximate the occupancy of the kernel, this can be used by the grid layout heuristic)

What do you guys think?

giuseros · 2023-09-05T10:54:02Z

giuseros
Sep 5, 2023
Collaborator Author

cc @manupak , @krzysz00 , @sjw36

4 replies

manupak Sep 5, 2023
Collaborator

I agree with the pass ordering.

Let me sketch out what I had in mind for a better pipelining scheme within rocMLIR.

My idea of this has been that in kernel lowering we define 'stages' which basically has the form :
read from some memory --> compute --> store to some memory. Lets give the bad name rock.stage for the sake of this discussion.

rock.stage (ins: [input memrefs]) (outs: [output memrefs]) {
    rock.threadwise_read_into // we can make these work for register memrefs as well
    rock.transforming_for // to do compute
    rock.threadwise_write_all // we can make these work for register memrefs as well
}

So we'd end up having a kernel be a composition of potentially pipeline'ble stages say :

rock.stage0 
rock.stage1
...
rock.stageN

More generally one could have parallel stages as well :

rock.stage0
rock.stage1 {rock.stage1a | rock.stage 1b}
rock.stage2
...
rock.stageN

Then what we do today (also known as "stream" pipelining) is that we overlap stages that does not have common memory that they write or read into -- which is quite a reasonable thing to do because it will not need double buffering.

However, if we need pipeline consecutive stages, then the common memories will need double buffering.

In a theoretical perspective,
To perform a optimal thread pipeline, we just need to know the latency of each stage. However, that could potentially be suboptimal in terms overall GPU performance because, the to achieve optimal pipelining, we might end up creating higher register pressure and/or higher LDS utilization -- hence the optimal threadwise pipeline might not lead to optimal GPU performance if the kernel is compute bound.

Towards the optimal thread pipeline

Tuning

Tuning is always a possibility to decide where the to break the stages and fold. However, it might fall into more brute force typed approaches here.

Static heuristic

This would be nice if we can come up with one -- it could be an outcome of empirical evaluation of kernels configurations we care about.

Profile driven

Another option is to add instrumentation run to profile performance for each stages to decide where we should be folding to create the pipeline.

Towards the optimal GPU pipeline

I consider this to be the harder problem of the two.
I personally think this deviates from the optimal thread pipeline, specifically due to occupancy.

Just a thought:
One option here is to consider this as a multi-variate optimization of {latency, LDS usage, register pressure} and construct a pareto optimal curve of solutions that tries to minimize above three and then pick a solution from the pareto-optimal solutions either based on tuning or a heuristic.

Summary

I think we are saying the same thing in terms of short-term goals here. However, I'd like an extensible abstraction.
E.g. : I suppose you have few ideas of how to 'stage' the kernel, my suggestion here is to break it down small stages as possible and consider a few (2 or 3) distinct points to pipeline. We could even consider adding a IR marker : rock.pipeline_barrier to denote these.

manupak Sep 5, 2023
Collaborator

I agree with the pass ordering.

A correction : this should go above threadwise lowering and I claim we should move threadwise_read_into and threadwise_write_all to threadwise lowering pass.

giuseros Sep 5, 2023
Collaborator Author

I think I agree with most of what you said. As we agreed, we can create utility to create/merge/split stages. Re ordering, ok, so that should be:

GridwiseToBlockwise (no pipelining)
BlockwiseToThreadwise (normal straight loops)
PipelinePass : this will decide the loop structure and the barrier placement and will operate on threadwise operations grouped in stages.
ThreadwiseLowering

giuseros Sep 7, 2023
Collaborator Author

A bit more on this one.

The current status

In the current status of our code we

pipeline the gridwise loop manually (i.e., creating affine ops)
schedule the blockwise gemm loops manually (i.e., creating affine ops)

From what I understand, we want to automate both, because somehow pipelining and loop scheduling are connected. My main question is: how to do that without degrading current performance?

So my first step would be to implement a design that could bring us in the actual situation and then move to more sophisticated pipeline strategies.

What IR we produce today?

This is the IR that we produce after the gridwiseToBlockwise pass.

rock.threadwise_read_into 
rock.threadwise_read_into 
rock.threadwise_transpose
rock.threadwise_transpose
rock.threadwise_write_all 
rock.threadwise_write_all 
 affine.for %arg3 = 0 to 12 { // main for loop over k
      rock.threadwise_read_into 
      rock.threadwise_read_into 
      rock.lds_barrier
      rock.blockwise_gemm
      rock.lds_barrier
     rock.threadwise_transpose
     rock.threadwise_transpose
     rock.threadwise_write_all 
     rock.threadwise_write_all 
}

And after the blockwiseToThreadwise pass we have the blockwise IR in this form:

 affine.for %arg3 = 0 to 12 { // main for loop over k
      rock.threadwise_read_into 
      rock.threadwise_read_into 
      rock.threadwise_transpose
      rock.threadwise_transpose
      rock.threadwise_write_all 
      rock.threadwise_write_all 
      rock.lds_barrier
     affine.for %arg4 = 0 to 1{ // mRepeat
         affine.for %arg5 = 0 to 2{ // load A[mRepeat,:]
              memref.load
              memref.load
              memref.store
              memref.store
        }
        affine.for %arg5 = 0 to 1{ // nRepeat
              affine.for %arg5 = 0 to 2{ // load B[nRepeat,:]
                   memref.load
                  memref.load
                  memref.store
                  memref.store
              }
              rock.accel_gemm (A,B)
        }
   }
}

Note how the internal loop is following a given schedule to mirror the loop that CK does in their code.

A new set of passes

My understanding is that we want to get rid of the manual pipelining/scheduling which should be the result of our pipelining/loop_scheduling passes. So we should start from this IR:

 affine.for %arg3 = 0 to 12 { // main for loop over k
      rock.threadwise_read_into 
      rock.threadwise_read_into 
     rock.threadwise_transpose
     rock.threadwise_transpose
     rock.threadwise_write_all 
     rock.threadwise_write_all 
      rock.lds_barrier
      rock.blockwise_gemm
}

And after those passes I think the optimal threadwise IR (before threadwise lowering) should be:

rock.threadwise_read_into 
rock.threadwise_read_into 
rock.threadwise_transpose
rock.threadwise_transpose
rock.threadwise_write_all 
rock.threadwise_write_all 
 affine.for %arg3 = 0 to 12 { // main for loop over k
      rock.threadwise_read_into 
      rock.threadwise_read_into 
      rock.lds_barrier
     affine.for %arg4 = 0 to 1{ // mRepeat
         affine.for %arg5 = 0 to 2{ // load A[mRepeat,:]
             rock.threadwise_read_into 
         }
        affine.for %arg5 = 0 to 1{ // nRepeat
              affine.for %arg5 = 0 to 2{ // load B[nRepeat,:]
                   rock.threadwise_read_into 
              }
              rock.accel_gemm (A,B)
        }
    }
   rock.threadwise_transpose
   rock.threadwise_transpose
    rock.lds_barrier
   rock.threadwise_write_all 
   rock.threadwise_write_all 
}

ANd this should be lowered exactly to our current IR.

So we have blockwiseIR -> pass1 -> pass2 -> passN -> threadwiseIR

So my question is: what are the passes we need to move from blockwiseIR to threadwiseIR? I think @manupak is suggesting to wrap parts of the code into stages:

 affine.for %arg3 = 0 to 12 { // main for loop over k
      rock.stage{rock.threadwise_read_into 
      rock.threadwise_read_into 
     rock.threadwise_transpose
     rock.threadwise_transpose}
    rock.stage{
     rock.threadwise_write_all 
     rock.threadwise_write_all 
}
      rock.lds_barrier
     rock.stage{ rock.blockwise_gemm}
}

But then how to move from this IR to the one we currently generate? I mean how do we move from the above IR to the threadwise optimal IR?

sjw36 · 2023-09-06T15:10:21Z

sjw36
Sep 6, 2023
Maintainer

Point of reference:
The triton sw pipelining pass is based on SCF LoopPipelining: mlir/lib/Dialect/SCF/Transforms/LoopPipelining.cpp
At a lower level it, triton SW pipelining simply walks a loop looking for loads into shared mem that can be peeled off.

0 replies

giuseros · 2023-09-13T14:06:59Z

giuseros
Sep 13, 2023
Collaborator Author

Ok, so I think the discussion can be wrapped up in this sense:

The schedule, i.e., the order of the loops, is an input to the pipeline pass. The schedule can be manually constructed, or constructed via a set of transform passes
Once we get the schedule, we can use the pipeline pass to pipeline the loop

So we have two options ahead of us:
a) We convert our manual schedule in a set of transform passes and, after we scheduled, we use the transform.pipeline pass
b) We keep our manual schedule and use a pipeline pass to pipeline it, possibly reusing the SCF loop pipelining that Simon mentioned

I would start with b) and then we can refactor things into a)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing different software pipelining policies in rocMLIR #1226

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implementing different software pipelining policies in rocMLIR #1226

giuseros Sep 5, 2023 Collaborator

Intro

Possible structure

Implementation strategy

Replies: 3 comments · 4 replies

giuseros Sep 5, 2023 Collaborator Author

manupak Sep 5, 2023 Collaborator

Towards the optimal thread pipeline

Tuning

Static heuristic

Profile driven

Towards the optimal GPU pipeline

Summary

manupak Sep 5, 2023 Collaborator

giuseros Sep 5, 2023 Collaborator Author

giuseros Sep 7, 2023 Collaborator Author

The current status

What IR we produce today?

A new set of passes

sjw36 Sep 6, 2023 Maintainer

giuseros Sep 13, 2023 Collaborator Author

giuseros
Sep 5, 2023
Collaborator

Replies: 3 comments 4 replies

giuseros
Sep 5, 2023
Collaborator Author

manupak Sep 5, 2023
Collaborator

manupak Sep 5, 2023
Collaborator

giuseros Sep 5, 2023
Collaborator Author

giuseros Sep 7, 2023
Collaborator Author

sjw36
Sep 6, 2023
Maintainer

giuseros
Sep 13, 2023
Collaborator Author