Software pipelining #1177

zhanglx13 · 2023-07-31T04:41:22Z

zhanglx13
Jul 31, 2023
Collaborator

I would like to discuss how we implement software pipelining in rocMLIR.
In the graphs, light color means global_load is issued but data is not ready. Dark color means data is committed.

Double-buffer case

Since we do global_load before lds_barrier at the beginning of the loop. At the moment of global_load we need two sets of VGPRs , one to wait for global_load and the other to wait for ds_write. As shown in the graph below:

If we switch the global_load and lds_barrier, as shown in the graph below, we'll need only one set of VGPRs.

Triple-buffer case

By doing lds_barrier before 'global_load', we can achieve triple-buffer software pipelining with two sets of VGPRs, as shown below

Note that at every moment, only two sets of VGPRs are in use.

So my question for @krzysz00 and @sjw36:

Are we using two sets of VGPRs in the current software pipelining implementation of rocMLIR?
Have we considered using more sets of VGPRs to achieve n-buffer software pipeling?

sjw36 · 2023-08-01T15:42:23Z

sjw36
Aug 1, 2023
Maintainer

If there is no transpose/swizzle between global_load and ds_write, then the compiler should use the same registers. The values will be read out by ds_write first, before the global_load copies in new values.
This would certainly create more register pressure and probably less occupancy. It would be worthwhile asking CK if they attempted this approach.

1 reply

zhanglx13 Aug 2, 2023
Collaborator Author

In upstream triton, number of buffers is a tuning parameters. We can do the same. But asking CK people should be a reasonable first step.

sjw36 · 2023-08-01T15:44:18Z

sjw36
Aug 1, 2023
Maintainer

I am referring to our approach as a stream vs double buffer since the compute values are always taken from the same buffer in LDS. We stream tiles through vgprs to the same buffer. We are really relying on more occupancy to get perf.

0 replies

manupak · 2023-08-02T16:03:15Z

manupak
Aug 2, 2023
Collaborator

Having looked at our code, Im still not sure we do software pipelining at all..

Let me state the theory first,

for k : 0 to K - 1
   load k -- load here global -> regs --> regs --> lds (there is NO SW pipeline between these internal stages)
   compute k
   store k

Once SW pipeline it it should look like :

load 0
for k : 1 to K - 1
  load k
  compute & accumulate k - 1
compute & accumulate K - 1
store to global

Above is just an example that does hide load latency with compute & accumulate

Why I think our is not doing software pipelining because I read our code as follows :

load 0  -- load here global -> regs --> regs --> lds (there is NO SW pipeline between these internal stages)
compute 0
for k : 1 to K - 1
  load k
  compute & accumulate k
store to global

Above just peels of the first iteration out of the loop but does not SW pipeline.
I must be surely missing something here...

1 reply

manupak Aug 2, 2023
Collaborator

ok my bad... we do

load1 0 ;  load here global -> regs
load2 0; load here regs -> lds 
for k : 1 to K - 1
  load1 k;  load here global -> regs
  compute & accumulate k - 1
  load2 k; load here regs -> lds 
compute & accumulate K - 1
store to global

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Software pipelining #1177

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Software pipelining #1177

zhanglx13 Jul 31, 2023 Collaborator

Double-buffer case

Triple-buffer case

Replies: 3 comments · 2 replies

sjw36 Aug 1, 2023 Maintainer

zhanglx13 Aug 2, 2023 Collaborator Author

sjw36 Aug 1, 2023 Maintainer

manupak Aug 2, 2023 Collaborator

manupak Aug 2, 2023 Collaborator

zhanglx13
Jul 31, 2023
Collaborator

Replies: 3 comments 2 replies

sjw36
Aug 1, 2023
Maintainer

zhanglx13 Aug 2, 2023
Collaborator Author

sjw36
Aug 1, 2023
Maintainer

manupak
Aug 2, 2023
Collaborator

manupak Aug 2, 2023
Collaborator