Replies: 3 comments 2 replies
-
|
Beta Was this translation helpful? Give feedback.
-
I am referring to our approach as a stream vs double buffer since the compute values are always taken from the same buffer in LDS. We stream tiles through vgprs to the same buffer. We are really relying on more occupancy to get perf. |
Beta Was this translation helpful? Give feedback.
-
Having looked at our code, Im still not sure we do software pipelining at all.. Let me state the theory first,
Once SW pipeline it it should look like :
Above is just an example that does hide load latency with compute & accumulate Why I think our is not doing software pipelining because I read our code as follows :
Above just peels of the first iteration out of the loop but does not SW pipeline. |
Beta Was this translation helpful? Give feedback.
-
I would like to discuss how we implement software pipelining in rocMLIR.
In the graphs, light color means
global_load
is issued but data is not ready. Dark color means data is committed.Double-buffer case
Since we do
global_load
beforelds_barrier
at the beginning of the loop. At the moment ofglobal_load
we need two sets of VGPRs , one to wait forglobal_load
and the other to wait fords_write
. As shown in the graph below:If we switch the
global_load
andlds_barrier
, as shown in the graph below, we'll need only one set of VGPRs.Triple-buffer case
By doing
lds_barrier
before 'global_load', we can achieve triple-buffer software pipelining with two sets of VGPRs, as shown belowNote that at every moment, only two sets of VGPRs are in use.
So my question for @krzysz00 and @sjw36:
Beta Was this translation helpful? Give feedback.
All reactions