systolic-array: Fix bugs for small PE configurations. #45

yaoyuannnn · 2022-01-19T05:52:56Z

To compute a large convolution, we tile the operation into two levels of
"folds". First, the kernels are tiled into (numKerns / peArrayCols) weight
folds (as each PE column will finish a kernel), and within each weight fold,
each output feature map is further tiled into (outputRows * outputCols /
peArrayRows) output folds (as each PE row is responsible for producing a output
channel). Each PE row has a commit unit that collects finished results from the
row and writes the data to the output scratchpad. The current write size is
always set to the line size of the scratchpad, which is not correct for a small
PE column size where the entire row doesn't form a line. This commit fixes this
by taking into account when peArrayCols is smaller than elemsInLine.

Also, this fixes the incorrect initial tensor iterator place for the commit units.

yaoyuannnn · 2022-01-19T05:54:26Z

This fixes the hang in small PE configurations (#43). Unit tests will be added in next PRs.

To compute a large convolution, we tile the operation into two levels of "folds". First, the kernels are tiled into (numKerns / peArrayCols) weight folds (as each PE column will finish a kernel), and within each weight fold, each output feature map is further tiled into (outputRows * outputCols / peArrayRows) output folds (as each PE row is responsible for producing a output channel). Each PE row has a commit unit that collects finished results from the row and writes the data to the output scratchpad. The current write size is always set to the line size of the scratchpad, which is not correct for a small PE column size where the entire row doesn't form a line. This commit fixes this by taking into account when peArrayCols is smaller than elemsInLine. Also, this also fixes the incorrect initial tensor iterator place for the commit units. Change-Id: Iacc69f5d4cfa12512f25a19879b7e1aaa292a9d9

xyzsam · 2022-01-19T17:20:03Z

src/systolic_array/commit.cpp

-  for (int i = 0; i < inputs.size() / elemsPerLine; i++) {
-    for (int j = 0; j < elemsPerLine; j++) {
+  for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) {
+    for (int j = 0; j < elemsPerLine && j < inputs.size(); j++) {


I think you could add a var for numElemsToWrite = min(elemsPerLine, inputs.size()).

Yes. Added a var for that. I also used a slightly different way to visit all the inputs. Doing it by keeping track of remaining unvisted inputs seems to be easier to handle the boundary.

xyzsam · 2022-01-19T17:20:26Z

src/systolic_array/commit.cpp

  //
  // There are two cases where the commit unit will never see some of output
  // pixels ready: 1) The commit unit is not used at all, which means the whole
  // PE row is left idle. 2) Some of the PE columns are left idle due to a lack
  // of weights. In this case, we should do a writeback once all the "active"
  // columns have produced outputs.
-  for (int i = 0; i < inputs.size() / elemsPerLine; i++) {
-    for (int j = 0; j < elemsPerLine; j++) {
+  for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) {


Add a var for numLines = ceil(float(...)).

After the refactoring, this seems no longer needed.

xyzsam · 2022-01-19T17:22:20Z

src/systolic_array/commit.cpp

@@ -163,14 +167,14 @@ void Commit::localSpadCallback(PacketPtr pkt) {
      // If the outputs are finished, do the activation function before we send
      // the outputs back to the scratchpad.
      activationFunc(lineSlotPtr->getDataPtr<uint8_t>(),
-                     elemsPerLine,
+                     pkt->getSize() / accel.elemSize,


If pkt size is not a multiple of accel.elemSize, what happens? Is that possible? If not, please add an assertion or fatal warning to enforce it.

The pkt size will always be a multiple of accel.elemSize. Added the assert.

xyzsam · 2022-01-19T17:24:09Z

src/systolic_array/commit.cpp

  // Copy data from the buffer for the collected data.
-  for (int i = 0; i < elemsPerLine; i++) {
+  for (int i = 0; i < reqSize / accel.elemSize; i++) {


In general, integer divides make me nervous, because they're so easy to get wrong in C++ and the compiler often won't even warn you. In general, I would extract divisions into temporary variables so that any bugs can only happen in one place. In this case, the divide is actually unnecessary, because this is the same as min(elemsPerLine, inputs.size()).

I agree. I changed this function to take two arguments: the start index of the elements, and the number of elements to write. This is clearer to me.

xyzsam · 2022-01-19T17:27:29Z

src/systolic_array/commit.cpp

@@ -34,7 +36,7 @@ void Commit::setParams() {
      { 0, 0, 0, 0 },
      { 1, accel.outputRows, accel.outputCols, accel.peArrayCols });
  // Move the iterator to the correct starting place.
-  iter += { 0, 0, 0, accel.lineSize / accel.elemSize * id };
+  iter += { 0, 0, id, 0 };


Can you elaborate why this is the correct place now? I'm a bit confused by the previous code - were we relying on the iterator math to carry over the addition from dim[0] to dim[1]?

I was confused by the previous code too.. Yes seems like it was relying on the carryover from the channel dimension to the width dimension. However, since one PE row is responsible for producing an output channel, commit unit i should start at [0, 0, i, 0] and once it finishes that channel, it advances to [0, 0, 2*i, 0].

Forgot to mention here the layout format used in the systolic array is NHWC.

xyzsam · 2022-01-20T16:56:26Z

src/systolic_array/commit.cpp

+       remainingElems -= elemsPerLine) {
+    // Check if we have collected all the pixels for a writeback.
+    int elemsToWrite = std::min(elemsPerLine, remainingElems);
+    int index = inputs.size() - remainingElems;


I'd pick a more descriptive name than "index". Something that tells me it represents the "start" of something would be better (based on how it's used in queueCommitRequest).

Yes. Changed the name to start.

xyzsam · 2022-01-20T16:57:21Z

src/systolic_array/commit.cpp

+    int elemsToWrite = std::min(elemsPerLine, remainingElems);
+    int index = inputs.size() - remainingElems;
+    if (isLineComplete(index, elemsToWrite))
+      queueCommitRequest(index, elemsToWrite);


Is moving this up here functionally equivalent? IsLineComplete reads outputBuffer, which is modified by the loop below.

Yes it's functionally equivalent but I figured moving this up here is better in terms of modeling the hardware. Yes outputBuffer is modified by the loop below, but the hardware should see that and create a commit request in the next cycle (if all elements in the line are collected).

Change-Id: I18ff5af077753a776f382ad386529cace5b9ed8e

yaoyuannnn changed the title ~~systolic-array: Fix bugs for small PE configuration.~~ systolic-array: Fix bugs for small PE configurations. Jan 19, 2022

yaoyuannnn force-pushed the fix_systolic_array branch from 93cfb1f to 7cc5300 Compare January 19, 2022 05:56

xyzsam reviewed Jan 19, 2022

View reviewed changes

yaoyuannnn force-pushed the fix_systolic_array branch 2 times, most recently from e50e9f7 to 229c1db Compare January 20, 2022 06:37

xyzsam reviewed Jan 20, 2022

View reviewed changes

systolic-array: Address comments.

bf2e4aa

Change-Id: I18ff5af077753a776f382ad386529cace5b9ed8e

yaoyuannnn force-pushed the fix_systolic_array branch from 229c1db to bf2e4aa Compare January 21, 2022 03:52

yaoyuannnn requested a review from xyzsam January 21, 2022 03:53

xyzsam approved these changes Jan 21, 2022

View reviewed changes

yaoyuannnn merged commit d4efbee into harvard-acc:master Jan 21, 2022

yaoyuannnn deleted the fix_systolic_array branch January 21, 2022 06:26

mateo-vm mentioned this pull request Feb 10, 2022

Systolic arrays sizes #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systolic-array: Fix bugs for small PE configurations. #45

systolic-array: Fix bugs for small PE configurations. #45

yaoyuannnn commented Jan 19, 2022 •

edited

Loading

yaoyuannnn commented Jan 19, 2022

xyzsam Jan 19, 2022

yaoyuannnn Jan 20, 2022 •

edited

Loading

xyzsam Jan 19, 2022

yaoyuannnn Jan 20, 2022

xyzsam Jan 19, 2022

yaoyuannnn Jan 20, 2022

xyzsam Jan 19, 2022

yaoyuannnn Jan 20, 2022

xyzsam Jan 19, 2022

yaoyuannnn Jan 20, 2022 •

edited

Loading

yaoyuannnn Jan 20, 2022

xyzsam Jan 20, 2022

yaoyuannnn Jan 21, 2022

xyzsam Jan 20, 2022

yaoyuannnn Jan 21, 2022

systolic-array: Fix bugs for small PE configurations. #45

systolic-array: Fix bugs for small PE configurations. #45

Conversation

yaoyuannnn commented Jan 19, 2022 • edited Loading

yaoyuannnn commented Jan 19, 2022

Choose a reason for hiding this comment

yaoyuannnn Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyuannnn Jan 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaoyuannnn commented Jan 19, 2022 •

edited

Loading

yaoyuannnn Jan 20, 2022 •

edited

Loading

yaoyuannnn Jan 20, 2022 •

edited

Loading