-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systolic-array: Fix bugs for small PE configurations. #45
Conversation
This fixes the hang in small PE configurations (#43). Unit tests will be added in next PRs. |
To compute a large convolution, we tile the operation into two levels of "folds". First, the kernels are tiled into (numKerns / peArrayCols) weight folds (as each PE column will finish a kernel), and within each weight fold, each output feature map is further tiled into (outputRows * outputCols / peArrayRows) output folds (as each PE row is responsible for producing a output channel). Each PE row has a commit unit that collects finished results from the row and writes the data to the output scratchpad. The current write size is always set to the line size of the scratchpad, which is not correct for a small PE column size where the entire row doesn't form a line. This commit fixes this by taking into account when peArrayCols is smaller than elemsInLine. Also, this also fixes the incorrect initial tensor iterator place for the commit units. Change-Id: Iacc69f5d4cfa12512f25a19879b7e1aaa292a9d9
93cfb1f
to
7cc5300
Compare
src/systolic_array/commit.cpp
Outdated
for (int i = 0; i < inputs.size() / elemsPerLine; i++) { | ||
for (int j = 0; j < elemsPerLine; j++) { | ||
for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) { | ||
for (int j = 0; j < elemsPerLine && j < inputs.size(); j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could add a var for numElemsToWrite = min(elemsPerLine, inputs.size())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added a var for that. I also used a slightly different way to visit all the inputs. Doing it by keeping track of remaining unvisted inputs seems to be easier to handle the boundary.
src/systolic_array/commit.cpp
Outdated
// | ||
// There are two cases where the commit unit will never see some of output | ||
// pixels ready: 1) The commit unit is not used at all, which means the whole | ||
// PE row is left idle. 2) Some of the PE columns are left idle due to a lack | ||
// of weights. In this case, we should do a writeback once all the "active" | ||
// columns have produced outputs. | ||
for (int i = 0; i < inputs.size() / elemsPerLine; i++) { | ||
for (int j = 0; j < elemsPerLine; j++) { | ||
for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a var for numLines = ceil(float(...))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the refactoring, this seems no longer needed.
@@ -163,14 +167,14 @@ void Commit::localSpadCallback(PacketPtr pkt) { | |||
// If the outputs are finished, do the activation function before we send | |||
// the outputs back to the scratchpad. | |||
activationFunc(lineSlotPtr->getDataPtr<uint8_t>(), | |||
elemsPerLine, | |||
pkt->getSize() / accel.elemSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pkt size is not a multiple of accel.elemSize, what happens? Is that possible? If not, please add an assertion or fatal warning to enforce it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pkt size will always be a multiple of accel.elemSize. Added the assert.
src/systolic_array/commit.cpp
Outdated
// Copy data from the buffer for the collected data. | ||
for (int i = 0; i < elemsPerLine; i++) { | ||
for (int i = 0; i < reqSize / accel.elemSize; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, integer divides make me nervous, because they're so easy to get wrong in C++ and the compiler often won't even warn you. In general, I would extract divisions into temporary variables so that any bugs can only happen in one place. In this case, the divide is actually unnecessary, because this is the same as min(elemsPerLine, inputs.size())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I changed this function to take two arguments: the start index of the elements, and the number of elements to write. This is clearer to me.
@@ -34,7 +36,7 @@ void Commit::setParams() { | |||
{ 0, 0, 0, 0 }, | |||
{ 1, accel.outputRows, accel.outputCols, accel.peArrayCols }); | |||
// Move the iterator to the correct starting place. | |||
iter += { 0, 0, 0, accel.lineSize / accel.elemSize * id }; | |||
iter += { 0, 0, id, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why this is the correct place now? I'm a bit confused by the previous code - were we relying on the iterator math to carry over the addition from dim[0] to dim[1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by the previous code too.. Yes seems like it was relying on the carryover from the channel dimension to the width dimension. However, since one PE row is responsible for producing an output channel, commit unit i should start at [0, 0, i, 0] and once it finishes that channel, it advances to [0, 0, 2*i, 0].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention here the layout format used in the systolic array is NHWC.
e50e9f7
to
229c1db
Compare
src/systolic_array/commit.cpp
Outdated
remainingElems -= elemsPerLine) { | ||
// Check if we have collected all the pixels for a writeback. | ||
int elemsToWrite = std::min(elemsPerLine, remainingElems); | ||
int index = inputs.size() - remainingElems; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd pick a more descriptive name than "index". Something that tells me it represents the "start" of something would be better (based on how it's used in queueCommitRequest).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Changed the name to start
.
src/systolic_array/commit.cpp
Outdated
int elemsToWrite = std::min(elemsPerLine, remainingElems); | ||
int index = inputs.size() - remainingElems; | ||
if (isLineComplete(index, elemsToWrite)) | ||
queueCommitRequest(index, elemsToWrite); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is moving this up here functionally equivalent? IsLineComplete reads outputBuffer, which is modified by the loop below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's functionally equivalent but I figured moving this up here is better in terms of modeling the hardware. Yes outputBuffer is modified by the loop below, but the hardware should see that and create a commit request in the next cycle (if all elements in the line are collected).
Change-Id: I18ff5af077753a776f382ad386529cace5b9ed8e
229c1db
to
bf2e4aa
Compare
To compute a large convolution, we tile the operation into two levels of
"folds". First, the kernels are tiled into (numKerns / peArrayCols) weight
folds (as each PE column will finish a kernel), and within each weight fold,
each output feature map is further tiled into (outputRows * outputCols /
peArrayRows) output folds (as each PE row is responsible for producing a output
channel). Each PE row has a commit unit that collects finished results from the
row and writes the data to the output scratchpad. The current write size is
always set to the line size of the scratchpad, which is not correct for a small
PE column size where the entire row doesn't form a line. This commit fixes this
by taking into account when peArrayCols is smaller than elemsInLine.
Also, this fixes the incorrect initial tensor iterator place for the commit units.