Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systolic-array: Fix bugs for small PE configurations. #45

Merged
merged 2 commits into from
Jan 21, 2022

Conversation

yaoyuannnn
Copy link
Member

@yaoyuannnn yaoyuannnn commented Jan 19, 2022

To compute a large convolution, we tile the operation into two levels of
"folds". First, the kernels are tiled into (numKerns / peArrayCols) weight
folds (as each PE column will finish a kernel), and within each weight fold,
each output feature map is further tiled into (outputRows * outputCols /
peArrayRows) output folds (as each PE row is responsible for producing a output
channel). Each PE row has a commit unit that collects finished results from the
row and writes the data to the output scratchpad. The current write size is
always set to the line size of the scratchpad, which is not correct for a small
PE column size where the entire row doesn't form a line. This commit fixes this
by taking into account when peArrayCols is smaller than elemsInLine.

Also, this fixes the incorrect initial tensor iterator place for the commit units.

@yaoyuannnn
Copy link
Member Author

This fixes the hang in small PE configurations (#43). Unit tests will be added in next PRs.

@yaoyuannnn yaoyuannnn changed the title systolic-array: Fix bugs for small PE configuration. systolic-array: Fix bugs for small PE configurations. Jan 19, 2022
To compute a large convolution, we tile the operation into two levels of
"folds". First, the kernels are tiled into (numKerns / peArrayCols) weight
folds (as each PE column will finish a kernel), and within each weight fold,
each output feature map is further tiled into (outputRows * outputCols /
peArrayRows) output folds (as each PE row is responsible for producing a output
channel). Each PE row has a commit unit that collects finished results from the
row and writes the data to the output scratchpad. The current write size is
always set to the line size of the scratchpad, which is not correct for a small
PE column size where the entire row doesn't form a line. This commit fixes this
by taking into account when peArrayCols is smaller than elemsInLine.

Also, this also fixes the incorrect initial tensor iterator place for the
commit units.

Change-Id: Iacc69f5d4cfa12512f25a19879b7e1aaa292a9d9
for (int i = 0; i < inputs.size() / elemsPerLine; i++) {
for (int j = 0; j < elemsPerLine; j++) {
for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) {
for (int j = 0; j < elemsPerLine && j < inputs.size(); j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could add a var for numElemsToWrite = min(elemsPerLine, inputs.size()).

Copy link
Member Author

@yaoyuannnn yaoyuannnn Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added a var for that. I also used a slightly different way to visit all the inputs. Doing it by keeping track of remaining unvisted inputs seems to be easier to handle the boundary.

//
// There are two cases where the commit unit will never see some of output
// pixels ready: 1) The commit unit is not used at all, which means the whole
// PE row is left idle. 2) Some of the PE columns are left idle due to a lack
// of weights. In this case, we should do a writeback once all the "active"
// columns have produced outputs.
for (int i = 0; i < inputs.size() / elemsPerLine; i++) {
for (int j = 0; j < elemsPerLine; j++) {
for (int i = 0; i < ceil(float(inputs.size()) / elemsPerLine); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a var for numLines = ceil(float(...)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the refactoring, this seems no longer needed.

@@ -163,14 +167,14 @@ void Commit::localSpadCallback(PacketPtr pkt) {
// If the outputs are finished, do the activation function before we send
// the outputs back to the scratchpad.
activationFunc(lineSlotPtr->getDataPtr<uint8_t>(),
elemsPerLine,
pkt->getSize() / accel.elemSize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pkt size is not a multiple of accel.elemSize, what happens? Is that possible? If not, please add an assertion or fatal warning to enforce it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pkt size will always be a multiple of accel.elemSize. Added the assert.

// Copy data from the buffer for the collected data.
for (int i = 0; i < elemsPerLine; i++) {
for (int i = 0; i < reqSize / accel.elemSize; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, integer divides make me nervous, because they're so easy to get wrong in C++ and the compiler often won't even warn you. In general, I would extract divisions into temporary variables so that any bugs can only happen in one place. In this case, the divide is actually unnecessary, because this is the same as min(elemsPerLine, inputs.size()).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I changed this function to take two arguments: the start index of the elements, and the number of elements to write. This is clearer to me.

@@ -34,7 +36,7 @@ void Commit::setParams() {
{ 0, 0, 0, 0 },
{ 1, accel.outputRows, accel.outputCols, accel.peArrayCols });
// Move the iterator to the correct starting place.
iter += { 0, 0, 0, accel.lineSize / accel.elemSize * id };
iter += { 0, 0, id, 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why this is the correct place now? I'm a bit confused by the previous code - were we relying on the iterator math to carry over the addition from dim[0] to dim[1]?

Copy link
Member Author

@yaoyuannnn yaoyuannnn Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by the previous code too.. Yes seems like it was relying on the carryover from the channel dimension to the width dimension. However, since one PE row is responsible for producing an output channel, commit unit i should start at [0, 0, i, 0] and once it finishes that channel, it advances to [0, 0, 2*i, 0].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention here the layout format used in the systolic array is NHWC.

@yaoyuannnn yaoyuannnn force-pushed the fix_systolic_array branch 2 times, most recently from e50e9f7 to 229c1db Compare January 20, 2022 06:37
remainingElems -= elemsPerLine) {
// Check if we have collected all the pixels for a writeback.
int elemsToWrite = std::min(elemsPerLine, remainingElems);
int index = inputs.size() - remainingElems;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pick a more descriptive name than "index". Something that tells me it represents the "start" of something would be better (based on how it's used in queueCommitRequest).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Changed the name to start.

int elemsToWrite = std::min(elemsPerLine, remainingElems);
int index = inputs.size() - remainingElems;
if (isLineComplete(index, elemsToWrite))
queueCommitRequest(index, elemsToWrite);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is moving this up here functionally equivalent? IsLineComplete reads outputBuffer, which is modified by the loop below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's functionally equivalent but I figured moving this up here is better in terms of modeling the hardware. Yes outputBuffer is modified by the loop below, but the hardware should see that and create a commit request in the next cycle (if all elements in the line are collected).

Change-Id: I18ff5af077753a776f382ad386529cace5b9ed8e
@yaoyuannnn yaoyuannnn merged commit d4efbee into harvard-acc:master Jan 21, 2022
@yaoyuannnn yaoyuannnn deleted the fix_systolic_array branch January 21, 2022 06:26
@mateo-vm mateo-vm mentioned this pull request Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants