Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systolic arrays sizes #43

Open
mateo-vm opened this issue Dec 13, 2021 · 7 comments
Open

Systolic arrays sizes #43

mateo-vm opened this issue Dec 13, 2021 · 7 comments
Assignees
Labels

Comments

@mateo-vm
Copy link

I am testing the systolic array accelerator, and I am having problems when changing the array sizes. For the default configuration (8x8), everything works fine.

For smaller sizes, it never ends. It gets stuck in a loop, and never leaves. The following is an example of a loop for a 4x4 systolic array. This happens even for the base test.c.

963152000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963152000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: global: Weight fold barrier, arrived: 1.
963153000: global: Weight fold barrier, arrived: 2.
963153000: global: evaluate
963153000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963153000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: global: Weight fold barrier, arrived: 3.
963154000: global: Weight fold barrier, arrived: 4.
963154000: global: evaluate
963154000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963154000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: global: Weight fold barrier, arrived: 5.
963155000: global: Weight fold barrier, arrived: 6.
963155000: global: evaluate
963155000: system.systolic_array_acc.input_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.input_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch0: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch1: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch2: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963155000: system.systolic_array_acc.weight_fetch3: Fetch queue occupied space: 0 / 32, allFetched: 1, allConsumed: 1, arrived at barrier: 1.
963156000: global: Weight fold barrier, arrived: 7.
963156000: global: Weight fold barrier, arrived: 8.
963156000: global: All have arrived at the weight fold barrier.

As for the bigger SAs, there are two cases. In the base test.c, SAs up to a size of 16 (not included) work as expected. For bigger sizes, I am getting a segmentation fault.

If I want to simulate bigger layers (e.g., ResNet's conv3), for bigger SAs I am getting the following error:

fatal: Streaming out premature data!

How can these errors be fixed?

@yaoyuannnn yaoyuannnn self-assigned this Dec 15, 2021
@yaoyuannnn yaoyuannnn added the bug label Dec 15, 2021
@xyzsam
Copy link
Member

xyzsam commented Dec 20, 2021

Yuan did you get a chance to look into this? IIRC we've tested out both 4x4 and 16x16 arrays in the past.

@yaoyuannnn
Copy link
Member

Yuan did you get a chance to look into this? IIRC we've tested out both 4x4 and 16x16 arrays in the past.

Sam, I haven't taken a look. It could be bugs introduced by more recent changes. Will look into this this week.

@mateo-vm
Copy link
Author

Hi! Would there be any update regarding this issue?

@yaoyuannnn
Copy link
Member

Sorry for late response. After a bit of investigation, I think for the 4x4 PE configuration, the hang is because of a bug in the commit unit (which collects data from the PEs and writes results to the local SRAM). It currently assumes the number of PE columns is larger than the writeback line size. I will upload a fix for this tomorrow.

@xyzsam
Copy link
Member

xyzsam commented Feb 7, 2022

@mateo-vm: did this resolve your problems?

@yaoyuannnn
Copy link
Member

Hey Sam, the MRs didn't fix the bug for 16x16 configuration. Sorry been quite busy lately, but will get down to this soon.

@mateo-vm
Copy link
Author

@mateo-vm: did this resolve your problems?

Yes, thank you very much. At the end I focused on arrays up to 8x8, so #45 really helped. On my side is solved, but I won't close the issue in case you still want to solve the 16x16 problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants