[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

mp-17 · 2022-12-05T19:41:10Z

This PR was broken down into two different PRs: #202 #203

The only original contribution here is Barber Pole. Need to clean this before merging.

Heterogeneous PR with optimizations and fixes.

Changelog

Fixed

Decouple cmdBuffer and dataBuffer depth parameters in the operand queues

Added

Add support for Barber's Pole VRF Layout

Changed

Handle WAW and WAR vload hazards in the VLDU without stalling the main sequencer
Reductions are no more treated as widening instructions for what concerns WAW hazards in the operand requesters
slide1x instructions are now not stalled in the main sequencer, but the hazard is handled downstream

Checklist

Automated tests pass
Changelog updated
Code style guideline is observed

Before this commit, all the hazards (RAW, WAR, WAW) are handled by the operand requesters that throttle access to source reg elements. Even if the hazard is a WAR/WAW, the suboptimal but efficient way to deal with it is to slow down the source reg fetch. If an instruction does not have source regs, this cannot happen. For example, load instructions. Therefore, all the instructions that do not have vector source operands are stalled in the sequencer. Loads are super common, and stalling in the main sequencer means that all the instructions after the load are also stalled and cannot start their execution. Therefore, now they are processed, and the hazard check is done inside the VLDU. The write-back request is masked until there is no more any hazards on that load instruction.

With Barber Pole layout, the PEs can almost always increment the address by 1 when writing back new data into the VRF. Only the Slide Unit has some special treatment, as its start address come with an offset. Remember that the VRF layou should also be consistent among different LMUL settings, i.e. when LMUL > 1 and we pass from reg N to reg N+1, we must also take into account that reg N+1 has a different starting position for element 0.

Slide1Up/Down were blocked in the main sequencer when they had specific hazards. Now, these hazards are handled downstream, waiting for 1 cycle of stall and then continuing with the usual protocol. WAW hazards for widening instructions are also handled better now, discriminating between real widening instructions and reductions.

hoggur2000 · 2023-04-25T04:40:55Z

Greetings,
I hope this message finds you well. I am writing to inquire about the status of the ongoing project. Specifically, I am seeking assistance with VRF support to enable simultaneous reading of values across multiple banks. In addition, I am considering implementing the barber's pole as a potential solution.
After reviewing the current Ara code, I have identified the code related to bank port assignment and address assignment in the operand_requester.sv file located in the hardware/src/lane directory. I am curious to know if this particular code implements the barber's pole functionality in the hardware. If not, could you kindly direct me to where the barber's pole function is implemented?
Additionally, I am interested in understanding the challenges associated with implementing this aspect in the hardware, and the necessary considerations for implementing multiple registers to read values simultaneously.
I am eagerly looking forward to the opportunity to implement this functionality and await your response.
Thank you for your time and assistance.

mp-17 · 2023-04-25T12:57:45Z

Hello @hoggur2000,

Barber's Pole is currently not implemented in Ara, and a draft version of it can be found in this PR. The logic to calculate the addresses with Barber's Pole is here:

https://github.com/pulp-platform/ara/blob/12e5768ce302bba2c5d82d42deceb590ff3f013f/hardware/include/ara_vaddr.svh

But the implementation is not 100% bug-free. Have a look at the file and, if you have questions, I will be happy to answer! :-)

Best,
Matteo

hoggur2000 · 2023-05-10T01:35:06Z

Hi Matteo,
Thank you so much for taking the time to answer my question in such detail.
I really appreciated your detailed explanation of barber's pole location. It was incredibly helpful and cleared up a lot of confusion for me.
I was wondering if you have identified the cause for the current issues with achieving a 100% bug-free state? If yes, could you kindly share the reason(s) behind this?
Thanks again for your help.

mp-17 force-pushed the opt/generics branch 2 times, most recently from 7ca9031 to cf45273 Compare December 12, 2022 12:30

mp-17 added 6 commits December 12, 2022 13:32

[hardware] 🐛 Decouple cmdBuffer and dataBuffer depths in opQueues

e9a9da3

[hardware] Parametrize addrgen queue depth

630ef8e

[CHANGELOG] Update Changelog

fd801c4

mp-17 force-pushed the opt/generics branch from cf45273 to fd801c4 Compare December 12, 2022 12:45

mp-17 added 2 commits December 12, 2022 15:40

[hardware] 🐛 vstart should consider Barber's Pole layout

a199654

DEBUG: retrigger the CI

12e5768

mp-17 closed this Jun 25, 2024

mp-17 mentioned this pull request Jun 25, 2024

[HW] Faster hazard handling for VLDU and SLDU #203

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

mp-17 commented Dec 5, 2022 •

edited

Loading

hoggur2000 commented Apr 25, 2023

mp-17 commented Apr 25, 2023

hoggur2000 commented May 10, 2023

[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

Conversation

mp-17 commented Dec 5, 2022 • edited Loading

This PR was broken down into two different PRs: #202 #203

The only original contribution here is Barber Pole. Need to clean this before merging.

Changelog

Fixed

Added

Changed

Checklist

hoggur2000 commented Apr 25, 2023

mp-17 commented Apr 25, 2023

hoggur2000 commented May 10, 2023

mp-17 commented Dec 5, 2022 •

edited

Loading