Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HW] Generic optimizations on hazard handling + Barber's Pole + OpQueue Fix #190

Closed
wants to merge 8 commits into from

Conversation

mp-17
Copy link
Collaborator

@mp-17 mp-17 commented Dec 5, 2022

This PR was broken down into two different PRs: #202 #203

The only original contribution here is Barber Pole. Need to clean this before merging.

Heterogeneous PR with optimizations and fixes.

Changelog

Fixed

  • Decouple cmdBuffer and dataBuffer depth parameters in the operand queues

Added

  • Add support for Barber's Pole VRF Layout

Changed

  • Handle WAW and WAR vload hazards in the VLDU without stalling the main sequencer
  • Reductions are no more treated as widening instructions for what concerns WAW hazards in the operand requesters
  • slide1x instructions are now not stalled in the main sequencer, but the hazard is handled downstream

Checklist

  • Automated tests pass
  • Changelog updated
  • Code style guideline is observed

@mp-17 mp-17 force-pushed the opt/generics branch 2 times, most recently from 7ca9031 to cf45273 Compare December 12, 2022 12:30
Before this commit, all the hazards (RAW, WAR, WAW) are handled
by the operand requesters that throttle access to source reg elements.
Even if the hazard is a WAR/WAW, the suboptimal but efficient way to
deal with it is to slow down the source reg fetch.
If an instruction does not have source regs, this cannot happen. For
example, load instructions. Therefore, all the instructions that do
not have vector source operands are stalled in the sequencer.
Loads are super common, and stalling in the main sequencer means
that all the instructions after the load are also stalled and cannot
start their execution.
Therefore, now they are processed, and the hazard check is done inside
the VLDU. The write-back request is masked until there is no more any
hazards on that load instruction.
With Barber Pole layout, the PEs can almost always increment
the address by 1 when writing back new data into the VRF.
Only the Slide Unit has some special treatment, as its
start address come with an offset.
Remember that the VRF layou should also be consistent
among different LMUL settings, i.e. when LMUL > 1 and
we pass from reg N to reg N+1, we must also take into
account that reg N+1 has a different starting position
for element 0.
Slide1Up/Down were blocked in the main sequencer when they had specific
hazards. Now, these hazards are handled downstream, waiting for 1 cycle
of stall and then continuing with the usual protocol.
WAW hazards for widening instructions are also handled better now,
discriminating between real widening instructions and reductions.
@hoggur2000
Copy link

Greetings,
I hope this message finds you well. I am writing to inquire about the status of the ongoing project. Specifically, I am seeking assistance with VRF support to enable simultaneous reading of values across multiple banks. In addition, I am considering implementing the barber's pole as a potential solution.
After reviewing the current Ara code, I have identified the code related to bank port assignment and address assignment in the operand_requester.sv file located in the hardware/src/lane directory. I am curious to know if this particular code implements the barber's pole functionality in the hardware. If not, could you kindly direct me to where the barber's pole function is implemented?
Additionally, I am interested in understanding the challenges associated with implementing this aspect in the hardware, and the necessary considerations for implementing multiple registers to read values simultaneously.
I am eagerly looking forward to the opportunity to implement this functionality and await your response.
Thank you for your time and assistance.

@mp-17
Copy link
Collaborator Author

mp-17 commented Apr 25, 2023

Hello @hoggur2000,

Barber's Pole is currently not implemented in Ara, and a draft version of it can be found in this PR. The logic to calculate the addresses with Barber's Pole is here:

https://github.com/pulp-platform/ara/blob/12e5768ce302bba2c5d82d42deceb590ff3f013f/hardware/include/ara_vaddr.svh

But the implementation is not 100% bug-free. Have a look at the file and, if you have questions, I will be happy to answer! :-)

Best,
Matteo

@hoggur2000
Copy link

Hi Matteo,
Thank you so much for taking the time to answer my question in such detail.
I really appreciated your detailed explanation of barber's pole location. It was incredibly helpful and cleared up a lot of confusion for me.
I was wondering if you have identified the cause for the current issues with achieving a 100% bug-free state? If yes, could you kindly share the reason(s) behind this?
Thanks again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants