Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HW] A series of fixes #353

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

[HW] A series of fixes #353

wants to merge 20 commits into from

Conversation

mp-17
Copy link
Collaborator

@mp-17 mp-17 commented Aug 26, 2024

Description of PR that completes issue here...

Changelog

Fixed

  • Description of changes

Added

  • Description of changes

Changed

  • Description of changes

Checklist

  • Automated tests pass
  • Changelog updated
  • Code style guideline is observed

Please check our contributing guidelines before opening a Pull Request.

MaistoV and others added 19 commits June 27, 2024 13:40
If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register
within Vn and V(n+X-1) that has an EEW mismatch.
All these reshuffles are reshuffling different Vm with LMUL_1, but also
the same register (Vn with LMUL_X) from the point of view of the hazard
checks on the next instruction that has a dependency on Vn with LMUL_X.

We cannot just inject one macro reshuffle since the registers between
Vn and V(n+X-1) can have different encodings. So, we need finer-grain
reshuffles that messes up the dependency tracking.

For example,
vst @, v0 (LMUL_8)
will use the registers from v0 to v7. If they are all reshuffled, we
will end up with 8 reshuffle instructions that will get IDs from
0 to 7. The store will then see a dependency on the reshuffle ID that
targets v0 only. This is wrong, since if the store opreq is faster than
the slide opreq once the v0-reshuffle is over, it will violate the RAW
dependency.

Not to mess this up, the safest and most suboptimal fix is to just
wait in WAIT_IDLE after a reshuffle with LMUL > 1.

There are many possible optimizations to this:
 1) Check if, when LMUL > 1, we reshuffled more than 1 register.
If we reshuffle 1 reg only, we can also skip the WAIT_IDLE.
 2) Check if all the X registers need to be reshuffled (common case).
If this is the case, inject a large reshuffle with LMUL_X only and
skip WAIT_IDLE.
 3) Not to wait until idle, instead of WAIT_IDLE we can inject the
reshuffles starting from V(n+X-1) instead than Vn. This will automatically
adjust the dependency check and will speed up a bit the whole operation.
Signed-off-by: Moritz Imfeld <[email protected]>
@mp-17 mp-17 mentioned this pull request Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants