-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SW] Initial support for compilation in Linux environment #312
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register within Vn and V(n+X-1) that has an EEW mismatch. All these reshuffles are reshuffling different Vm with LMUL_1, but also the same register (Vn with LMUL_X) from the point of view of the hazard checks on the next instruction that has a dependency on Vn with LMUL_X. We cannot just inject one macro reshuffle since the registers between Vn and V(n+X-1) can have different encodings. So, we need finer-grain reshuffles that messes up the dependency tracking. For example, vst @, v0 (LMUL_8) will use the registers from v0 to v7. If they are all reshuffled, we will end up with 8 reshuffle instructions that will get IDs from 0 to 7. The store will then see a dependency on the reshuffle ID that targets v0 only. This is wrong, since if the store opreq is faster than the slide opreq once the v0-reshuffle is over, it will violate the RAW dependency. Not to mess this up, the safest and most suboptimal fix is to just wait in WAIT_IDLE after a reshuffle with LMUL > 1. There are many possible optimizations to this: 1) Check if, when LMUL > 1, we reshuffled more than 1 register. If we reshuffle 1 reg only, we can also skip the WAIT_IDLE. 2) Check if all the X registers need to be reshuffled (common case). If this is the case, inject a large reshuffle with LMUL_X only and skip WAIT_IDLE. 3) Not to wait until idle, instead of WAIT_IDLE we can inject the reshuffles starting from V(n+X-1) instead than Vn. This will automatically adjust the dependency check and will speed up a bit the whole operation.
* Add MMU interface (just mock) * Refactoring
* Switch from pulp-platform/cva6 to MaistoV/cva6_fork * Bump axi to v0.39.0
3 tasks
* vstart support for vector unit-stride loads and stores * vstart support for vector strided loads and stores * vstart support for valu operations, mask operations not tested * Preliminary work on vstart support for vector indexed loads and stores * Minor fixes * Refactoring * Explanatory comments
- Restrict mem bus to EW if vstore, vstart > 0, and EW < 64-bit If vstart > 0 and EW < 64, the situation is similar to when the memory addr is misaligned wrt the memory bus. Because of the VRF Byte Layout and since the granularity of each lane's payload to the store unit is 64 bit, all the packets can contain valid data while we have not completed the beat. So, either we calculate in the addrgen the effective length of a bursts with unequal beats, or we add a buffer and aligner in the store unit, or we handle the ready signals at a byte level, or we simply reduce the effective memory bus to the element width (worst case). We do the latter. It's low performance, but vstore with vstart > 0 happen after an exception, so the throughput drop should be acceptable. - Data packets from VRF to STU Operand requesters now send balanced payloads from all the lanes if vstart > 0. The store unit will identify the good ones by itself, and will only have to handshake balanced payloads.
- Time the STU exception flush with the opqueues
The vstart signal within the lanes is not the architectural vstart. For all the instructions, it corresponds to the architectural vstart manipulated to reflect the "vstart" in every lane for VRF fetch address calculation purposes. Memory instructions, which support arch vstart > 0, can use that vstart signal to resize the number of elements to fetch from the VRF. Slide instructions, instead, further modify the vstart only for addressing purposes, and should not use the vstart signal to resize the number of elements to fetch.
* Added LINUX switch, default LINUX=0
Continue this in #319 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#269 rework
Introduce initial support for kernel compilation under Linux environment
Changelog
Fixed
Added
Changed
Checklist
Please check our contributing guidelines before opening a Pull Request.