[SW] Initial support for compilation in Linux environment #312

If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register within Vn and V(n+X-1) that has an EEW mismatch. All these reshuffles are reshuffling different Vm with LMUL_1, but also the same register (Vn with LMUL_X) from the point of view of the hazard checks on the next instruction that has a dependency on Vn with LMUL_X. We cannot just inject one macro reshuffle since the registers between Vn and V(n+X-1) can have different encodings. So, we need finer-grain reshuffles that messes up the dependency tracking. For example, vst @, v0 (LMUL_8) will use the registers from v0 to v7. If they are all reshuffled, we will end up with 8 reshuffle instructions that will get IDs from 0 to 7. The store will then see a dependency on the reshuffle ID that targets v0 only. This is wrong, since if the store opreq is faster than the slide opreq once the v0-reshuffle is over, it will violate the RAW dependency. Not to mess this up, the safest and most suboptimal fix is to just wait in WAIT_IDLE after a reshuffle with LMUL > 1. There are many possible optimizations to this: 1) Check if, when LMUL > 1, we reshuffled more than 1 register. If we reshuffle 1 reg only, we can also skip the WAIT_IDLE. 2) Check if all the X registers need to be reshuffled (common case). If this is the case, inject a large reshuffle with LMUL_X only and skip WAIT_IDLE. 3) Not to wait until idle, instead of WAIT_IDLE we can inject the reshuffles starting from V(n+X-1) instead than Vn. This will automatically adjust the dependency check and will speed up a bit the whole operation.

* Add MMU interface (just mock) * Refactoring

* Switch from pulp-platform/cva6 to MaistoV/cva6_fork * Bump axi to v0.39.0

* vstart support for vector unit-stride loads and stores * vstart support for vector strided loads and stores * vstart support for valu operations, mask operations not tested * Preliminary work on vstart support for vector indexed loads and stores * Minor fixes * Refactoring * Explanatory comments

- Restrict mem bus to EW if vstore, vstart > 0, and EW < 64-bit If vstart > 0 and EW < 64, the situation is similar to when the memory addr is misaligned wrt the memory bus. Because of the VRF Byte Layout and since the granularity of each lane's payload to the store unit is 64 bit, all the packets can contain valid data while we have not completed the beat. So, either we calculate in the addrgen the effective length of a bursts with unequal beats, or we add a buffer and aligner in the store unit, or we handle the ready signals at a byte level, or we simply reduce the effective memory bus to the element width (worst case). We do the latter. It's low performance, but vstore with vstart > 0 happen after an exception, so the throughput drop should be acceptable. - Data packets from VRF to STU Operand requesters now send balanced payloads from all the lanes if vstart > 0. The store unit will identify the good ones by itself, and will only have to handshake balanced payloads.

- Time the STU exception flush with the opqueues

The vstart signal within the lanes is not the architectural vstart. For all the instructions, it corresponds to the architectural vstart manipulated to reflect the "vstart" in every lane for VRF fetch address calculation purposes. Memory instructions, which support arch vstart > 0, can use that vstart signal to resize the number of elements to fetch from the VRF. Slide instructions, instead, further modify the vstart only for addressing purposes, and should not use the vstart signal to resize the number of elements to fetch.

* Added LINUX switch, default LINUX=0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SW] Initial support for compilation in Linux environment #312

[SW] Initial support for compilation in Linux environment #312

Commits on Jun 19, 2024

Commits on Jun 25, 2024