Support multiple outstanding load operations to the Dcache #1348

cfuguet · 2023-08-28T21:50:23Z

This modification introduces a pending load table that tracks outstanding load operations to the Dcache. The depth of this table is a parameter in the target configuration package.

The ID in the request from the load/store unit must be mirrored by the Dcache in the response. This allows to match a given response to its corresponding request. Responses can be given (by the Dcache) in a different order that the one of requests.

To be able to send one load request per cycle, the pending load table shall have at least 2 entries. This is to avoid a combinational path between the request and the response interfaces. This has a minor impact in the area anyway because each entry of the buffer has roughly 10 bits (FFs).

github-actions · 2023-08-29T11:29:47Z

❌ failed run, report available here.

cfuguet · 2023-08-29T14:43:44Z

Hi @JeanRochCoulon, It looks like some tests (4/18) in the Thales's pipeline are failing on this PR. The following tests are failing:

Environment check
Compliance cv64a6_imafdc_sv39
Riscv-test cv64a6_imafdc_sv39 (physical)
ASIC Synthesis cv32a6_embedded

They fail after only a few seconds, so I guess it is a compilation issue. Could you please share the log ?

Thank you very much

JeanRochCoulon · 2023-08-29T16:23:51Z

The CI went into trouble. Some jobs fail due to Spike issue. This should be fixed now. I have relaunched the CI. Let's wait and pray.

github-actions · 2023-08-29T17:13:48Z

❌ failed run, report available here.

github-actions · 2023-08-30T11:35:56Z

❌ failed run, report available here.

cfuguet · 2023-08-30T18:41:54Z

I had a discussion with @Jbalkind earlier today. He was concerned about the choice of having a default of 2 entries in the load buffer. First, because of the additional area, and second because of the possible verification hole when introducing this new capability while using the current WT and STD caches. Having 2 entries means that the load_unit can send ID=0 and ID=1 and the caches must mirror that ID in the response.

Regarding the first issue, we agreed that the overhead of something like 10 bits (FFs) is negligible I would say. However, I agree that the second is effectively important. Of course, this PR is passing functional tests in the Github and Thales CIs but it is not sure that it is enough.

I guess that a way of diminishing the risk, is to use by default one single entry in the load buffer. However, with my current proposed implementation, when there is a single entry, the load unit can send one load every two cycles. I do not think this is acceptable. Thus, I will improve a little bit the PR to implement a fall-through mode in the load buffer when the number of entries is 1. If the number of entries is bigger than 1, the load buffer is not fall-through anymore to ease the timing (no combinational path between the request and the response interfaces). This allows to have a throughput of one load per cycle either way.

github-actions · 2023-08-30T20:07:07Z

❌ failed run, report available here.

cfuguet · 2023-08-30T22:21:42Z

I implemented the fall-through mode as explained above (only used when the number of entries in the load buffer is 1). However, I kept the default value of 2 entries in all configuration. I really think that the fall-through mode will be a very tight timing path:
the request valid signal depends on the response valid signal. The valid signal from the response depends on the output of the cache directory SRAM (hit logic) and then the valid signal of the request goes to the chip select logic of the directory and data SRAMs. These are then RAM-to-RAM paths.

I ran some tests on my side using the three different cache subsystems: WB, WT and HPDCACHE. No issue found until now. I ran the tests on both configurations: single-entry load buffer, or 2-entries load buffer.

github-actions · 2023-08-31T10:59:13Z

❌ failed run, report available here.

JeanRochCoulon · 2023-08-31T12:01:11Z

Hello @cfuguet
I appreciate your reactivity !
First, I saw that synthesis is failed, that's because gate count is higher. Thsi could be fixed by gating your modification with HPDCache configuration. Or by increasing the excepted gate count value in https://github.com/openhwgroup/cva6/blob/master/.gitlab-ci/expected_synth.yml
Second, I would be pleased to know your HPDcache insertion roadmap. How many PRs do you plan to insert the HPDCache ? And when ? Thanks. We can discuss it outside Github if needed.

JeanRochCoulon

Question: will this buffer speed-up the loads (when HPDcache is not used)?

JeanRochCoulon · 2023-08-31T13:57:45Z

core/include/ariane_pkg.sv

@@ -667,6 +667,9 @@ package ariane_pkg;
        logic                       vfp;         // is this a vector floating-point instruction?
    } scoreboard_entry_t;

+    //  Maximum number of inflight memory load requests
+    localparam int unsigned NR_LOAD_BUFFER_ENTRIES = cva6_config_pkg::CVA6ConfigNrLoadBufEntries;
+


To be compliant to the new parametrization strategy, this user parameter should be added to the cva6_cfg_t from config_pkg.sv and removed from ariane_pkg.sv.

Thank you @JeanRochCoulon for catching this compliance issue. I will do the modification.

cfuguet · 2023-08-31T15:42:59Z

Question: will this buffer speed-up the loads (when HPDcache is not used)?

No. If the WT and WB caches are not modified to support multiple load misses, this does not provide any benefit. It is useful in a cache like the HPDcache that can accept multiple outstanding loads while processing concurrently one or multiple misses.

cfuguet · 2023-08-31T16:04:07Z

I saw that synthesis is failed, that's because gate count is higher. This could be fixed by gating your modification with HPDCache configuration. Or by increasing the excepted gate count value

This is a tricky choice.

First, currently there is an increase of 300 gates (according to the CI's report). This is mainly because I let by default 2 entries in the load buffer to enable a throughput of 1 load per cycle. The advantage of having two entries is that we can have this 1 load/cycle throughput without any timing path between the request and the response.

We can remove this overhead by putting 1 entry by default. But if we still want a throughput of 1 load/cycle, this requires a timing path between the request and the response. This is a RAM-to-RAM path, thus potentially critical.

Before my modifications, the load unit was kind of fragile. I would say that it was working because of some lucky circumstances. It was indeed able to provide a throughput of 1 load/cycle in case of cache hit, but in case of miss, it expected the data cache to apply back-pressure by putting the ready signal to 0. If the data cache is able to accept new requests in case of miss, the original implementation overwrites the load buffer and everything breaks. That is why, in my humble opinion, keeping the original implementation is not a good thing (even if it is working because current WT and WB caches luckily blocked a given requester after a miss) . We can consider that the ready/valid protocol was not respected.

In summary, I would suggest to accept the increase of 300 gates with a 2-entry load buffer to keep the 1 load/cycle throughput without any impact on the timing. If this increase is not acceptable, we can put a 1-entry load buffer but with an important impact on timing.

What do you think @JeanRochCoulon, @Jbalkind, @zarubaf ?

Jbalkind · 2023-08-31T16:27:39Z

The tiny increase seems reasonable to me. My concern is just that we make sure that we don't have other assumptions about only issuing 1 load that could be violated. I think that's for people beyond just César to think about.

JeanRochCoulon · 2023-08-31T21:05:15Z

Thank you for the explanations, @cfuguet. 300gates, that's quite small. I approve.

github-actions · 2023-09-01T11:22:44Z

❌ failed run, report available here.

The ID in the request from the load/store unit must be mirrored by the Dcache in the response. This allows to match a given response to its corresponding request. Responses can be given (by the Dcache) in a different order that the one of requests. This modification introduces a pending load table that tracks outstanding load operations to the Dcache. The depth of this table is a parameter in the target configuration package. Signed-off-by: Cesar Fuguet <[email protected]>

Signed-off-by: Cesar Fuguet <[email protected]>

github-actions · 2023-09-01T20:21:06Z

✔️ successful run, report available here.

cfuguet · 2023-09-01T21:07:41Z

Well everything looks good now : (1) I made the modification to comply with the new parameters strategy ; (2) I've slightly increased the allowed number of gates.

@JeanRochCoulon, it's up to you to merge the pull request :)

Thank you !

cfuguet requested review from JeanRochCoulon and zarubaf as code owners August 28, 2023 21:50

cfuguet mentioned this pull request Aug 28, 2023

Support multiple outstanding load operations to the Dcache #1169

Closed

cfuguet force-pushed the cfuguet_support_multiple_loads branch from 7dcdbac to 1e80742 Compare August 30, 2023 17:38

cfuguet force-pushed the cfuguet_support_multiple_loads branch from 1e80742 to 7e93829 Compare August 30, 2023 22:15

JeanRochCoulon approved these changes Aug 31, 2023

View reviewed changes

cfuguet force-pushed the cfuguet_support_multiple_loads branch from 7e93829 to 5443a3d Compare September 1, 2023 07:42

cfuguet added 2 commits September 1, 2023 13:59

gitlab-ci: update the number of gates for the cv32a6_embedded ASIC check

85b3898

Signed-off-by: Cesar Fuguet <[email protected]>

cfuguet force-pushed the cfuguet_support_multiple_loads branch from 5443a3d to 85b3898 Compare September 1, 2023 12:00

JeanRochCoulon merged commit d9ad16b into openhwgroup:master Sep 2, 2023
6 checks passed

cfuguet deleted the cfuguet_support_multiple_loads branch September 2, 2023 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple outstanding load operations to the Dcache #1348

Support multiple outstanding load operations to the Dcache #1348

cfuguet commented Aug 28, 2023 •

edited

Loading

github-actions bot commented Aug 29, 2023

cfuguet commented Aug 29, 2023

JeanRochCoulon commented Aug 29, 2023

github-actions bot commented Aug 29, 2023

github-actions bot commented Aug 30, 2023

cfuguet commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

cfuguet commented Aug 30, 2023 •

edited

Loading

github-actions bot commented Aug 31, 2023

JeanRochCoulon commented Aug 31, 2023 •

edited

Loading

JeanRochCoulon left a comment

JeanRochCoulon Aug 31, 2023

cfuguet Aug 31, 2023

cfuguet commented Aug 31, 2023

cfuguet commented Aug 31, 2023 •

edited

Loading

Jbalkind commented Aug 31, 2023

JeanRochCoulon commented Aug 31, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

cfuguet commented Sep 1, 2023

Support multiple outstanding load operations to the Dcache #1348

Support multiple outstanding load operations to the Dcache #1348

Conversation

cfuguet commented Aug 28, 2023 • edited Loading

github-actions bot commented Aug 29, 2023

cfuguet commented Aug 29, 2023

JeanRochCoulon commented Aug 29, 2023

github-actions bot commented Aug 29, 2023

github-actions bot commented Aug 30, 2023

cfuguet commented Aug 30, 2023

github-actions bot commented Aug 30, 2023

cfuguet commented Aug 30, 2023 • edited Loading

github-actions bot commented Aug 31, 2023

JeanRochCoulon commented Aug 31, 2023 • edited Loading

JeanRochCoulon left a comment

Choose a reason for hiding this comment

JeanRochCoulon Aug 31, 2023

Choose a reason for hiding this comment

cfuguet Aug 31, 2023

Choose a reason for hiding this comment

cfuguet commented Aug 31, 2023

cfuguet commented Aug 31, 2023 • edited Loading

Jbalkind commented Aug 31, 2023

JeanRochCoulon commented Aug 31, 2023

github-actions bot commented Sep 1, 2023

github-actions bot commented Sep 1, 2023

cfuguet commented Sep 1, 2023

cfuguet commented Aug 28, 2023 •

edited

Loading

cfuguet commented Aug 30, 2023 •

edited

Loading

JeanRochCoulon commented Aug 31, 2023 •

edited

Loading

cfuguet commented Aug 31, 2023 •

edited

Loading