Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple outstanding load operations to the Dcache #1348

Merged

Conversation

cfuguet
Copy link
Contributor

@cfuguet cfuguet commented Aug 28, 2023

This modification introduces a pending load table that tracks outstanding load operations to the Dcache. The depth of this table is a parameter in the target configuration package.

The ID in the request from the load/store unit must be mirrored by the Dcache in the response. This allows to match a given response to its corresponding request. Responses can be given (by the Dcache) in a different order that the one of requests.

To be able to send one load request per cycle, the pending load table shall have at least 2 entries. This is to avoid a combinational path between the request and the response interfaces. This has a minor impact in the area anyway because each entry of the buffer has roughly 10 bits (FFs).

@github-actions
Copy link
Contributor

❌ failed run, report available here.

@cfuguet
Copy link
Contributor Author

cfuguet commented Aug 29, 2023

Hi @JeanRochCoulon, It looks like some tests (4/18) in the Thales's pipeline are failing on this PR. The following tests are failing:

  • Environment check
  • Compliance cv64a6_imafdc_sv39
  • Riscv-test cv64a6_imafdc_sv39 (physical)
  • ASIC Synthesis cv32a6_embedded

They fail after only a few seconds, so I guess it is a compilation issue. Could you please share the log ?

Thank you very much

@JeanRochCoulon
Copy link
Contributor

The CI went into trouble. Some jobs fail due to Spike issue. This should be fixed now. I have relaunched the CI. Let's wait and pray.

@github-actions
Copy link
Contributor

❌ failed run, report available here.

1 similar comment
@github-actions
Copy link
Contributor

❌ failed run, report available here.

@cfuguet cfuguet force-pushed the cfuguet_support_multiple_loads branch from 7dcdbac to 1e80742 Compare August 30, 2023 17:38
@cfuguet
Copy link
Contributor Author

cfuguet commented Aug 30, 2023

I had a discussion with @Jbalkind earlier today. He was concerned about the choice of having a default of 2 entries in the load buffer. First, because of the additional area, and second because of the possible verification hole when introducing this new capability while using the current WT and STD caches. Having 2 entries means that the load_unit can send ID=0 and ID=1 and the caches must mirror that ID in the response.

Regarding the first issue, we agreed that the overhead of something like 10 bits (FFs) is negligible I would say. However, I agree that the second is effectively important. Of course, this PR is passing functional tests in the Github and Thales CIs but it is not sure that it is enough.

I guess that a way of diminishing the risk, is to use by default one single entry in the load buffer. However, with my current proposed implementation, when there is a single entry, the load unit can send one load every two cycles. I do not think this is acceptable. Thus, I will improve a little bit the PR to implement a fall-through mode in the load buffer when the number of entries is 1. If the number of entries is bigger than 1, the load buffer is not fall-through anymore to ease the timing (no combinational path between the request and the response interfaces). This allows to have a throughput of one load per cycle either way.

@github-actions
Copy link
Contributor

❌ failed run, report available here.

@cfuguet cfuguet force-pushed the cfuguet_support_multiple_loads branch from 1e80742 to 7e93829 Compare August 30, 2023 22:15
@cfuguet
Copy link
Contributor Author

cfuguet commented Aug 30, 2023

I implemented the fall-through mode as explained above (only used when the number of entries in the load buffer is 1). However, I kept the default value of 2 entries in all configuration. I really think that the fall-through mode will be a very tight timing path:
the request valid signal depends on the response valid signal. The valid signal from the response depends on the output of the cache directory SRAM (hit logic) and then the valid signal of the request goes to the chip select logic of the directory and data SRAMs. These are then RAM-to-RAM paths.

I ran some tests on my side using the three different cache subsystems: WB, WT and HPDCACHE. No issue found until now. I ran the tests on both configurations: single-entry load buffer, or 2-entries load buffer.

@github-actions
Copy link
Contributor

❌ failed run, report available here.

@JeanRochCoulon
Copy link
Contributor

JeanRochCoulon commented Aug 31, 2023

Hello @cfuguet
I appreciate your reactivity !
First, I saw that synthesis is failed, that's because gate count is higher. Thsi could be fixed by gating your modification with HPDCache configuration. Or by increasing the excepted gate count value in https://github.com/openhwgroup/cva6/blob/master/.gitlab-ci/expected_synth.yml
Second, I would be pleased to know your HPDcache insertion roadmap. How many PRs do you plan to insert the HPDCache ? And when ? Thanks. We can discuss it outside Github if needed.

Copy link
Contributor

@JeanRochCoulon JeanRochCoulon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will this buffer speed-up the loads (when HPDcache is not used)?

@@ -667,6 +667,9 @@ package ariane_pkg;
logic vfp; // is this a vector floating-point instruction?
} scoreboard_entry_t;

// Maximum number of inflight memory load requests
localparam int unsigned NR_LOAD_BUFFER_ENTRIES = cva6_config_pkg::CVA6ConfigNrLoadBufEntries;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be compliant to the new parametrization strategy, this user parameter should be added to the cva6_cfg_t from config_pkg.sv and removed from ariane_pkg.sv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JeanRochCoulon for catching this compliance issue. I will do the modification.

@cfuguet
Copy link
Contributor Author

cfuguet commented Aug 31, 2023

Question: will this buffer speed-up the loads (when HPDcache is not used)?

No. If the WT and WB caches are not modified to support multiple load misses, this does not provide any benefit. It is useful in a cache like the HPDcache that can accept multiple outstanding loads while processing concurrently one or multiple misses.

@cfuguet
Copy link
Contributor Author

cfuguet commented Aug 31, 2023

I saw that synthesis is failed, that's because gate count is higher. This could be fixed by gating your modification with HPDCache configuration. Or by increasing the excepted gate count value

This is a tricky choice.

First, currently there is an increase of 300 gates (according to the CI's report). This is mainly because I let by default 2 entries in the load buffer to enable a throughput of 1 load per cycle. The advantage of having two entries is that we can have this 1 load/cycle throughput without any timing path between the request and the response.

We can remove this overhead by putting 1 entry by default. But if we still want a throughput of 1 load/cycle, this requires a timing path between the request and the response. This is a RAM-to-RAM path, thus potentially critical.

Before my modifications, the load unit was kind of fragile. I would say that it was working because of some lucky circumstances. It was indeed able to provide a throughput of 1 load/cycle in case of cache hit, but in case of miss, it expected the data cache to apply back-pressure by putting the ready signal to 0. If the data cache is able to accept new requests in case of miss, the original implementation overwrites the load buffer and everything breaks. That is why, in my humble opinion, keeping the original implementation is not a good thing (even if it is working because current WT and WB caches luckily blocked a given requester after a miss) . We can consider that the ready/valid protocol was not respected.

In summary, I would suggest to accept the increase of 300 gates with a 2-entry load buffer to keep the 1 load/cycle throughput without any impact on the timing. If this increase is not acceptable, we can put a 1-entry load buffer but with an important impact on timing.

What do you think @JeanRochCoulon, @Jbalkind, @zarubaf ?

@Jbalkind
Copy link
Contributor

The tiny increase seems reasonable to me. My concern is just that we make sure that we don't have other assumptions about only issuing 1 load that could be violated. I think that's for people beyond just César to think about.

@JeanRochCoulon
Copy link
Contributor

Thank you for the explanations, @cfuguet. 300gates, that's quite small. I approve.

@cfuguet cfuguet force-pushed the cfuguet_support_multiple_loads branch from 7e93829 to 5443a3d Compare September 1, 2023 07:42
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

❌ failed run, report available here.

The ID in the request from the load/store unit must be mirrored by the
Dcache in the response. This allows to match a given response to its
corresponding request. Responses can be given (by the Dcache) in a
different order that the one of requests.

This modification introduces a pending load table that tracks
outstanding load operations to the Dcache. The depth of this table is
a parameter in the target configuration package.

Signed-off-by: Cesar Fuguet <[email protected]>
@cfuguet cfuguet force-pushed the cfuguet_support_multiple_loads branch from 5443a3d to 85b3898 Compare September 1, 2023 12:00
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

✔️ successful run, report available here.

@cfuguet
Copy link
Contributor Author

cfuguet commented Sep 1, 2023

Well everything looks good now : (1) I made the modification to comply with the new parameters strategy ; (2) I've slightly increased the allowed number of gates.

@JeanRochCoulon, it's up to you to merge the pull request :)

Thank you !

@JeanRochCoulon JeanRochCoulon merged commit d9ad16b into openhwgroup:master Sep 2, 2023
6 checks passed
@cfuguet cfuguet deleted the cfuguet_support_multiple_loads branch September 2, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants