-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple outstanding load operations to the Dcache #1348
Support multiple outstanding load operations to the Dcache #1348
Conversation
❌ failed run, report available here. |
Hi @JeanRochCoulon, It looks like some tests (4/18) in the Thales's pipeline are failing on this PR. The following tests are failing:
They fail after only a few seconds, so I guess it is a compilation issue. Could you please share the log ? Thank you very much |
The CI went into trouble. Some jobs fail due to Spike issue. This should be fixed now. I have relaunched the CI. Let's wait and pray. |
❌ failed run, report available here. |
1 similar comment
❌ failed run, report available here. |
7dcdbac
to
1e80742
Compare
I had a discussion with @Jbalkind earlier today. He was concerned about the choice of having a default of 2 entries in the load buffer. First, because of the additional area, and second because of the possible verification hole when introducing this new capability while using the current WT and STD caches. Having 2 entries means that the load_unit can send ID=0 and ID=1 and the caches must mirror that ID in the response. Regarding the first issue, we agreed that the overhead of something like 10 bits (FFs) is negligible I would say. However, I agree that the second is effectively important. Of course, this PR is passing functional tests in the Github and Thales CIs but it is not sure that it is enough. I guess that a way of diminishing the risk, is to use by default one single entry in the load buffer. However, with my current proposed implementation, when there is a single entry, the load unit can send one load every two cycles. I do not think this is acceptable. Thus, I will improve a little bit the PR to implement a fall-through mode in the load buffer when the number of entries is 1. If the number of entries is bigger than 1, the load buffer is not fall-through anymore to ease the timing (no combinational path between the request and the response interfaces). This allows to have a throughput of one load per cycle either way. |
❌ failed run, report available here. |
1e80742
to
7e93829
Compare
I implemented the fall-through mode as explained above (only used when the number of entries in the load buffer is 1). However, I kept the default value of 2 entries in all configuration. I really think that the fall-through mode will be a very tight timing path: I ran some tests on my side using the three different cache subsystems: WB, WT and HPDCACHE. No issue found until now. I ran the tests on both configurations: single-entry load buffer, or 2-entries load buffer. |
❌ failed run, report available here. |
Hello @cfuguet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: will this buffer speed-up the loads (when HPDcache is not used)?
core/include/ariane_pkg.sv
Outdated
@@ -667,6 +667,9 @@ package ariane_pkg; | |||
logic vfp; // is this a vector floating-point instruction? | |||
} scoreboard_entry_t; | |||
|
|||
// Maximum number of inflight memory load requests | |||
localparam int unsigned NR_LOAD_BUFFER_ENTRIES = cva6_config_pkg::CVA6ConfigNrLoadBufEntries; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be compliant to the new parametrization strategy, this user parameter should be added to the cva6_cfg_t from config_pkg.sv and removed from ariane_pkg.sv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @JeanRochCoulon for catching this compliance issue. I will do the modification.
No. If the WT and WB caches are not modified to support multiple load misses, this does not provide any benefit. It is useful in a cache like the HPDcache that can accept multiple outstanding loads while processing concurrently one or multiple misses. |
This is a tricky choice. First, currently there is an increase of 300 gates (according to the CI's report). This is mainly because I let by default 2 entries in the load buffer to enable a throughput of 1 load per cycle. The advantage of having two entries is that we can have this 1 load/cycle throughput without any timing path between the request and the response. We can remove this overhead by putting 1 entry by default. But if we still want a throughput of 1 load/cycle, this requires a timing path between the request and the response. This is a RAM-to-RAM path, thus potentially critical. Before my modifications, the load unit was kind of fragile. I would say that it was working because of some lucky circumstances. It was indeed able to provide a throughput of 1 load/cycle in case of cache hit, but in case of miss, it expected the data cache to apply back-pressure by putting the ready signal to 0. If the data cache is able to accept new requests in case of miss, the original implementation overwrites the load buffer and everything breaks. That is why, in my humble opinion, keeping the original implementation is not a good thing (even if it is working because current WT and WB caches luckily blocked a given requester after a miss) . We can consider that the ready/valid protocol was not respected. In summary, I would suggest to accept the increase of 300 gates with a 2-entry load buffer to keep the 1 load/cycle throughput without any impact on the timing. If this increase is not acceptable, we can put a 1-entry load buffer but with an important impact on timing. What do you think @JeanRochCoulon, @Jbalkind, @zarubaf ? |
The tiny increase seems reasonable to me. My concern is just that we make sure that we don't have other assumptions about only issuing 1 load that could be violated. I think that's for people beyond just César to think about. |
Thank you for the explanations, @cfuguet. 300gates, that's quite small. I approve. |
7e93829
to
5443a3d
Compare
❌ failed run, report available here. |
The ID in the request from the load/store unit must be mirrored by the Dcache in the response. This allows to match a given response to its corresponding request. Responses can be given (by the Dcache) in a different order that the one of requests. This modification introduces a pending load table that tracks outstanding load operations to the Dcache. The depth of this table is a parameter in the target configuration package. Signed-off-by: Cesar Fuguet <[email protected]>
Signed-off-by: Cesar Fuguet <[email protected]>
5443a3d
to
85b3898
Compare
✔️ successful run, report available here. |
Well everything looks good now : (1) I made the modification to comply with the new parameters strategy ; (2) I've slightly increased the allowed number of gates. @JeanRochCoulon, it's up to you to merge the pull request :) Thank you ! |
This modification introduces a pending load table that tracks outstanding load operations to the Dcache. The depth of this table is a parameter in the target configuration package.
The ID in the request from the load/store unit must be mirrored by the Dcache in the response. This allows to match a given response to its corresponding request. Responses can be given (by the Dcache) in a different order that the one of requests.
To be able to send one load request per cycle, the pending load table shall have at least 2 entries. This is to avoid a combinational path between the request and the response interfaces. This has a minor impact in the area anyway because each entry of the buffer has roughly 10 bits (FFs).