Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about fault-difftest only diff ecall and page-fault #235

Open
zybzzz opened this issue Dec 16, 2024 · 11 comments
Open

about fault-difftest only diff ecall and page-fault #235

zybzzz opened this issue Dec 16, 2024 · 11 comments

Comments

@zybzzz
Copy link
Contributor

zybzzz commented Dec 16, 2024

We can observe that when handling runtime exceptions, only ecall and page-fault related exceptions are difftested in the code.

https://github.com/OpenXiangShan/GEM5/blob/38a5253405119e5e63d3cb6e7fad90b691fa111d/src/cpu/o3/commit.cc#L1567C9-L1600C10

While it is true that in most checkpoints, the only runtime exceptions are ecall and page-fault, a broader consideration is that the difftest for runtime exceptions should detect other types of exceptions as well, but the current code doesn't seem to do so.

So I'm wondering why currently there is no difftest for all faults. is it because the nemu ref model has not been fully implemented for the behavior of other exceptions?

@shinezyy @tastynoob @jueshiwenli

@eastonman
Copy link
Member

We have difftest of other exceptions on RTL-NEMU cosim. I suppose this is just not implemented.

@zybzzz
Copy link
Contributor Author

zybzzz commented Dec 16, 2024

We have difftest of other exceptions on RTL-NEMU cosim. I suppose this is just not implemented.

What is RTL-NEMU cosim? I don't really understand.

I think what you mean is that the conditions for difftest other exceptions are already in place on nemu, but the code related to difftest has not yet been implemented on gem5. Is this correct?

@tastynoob
Copy link
Collaborator

NEMU has one mechanism called " guide execute", when GEM5 has trap, it will sync to NEMU

@eastonman
Copy link
Member

The code seems to just override NEMU excp, and do a compare when ecall

@eastonman
Copy link
Member

eastonman commented Dec 16, 2024

Not sure why this is implemented like this. On XiangShan-RTL I believe only interrupt will override NEMU's trap? Other traps will be compared between RTL and NEMU.

@jueshiwenli
Copy link
Collaborator

A page fault is a special case. Due to out-of-order execution, there might be scenarios where gem5 triggers a page fault, but the reference model (NEMU) does not. Both situations are theoretically correct. In such cases, we need to synchronize gem5's information with NEMU and continue execution. At this point, we assume gem5's page fault behavior is correct in theory. To prevent gem5 from stalling, we have defined in the reference model that the same address should not trigger more than five consecutive page faults. However, we still recommend manually checking the behavior at the corresponding location when there is a mismatch in page fault occurrences between the two models to ensure correctness.

@zybzzz
Copy link
Contributor Author

zybzzz commented Dec 16, 2024

https://github.com/OpenXiangShan/GEM5/blob/38a5253405119e5e63d3cb6e7fad90b691fa111d/src/cpu/o3/commit.cc#L169C1-L175C60

if (faultNum.find(exception_no) != faultNum.end()) {

I understand guide exec, but guide exec is only for page-fault related exceptions (based on if conditions), and Commit::diffInst is only for ecall.

Then other exceptions, such as Exaception Code 2: Illegel instruction, don't have a difftest mechanism for them when they are triggered, which is dangerous.


My problem is that for exceptions other than ecall and page-fault, without an explicit Commit::diffInst call, instruction execution in nemu will be behind gem5, e.g. gem5's pc is already at the entry address of the exception handler but nemu's pc is still parked on the instruction that caused the exception.

https://github.com/OpenXiangShan/GEM5/blob/38a5253405119e5e63d3cb6e7fad90b691fa111d/src/cpu/base.cc#L1469C1-L1483C18

But now, through my observation, it seems that the pc mismatch will be checked in the base.cc difftestStep, and in case of pc mismatch, nemu will execute one more step, and do the difftest again, which seems to solve the phenomenon of pc mismatch between gem5 and nemu when handling exception.

我的问题在于,对于除了 ecall 和 page-fault 之外的异常,如果不显式的进行 Commit::diffInst 的调用,nemu 中的指令执行将会落后于 gem5,比如说 gem5 的 pc 已经在异常处理的入口地址,但是 nemu 的 pc 仍然停在造成异常的那条指令上。

但是现在通过我的观察发现,在 difftestStep 中似乎会检查 pc 之间的不匹配,在 pc 不匹配的情况下让 nemu 再执行一步,再进行一次 difftest,这似乎解决了在处理异常的时候 gem5 和 nemu 之间 pc 不同步的现象。

@eastonman @tastynoob @jueshiwenli

@eastonman
Copy link
Member

A page fault is a special case. Due to out-of-order execution, there might be scenarios where gem5 triggers a page fault, but the reference model (NEMU) does not. Both situations are theoretically correct. In such cases, we need to synchronize gem5's information with NEMU and continue execution. At this point, we assume gem5's page fault behavior is correct in theory. To prevent gem5 from stalling, we have defined in the reference model that the same address should not trigger more than five consecutive page faults. However, we still recommend manually checking the behavior at the corresponding location when there is a mismatch in page fault occurrences between the two models to ensure correctness.

Why this happens? Is this related to multi-core page table update?

@zybzzz
Copy link
Contributor Author

zybzzz commented Dec 23, 2024

A page fault is a special case. Due to out-of-order execution, there might be scenarios where gem5 triggers a page fault, but the reference model (NEMU) does not. Both situations are theoretically correct. In such cases, we need to synchronize gem5's information with NEMU and continue execution. At this point, we assume gem5's page fault behavior is correct in theory. To prevent gem5 from stalling, we have defined in the reference model that the same address should not trigger more than five consecutive page faults. However, we still recommend manually checking the behavior at the corresponding location when there is a mismatch in page fault occurrences between the two models to ensure correctness.

您好,我同样对这种情况的发生时机有所疑惑,也想知道什么时候会发生这种情况。简单的来讲,我在用NEMU为值预测器的研究设计一个理想的模型,这个理想的模型会先于 gem5 执行指令并获得指令的结果。在我的设计中,我希望 NEMU 在遇到异常时能自动停下来,而您上面描述到的这种情况(即NEMU在某些缺页情况下不触发异常)会导致我设计的理想模型在遇到此类情况的时候设计变得非常复杂,甚至导致无法继续向下设计。

因此,我想知道您上面描述的情况是在何种情况下可能发生的,是在多核情况下有可能发生吗?或者是单核情况下也有可能发生?如果 gem5 和 nemu 的运行环境只是简单的支持 RVG,不开启向量化V拓展和虚拟化H拓展,这种情况是否仍有可能发生?

谢谢。

@jueshiwenli


Hello,

I also have some doubts about the scenarios in which this issue might occur and would like to understand under what circumstances it happens. To provide some context, I am designing an idealized model for a value predictor using NEMU as part of my research. This idealized model executes instructions ahead of gem5 and obtains their results.

In my design, I expect NEMU to automatically stop when encountering an exception. However, the situation you described above—where NEMU does not trigger an exception in certain page fault cases—introduces significant complexity to my idealized model. It even makes further design infeasible in such scenarios.

Therefore, I’d like to ask under what specific conditions this issue might occur. Could it happen in a multi-core scenario? Or is it also possible in a single-core scenario? Additionally, if the gem5 and NEMU runtime environments are set up to simply support RVG without enabling the Vector (V) extension or Hypervisor (H) extension, could this issue still arise?

Thank you for your insights.

@jueshiwenli
Copy link
Collaborator

If there is no sfence.vma, then the load/store in the back to access page N may use the invalid page table entries before the update, resulting in page fault.
0dd46b9a8b1bff5b607cbc99d28b7d4

@zybzzz
Copy link
Contributor Author

zybzzz commented Dec 24, 2024

感谢你的回答,我正在尝试理解图中描述的问题。

根据我的我的理解,这个错误似乎是参考模型和实际模型的建模层次不同导致的。实际模型中建模了 store buffer 但是参考模型没有这样的设计,这使得上图描述的参考模型(nemu)总是触发 store page fault,但是实际模型(gem5)中这个 store page fault 的触发可能被推迟,导致了上面图中描述的问题。

@jueshiwenli


Thank you for your response. I am trying to better understand the issue described in the diagram.

Based on my understanding, this problem seems to stem from the difference in abstraction levels between the reference model and the actual model. In the actual model, a store buffer is implemented, whereas the reference model lacks such a design. This causes the reference model (NEMU) to always trigger a store page fault, while in the actual model (gem5), the triggering of the store page fault might be delayed, leading to the issue described in the diagram.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants