-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large-scale PowerPC recompiler rework #641
base: main
Are you sure you want to change the base?
Conversation
What would be the scope of changing the x64 emitter over to something like xbyak? With the current x64 emitter, adding a new instruction or class of instructions would involve implementing the encoding for those instructions (REX, VEX, EVEX, ModR/M, SIB, etc) from scratch and then implementing the new instruction in particular AND detecting it the particular CPUID flags when this redundant work can probably just be pushed onto a proven library. |
Thanks for pointing out Xbyak, I wasn't aware of it. The assemblers I looked at were always a bit overkill for our purposes, usually focusing on human-friendly API and less towards a simple interface for machine generated code. We only need a very thin emitter, but Xbyak seems to be exactly that. As part of this rework I also started a new "cleaner" x86-64 high-performance emitter which I auto-generate from encoding tables. The effort for this is relatively minimal, but using a premade emitter would certainly cut down the effort even further. I'll think about it. |
did you drop this project ? |
Nah just busy with other stuff. I'll get back to this eventually |
Thanks! ARM64 Support would make the CEMU emulator finally done and future proof! |
On ARM64: I've been using oaknut on other projects. It is structured very similarly to xbyak. |
This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD? |
That's a graphical issue. It's unaffected by this CPU rework. |
a671611
to
570e2f6
Compare
Intermediate commit while I'm still fixing things but I didn't want to pile on too many changes in a single commit. New: Reworked PPC->IML converter to first create a graph of basic blocks and then turn those into IML segment(s). This was mainly done to decouple IML design from having PPC specific knowledge like branch target addresses. The previous design also didn't allow to preserve cycle counting properly in all cases since it was based on IML instruction counting. The new solution supports functions with non-continuous body. A pretty common example for this is when functions end with a trailing B instruction to some other place. Current limitations: - BL inlining not implemented - MFTB not implemented - BCCTR and BCLR are only partially implemented Undo vcpkg change
Instead of having fixed macros for BCCTR/BCCTRL/BCLR/BCLRL we now have only one single macro instruction that takes the jump destination as a register parameter. This also allows us to reuse an already loaded LR register (by something like MTLR) instead of loading it again from memory. As a necessary requirement for this: The register allocator now has support for read operations in suffix instructions
Also removed associatedPPCAddress field from IMLInstruction as it's no longer used
de1a45e
to
a52e39d
Compare
I consider this PR complete. There is more work that can be done but it's at a good point to merge so let's do that. Here is a benchmark. Previous PPC JIT in Cemu 2.2: The reworked PPC JIT from this PR: There have also been some general accuracy improvements and the top post has all the under-the-hood changes that were made. |
@Exzap I've tried macOS build and it crashed upon loading pipelines on every single game I've tried. If that helps - I've confirmed that this doesn't happen on 2.2. |
@boggydigital Can you post the log for other games as well |
I was able to get it to crash by turning off BMI2 extension. Unsure if it's directly related to your crashes but we will see. Working on a fix |
Can you grab the latest build and check again @goeiecool9999 @boggydigital |
No change. It crashes in the same spot. |
Tried the latest build. It crashes for me as well. |
That fixes it 🥳 |
Likewise, I can't repro the crash in any of the ~10 titles I've tried. Thank you @Exzap! |
what is a real bottleneck here if CPU emulation was not the problem ? |
It differs by game, but for the more graphically complex games it's usually the GPU command processor. |
Tested on macOS with most first party titles and didn't encounter any issues compared to main. |
Disclaimer: This is work-in-progress. I'm opening this draft PR for visibility, so others can track progress and know not to alter recompiler code. Work started on this in November and the ETA for completion is somewhere in the span of the next few months, depending on my motivation.
Goals
I originally started work on the recompiler in 2014 and since then I have learned a lot more about state-of-the-art compiler and IR design. While I'm generally happy with the quality of our code translation, some of the design choices I made along the way make it hard to introduce further optimizations or fixes. A lot of the complexity is at the burden of the x86-64 backend, which means that all of that would have to be reimplemented when targeting another architecture.
Overall, the idea is to make both the front-end (PPC to IR) and the back-end (IR to x86-64) as "dumb" as possible so that all the complex logic can be shifted to operate on platform-independent IR, lowering the burden on platform-specific code.
State
Please do not report bugs yet. In fact I don't recommend trying this out, it's an active construction site.
SHL reg/mem, CL
). This is currently done suboptimal by the final emitter moving registers around whenever such an instruction is encounteredI know a lot of these are pretty abstract, so in the future I might add a few before-vs-after code examples to this text.
Q&A
Will this PR add ARM support?
No. But it will make adding a new target architecture a lot easier and if I am motivated enough I'll look into adding an aarch64 backend after this is done.
Will this make Cemu faster?
Maybe? After everything is done the recompiler should output faster code, but CPU execution speed generally isn't a bottleneck in Cemu so it's hard to predict whether there will be an actual difference.
What about the proposed plan to use LLVM?
I did quite a bit of research on that. The biggest downside is that LLVM is still quite JIT-unfriendly and comes with significant bloat. Not saying that it wouldn't work, but the cons outweigh the pros in my opinion. Plus we already got a pretty sophisticated recompiler and it would be a waste to throw it away.
On a personal note, I enjoy working on custom solutions more than plugging in libraries so it's easier for me to stay motivated and make progress. In regards to total effort both solutions are about the same.