FEX-2312
Read the blog post at FEX-Emu's Site!
We're back with another month of changes. After last month being a bit slower, we're back in the swing of implementing more optimizations and bug fixes. No dilly dallying, let's get right in to it!
More optimizations this month!
Once again this month has a whole bunch of optimizations that is very exciting! We will lightly go over the changes to talk about what changed.
Keep guest SF/ZF/CF/OF flags resident in host NZCV
This is one of the bigger optimizations this month. A bit of backstory is needed for what this optimization is for. x86 has a flags register called EFLAGS which contains quite a few random bits of information. The subflags we care about here are the SF, ZF, CF, and OF flags inside of it. These are various flags that are set typically from ALU operations for information depending on the result. So something like an integer Add will set ZF if the result was zero, SF if the result has the sign bit set, CF if a carry occured, and OF if the operation overflowed. These are usually quite cheap for the CPU to calculate by itself, but manually calculating the flags usually takes a few additional instructions each.
The original implementation inside of FEX for calculating these flags would spend the additional instructions and calculate each one manually. This
would usually end up with a dozen or so of additional instructions for calculating flags. While FEX would typically optimize out the calculations if
they weren't used, it would still add CPU time when we couldn't.
Luckily ARM also has a flags register called NZCV which maps almost perfectly to x86's EFLAGS. This lets us optimize these instruction implementations
to instead use the ARM flags directly. This has a couple of effects, not only does it remove the instructions from our code generation, it has
knock-on effects that the flags are now stored inside of the NZCV which reduces memory accesses. A multi-hit combo for improving performance.
While not all x86 instructions map their flags registers 1:1, this has a fairly significant performance uplift in most situations!
Dedicate registers for PF/AF
Related to the previous change, x86 has two flags registers stored inside of EFLAGS that doesn't have a direct equivalent on ARM CPUs. These two
flags are fairly uncommon but instructions will still generate them. These flags have the additional problem that they are fairly costly to calculate,
with one of them requiring a GPR population count instruction which ARM doesn't even support until new instruction extensions called CSSC. While in
most cases the result of these flags isn't used, the overhead of calculating them can add up a bit. This is why we are now dedicating two registers to
these flags to reduce their overhead as much as possible!
Misc optimizations
- Optimize BT/BTC/BTS/BTR
- Optimize shifts/rotates
- Optimize selects & branches & more nzcv goodies
- Optimize three sha instructions
- Make "not" not garbage
- Optimize memcpy and memset when direction is compile time constant
With all these optimizations in place this month we have a fairly significant performance uplift!
<-- Geekbench and bytemark graphs -->
While Geekbench is showing a fairly modest 17.6% performance uplift, bytemark is showing up to a 60% performance uplift! Over the course of
the last three months we have had benchmarks that have improved by over 100%! These improvements can be seen in games as well, with some CPU heavy
games have had their FPS improve by over 2x. In a lot of games tested they have changed from being CPU limited to GPU limited on our Lenovo X13s
laptops even! We are looking forward to when these companies release new laptops based on Snapdragon X
Elite in the middle of next year!
Various bug fixes
In addition to performance improvements, we have some bug fixes this month.
- Fixed corruption in the JIT
- Caused corruption with x87 heavy games
- Fixes integer multiply corrupting results
- Corrupted some register state, which was breaking the game Dungeon Defenders
Support extracting erofs images
One of the features that FEXRootFSFetcher was missing was the ability to extract erofs images once downloaded. This was because we didn't know that
erofs-utils provided an application for extracting these images without FUSE. Turns out the developers put an extractor inside of their fsck
application that we had completely missed! Now if a user wants to extract an x86 rootfs image for lower overhead, they can do this directly from our
FEXRootFSFetcher tool.
Preparation for improving gdbserver
GDBServer is a socket interface that GDB supports for remotely debugging applications. One of the harder things about working on FEX-Emu is that the
ability to debug an application is usually quite hard. GDBServer is a way to improve this situation so that GDB can remotely connect to a FEX process.
There's a bunch of work this month towards cleaning up this interface and getting it to work correctly. While it is still not quite usable for
debugging, we are working towards this so applications can actually be debugged!
Improvements to WOW64 compatibility for newer WINE
Newer versions of WINE has changed some behaviour around WOW64 support. So this month we have added support for some of this newer behaviour. Thanks
again to Bylaws for implementing this!
FEX rootfs image updates
This month we are updating our rootfs images to incorporate the latest Mesa 23.3.0 release that
occured a few days ago. We have updated our Ubuntu 22.04, 23.04, 23.10, ArchLinux, and Fedora 38 images with this latest version of mesa. As usual if
there are any issues, let us know so we can sort them out.
Raw Changes
FEX Release FEX-2312
-
Arm64Emitter
-
Dedicate registers for PF/AF (9b64674)
-
Fixes warning (bf147f4)
-
Config
-
Removes Threads option (3c73357)
-
Dispatcher
-
Fixes corruption when spilling SRA registers (f6b1434)
-
EmulatedFiles
-
Stop relying on O_TMPFILE (2e24f34)
-
FEX
-
Only pass CPU tunables to FEXCore and FEXLoader (b4eeb96)
-
FEXCore
-
Removes GetProgramStatus (c8ef77c)
-
Removes InitializeContext API (b35fadf)
-
Fixes passing arguments to ABI helpers (d0f54bc)
-
Removes Get/SetCPUState (7de66ac)
-
Optimize memcpy and memset when direction is compile time constant (85a1c1f)
-
Removes FEX_PACKED from CPUState (8e892ec)
-
Moves debug strings to gdbserver (3f02d7c)
-
Start changing how thread creation works (f328fca)
-
Moves more SignalDelegator functions to the frontend (6e8af29)
-
Removes x86 DebugInfo table (8015ce2)
-
Removes GetExitReason (bdf4089)
-
Disables RPRES until AFP is audited and enabled (b027113)
-
Fixes imul returning garbage data (389c6b1)
-
Work around broken preserve_all support in Windows clang (98f9a65)
-
FEXLoader
-
Wire up gdbserver in the frontend (efc5eb2)
-
FEXRootFSFetcher
-
Supports extracting erofs images (aa1344a)
-
GdbServer
-
Switch over to a unix domain socket (0357bb2)
-
IR
-
Moves remaining NZCV operations to use DestSize (b619f38)
-
IRDumper
-
Fixes missing conditional name (8892580)
-
InstCountCI
-
Moves Sonic Mania code to 32-bit file (470615b)
-
InstructionCountCI
-
Remove Optimal flags (4a31b61)
-
JIT
-
Fixes crash in TestNZ (a8ab8bb)
-
OpcodeDispatcher
-
Optimize three sha instructions (0e1e4c1)
-
Make "not" not garbage (3767f36)
-
ScopedSignalMask
-
Clean up API and use std::unique_lock/shared_lock (8726c8f)
-
Thunks
-
libX11
-
Change lock functions to nullptr by default (db63241)
-
Misc
-
Optimize BT/BTC/BTS/BTR (250ffb6)
-
Improvements for WOW64 compat with newer wine (43cf2e4)
-
More GdbServer improvements (1c11509)
-
Optimize shifts/rotates (c1d5fae)
-
Optimize selects & branches & more nzcv goodies (1e2d059)
-
Keep guest SF/ZF/CF/OF flags resident in host NZCV (af32539)
-
Add helper for deriving ops by opcode (3f1f7fa)
-
unittests
-
ASM
-
Adds unittest found from Ender Lilies that crashed with NZCV (996a4c0)