FEX-2310
Read the blog post at FEX-Emu's Site!
Welcome back to another monthly release for FEX-Emu. You might be thinking that after last month's optimizations that we wouldn't have much to show
for this month. Well you would be wrong! We optimized even more! Let's get in to it!
More instruction optimizations!
As stated last month, we introduced Instruction Count CI which has allowed us to do targeted optimizations of our code. One again we have optimized so
many instructions that it would be impossible to go through each individual change. Check our detailed change log if you want to see all the
instructions optimized. Let's just look at the final benchmark numbers compared to last month.
<- Geekbench 5 versus last month ->
<- Bytemark versus last month ->
Let's talk about the Geekbench 5.4 results first since they don't look very
impressive at first glance. While we are only showing ~13% of a performance improvement, the problem with this result is that this number is an
aggregate of multiple smaller benchmarks. Looking at the breakdown of all the subtests there are some that have improved by up to 66%! This is of
course because some benchmarks take advantage of some instructions that we optimized more heavily than others. Luckily this improvement also scales to
other video games as well.
The Bytemark improvements are a bit hard to make out, some numbers are hardly changed at all while a couple stand out as huge improvements. This
mostly comes down to some very specific instruction optimizations that significantly improved performance in a couple of tests and the rest don't show
up as much.
With this months optimizations and last months combined these optimizations end up being significantly more interesting. Some
Geekbench results are showing an average of 50% to 65% higher performance
sometimes even higher. Some benchmark results showing nearly 2x the performance compared to before! These numbers translate very well to gaming
performance where some games have more than doubled their FPS over the past couple months.
We're not slowing down either, we still have a ton of optimizations to go on our march to get our emulation close to native performance.
Support preserve_all for interpreter fallbacks
We're calling out this particular optimization for three reasons.
- It improves performance of x87 heavy code
- It only works with the super recently released Clang 17
- wine packages in FEX's rootfs use x87 heavily in some instances.
Let's talk about what this optimization is and how it improves performance. In Clang 17 they added support for a new function calling ABI called
preserve_all. x86 has supported this ABI for a very long time but it is a new addition for Arm64. This ABI breaks convention from the regular AAPCS64
ABI in that if a small function needs to more registers then they need to first save pretty much any of them. Unlike AAPCS64 where it has a bunch of
registers free for using. This is beneficial for FEX's JIT since we can save signicant time by not saving any state when we need to jump out of the
JIT and execute x87 softfloat code.
In particular this manifests to upwards of a 200% performance improvement in some microbenchmarks around x87 code! While this advantage is quite
significant, the only way to take advantage of it is to compile FEX with Clang 17. Since this compiler release came out only last month, pretty much
no distros have adopted it so it is unlikely to be used soon. In a few months time, or years depending on distro, they should naturally upgrade their
compiler stack and free performance improvements will happen.
As a fairly major side note to this excursion, FEX has found that the 32-bit wine packages that is compiled with Canonical's repository uses x87
heavily in some instances. This causes some really bad performance issues with some 32-bit games and installers. It is recommended to use Proton where
you can here since it compiles its 32-bit libraries with SSE optimizations instead which work significantly better.
FEX-Emu may look to provide its own wine packages in the future with this same optimization in place to help alleviate some of this burden. Until then
it is recommended to use FEX's x87 reduced precision mode to try and alleviate some of the overhead.
Fixes a bug when chrooting in to rootfs
For quite a few months now FEX-Emu has changed some behaviour around chrooting in to the FEX rootfs.
While chrooting isn't generally advised, if a user wants to modify the rootfs then it's the only option. While we provide some scripts inside of our
rootfs images to facilitate this, it has been broken for a few months.
We have now fixed this bug in both FEX-Emu and the scripts inside of our rootfs images. So if you want to modify packages inside of the image you will
now be able to do so again. Make sure to update your image to get the new scripts!
Remove x86-64 JIT and Interpreter
This has been a long time coming in the FEX-Emu project. We have had support for an IR interpreter and x86-64 host JIT for compatibility testing since
the project's inception. It has always been the case that if these CPU backends get in the way of the ARM64 JIT that they would get removed.
That time has finally come. Due to some upcoming changes around how flags are getting represented in FEX's JIT and the general burden of implemented
FEX's IR operations three times, often undoing an x86->Arm64 translation to go back to x86. It has been deemed too much of a burden and these have
been removed. This is a necessary step for our ARM64 JIT to gain more performance that we will be gaining in the coming months!
We are looking forward to future ARM platforms that can take Radeon GPUs through PCIe slots to regain a platform which can test RADV directly, but
until that point we will have to make due with our current devices.
Instruction Count CI on x86-64 hosts
While we removed our x86-64 JIT, we do have a fun addition to our instruction count CI. Now developers that don't have an Arm64 device handy can still
run the Instruction Count CI and attempt to optimize implementations without even having an ARM64 device to run it on. This is as simple as building
FEX on an x86-64 device with the Vixl disassembler and simulator enabled and you will be able to optimize to your hearts content!
We've got a need for JIT speed! Let's go fast!
Implement first optimizations using 128-bit SVE
This is a fairly minor change but previously FEX was not using any 128-bit SVE instructions. This is primarily because there aren't really any SVE
supporting devices in the consumer market, even though Snapdragon hardware theoretically supports it. 128-bit SVE adds a couple of optimizations that
we can use.
- Wide-element shifts
- Index instruction for generating simple index masks
While these are fairly simple initially, they change some from being translated to six instructions down to one or two depending. This is a fairly
minor change, but it is good to note that FEX is now taking advantage of SVE if it is available!
Adds WOW64 frontend
This has been a long time coming, with us adding initial mingw support back in FEX-2305. FEXCore now supports being built with a brand new WOW64 WINE
frontend. While currently not being utilized, this will allow WINE to integrate FEX directly in to its WOW64 layer for running both x86 and x86-64
applications on Arm64 host devices.
This is a very substantial change to how WINE integrates with FEX, since today FEX-Emu just runs the full x86-64 WINE process and eats the overhead of
emulating everything WINE needs to do. With the WOW64 layer now implemented, a bunch of the WINE code can now be Arm64 native code and when it needs
to execute application code it just jumps back to the emulator. This is similar to how Windows natively handles its emulation through its "XTA" layer.
Sadly today this is only wired up to work through a 32-bit x86 part of the layer, we need to get setup to support Wine when it inevitably supports
Wow64 for x86_64->Arm64.
Big shout out to ByLaws implementing support for this! We look forward to future Wine integration work landing!
Implement thunking support for wayland-client and zink
We have some improvements to thunking this month! As we are working towards supporting thunking more code, we implemented some features to get
wayland-client thunking wired up. While this support is early, it is enough to get Super Meat Boy up and running using wayland and zink overrides
within a Wayland environment. We look forward to additional thunking improvements going forward so that performance can be improved everywhere.
Raw Changes
FEX Release FEX-2310
-
AppConfig
-
Removes Steam config (02da6d6)
-
Arm64
-
Fixes inline syscalls (4e9a114)
-
Optimize wide shifts slightly for 64-bit OpSize (f5c4e28)
-
Recover two unused vector vector temporary registers (90f7937)
-
ALUOps
-
Remove spills in PEXT (4604c01)
-
VectorOps
-
Elide moves where applicable in 128-bit VSQXTUN2 (fd1b639)
-
Improve handling of 128-bit vector VInsElement (950a8db)
-
Elide moves in ASIMD VUShrNI2 if possible (b3269f2)
-
Assert VTMP1 and VTMP2 are sequential in VTBL2 (8168a49)
-
Fix SVE aliasing-path move in VSShr (ffb5876)
-
CI
-
Run tests with <30s runtime first (e1eb151)
-
CPUID
-
Enabled Enhanced REP MOVSB/STOSB (6fe643d)
-
Config
-
Fixes core sanitization (da3e172)
-
ConstProp
-
Fixes unscaled signed 9-bit range (72d092e)
-
DeadContextStoreElimination
-
Silence unused function warning (773e946)
-
ELFCodeLoader
-
Expose FSGSBase in getauxval HWCAP2 (fbc4bda)
-
FEX
-
Moves Linux utils and adds spdx (ba56e51)
-
Common
-
Adds SPDX identifier (9f5f09b)
-
Tools
-
Adds SPDX identifier (ddf4b5c)
-
FEXCore
-
Support CpuState relative vector named constants (3413eb3)
-
Merge Arm64Dispatcher in to Dispatcher (935b3a3)
-
Removes x86 JIT. (879b41c)
-
Removes vestigial Interpreter code (65b6df9)
-
Support preserve_all ABI for interpreter fallbacks (fea72ce)
-
Gut interpreter (7d99eb0)
-
Adds SPDX identifier (67680d7)
-
Implements support for shifted bitwise ops (d578256)
-
Disable Enhanced REP MOVSB if Atomic TSO is enabled (0604336)
-
Defer setting x87 softflow rounding mode until use (9866e23)
-
Minor optimization to StoreRegisterSRA (6fdf2f9)
-
Include
-
Adds SPDX identifier (d86f41e)
-
JitSymbols
-
Buffer writes to reduce overhead (2ea2300)
-
FEXServerClient
-
Adds back ServerSocketPath config option (e795ec6)
-
FHU
-
Prepend SPDX identifier (3b188b7)
-
FileManagement
-
Fix inverted boolean check for procfs/interpreter support (615ab8d)
-
Github
-
Changes jobs to have unique names (220761a)
-
HostFeatures
-
Fix x86 CLWB support check (6e08ac6)
-
Detect FlagM/2 (98f1487)
-
IR
-
Changes Select operation to not have implicit sizes (a8c1720)
-
Changes crc32 operation to always return a 32-bit result. (6d9b524)
-
RCLSE: Partially reenables the RCLSE pass (879fcdc)
-
InstCountCI
-
Enable running on x86 hosts (a1a709f)
-
Support f64 reduced precision mode tests (5eed24a)
-
Fail CI if there was any difference. (93aeb15)
-
Adds negative immediate primary tests (c38beff)
-
Add log before compiling instruction (1804b00)
-
Adds missing instructions from Secondary OpSize tables (750d909)
-
OpcodeDispatcher
-
Don't mask logic op inputs (d1d3de8)
-
Optimize lock btr (2b7e1d1)
-
Optimize reconstructing FSW (5fc8699)
-
Removes non-explicit SelectCC function (43fd159)
-
Improve output of SHLX/SHRX/SARX (ad8b0c6)
-
Improve output of MULX (e574cfe)
-
Handle RORX corner cases better (647629a)
-
Optimize cmov (c8e7c34)
-
Optimize CRC32 (96bbd01)
-
Optimize 16-bit MOVBE (f84a264)
-
Optimize blendp{s,d} (92824f5)
-
Optimize pins{b,w,d,q} (213d3c4)
-
Optimize pextr{b,w} (d4c6749)
-
Optimize shufpd (655cee0)
-
Implement shufps with VTBL2 in worst case (31d8283)
-
Optimize a bunch of shufps variants (cfe620a)
-
Optimize 32-bit bswap (ebdca02)
-
Optimize NOP vector move (f7e652b)
-
Minor optimization to BT/BTC/BTR/BTS (48521a4)
-
Update 32/64-bit RCL for operating size (950007c)
-
Update 32/64-bit RCR for operating size (d029394)
-
PassManager
-
Optimize out CPUID and XGetBV calls (234e029)
-
RCLSE
-
Optimize redundant store->load operations (6dc5c0d)
-
Scripts
-
Update generate_doc_outline for moved FEXCore (9ff5544)
-
Thunks
-
Only build guest target for libfex_thunk_test if FEXLinuxTests are enabled (507cf82)
-
Analyze data layout to detect platform differences (48fa4f1)
-
Fix AddressSanitizer build (1439874)
-
Update Vulkan thunk to v1.3.261.1 (76d4637)
-
Avoid recompiling thunk interfaces on FEXLoader changes (533f359)
-
Minor restructuring and small cleanups (be07254)
-
wayland
-
Add support for APIs required by zink and Super Meat Boy (0d9dce9)
-
Tools
-
Fixes usage of waitpid in the face of EINTR (dda5861)
-
X87F64
-
Implement FABS with vector instruction (3a25dd6)
-
Use Bfe for rounding mode, FCHS use float instruction (ccfd770)
-
Misc
-
Minor AVX optimizations (3ba1c79)
-
Optimize ASCII flags (6b4ff4a)
-
Use adcs (ca87d86)
-
Optimize 8/16-bit CF calculation (8b3881b)
-
Optimize PF calculation in lahf (19a7b51)
-
More opts to the dispatcher + 1 to the JIT (bee9730)
-
Requiem for the x86 jit (86ad35c)
-
Add WOW64 JIT frontend (797c890)
-
Optimize reconstructing x87, harder (5444810)
-
Make x87 FCMOV slightly less terrible (65d558b)
-
Minor/flag opts (8b52308)
-
InstCountCI/VEX_map3: Add missing zeroing vperm2f128/vperm2i128 test cases (3d0b664)
-
Inline constant with PF calculation (9152fb0)
-
Optimize out carry invert for DEC (b6922df)
-
unittests
-
Instruct CTest to print output from tests on failure (e32601f)
-
Add test thunk library (000fb2e)
-
ASM
-
Implements tests for vpgatherqd/vgatherqps (ab4642a)
-
Implements tests for vpgatherqq/vgatherqpd (d94e5ce)
-
Implements tests for vpgatherdq/vgatherpq (dad7086)
-
Implements tests for vpgatherdd/vgatherps (85da0f0)