Skip to content

FEX-2406

Compare
Choose a tag to compare
@Sonicadvance1 Sonicadvance1 released this 13 Jun 02:32
· 1650 commits to main since this release
aa0f2c3

Read the blog post at FEX-Emu's Site!

A little late this month but we have a new FEX release has finally landed. This month we have some good optimization and fixes so let's get right in to it.

A bunch of JIT optimizations

This last month is finally the culmination of preparation work over the past few months of cleanups in the FEX JIT. The new register allocator has
landed in FEX which is significantly better than our previous RA. Our prior implementation was meant to be a temporary solution when FEX initially
started as a project and as with most temporary code, it became permanent. It was excessively slow, best case it ran in quadratic time, worst case it
could take INFINITE time which resulted in significant stutters or hangs. This new implementation by Alyssa now runs in two passes in linear time,
significantly improving performance and also removing a ton of bad design decisions from the first implementation.

In addition to the new RA, we also have a bunch of little optimizations spread around that improves performance all over the place. One of the bigger
performance improvements for people with new hardware is enabling the AFP extension and RPRES if supported. Apple supports these in their latest SoC
and the newer Cortex also supports them. This improves scalar SSE performance by quite a bit. We won't dive in to these too much but the various
optimizations can improve performance from 2% to 12% in testing. We're marching ever closer to running applications at near native speeds now.

Add support for 32-bit OpenGL thunking

This is a big feature! 32-bit thunking has been a long time coming and has crossed some significant hurdles towards actually working! One of the
biggest CPU time sinks with games is the amount of time we need to spend in the video driver when running a game. "Thunking" allows us to remove that
overhead and jump directly in to the AArch64 libGL directly and remove a bunch of emulation overhead. We have done a bunch of testing with this but we
expect there will still be some bugs that need to be worked out. As for fun performance improvements, we have seen one game go from 150FPS up to
270FPS, so it's worth trying in some cases.

As a note though, this is only 32-bit OpenGL thunking. 32-bit Vulkan drivers still need to go down the emulation path, so things like DXVK in older
32-bit games won't get these performance improvements.

Default TSO emulation options changed

Over the course of the past couple months we have been testing the new TSO memory model emulation toggles and during this time we have determined the
cost of accurately emulating Vector and Memory copy memory atomics to be too high for most hardware. The good news is that from all the testing we
have done, this doesn't actually cause any problems in any known games. So from this release onward we are by default disabling TSO emulation on these
operations. We may come back and visit this once hardware ships that has FEAT_LRCPC3 which adds new instructions for Vector TSO loadstores.

Users with an older configuration can go in to FEXConfig to toggle these options off and enjoy the free speed benefits of not doing accurate emulation
today! As a note, Apple Silicon's TSO hardware emulation bit doesn't suffer the same performance degradation so once Asahi Linux supports this for
users then they get accurate emulation and speed!
TSO options in FEX

Unaligned Half-barrier TSO Enabled is still recommended to keep enabled as that can cause significant bugs

Fix fstatat/statx with NOFOLLOW And JIT bugs

During a livestream one of our users encountered a bug in FEX-Emu that was breaking Darwinia.
After diving in to the game to figure out what it was doing, it actually turned out to be three separate bugs that broke the game. The first bug fix
with fstatat and statx syscalls were around edge case behaviour with the NOFOLLOW flag. The game was attempting to find the directory that the
executable was living in and being smart in a way that broke FEX.

The other bugs were behaviour in our optimization passes where we broke x86 SIB addressing in a couple ways. We have since added unittests for
these two bugs but if you would like to read more you can check out ConstProp fixes for Darwinia

With these bugs fixed the game now runs correctly under FEX-Emu without issue!

More ARM64EC improvements

This was some cleanup work for helping more easily integrating with what upstream WINE is doing for ARM64EC support. While still not entirely usable
for end-users yet, it is steadily improving and can run real games if the environment is setup correctly. A lot of good work here and we're hoping
for more testing going forward.

NVIDIA Orin CPU errata!

Over the past month or two we had noticed that the NVIDIA Orin platform with its Cortex-A78AE CPU cores were running games markedly worse than our
Snapdragon 8cx Gen 3 platform with Cortex-A78C cores. While these CPU cores are not identical between platforms, they are both based on the Cortex-A78
CPU core design so they should be relatively close. The NVIDIA Orin runs its cores at a 2.2Ghz clock frequency, while the Snapdragon runs its cores at
2.4Ghz. Nearly a 9% clock speed difference wasn't accounting for the performance delta we were seeing!

The game we were testing was the PC port of Sonic Adventure 2: Battle; On Orin the board could only achieve 18FPS, while on Snapdragon we were easily
hitting 60FPS with headroom to go higher if VSync was disabled. We were stunned by this absolute performance difference and couldn't nail down the
difference being due to different drivers.

Turns out we only needed to look at the Cortex-A78AE Software Developer Errata Notice to find out why.

1951502

Atomic instructions with acquire semantics might not be ordered with respect to older stores with release semantics

Under certain conditions, atomic instructions with acquire semantics might not be ordered with respect to older instructions with release semantics. The older instruction could either be a store or store atomic.

This erratum can be avoided by inserting a DMB ST before acquire atomic instructions without release semantics. This can be implemented through execution of the following code at EL3 as soon as possible after boot:

This then goes on to talking about some code that programs the CPU so that it injects these DMB ST instructions before atomic acquires automatically!
This is why this platform has been so weird for performance testing for years! This massive hardware errata basically deletes any advantage that the
FEAT_LRCPC extension gives FEX and goes back to emulating atomics using half-barriers similar to how FEX already does it around unaligned
atomics!

We are now looking to move off of this NVIDIA Orin platform as quickly as possible, it was already old and now that we have identified a significant
problem around atomic performance it is higher priority. Luckily over the last few months we have great new hardware announcements. The Snapdragon X
Elite devices are shipping soon, NVIDIA has announced a new Jetson AGX Thor platform, The NVIDIA Grace server platform is starting to become available, and
Apple has some new M4 devices that will be interesting! Ideally we will get a new platform that we can plug a Radeon GPU in since it is a huge boon to
our testing performance, but depending we may not have that luxury. We'll see as we move on to new and better platforms!

Video game showcase

Instead of a video showcase from FEX this month, go checkout Asahi Lina's Youtube Page. She recently did a
couple of live streams fixing issues with the Asahi Linux MicroVM solution for running FEX-Emu on on Apple Silicon! She showcases a bunch of games
while covering some of the more technical problems involved with getting FEX-Emu running on that platform.

Be warned, these are very long streams.

Asahi Lina stream Part 1
Asahi Lina stream Part 2

Raw Changes

FEX Release FEX-2406

  • AOTIR

  • Refactor interfaces to clarify ownership flow (3bac767)

  • CMake

  • Remove obsolete Catch2 setting (170204d)

  • CPUID

  • Adds Qualcomm Oryon product name (f7bfecd)

  • Config

  • Change default TSO options (6954ebe)

  • ConstProp

  • remove x86 jit leftover (58614ff)

  • Constprop

  • clean up (14bfe60)

  • FEXCore

  • Fixes the difference between CPL-0 and undefined instructions (efe7c54)

  • Get rid of DeferredSignalFaultAddress and use the InterruptFaultPage (f27f187)

  • ARM64EC x64 entry/exit support (2cae2f2)

  • docs

  • Adds programmer documentation about memory model emulation (a515b70)

  • FEXLoader

  • Cleanup FD extraction from environment variables (9dd6d8e)

  • Changes frontend thread management to wrap FEXCore thread objects (ef6d640)

  • FEXLogging

  • Changes representation of timestamp (9b1b9c2)

  • FEXServer

  • Removes temporary variable allocation (cd249e2)

  • FileManagement

  • Fix fstatat/statx with self and NOFOLLOW (6052b33)

  • Github

  • Support a timeout on checkout (6d3471b)

  • IR

  • drop IRParser (10de2f8)

  • InstCountCI

  • Adds SSE4.2 operations (ad13442)

  • Hardcode the offset to load tests into (663f3d8)

  • InstructionCountCI

  • add bytemark hot block (063b1eb)

  • JIT

  • factor out sub reg size conversion (948938b)

  • VectorOps

  • deduplicate common implementations (e3ec25d)

  • LinuxSyscalls

  • Cleanup envp copying in execve (55bfd63)

  • NFC

  • Fix typo (c5f8ea5)

  • OpcodeDispatcher

  • eliminate some Bfe's (7b4e484)

  • reorder some moves (926eefc)

  • Readme

  • Remove misleading text about x86 hosts being supported (eddb7d1)

  • RegisterAllocationPass

  • drop AVX flag (1fde5d7)

  • Misc

  • Library Forwarding: Add support for 32-bit OpenGL (4dc8648)

  • Find-and-replace OrderedNode* with Ref (c0bab70)

  • optimize shld (9ab0fa0)

  • (8c4860b)

  • Fix segfault when starting TestHarnessRunner with missing arguments (f90d2ae)

  • Optimize logical flags (20d5a26)

  • Track tied sources in the IR (ee96d60)

  • ConstProp fixes for Darwinia (ab0a6bb)

  • Optimize PCMPESTRI flags a bit (35ec54f)

  • Removes warnings (734258e)

  • Delete a big chunk of IR/Passes/* (9d0ff79)

  • (314fea3)

  • (3b7d30d)

  • Optimize asr (32f2dec)

  • Optimize large sign-extended constants (5497240)

  • Optimize sign-extension (0adcc77)

  • Simplify/fix our validation passes (c90036a)

  • Slightly improve pair coalescing + memcpy fix from RA branch (ca70e38)

  • ConstProp, RCLSE: simplifications (9a48310)

  • Allow garbage on more shifts (85776c2)

  • clang-format: left-align escaped newlines (e3e7f02)

  • (048c8de)

  • Use erase-remove idiom to remove element (3eb7a5b)

  • Fix left shift undefined behaviour (5bedf32)

  • Fix exec path where file needs to be ignored (3a7aa83)

  • SRA controlled burn (47242dc)

  • Fix WOW64 frontend with recent wine versions (55d1d6b)

  • Pass compulsory mode argument to open when O_CREAT is used (f70aafb)

  • unittests

  • add XeSS test (a01402d)

  • bextr

  • add SrcSize tests (033b1ce)