FEX-2404
Read the blog post at FEX-Emu's Site!
After last month having an absolute ton of improvements, this month of changes is going to look positively tiny in comparison. We have some good new
options for tinkering with FEX's behaviour and more performance improvements. Let's get in to it!
Implement more memory model emulation toggles
The biggest performance hit with FEX's x86 emulation has always been emulating the memory model of x86. ARM has added various extensions over the years to make this emulation faster but it still isn't enough.
- FEAT_LSE - Adds a bunch of atomic memory instructions
- Original ARMv8.0 doesn't support this. Massive impact on performance.
- FEAT_LSE2 - Adds unaligned atomics (within a 16-byte granule) to improve performance of x86 atomics.
- Doesn't quite cover the full 64-byte cacheline of unaligned atomics that x86 supports
- FEAT_LRCPC - Adds new load instructions which match the x86 memory model
- FEAT_LRCPC2 - Adds even more loadstore instructions which match x86 instructions
- FEAT_LRCPC3 - Adds even more, including vector loadstore instructions
- No hardware today supports this extension
Even with this set of extensions, emulating x86's memory model can have near a 10x performance hit. This performance impact is most felt in games because they use vector instructions very heavily, which is because of the lack of the FEAT_LRCPC3 extension.
With this in mind, we are introducing some sub-options around emulating x86's TSO memory model to try and lessen the impact when we can get away with it. These new options can be found in the FEXConfig Hacks tag.
These two new options are only available for toggling when TSO emulation is enabled. If your CPU supports FEAT_LRCPC and FEAT_LRCPC2 then a recommended configuration is to keep the TSO Enabled option enabled, but disable the Vector and Memcpy options.
While this will incur a performance hit compared to disabling TSO emulation, it is significantly more stable to keep TSO emulation on.
If you still need more performance, then it may be beneficial to turn off TSO emulation entirely. It's unstable though! It's incorrect emulation to gain speed!
Vector TSO enabled
This option enables emulating the memory model around vector loadstore instructions. This has a HUGE performance impact even on latest generation hardware.
Memcpy TSO enabled
This option enables emulating the memory model around x86's REP MOVS and REP STOS instructions. These are used for doing memory copying and memory setting respectively.
The impact of this option depends heavily on the application. This is because most memcpy and memset functions actually use vectors to modify the memory.
JIT core improvements
Once again this month there has been a focus on JIT optimizations. Although this time it might be hard to see what is improving. Overall in benchmarks
there has been roughly a 3% performance improvement. With a mixture of improvements this month being foundational work to lower JIT compile time
overhead in the coming months. As usual there is too much to dive in to each change individually so we'll just have a list.
- Optimize LOOP/N/E
- Negate more inline constants
- Optimize PF calculation using integers rather than vectors
- Optimize CLC
- Optimize cmpxchg
- A bunch of instructions cleaned up and rewritten to remove small amounts of overhead
- Improves 32-bit address mode accesses
- Implements support for prefetch and rdpid instructions
Optimize memcpy and memset IR operations when TSO emulation is disabled
Speaking of the previous optimization. We have now optimized the implementation of the memcpy and memset instructions to be significantly faster. Sometimes a compiler will inline these instructions which was causing upwards of 5% CPU time doing memory copies.
With this optimization in place we have benchmarked and improvement from 2-3GB/s up to 88GB/s! That'll teach that code to be slow.
Fix memory leaks in thread creation
A memory leak that has occured where FEX would leak some thread stacks when they shutdown. This has now been resolved which lowers memory usage for
long running applications that shutdown threads. In particular this makes Steam consume less RAM.
We have more memory leaks to solve as we move forward but they are significantly less severe than this.
A ton of small cleanups in the code
This month has had a lot of code cleanup in FEX but these aren't user facing so it isn't very interesting. Let it be known although that something
like half the commits this month were cleaning up various bits of code or restructuring which isn't getting a focus.
Raw Changes
FEX Release FEX-2404
-
Allocator
-
Cleanup StealMemoryRegions implementation (8a3d08e)
-
ConstProp
-
drop dead code (202a60b)
-
ELFParser
-
Stop using a VLA (32ec4a3)
-
Externals
-
Update Catch2 to v3.5.3 (b892da7)
-
FEXCore
-
Fixes priority of FEX_APP_CONFIG (7786c23)
-
Move nearly all IR definitions to internal (e2a0953)
-
Moves CodeLoader to frontend (3bed305)
-
Moves CPUBackend definition internal (f6639c3)
-
Remove DebugStore map (2ad170b)
-
Adds more TSO control levers (24fd28e)
-
Removes vestigial mman SMC checking (542f454)
-
Fallback to the memcpy slow path for overlaps within 32 bytes (167896d)
-
Add non-atomic Memcpy and Memset IR fast paths (7dcacfe)
-
FEXLoader
-
Add a way to sleep a process on startup (624bc3f)
-
Add some debug-only tracking for FEX owned FDs (79454ed)
-
InstcountCI
-
Adds a block that is causing panic spilling (150af80)
-
IoctlEmulation
-
Add missing nouveau ioctl (6d94d79)
-
JIT
-
Optimize pmovmaskb with a named vector constant (ab8ee64)
-
Linux
-
Expose support for v6.8 (e33a76a)
-
Threads
-
Fixes a stack memory leak for pthreads (3d31291)
-
OpcodeDispatcher
-
clean up shifts (c43af8e)
-
drop ZeroMultipleFlags (b632f72)
-
eliminate xblock liveness for rcl/rcr (cd9ffd2)
-
eliminate branch in cmpxchg pair (aa26b62)
-
Fixes 32-bit mode LOOP RCX register usage (7698347)
-
optimize LOOP/N/E (8852d94)
-
Implement support for the various prefetch instructions (2a9fcc6)
-
Implement rdpid (ba3029b)
-
RA
-
drop dead block interference code (f2d001e)
-
Removes VLA usage (67baff8)
-
Adds RIP when a block panic spills (a8b59c1)
-
RCLSE
-
Optimize store-after-store (ca6b2e4)
-
Telemetry
-
Allow redirecting directory that data is written to (970d5d5)
-
Adds tracker for non-canonical memory access crash (7f90ca5)
-
Rename old file instead of copying (002ca36)
-
Misc
-
Negate more to inline constants (aa8d04c)
-
Minor cleanups around flags (37f2b41)
-
Use scalar integer code to calculate PF (bd0b5ec)
-
Eliminate xblock liveness with rep cmp/lod/scas (e8abc88)
-
rewrite ROL/ROR (29c6281)
-
Fix reference to out of bounds address in offsetof (4214d9b)
-
optimize clc (b1ddd8c)
-
Moves FHU TypeDefines to FEXCore includes (5c29c9d)
-
Optimize cmpxchg with flagm (2a625a4)
-
Eliminate crossblock liveness in xsave/xrstor (d25ace4)
-
rewrite Demon Addition Adjust (DAA) and other demonic opcodes (7b74ca1)
-
Library Forwarding/vulkan: Fix query of vkCreateInstance function pointer (4ea6305)
-
Put <20M in double quotes to avoid truncate error (1450c92)
-
Removes false termux support (0c24aea)
-
Optimize DF representation (cd2a6ce)
-
Library Forwarding: Don't map float/double to fixed-size integers (4e269d8)
-
Disable assert in release (c37a12e)
-
Improve 32bit ld/st addressing mode propagation (ff0c763)
-
Library Forwarding: Fix accidental data copying when converting from host to guest layout (26a6679)
-
unittests
-
ASM
-
Adds a test for overlapping memcpy using rep movs (6ce366e)