FEX-2408
Read the blog post at FEX-Emu's Site!
In the beginning, there were integers.
And robots wanted precise math, and so the x87 floating
point unit was created.
And robots wanted faster math, and so SSE was created to
replace x87, and it was good.
Speeding up x87
Although x87 is slow and deprecated, it hasn't disappeared. 64-bit games will
use SSE or even AVX for floating point math, but older 32-bit binaries --
compiled decades ago -- are filled with x87.
FEX aims to support your entire game catalogue. Last release, we added AVX to
support the newest games. This release, we've circled back to the oldest. Old
games ought to run well on new hardware, but if they use x87, performance can
nosedive. Why? Two x87 quirks: 80-bit precision and the stack.
Floating point numbers are typically 32-bits or 64-bits. 32-bit is
faster, while 64-bit enhances precision for numerical computing. Our
target Arm hardware supports both 32-bit and 64-bit, but
x87 adds an unfortunate third mode: 80-bit. Ostensibly, the extra bits of
precision in intermediate calculations minimizes the accumulated error of
the final result.
Is that necessary? Careful code can mitigate rounding error without the massive
80-bit hammer, thanks to techniques like the Kahan summation
algorithm. New
code doesn't miss the 80-bit hardware.
Sadly, the FEX team can't afford a time machine. They're not cheap anymore.
We'll see what happens with Moore's Law eighteen months ago. So we can't teach
game developers in 2005 how to make do with 64-bit floats. All we can do is
slowly emulate 80-bit floats in software, or substitute 64-bit and the game
won't notice.
The second problem with x87 is more obscure. In a typical instruction set, each
instruction specifies which registers it accesses. By contrast, x87 arranges
registers in a stack. Instead of a destination register, x87 instructions
push to the stack. Instead of source registers, sources are indexed relative
to the top of the stack. Unlike 80-bit floats, stack machines are alive and
well for virtual machines. Like 80-bit floats, they complicate
emulation.
Arm instructions specify their registers directly, but we don't know which
registers an x87 instruction will use without knowing the stack top.
Previously, FEX worked around this mismatch by keeping the emulated x87 stack
in memory instead of registers. Arm can indirectly index memory, so this
works. However, it's slow. Because Arm is a RISC architecture, this approach
requires multiple load/store instructions for every x87 arithmetic operation.
We can do better.
Instead of single instructions, we can translate entire x87 code blocks. That
gives us the full context of each instruction. In "good" conditions, that lets
us statically determine the stack layout, so we can translate stack access to
real Arm floating point registers. The stack loads and stores disappear the
our generated code.
There's another trick we can play. Sometimes games will copy 80-bit floats
without performing any calculations. Translating these copies naïvely is slow
due to 80-bit emulation overhead. However, we can analyze multiple
instructions together to detect 80-bit copies and translate to efficient Arm
code.
These optimizations combine to a surprisingly large speed-up. To illustrate: a
hot block in Psychonauts swizzles a 4x4 matrix. That's light on arithmetic but
heavy on x87 overhead. These optimization reduce the translated code from 2340
to 165 instructions. That's a 93% improvement!
Big thanks to Paulo for taming x87, available in
this month's FEX release.
[redacted]
...
"I'm working on a branch that's 10% faster than upstream. Should I mention that
in the blog?""No, we don't want to ruin the surprise for next month's release."
...
Raw Changes
FEX Release FEX-2408
-
AOTIR
-
Change std::unique_ptr to fextl::unique_ptr (5fe405e)
-
ARM64EC
-
Install a custom call checker to bypass NTDLL function patches (2c3e6cb)
-
Set appropriate AFP and SVE256 state on JIT entry/exit (85d1b57)
-
Introduce FEX-side CRT and Windows API replacements (7b1d954)
-
Handle direct syscall instructions (24ea4b7)
-
Improvements to exception flag handling (cadb0a2)
-
Support the JIT API as is used by Windows (7ffd3e5)
-
AVX128
-
Optimize blends (4882f10)
-
Optimize all cases of vpermq (69ed39d)
-
Implement support for scalar FMA with AFP (9201ac5)
-
Improve VPERMILPS/PD and VPSHUFD (f8c4c54)
-
Fixes vmovq loading too much data (b9a6cae)
-
Extends 32-bit indexes path for 128-bit operations (7ccb252)
-
Optimize the vpgatherdd/vgatherdps cases that would fall back to ASIMD (22b2669)
-
Optimize QPS/QD variant of gather loads! (3627de4)
-
Extend 32-bit address indices when possible (aad7656)
-
Prescale addresses in gathers if possible (47d077f)
-
AllocatorHooks
-
Correct memory API usage on Windows (f98c010)
-
Allocate from the top down on windows (9d0b6ce)
-
ArgumentLoader
-
Removes static fextl::vector usage (f81fc4e)
-
Arm64
-
Implements support for DAZ using AFP.FIZ (f8c6baa)
-
Implement support for SVE bitperm (b282620)
-
Remove one move if possible in FMA operations (3bea08d)
-
Fixes long signed divide (3d65b70)
-
Arm64Emitter
-
Reload STATE before SRA fill on ARM64EC (a742441)
-
CMake
-
Add option to build PDB debug info instead of DWARF (49b8dae)
-
CPUID
-
Adds a few missing CPU names for new CPU cores (c4ae761)
-
CodeEmitter
-
Removes vestigial vixl usage (2da819c)
-
Config
-
Little assume non-null check (f40bc13)
-
Converts two LUT maps over linear scan arrays (c5fe872)
-
Removes a static vector initializer (e0c783d)
-
Search more locations for the config directory on Windows (1f59f0e)
-
EmulatedFiles
-
Adds a few leaf CPUID flags (d385e49)
-
Fix bad formatting (d1249ec)
-
F80
-
Drop dependency on state stored in TLS (c3c2b61)
-
FEX
-
Moves HostFeatures querying to the frontend (434bffa)
-
FEXCore
-
Pass HostFeatures in to CreateNewContext directly (0ecfc65)
-
Drop deferred signal handling on Windows (1007f87)
-
Add a generic spill/fill-all syscall ABI and use for Windows (10ee963)
-
Removes ThreadManager (403fd62)
-
Refactor ExitHandler slightly (380ba0a)
-
Removes CPUBackendFeatures (1fe497d)
-
FixedSizePooledAllocation
-
Fix a race when unclaiming disowned buffers (228009c)
-
HostFeatures
-
Removes feature flags always supported by FEX (633f624)
-
IR
-
garbage collect premature F80Cmp optimizations (da51169)
-
InvalidationTracker
-
Better match Windows code invalidation behaviour (4a3250d)
-
Ioctl32
-
Removes static fextl::vector in ioctlemulation (9688f5e)
-
LogManager
-
Removes fextl::vector usage (870e395)
-
OpcodeDispatcher
-
Force noinline for the function call in the Bind helper (19e8492)
-
Replace hand-written wrapper templates with a generic utility (fc0b233)
-
Fix 8/16-bit rcr masking (bbf8dde)
-
Fixes rotates with zero not zero extending 32-bit result (97329cc)
-
X87
-
use less creative Refs (d92b6a9)
-
RA
-
fix interaction between SRA & shuffles (90a6647)
-
Scripts
-
Fix issue in aarch64_fit_native (94bb7eb)
-
Workaround deprecated parse_version (c42808c)
-
drop remnant of IR parser (991c694)
-
Softfloat
-
Fixes Integer indefinite return for 16-bit signed values (f2d1f2d)
-
SpinWaitLock
-
Fixes missing newline in asm (1473129)
-
Syscalls
-
Updates for v6.10 (dd26b0c)
-
Telemetry
-
Remove VEX flag (5c9bb65)
-
Change how visibility of telemetry values work (b8e864f)
-
Threads
-
Setup the stack tracker to not need global initialization (93eead2)
-
VDSO
-
Stop using a vector for a static (04592f8)
-
WOW64
-
Support the JIT API as used by Windows (03ca3e7)
-
Mark the FEX dll as a wine builtin (635182b)
-
Windows
-
Pull in additional method and structure definitions from wine (f75bd2f)
-
Commonise TSOHandlerConfig (2fdd80f)
-
Report as an AMD64 processor when targeting ARM64EC (dbac23b)
-
X87
-
save uop in ReconstructFTW (f72cee4)
-
Misc
-
Directly use the EC code bitmap for determining page arch (201fe6e)
-
Don't apply the address-size flag to segment addresses (83fedd6)
-
Commonise logging and fallback to a log file for debug output on Windows (dedf4a9)
-
Fix nasm warning in Rounding.asm (069e2ce)
-
Reuse Top in ReconstructFSW_Helper (941fd9c)
-
ASM Tests: X87 Rounding modes (d24d0a9)
-
Fix call to FNINITF64 and refactor (c2092bf)
-
Test running scripts tell ctest of skipped tests (a6cf7fa)
-
json_ir_generator: stop prefixing arguments (6b91e0c)
-
x87 Stack Optimization (d507f4c)
-
Optimize zero x87 flags (77ec950)
-
Add x87 memcpy instcountci tests (e4b7a65)
-
Autogenerate LoweredX87() query, misc json_ir_generator cleanup in the area (17a55fb)
-
X87 Stack Ops Auto-marking (d204155)
-
Remove unused function MmapOverride (924b8c1)
-
ARM64EC frontend (6df51a5)
-
Remove Disabled_Tests file (e65545a)
-
Try to delete RCLSE again (d79b7fc)
-
Enable coverage configuration for FEX (b6e1469)
-
Implement support for SSE4.1/AVX NT loads (e25918d)
-
Fix all the warnings (9a8694c)
-
OpcodeDispatcher: Avoid template monomorphization to reduce FEXLoader binary size (070a914)
-
Drop deferred flag infrastructure (72d6c8e)
-
Tests for X87 FTST (3ef9ea9)
-
VCVT{T,}PD2DQ fixes and optimization (af6a0be)
-
Use nproc only if TEST_JOB_COUNT not specified (4501123)
-
FEXCore ARM64EC CI support (968d5e0)
-
Fix CF with small shifts (9bad09c)
-
Fix 8/16-bit RCR (653bf04)
-
JIT: fix ShiftFlags masking (b77a25b)
-
Fix 16-bit SBB (692c2fa)
-
fextl
-
Properly handle nullptr arguments in fextl::default_delete (8381d44)
-
github
-
Vixl simulator enable more asm tests (09c4a55)
-
man
-
Fixes newline issue with strenum (3d9114b)
-
unittests
-
Fixes vpblend unittest (f9bdf0b)
-
Extends vinsert{i,f}128 tests for garbage data (95a9f32)
-
x87StackOptimizationPass
-
Default initialise StackMemberInfo members (3e59fc0)