FEX-2309
Read the blog post at FEX-Emu's Site!
Last month we hinted that we didn't get all optimizations in that we wanted. There's more of that this month but we have also had an entire month to
push optimizations in. This month was a whirlwind of optimizations improving performance all over the place because of one feature that landed;
Instruction Count continous integration! Let's dive in to what this is.
Instruction Count CI
This is a major feature that we added last month that doesn't directly affect users but is such a huge quality of life improvement to our developers
that we need to discuss what it is. At its core, InstCountCI is a database (Actually JSON) of x86 instructions that FEX-Emu supports and shows how
that instruction gets converted to Arm64 instructions. This is in textual format for easily reading these instruction implementations and updating
quickly when the implementation changes. This has had a profound effect on our developers where they can't help but look at poor instruction
implementations and finding ways to optimize them.
<- Optimized versus non-optimized picture ->
As you can see in the example, one very complex instruction that was not optimal before has now translated in to something much more reasonable.
So far this has nerdsniped at least half a dozen developers in to finding more optimal implementations of these instruction translations!
Some design considerations of this must be understood when looking at FEX's instruction implementations although. The most important thing to remember
is that these implementations are looking at the instruction in a vacuum. These are translated as only single instruction entities, so any sort of
multi-instruction optimization is not going to be visible in this CI system. Additionally this isn't getting run on hardware in our CI, so
implementations that are close on instruction count may have wildly different performance characteristics depending on the hardware. So while it is a
good guide for getting eyes on the assembly, there still needs to be some knowledge as for what the translation is doing to ensure it's both fast and
correct.
This CI system was used heavily this last month for what our next topic is.
Optimization Extravaganza!
With InstCountCI in place, we can now quantify optimizations going in to the FEX CPU JIT without accidentally compromising performance of other
instructions. With this in-place we have had an absolute ton of CPU optimizations land in our JIT, enough that if we went through them all it would
take longer than all of previous progress reports!
Instead of going through each individual change, let's just discuss the main optimizations that have landed. The bulk of optimizations has
been making sure the translation between SSE instructions to Arm64's ASIMD instructions is more optimal. This is because reasoning about vector
optimizations is easier in this instance, and also because games more heavily abuse vector instructions than regular desktop applications. There were
other optimizations like some flag generation instructions becoming more optimal and eliminating redundant move instructions as well!
Let's take a look at the bytemark results.
<- Bytemark graphs ->
There's some surprising uplift in numbers here! Even more so since bytemark shouldn't heavily utilize SSE instructions so this is more just coming
from general optimizations that occured. Let's take a look at another benchmark for fun.
<- Geekbench 5.4.0 graph ->
Whoa, that is a surprising uplift in one month! Geekbench actually has some
benchmarks that use vector operations so they can get improvements more improvements than expected. We should expect even more performance once we
start optimizing more non-vector instruction translations!
As for gaming benchmarks, we're not going to do some in this blog post, but we have been told that due to various optimizations this month that Portal
performance has gained 30% and Oblivion has 50%. Big improvements towards making games feel better when playing them. Main concern here is that the
Adreno 690 in our Lenovo X13s test systems are actually quite unstable during testing, so finding suitable games that are CPU bound without crashing
the kernel driver is surprisingly difficult. Most of the lighter games that don't crash the MSM kernel driver are already running at hundreds of FPS
anyway so it isn't interesting.
A fun quirk of optimizing vector operations this month, we have finally landed our first optimizations that use ARM's SVE instruction set when
operating at 128-bit width. Turns out there are a few optimizations that can be done here aside from implementing AVX with the 256-bit version! I'm
sure we will see more of these as we continue optimizing.
Remove most implicit sized IR operations
Continuing from the last topic, this is one of the main changes that allows us to start working on non-vector instruction optimizations. FEX's IR
around general purpose ALU operations has a history of using implicit sized IR operations. This means we would check the size of the incoming data
sources and make an assumption for what the operating size of the whole thing should be. While this worked, it has been an absolute thorn in our side
for years at this point. Any time we would make a seemingly innocuous change it would subtly change the behaviour of some IR operations as a new size
propagates through the stack. Now that all of these operations explicitly state their operating size at generation time there is less room for error.
This follows with how our vector operations worked, where all of these were explicitly sized from the start and has had significantly less issues over
time.
With this change in place we can start optimizing general purpose ALU operations with less worry about breaking the world.
Mingw work
Some more work this month towards getting WINE WOW64 support wired up. Adding a toolchain file to help
facilitate cross compiling, stop saving and restoring the x18 platform register and various other things. While full support isn't yet merged, there's
a lot of preliminary work landing so we can support this. While this work is very early, it is already showing significant performance improvements
for Windows native games. A game like Bioshock Infinite is already running faster than FEX emulating x86 WINE fully! Look forward to future
improvements and integrations as this gets wired up!
Raw Changes
FEX Release FEX-2309
-
ARM64
-
Optimize vector zeroing (eaed5c4)
-
ARMEmitter
-
Handle SVE load and broadcast quadword groups (a9dea29)
-
Handle SVE load and broadcast element group (710a392)
-
Handle load/store multiple structures (scalar plus scalar) groups (eda67eb)
-
Handle SVE ADR (139dd4c)
-
Handle SVE CPY (immediate) (72357e5)
-
Migrate off vixl float utils (5a0a6dd)
-
Handle SVE FCPY (predicated) (8fce133)
-
Remove resolved TODO comment (5de7eee)
-
Handle contiguous first fault load (scalar plus scalar) group (0109e88)
-
Handle SVE FP multiply-add long groups (6f4a23d)
-
Arm64
-
Only allocate vixl::Decoder if enabled (2d78b1f)
-
Optimize AES operations by caching a zero register (7f99738)
-
Optimize AESKeyGenAssist (02b891c)
-
Optimize VFMin/VFMax (0819338)
-
Optimize SVE VInsElement (1f2c5fc)
-
Stop abusing orr in LoadConstant (6d562f8)
-
Optimize non-optimal BFI move case (1029bb1)
-
Optimize CacheLine{Clear,Clean} (1343c14)
-
Adds stats to the disassembly (53ac8ab)
-
Implement first SVE-128bit optimization (fe35135)
-
Remove erroneous LoadConstant (c4c7620)
-
ConversionOps
-
Remove redundant moves in AdvSIMD VInsGPR (6e4765d)
-
Add missing half-precision conversions to scalar functions (172c8f3)
-
Add scalar support to Vector_FToI (a62ba75)
-
EncryptionOps
-
Use MOVI reg, #0 to zero vectors (4286d44)
-
VectorOps
-
Remove redundant moves in SVE VExtr when possible (4297e13)
-
Remove redundant moves from VSQXTN2/VSQXTUN2/VSQSHL/VSRSHR (5b8a0f1)
-
Remove redundant moves from SVE variable/immediate/vector shifts when possible (926b8c2)
-
Remove redundant moves from SVE BSL when possible (ec6548e)
-
Remove redundant moves from SVE V{S,U}Min/V{S,U}Max when possible (350bca9)
-
Remove redundant moves from SVE VFMin/VFMax when possible (2264058)
-
Remove moves from SVE VFDiv if possible (da098d8)
-
Remove redundant moves from SVE VURAvg if possible (2501ebc)
-
Remove redundant move in VFRSqrt SVE path (6ad053a)
-
Arm64Emitter
-
Stop saving and restoring platform register (affbcd2)
-
Ensure that 128-bit predicate is generated with SVE (49b8b7c)
-
CMake
-
Add mingw toolchain file (f1aa620)
-
Config
-
Minor changes (3885bc4)
-
ConstProp
-
Adds constpool distance heuristic (67a26a0)
-
Fix set-but-not-used mask variable (2f0c690)
-
Context
-
Adds helper to reconstruct and consume packed EFLAGS (ea96581)
-
Externals
-
Update Catch2 to v2.13.10 (e025d32)
-
Update fmt to 10.1.0 (b0ec419)
-
FEX
-
Adds instruction count CI (6c7371a)
-
Create a CommonTools static library (f318203)
-
FEXCore
-
Rework X87 tag word handling (8bae58d)
-
Allows disabling telemetry at runtime (d738538)
-
Adds telemetry around legacy segment register setting (a523858)
-
Allow for interrupting the JIT on block entry (fc84f6b)
-
Fixes bug with 32-bit adcx (da17e24)
-
Fixes Arm64 stats disassembly (099f29f)
-
Fixes vector shifts by zero (c77ed78)
-
IR
-
Fixes bug in IRDumper without specification (6cb0f52)
-
FEXInterpeter
-
Fixes compilation when telemetry is disabled (5cc30bd)
-
FEXInterpreter
-
Supports procfs/interpreter (016c3c0)
-
Filemanagement
-
Optimize GetSelf using a string_view (9f3730f)
-
GIthub
-
Only enable InstCountCI on an ARM platform (df3d4ef)
-
Github
-
Adds a CI runner for 128-bit SVE testing (969ad9b)
-
IR
-
Remove phi nodes (16f826c)
-
Adds printer for OpSize (2c2081c)
-
Adds IR::OpSize to IRDumper (e1c2033)
-
Removes implicit sized add (b40f784)
-
Removes implicit sized bfe (351c0ee)
-
Removes implicit sized and (8239f8a)
-
Removes implicit sized sub (59a4d15)
-
Removes bfi from variable size (227ba9f)
-
Removes implicit sized xor (9b55a34)
-
Removes implicit sized andn (e9a3848)
-
Removes implicit sized or (516f27b)
-
Removes implicit sized lshr (24a9254)
-
Removes implicit sized lshl (8534d3d)
-
Removes implicit sized mul ops (915e520)
-
Removes sext IR helper (a55616e)
-
Removes implicit sized {Create,Extract}ElementPair (6f66153)
-
Convert all Move+Atomic+ALU ops from implicit to explicit size (590345b)
-
Fixes RAValidation for 32-bit applications (547daf8)
-
Implements support for wide scalar shifts (084d102)
-
Allow 128-bit broadcasts in VBroadcastFromMem (425b034)
-
Add VBroadcastFromMem opcode (cfc6368)
-
Adds Option to run the IRDumper with more configurations (ea8fbc6)
-
ConstProp
-
Ensure that BFI with constant bitfields can optimize to Andn or Or (a2e5c23)
-
InstCountCI
-
Update for previous changes (b801bac)
-
Test rorx at max mask size (e2ac97f)
-
Fix some mislabeled instructions (4443c66)
-
Add newline to end of file (f4f9b20)
-
Support encoding expected Arm64 ASM in JSON (dc0cf98)
-
Update tests from actual ARM64 device (f95ef70)
-
Adds RNG support (dac220a)
-
Disables tests with unsupported configurations (da334fe)
-
Ensure output nasm name doesn't conflict (3c99fb8)
-
Adds primary tables (63f28ea)
-
Adds secondary prefix tables (135b9ac)
-
Adds secondary tables (24d01cd)
-
Adds Primary group tables (103e604)
-
Adds VEX map3 tables (e334278)
-
Adds VEX map group tables (7fe2d3b)
-
Adds FEX map2 tables (871434d)
-
Adds VEX map1 tables (a464798)
-
Adds x87 table (1146c42)
-
Adds H0F38 table (a15934b)
-
InstructionCountCI
-
Adds three more instruction tables (6d1fcfc)
-
Interpreter
-
Tie SSA data elements to supported vector width (e765bd8)
-
Use alias for temporary vector data (8b9ee99)
-
Linux
-
Call exit_group when application tries (0195bb6)
-
LogMan
-
Commonise log level to string conversion (b4d1726)
-
OpcodeDispatcher
-
Optimize CMC (c27f69d)
-
Fixes NZCV and PF flag compacting (b18592f)
-
Remove final assumptions about small IR operating sizes (8017a91)
-
Cleans up RFLAGS size handling (62fcf6c)
-
Optimize calls with push (7c81a0d)
-
Optimize BLENDV when xmm0 is one of the sources (9fb8c95)
-
Removes erroneous debug log (012750f)
-
Optimize Get{Src,Dst}Size (c1b4c11)
-
Optimizes SSE movmaskps (1d7c280)
-
Optimize PSHUF{LW, HW, D}! (db3dc3e)
-
Optimize 128-bit movmaskpd (bf12f08)
-
Optimize movddup from register (1f7d138)
-
Optimize cvtdq2pd from register source (f36f070)
-
Optimizes movq (30a1a38)
-
Optimize nontemporal moves (631655d)
-
Generate more optimal code for scalar GPR converts (0ef439f)
-
Optimize cvtps2pd (80d871f)
-
Optimize MMX conversion operation (4d58ec1)
-
Optimize addsubp{s,d} using fcadd (200dbdd)
-
Cache named vector constants in the block (df99b7b)
-
Optimize AddSubP{S,D} (ab83ab4)
-
Optimize PMULH{U,}W using new IR operations (66c6f96)
-
Remove redundant moves from PCLMULQDQ and AES operations (1aa2c53)
-
Use new IR ops for pack instructions (ee10153)
-
Optimizes scalar movd/movq (4b06069)
-
Remove redundant moves from remaining AVX ops (b646f4b)
-
Remove redundant moves from VPACKUSOP/VPACKSSOp (a40526a)
-
Remove unnecessary moves from AVX move ops where applicable (86ef6fe)
-
Remove unnecessary moves from AVXExtendVectorElements (6624f50)
-
Remove unnecessary moves in AVXVFCMPOp (cd1f401)
-
Remove redundant moves in AVX blend special cases (76430ba)
-
Optimize MOVHP{S,D} (819fe11)
-
Remove unnecessary moves from AVX inserts (adfd678)
-
Optimize MOVLP{S,D} loads (bb2f710)
-
Remove unnecessary moves from AVX register shifts (5d44a44)
-
Remove redundant moves from AVX immediate shifts (fb65fb2)
-
Remove unnecessary moves from AVX conversion operations (36a5418)
-
Remove unnecessary moves from AVXVariableShiftImpl (ead141f)
-
Remove unnecessary move from VPHMINPOSUW (1414452)
-
Optimize phminposuw (ed7f1b0)
-
Optimize PFNACC (6c7933e)
-
Optimize hsubp (364f084)
-
Remove redundant move from VPSIGN (ffa8f1e)
-
Optimize pmuludq (9df94d8)
-
Remove redundant move in AVXVectorScalarALUOpImpl (71984fc)
-
Remove redundant moves in AVXVectorALUOp (3c88671)
-
Optimize pmaddwd (185e3bf)
-
Optimize phsub (3c49b32)
-
Handle zero immediate shifts better (3a2a576)
-
Optimizes mpsadbw (8ee6262)
-
Optimize SSE/AVX pmaddubsw (34a7fef)
-
Implement support for push IR operation (9e4888c)
-
Improve SHA1MSG1 output (277345d)
-
Minor optimization around clearing flags (3472234)
-
Eliminate redundant moves in {AVX}VectorRound (b973c19)
-
Eliminate unnecessary moves in {AVX}VFCMPOp (4c409ea)
-
Remove unnecessary moves in {AVX}VectorUnaryOp (ac53913)
-
Remove extraneous moves in {V}CVTSD2SS/{V}CVTSS2SD (2224c23)
-
Remove unnecessary moves in {AVX}VectorScalarALUOp (6ce380d)
-
Remove redundant moves from {V}CVTSD2SI/{V}CVTSS2SI (8dade7e)
-
Remove some extraneous MOVs from VMOVSD/VMOVSS (add5bae)
-
Handle broadcasting cases in VPERMQ/VPERMPD (db60a2f)
-
Improve {V}PSRLDQ shift by 0 (f09d9af)
-
Remove unnecessary conditionals in {V}PSLLIOp (461ca6f)
-
Improve VMOVDDUP output (a4a68b4)
-
Improve output of {V}MOVSLDUP/{V}MOVSHDUP (f70b6f3)
-
Remove unused variable in AVXVectorUnaryOpImpl (6580796)
-
Eliminate unnecessary moves in AVXVectorUnaryOpImpl (0e52158)
-
Flags
-
Update SHLimm to use Opsize upfront (2d22176)
-
Update ShiftLeft to use Opsize upfront (338cb19)
-
SignalDelegator
-
Fix build with telemetry disabled (81046ef)
-
Allow getting the internal configuration (6960fca)
-
Syscalls
-
Fix telemetry with exit_group (f0ab960)
-
X86Tables
-
Optimize MOVLPD stores (42200bf)
-
X8764
-
Ensure frndint uses host rounding mode (7c6660f)
-
Misc
-
Defer PF calculation completely (0422895)
-
Remove ABINoPF option (8184c55)
-
Defer second XOR for AF (a8bc6bb)
-
Stop zeroing undefined flags (252ca88)
-
Defer AF extract (486f0ba)
-
Optimize ADD flag calculation (ee8092b)
-
Remove implicit sized IR ops part atomic (bc1e89d)
-
Remove implicit sized IR ops part 1 (5415e4c)
-
IR/Passes/RA: Enable SRA for 32-bit GPRs (6f2b3e7)
-
x86_64/MemoryOps: Fix mislabeled IR op messages (c5f358b)
-
Optimize PSIGN and VBSL (2d7a3a5)
-
Support for Config.json loading on WIN32 (d3f0c7e)
-
Move External/FEXCore/ to FEXCore/ (c9856da)
-
Various warning fixes (9d26af9)