FEX-2311
Read the blog post at FEX-Emu's Site!
Another month gone by and another FEX release out the door! This last month was a bit of a less busy month as most of our team spent a week in Spain
to take part in XDC 2023! We did still have the rest of the month to do some work although, so let's get
to the changes!
Small bug fixes
This month we fixed a couple of bugs with could have caused spurious crashes! In fact while testing some upcoming performance optimizations, we fixed
a few unrelated bugs that was crashing Steam periodically! Always nice to see a bunch of little work that just improves the software, even if they
aren't a single big fix.
- Fix register corruption when jumping out of JIT
- Fixes double munmap which would cause spurious pointer unmaps
- Fixes crashes when a program would shut down a thread
- Implements RPRES support and Fix implementation issue with ARM's new RPRES feature
- RPRES gives us the ability to do reciprocals in one instruction instead of using ARM's divide instruction.
- The bug would have caused invalid data to be returned
- No CPU supports this yet luckily
- Fixed issue with *at syscalls not working with absolute paths
- Broke Proton's pressure-vessel in weird and unique ways
- Fixes bug with named enum argument parser
- This is used to override CPU features with the FEX_HOSTFEATURES option so typically not hit
32-bit thunking infrastructure
While 32-bit thunking is not yet in place, and this month it still isn't fully integrated, some of the code has been landing to work towards this
goal. In order to do 32-bit thunking the right way we are spending bunch of time ensuring that we have a proper daya layout analysis system in place
that is based on clang to do a couple of things. This analysis will let us to automatic translations of data structure from 32-bit in to 64-bit and
also alert us if something needs to be manually translated. This needs to be in place because otherwise we can end up in a situation where we
unknowingly corrupt data and it would be a nightmare to find. So this month we now have the ability to annotate our thunk definitions and start having
clang work for us. While not complete, some of the work has shown to have thunking working for 32-bit Super Meat Boy to work! It's getting there!
NZCV usage preparation
A big performance improvement that FEX is working on is to use the CPU's flags to directly emulate the x86 flags when possible. This is a long and
arduous task but the performance improvements will be huge once the code lands! A bunch of prep work this month has landed to start down this path but
we're going to need to let this sit in the oven for a bit longer. Check back next month to see if we get there!
Minor optimizations
With XDC being in the middle of the month, it caused most of the bigger work to be delayed so we have a bunch of smaller things this month!
- Minor optimization to bfi/bfxil
- Removes one or two instructions for some instruction translations
- Optimize atomic fetch operations in to atomic if the result isn't used
- Removes a couple of instructions if the resulting fetch data isn't used.
- Implements support for ARM's new AFP extension
- Currently disabled until we can audit the codebase to ensure we aren't corrupting anything
- Lets us remove an insert after every scalar operation to match SSE behaviour
- Optimize palignr that behaves like a move
- Compilers shouldn't use this, but now we optimize it to a move
- Optimize pblendw
- A fairly uncommon instruction but now its implementation is basically as fast as it can be
- Optimize blendps
- We had already optimized blendpd last month, so this time was to optimize the 32-bit version
- Fairly commonly used so should improve perf in some games
- Optimize dpps and dppd
- These instructions do a dot product and a broadcast of their result but we couldn't find a game using it heavily
- So while this is now optimal, this is unlikely to affect any real game
- Optimize some 3DNow! instructions
- 3DNow! is a really old instruction extension that is basically only used in some really old games
- All of these instruction implementations are basically as fast as we can make them now, which is good!
- Optimize direction flag pointer offset calculation
- This converts a three instruction calculation down to one and stops using a ternary selection
- This happens with x86's repeat instructions, which typically happens for memcpy and memset
- Used a lot but is a minimal improvement.
- A few other random bits and bobs!
AVX optimizations!
While nothing supports our AVX implementation today, we have optimized a handful of implementations once hardware supports what we need. We have
optimized a smattering of instruction translations.
-
256-bit VExtr, VFCADD, VURAvg, VFDiv, VSMax, VSMin, VUMax, VUMin
-
Removes a bunch of truncating moves
- If we know an AVX instruction is operating at 128-bit width, we can remove a redundant move which speeds things up!
Raw Changes
FEX Release FEX-2311
-
ARMEmitter
-
Fix GPR fill mask in
FillStaticRegs
(3702e51) -
Arm64
-
Minor optimization to bfxil and bfi (8181e53)
-
ArmEmitter
-
Adds sized Scalar 1 source and 2 source helpers (2e1389b)
-
CPUID
-
Adds some missing cpu core names (ff3f734)
-
Config
-
Fixes string enum parser with multiple arguments (bbd20b4)
-
External
-
Remove a spurious license (26ee63c)
-
FEXCore
-
Removes gdb pause check handler (d4a6b03)
-
Fixes bug in vector
ZextAndMaskingElimination
pass (a261d99) -
Removes a warning about assume discarding side-effects (462fff2)
-
Renames raw FLAGS location names to signify they can't be used directly (8dab35c)
-
Implements support for RPRES (b2a8b0c)
-
Support crypto extensions in HostFeatures override (fc70fc3)
-
FileLoading
-
Updates helper to load file that is backed by memory (190f7c2)
-
IR
-
Changes over to automated IR dispatch generation (6543a80)
-
FEXLinuxTests
-
Adds a unittest for eflags and signals around a inlined syscall (b92e716)
-
Compile tests with masm=intel (b45023b)
-
Temporarily limit thunk test execution to 64-bit guests (1ea40ae)
-
FEXLoader
-
Query runtime page size (65e8d09)
-
GDBServer
-
Preparation work to get this moved to the frontend (e91c5ff)
-
GdbServer
-
Fixes returning thread names (1806519)
-
IR
-
Print assert code for IR EmitValidation (5431aa5)
-
Optimize unused result atomic fetch mop to just atomic mop (14e5ea1)
-
Adds scalar vector insert operations (6253f4f)
-
InstCountCI
-
Update rounds{s,d} classification (a287f2a)
-
Adds two missing variants of movd/movq (8538f5b)
-
Support disabling flagm extensions (6db2125)
-
Adds some multi instruction tests (4834236)
-
Support multiple instructions in the tests (f036a0b)
-
Adds missing atomic tests (a5f82a5)
-
Fixes recursive tests with same filename (9c36d10)
-
Support overriding AFP features (5a3cc7b)
-
JIT
-
Implements Print support for vixl sim (5b70209)
-
JITArm64
-
Fixes double munmap issue that was causing crashes (8ee5b5c)
-
Fixes bug in rpres scalar operations (8f8f376)
-
Linux
-
Fixes issue with *at syscalls with absolute paths not working (cf9c2aa)
-
Fixes warning in 32-bit clock_settime (b4ddf36)
-
OpcodeDispatcher
-
Optimize palignr with zero immediate (9612b2f)
-
Optimize pblendw (ef5503f)
-
Optimize blendps (f45722d)
-
Optimize 128-bit DPPS and DPPD (77d9287)
-
Optimize a few 3DNow! operations (4045bfd)
-
Allow garbage in upper bits for more ALU ops (f5822f8)
-
Optimize DF pointer offset calculation (63e4c36)
-
Handle SSE vector moves into themselves a little better (5c93a08)
-
Remove unnecessary 128-bit truncating moves from StoreResult (ef321e4)
-
Put extra LoadSource options in a struct (6d39f36)
-
Remove redundant moves from rorx (efb479f)
-
Optimizes < 32-bit register push (cc558fd)
-
TestHarnessRunner
-
Don't hardcode stack allocation to 4096 bytes (0f3d14e)
-
Thunks
-
Print error if guest-provided callbacks are called asynchronously (e305a9a)
-
Skip data layout analysis for types that are always assumed compatible (a379d50)
-
Fix function pointer support on 32-bit (39c5ab1)
-
Annotate pointer parameters throughout all thunked libraries (5bf7903)
-
Add new pointer annotations to assist data layout analysis (978f607)
-
Oops deleted an entry point (e0ef32e)
-
Fixes missing vulkan definitions (cb53a70)
-
xcb
-
Drop unused and incomplete support for asynchronous callbacks (15c825f)
-
VectorOps
-
Handle SVE VExtr a little better (2e69441)
-
Handle SVE VFCADD a little better (1cb8e48)
-
Handle SVE VURAvg a little better (3c5c23b)
-
Handle SVE VFDiv a little better (9379257)
-
Handle SVE VSMax/VSMin and VUMax/VUMin paths a little better (8238de0)
-
Misc
-
Preparatory patches for nzcv (5103f2d)
-
Prep commits for NZCV modelling (e018917)
-
(8f246b2)
-
(c612fa8)
-
github
-
Enables Vixl simulator on x86 host for instcountci (9ba78c9)
-
unittests
-
ASM