Releases: FEX-Emu/FEX
FEX-2311
Read the blog post at FEX-Emu's Site!
Another month gone by and another FEX release out the door! This last month was a bit of a less busy month as most of our team spent a week in Spain
to take part in XDC 2023! We did still have the rest of the month to do some work although, so let's get
to the changes!
Small bug fixes
This month we fixed a couple of bugs with could have caused spurious crashes! In fact while testing some upcoming performance optimizations, we fixed
a few unrelated bugs that was crashing Steam periodically! Always nice to see a bunch of little work that just improves the software, even if they
aren't a single big fix.
- Fix register corruption when jumping out of JIT
- Fixes double munmap which would cause spurious pointer unmaps
- Fixes crashes when a program would shut down a thread
- Implements RPRES support and Fix implementation issue with ARM's new RPRES feature
- RPRES gives us the ability to do reciprocals in one instruction instead of using ARM's divide instruction.
- The bug would have caused invalid data to be returned
- No CPU supports this yet luckily
- Fixed issue with *at syscalls not working with absolute paths
- Broke Proton's pressure-vessel in weird and unique ways
- Fixes bug with named enum argument parser
- This is used to override CPU features with the FEX_HOSTFEATURES option so typically not hit
32-bit thunking infrastructure
While 32-bit thunking is not yet in place, and this month it still isn't fully integrated, some of the code has been landing to work towards this
goal. In order to do 32-bit thunking the right way we are spending bunch of time ensuring that we have a proper daya layout analysis system in place
that is based on clang to do a couple of things. This analysis will let us to automatic translations of data structure from 32-bit in to 64-bit and
also alert us if something needs to be manually translated. This needs to be in place because otherwise we can end up in a situation where we
unknowingly corrupt data and it would be a nightmare to find. So this month we now have the ability to annotate our thunk definitions and start having
clang work for us. While not complete, some of the work has shown to have thunking working for 32-bit Super Meat Boy to work! It's getting there!
NZCV usage preparation
A big performance improvement that FEX is working on is to use the CPU's flags to directly emulate the x86 flags when possible. This is a long and
arduous task but the performance improvements will be huge once the code lands! A bunch of prep work this month has landed to start down this path but
we're going to need to let this sit in the oven for a bit longer. Check back next month to see if we get there!
Minor optimizations
With XDC being in the middle of the month, it caused most of the bigger work to be delayed so we have a bunch of smaller things this month!
- Minor optimization to bfi/bfxil
- Removes one or two instructions for some instruction translations
- Optimize atomic fetch operations in to atomic if the result isn't used
- Removes a couple of instructions if the resulting fetch data isn't used.
- Implements support for ARM's new AFP extension
- Currently disabled until we can audit the codebase to ensure we aren't corrupting anything
- Lets us remove an insert after every scalar operation to match SSE behaviour
- Optimize palignr that behaves like a move
- Compilers shouldn't use this, but now we optimize it to a move
- Optimize pblendw
- A fairly uncommon instruction but now its implementation is basically as fast as it can be
- Optimize blendps
- We had already optimized blendpd last month, so this time was to optimize the 32-bit version
- Fairly commonly used so should improve perf in some games
- Optimize dpps and dppd
- These instructions do a dot product and a broadcast of their result but we couldn't find a game using it heavily
- So while this is now optimal, this is unlikely to affect any real game
- Optimize some 3DNow! instructions
- 3DNow! is a really old instruction extension that is basically only used in some really old games
- All of these instruction implementations are basically as fast as we can make them now, which is good!
- Optimize direction flag pointer offset calculation
- This converts a three instruction calculation down to one and stops using a ternary selection
- This happens with x86's repeat instructions, which typically happens for memcpy and memset
- Used a lot but is a minimal improvement.
- A few other random bits and bobs!
AVX optimizations!
While nothing supports our AVX implementation today, we have optimized a handful of implementations once hardware supports what we need. We have
optimized a smattering of instruction translations.
-
256-bit VExtr, VFCADD, VURAvg, VFDiv, VSMax, VSMin, VUMax, VUMin
-
Removes a bunch of truncating moves
- If we know an AVX instruction is operating at 128-bit width, we can remove a redundant move which speeds things up!
Raw Changes
FEX Release FEX-2311
-
ARMEmitter
-
Fix GPR fill mask in
FillStaticRegs
(3702e51) -
Arm64
-
Minor optimization to bfxil and bfi (8181e53)
-
ArmEmitter
-
Adds sized Scalar 1 source and 2 source helpers (2e1389b)
-
CPUID
-
Adds some missing cpu core names (ff3f734)
-
Config
-
Fixes string enum parser with multiple arguments (bbd20b4)
-
External
-
Remove a spurious license (26ee63c)
-
FEXCore
-
Removes gdb pause check handler (d4a6b03)
-
Fixes bug in vector
ZextAndMaskingElimination
pass (a261d99) -
Removes a warning about assume discarding side-effects (462fff2)
-
Renames raw FLAGS location names to signify they can't be used directly (8dab35c)
-
Implements support for RPRES (b2a8b0c)
-
Support crypto extensions in HostFeatures override (fc70fc3)
-
FileLoading
-
Updates helper to load file that is backed by memory (190f7c2)
-
IR
-
Changes over to automated IR dispatch generation (6543a80)
-
FEXLinuxTests
-
Adds a unittest for eflags and signals around a inlined syscall (b92e716)
-
Compile tests with masm=intel (b45023b)
-
Temporarily limit thunk test execution to 64-bit guests (1ea40ae)
-
FEXLoader
-
Query runtime page size (65e8d09)
-
GDBServer
-
Preparation work to get this moved to the frontend (e91c5ff)
-
GdbServer
-
Fixes returning thread names (1806519)
-
IR
-
Print assert code for IR EmitValidation (5431aa5)
-
Optimize unused result atomic fetch mop to just atomic mop (14e5ea1)
-
Adds scalar vector insert operations (6253f4f)
-
InstCountCI
-
Update rounds{s,d} classification (a287f2a)
-
Adds two missing variants of movd/movq (8538f5b)
-
Support disabling flagm extensions (6db2125)
-
Adds some multi instruction tests (4834236)
-
Support multiple instructions in the tests (f036a0b)
-
Adds missing atomic tests (a5f82a5)
-
Fixes recursive tests with same filename (9c36d10)
-
Support overriding AFP features (5a3cc7b)
-
JIT
-
Implements Print support for vixl sim (5b70209)
-
JITArm64
-
Fixes double munmap issue that was causing crashes (8ee5b5c)
-
Fixes bug in rpres scalar operations (8f8f376)
-
Linux
-
Fixes issue with *at syscalls with absolute paths not working (cf9c2aa)
-
Fixes warning in 32-bit clock_settime (https://github.com/FEX-Emu/FEX/commit...
FEX-2310
Read the blog post at FEX-Emu's Site!
Welcome back to another monthly release for FEX-Emu. You might be thinking that after last month's optimizations that we wouldn't have much to show
for this month. Well you would be wrong! We optimized even more! Let's get in to it!
More instruction optimizations!
As stated last month, we introduced Instruction Count CI which has allowed us to do targeted optimizations of our code. One again we have optimized so
many instructions that it would be impossible to go through each individual change. Check our detailed change log if you want to see all the
instructions optimized. Let's just look at the final benchmark numbers compared to last month.
<- Geekbench 5 versus last month ->
<- Bytemark versus last month ->
Let's talk about the Geekbench 5.4 results first since they don't look very
impressive at first glance. While we are only showing ~13% of a performance improvement, the problem with this result is that this number is an
aggregate of multiple smaller benchmarks. Looking at the breakdown of all the subtests there are some that have improved by up to 66%! This is of
course because some benchmarks take advantage of some instructions that we optimized more heavily than others. Luckily this improvement also scales to
other video games as well.
The Bytemark improvements are a bit hard to make out, some numbers are hardly changed at all while a couple stand out as huge improvements. This
mostly comes down to some very specific instruction optimizations that significantly improved performance in a couple of tests and the rest don't show
up as much.
With this months optimizations and last months combined these optimizations end up being significantly more interesting. Some
Geekbench results are showing an average of 50% to 65% higher performance
sometimes even higher. Some benchmark results showing nearly 2x the performance compared to before! These numbers translate very well to gaming
performance where some games have more than doubled their FPS over the past couple months.
We're not slowing down either, we still have a ton of optimizations to go on our march to get our emulation close to native performance.
Support preserve_all for interpreter fallbacks
We're calling out this particular optimization for three reasons.
- It improves performance of x87 heavy code
- It only works with the super recently released Clang 17
- wine packages in FEX's rootfs use x87 heavily in some instances.
Let's talk about what this optimization is and how it improves performance. In Clang 17 they added support for a new function calling ABI called
preserve_all. x86 has supported this ABI for a very long time but it is a new addition for Arm64. This ABI breaks convention from the regular AAPCS64
ABI in that if a small function needs to more registers then they need to first save pretty much any of them. Unlike AAPCS64 where it has a bunch of
registers free for using. This is beneficial for FEX's JIT since we can save signicant time by not saving any state when we need to jump out of the
JIT and execute x87 softfloat code.
In particular this manifests to upwards of a 200% performance improvement in some microbenchmarks around x87 code! While this advantage is quite
significant, the only way to take advantage of it is to compile FEX with Clang 17. Since this compiler release came out only last month, pretty much
no distros have adopted it so it is unlikely to be used soon. In a few months time, or years depending on distro, they should naturally upgrade their
compiler stack and free performance improvements will happen.
As a fairly major side note to this excursion, FEX has found that the 32-bit wine packages that is compiled with Canonical's repository uses x87
heavily in some instances. This causes some really bad performance issues with some 32-bit games and installers. It is recommended to use Proton where
you can here since it compiles its 32-bit libraries with SSE optimizations instead which work significantly better.
FEX-Emu may look to provide its own wine packages in the future with this same optimization in place to help alleviate some of this burden. Until then
it is recommended to use FEX's x87 reduced precision mode to try and alleviate some of the overhead.
Fixes a bug when chrooting in to rootfs
For quite a few months now FEX-Emu has changed some behaviour around chrooting in to the FEX rootfs.
While chrooting isn't generally advised, if a user wants to modify the rootfs then it's the only option. While we provide some scripts inside of our
rootfs images to facilitate this, it has been broken for a few months.
We have now fixed this bug in both FEX-Emu and the scripts inside of our rootfs images. So if you want to modify packages inside of the image you will
now be able to do so again. Make sure to update your image to get the new scripts!
Remove x86-64 JIT and Interpreter
This has been a long time coming in the FEX-Emu project. We have had support for an IR interpreter and x86-64 host JIT for compatibility testing since
the project's inception. It has always been the case that if these CPU backends get in the way of the ARM64 JIT that they would get removed.
That time has finally come. Due to some upcoming changes around how flags are getting represented in FEX's JIT and the general burden of implemented
FEX's IR operations three times, often undoing an x86->Arm64 translation to go back to x86. It has been deemed too much of a burden and these have
been removed. This is a necessary step for our ARM64 JIT to gain more performance that we will be gaining in the coming months!
We are looking forward to future ARM platforms that can take Radeon GPUs through PCIe slots to regain a platform which can test RADV directly, but
until that point we will have to make due with our current devices.
Instruction Count CI on x86-64 hosts
While we removed our x86-64 JIT, we do have a fun addition to our instruction count CI. Now developers that don't have an Arm64 device handy can still
run the Instruction Count CI and attempt to optimize implementations without even having an ARM64 device to run it on. This is as simple as building
FEX on an x86-64 device with the Vixl disassembler and simulator enabled and you will be able to optimize to your hearts content!
We've got a need for JIT speed! Let's go fast!
Implement first optimizations using 128-bit SVE
This is a fairly minor change but previously FEX was not using any 128-bit SVE instructions. This is primarily because there aren't really any SVE
supporting devices in the consumer market, even though Snapdragon hardware theoretically supports it. 128-bit SVE adds a couple of optimizations that
we can use.
- Wide-element shifts
- Index instruction for generating simple index masks
While these are fairly simple initially, they change some from being translated to six instructions down to one or two depending. This is a fairly
minor change, but it is good to note that FEX is now taking advantage of SVE if it is available!
Adds WOW64 frontend
This has been a long time coming, with us adding initial mingw support back in FEX-2305. FEXCore now supports being built with a brand new WOW64 WINE
frontend. While currently not being utilized, this will allow WINE to integrate FEX directly in to its WOW64 layer for running both x86 and x86-64
applications on Arm64 host devices.
This is a very substantial change to how WINE integrates with FEX, since today FEX-Emu just runs the full x86-64 WINE process and eats the overhead of
emulating everything WINE needs to do. With the WOW64 layer now implemented, a bunch of the WINE code can now be Arm64 native code and when it needs
to execute application code it just jumps back to the emulator. This is similar to how Windows natively handles its emulation through its "XTA" layer.
Sadly today this is only wired up to work through a 32-bit x86 part of the layer, we need to get setup to support Wine when it inevitably supports
Wow64 for x86_64->Arm64.
Big shout out to ByLaws implementing support for this! We look forward to future Wine integration work landing!
Implement thunking support for wayland-client and zink
We have some improvements to thunking this month! As we are working towards supporting thunking more code, we implemented some features to get
wayland-client thunking wired up. While this support is early, it is enough to get Super Meat Boy up and running using wayland and zink overrides
within a Wayland environment. We look forward to additional thunking improvements going forward so that performance can be improved everywhere.
Raw Changes
FEX Release FEX-2310
-
AppConfig
-
Removes Steam config (02da6d6)
-
Arm64
-
Fixes inline syscalls (4e9a114)
-
Optimize wide shifts slightly for 64-bit OpSize (f5c4e28)
-
Recover two unused vector vector temporary registers (90f7937)
-
ALUOps
-
Remove spills in PEXT (4604c01)
-
VectorOps
-
Elide moves where applicable in 128-bit VSQXTUN2 (fd1b639)
-
Improve handling of 128-bit vector VInsElement (950a8db)
-
Elide moves in ASIMD VUShrNI2 if possible (b3269f2)
-
Assert VTMP1 and VTMP2 are sequential in VTBL2 (8168a49)
-
Fix SVE aliasing-path move in VSShr (ffb5876)
-
CI
-
Run tests with <30s runtime first (e1eb151)
-
CPUID
-
Enabled Enhanced REP MOVSB/STOSB (6fe643d)
-
Config
-
Fixes core sanitization (da3e172)
-
ConstProp
-
Fixes unscaled signed 9-bit range (72d092e)
-
DeadContextStoreElimination
-
Silence unused function warning (773e946)
-
ELFCodeLoader
-
Expose FSGSBase in getauxval HWCAP2 (fbc4bda)
-
FEX
-
Moves Linux u...
FEX-2309.1
Hotfix patch to fix a bug around accessing files!
FEX-2309
Read the blog post at FEX-Emu's Site!
Last month we hinted that we didn't get all optimizations in that we wanted. There's more of that this month but we have also had an entire month to
push optimizations in. This month was a whirlwind of optimizations improving performance all over the place because of one feature that landed;
Instruction Count continous integration! Let's dive in to what this is.
Instruction Count CI
This is a major feature that we added last month that doesn't directly affect users but is such a huge quality of life improvement to our developers
that we need to discuss what it is. At its core, InstCountCI is a database (Actually JSON) of x86 instructions that FEX-Emu supports and shows how
that instruction gets converted to Arm64 instructions. This is in textual format for easily reading these instruction implementations and updating
quickly when the implementation changes. This has had a profound effect on our developers where they can't help but look at poor instruction
implementations and finding ways to optimize them.
<- Optimized versus non-optimized picture ->
As you can see in the example, one very complex instruction that was not optimal before has now translated in to something much more reasonable.
So far this has nerdsniped at least half a dozen developers in to finding more optimal implementations of these instruction translations!
Some design considerations of this must be understood when looking at FEX's instruction implementations although. The most important thing to remember
is that these implementations are looking at the instruction in a vacuum. These are translated as only single instruction entities, so any sort of
multi-instruction optimization is not going to be visible in this CI system. Additionally this isn't getting run on hardware in our CI, so
implementations that are close on instruction count may have wildly different performance characteristics depending on the hardware. So while it is a
good guide for getting eyes on the assembly, there still needs to be some knowledge as for what the translation is doing to ensure it's both fast and
correct.
This CI system was used heavily this last month for what our next topic is.
Optimization Extravaganza!
With InstCountCI in place, we can now quantify optimizations going in to the FEX CPU JIT without accidentally compromising performance of other
instructions. With this in-place we have had an absolute ton of CPU optimizations land in our JIT, enough that if we went through them all it would
take longer than all of previous progress reports!
Instead of going through each individual change, let's just discuss the main optimizations that have landed. The bulk of optimizations has
been making sure the translation between SSE instructions to Arm64's ASIMD instructions is more optimal. This is because reasoning about vector
optimizations is easier in this instance, and also because games more heavily abuse vector instructions than regular desktop applications. There were
other optimizations like some flag generation instructions becoming more optimal and eliminating redundant move instructions as well!
Let's take a look at the bytemark results.
<- Bytemark graphs ->
There's some surprising uplift in numbers here! Even more so since bytemark shouldn't heavily utilize SSE instructions so this is more just coming
from general optimizations that occured. Let's take a look at another benchmark for fun.
<- Geekbench 5.4.0 graph ->
Whoa, that is a surprising uplift in one month! Geekbench actually has some
benchmarks that use vector operations so they can get improvements more improvements than expected. We should expect even more performance once we
start optimizing more non-vector instruction translations!
As for gaming benchmarks, we're not going to do some in this blog post, but we have been told that due to various optimizations this month that Portal
performance has gained 30% and Oblivion has 50%. Big improvements towards making games feel better when playing them. Main concern here is that the
Adreno 690 in our Lenovo X13s test systems are actually quite unstable during testing, so finding suitable games that are CPU bound without crashing
the kernel driver is surprisingly difficult. Most of the lighter games that don't crash the MSM kernel driver are already running at hundreds of FPS
anyway so it isn't interesting.
A fun quirk of optimizing vector operations this month, we have finally landed our first optimizations that use ARM's SVE instruction set when
operating at 128-bit width. Turns out there are a few optimizations that can be done here aside from implementing AVX with the 256-bit version! I'm
sure we will see more of these as we continue optimizing.
Remove most implicit sized IR operations
Continuing from the last topic, this is one of the main changes that allows us to start working on non-vector instruction optimizations. FEX's IR
around general purpose ALU operations has a history of using implicit sized IR operations. This means we would check the size of the incoming data
sources and make an assumption for what the operating size of the whole thing should be. While this worked, it has been an absolute thorn in our side
for years at this point. Any time we would make a seemingly innocuous change it would subtly change the behaviour of some IR operations as a new size
propagates through the stack. Now that all of these operations explicitly state their operating size at generation time there is less room for error.
This follows with how our vector operations worked, where all of these were explicitly sized from the start and has had significantly less issues over
time.
With this change in place we can start optimizing general purpose ALU operations with less worry about breaking the world.
Mingw work
Some more work this month towards getting WINE WOW64 support wired up. Adding a toolchain file to help
facilitate cross compiling, stop saving and restoring the x18 platform register and various other things. While full support isn't yet merged, there's
a lot of preliminary work landing so we can support this. While this work is very early, it is already showing significant performance improvements
for Windows native games. A game like Bioshock Infinite is already running faster than FEX emulating x86 WINE fully! Look forward to future
improvements and integrations as this gets wired up!
Raw Changes
FEX Release FEX-2309
-
ARM64
-
Optimize vector zeroing (eaed5c4)
-
ARMEmitter
-
Handle SVE load and broadcast quadword groups (a9dea29)
-
Handle SVE load and broadcast element group (710a392)
-
Handle load/store multiple structures (scalar plus scalar) groups (eda67eb)
-
Handle SVE ADR (139dd4c)
-
Handle SVE CPY (immediate) (72357e5)
-
Migrate off vixl float utils (5a0a6dd)
-
Handle SVE FCPY (predicated) (8fce133)
-
Remove resolved TODO comment (5de7eee)
-
Handle contiguous first fault load (scalar plus scalar) group (0109e88)
-
Handle SVE FP multiply-add long groups (6f4a23d)
-
Arm64
-
Only allocate vixl::Decoder if enabled (2d78b1f)
-
Optimize AES operations by caching a zero register (7f99738)
-
Optimize AESKeyGenAssist (02b891c)
-
Optimize VFMin/VFMax (0819338)
-
Optimize SVE VInsElement (1f2c5fc)
-
Stop abusing orr in LoadConstant (6d562f8)
-
Optimize non-optimal BFI move case (1029bb1)
-
Optimize CacheLine{Clear,Clean} (1343c14)
-
Adds stats to the disassembly (53ac8ab)
-
Implement first SVE-128bit optimization (fe35135)
-
Remove erroneous LoadConstant (c4c7620)
-
ConversionOps
-
Remove redundant moves in AdvSIMD VInsGPR (6e4765d)
-
Add missing half-precision conversions to scalar functions (172c8f3)
-
Add scalar support to Vector_FToI (a62ba75)
-
EncryptionOps
-
Use MOVI reg, #0 to zero vectors (4286d44)
-
VectorOps
-
Remove redundant moves in SVE VExt...
FEX-2308
Read the blog post at FEX-Emu's Site!
Whoa jeez, another month already? We've had our heads down working hard this last month, trying to make FEX-Emu the greatest x86/x86-64 emulator on
Linux. A huge focus this month is optimizations because of course what we want is to go fast. We're all cats and we've got the zoomies.
Every day we're optimizing
As said before, this month has been an absolute mess of optimizations as we've been optimizing the project as thoroughly as possible. We could spend
another month talking about the optimizations that we did this last month, so let's blast through what we did. First let's show a graph for how much
FEX has improved over this last month.
Look at those numbers! Some benchmarks from bytemark have cracked the 200% mark! While a couple benchmarks do have regressions, we're pretty sure that
we know what they are and they will be rectified soon. These are the sorts of optimizations that can be felt in real games though.
So lets quickly run through some of the optimizations we ran in to this last month.
Switch to using half-barriers for memory accesses
When ARM hits an unaligned atomic memory access, we previous wrapped that load or store in two slow barrier instructions. We can now safely only use
one barrier on one half of the instruction! This makes unaligned accesses quite a bit quicker.
Optimize x87 memory accesses
This removes a couple instructions when we access 80-bit floats.
Only clear icache for code
Some large code blocks can generate a decent amount of metadata that don't need an icache clear. Can remove a bit of stutter.
Const prop BFI operation
Sometimes when a BFI instruction has constants in it, we can remove the BFI instruction
Optimize vector TSO loadstores
vector operations typically need an additional add on its address if it can't fit in the instruction encoding for the immediate offset. We missed the
optimization in which the immediate offset CAN actually fit. Removes an instruction per vector loadstore commonly
Use TST instead of CMN
Sometimes these instructions hit a slow path on Cortex-A57 so a minor win there.
Optimize xor reg, reg
x86's universally agreed upon instruction for generating zero in a register is xor. This instruction isn't actually optimal in ARM hardware. We now
emit a move of constant zero which gets optimized to register rename on most ARM hardware.
More instructions optimized
These mostly just make the implementations use less instructions which makes them faster. There will be way more of this in the coming month
- rotate flag calculations
- phsubsw/phaddsw
- cmpxchg8b/16b
- psad*
- 8-bit, 16-bit rcr
- fcmov
- shld/shrd
- movss
- maskmovdqu
- maskmovq
- phminposuw
- fild
- PF flag calculation optimization
- Optimizing packing RFLAGS
- Optimize ADD/ADC OF flag packing
Fixes bug in SSE4.2 pcmpestri
This was causing Java applications to crash. Now that we fixed a different bug last month, we now have Java working to an extent. It still crashes on
shutdown which is interesting and not all games are expected to work. But good luck testing random Java games!
Pack NZCV flags
This is the first step towards FEX generating x86 flags in a more optimal way. These flags match the ARM flags fairly closely and can be emulated in a
more optimal way if we pack them together. This is likely what causes the regression in bytemark, but since this is an intermediate step it is
expected to go away with the next optimization after this. Look forward to future optimizations that make this faster!
Remove weak symbol declarations in thunks
A bug that cropped up in thunks has been a crash that occurs when trying to use thunks from Ubuntu's PPA system. This has been a major thorn in FEX's
side for months because once you rebuild the project locally, it would never reproduce. The problem stems from the fact that clang would decide that
it can inline a "weak" symbol if its implementation is visible. This would only occur on Canonical's ARM builders, potentially due to whatever device
they use to compile the code on. This would cause our thunks to crash almost immediately if a user tried them from the PPA system. We have now worked
around this clang quirk and this will now fix thunks when enabled from the PPA system.
Mingw build work
As part of FEX's effort towards supporting running as a WINE dll, we have been slowly adding support for compiling FEXCore as a Windows DLL.
This month we have removed a bunch of Linux assumptions and API usages from FEXCore and moved it to the frontend FEXInterpreter application. In doing
so, FEXCore can now be compiled using llvm-mingw as a WINE specific DLL. This is completely unusable for users today but sets the groundwork towards
what will eventually become a WoW64 integration in the future. We have also added mingw building of FEXCore to our CI so we ensure it doesn't get
broken.
To be clear, even though this work allows us to compile as a Windows DLL, this doesn't allow us to run under Windows. FEX still does a bunch of things
that are Linux specific inside of the code.
ARMEmitter cleanups
Another improvement that doesn't affect our users but good to shoutout the improvement for our developers. @Lioncache
has spent a good amount of time this last month adding missing instructions and aliases to our AArch64 code emitter. While our code emitter covers a
decent amount of the AArch64 instruction space, it takes time to ensure full coverage. Whenever we're writing code for our JIT and an instruction is
missing, it slows down whatever we are working on. So kudos for improving our coverage because it makes everyone's lives easier.
Implement missing accept4, recvmmsg, sendmmsg for 32-bit socketcall
In a recent Steam client update, it started using accept4 for some background thing. This would cause it to spam a bunch of logs when failing to
accept some connection. A simple fix just for a few missing system calls, Steam now no longer is complaining loudly.
Fix variadic packing in X11 thunking
WINE had broken X11 thunking for all of FEX's history without any indication as to why. We never had time to look in to this but this last month we
finally hit a game that crashed which made this easier to debug. This bug occured because WINE is one of the few applications that pass more than
seven arguments through a few variadic API calls. This triggered a bug in FEX's variadic repacking code once we starting packing the arguments on to
the stack. With this fixed, WINE X11 thunking now works in significantly more games. This means that both OpenGL and Vulkan applications can be
thunked under WINE.
Fixes dead context store elimination pass
This optimization pass removes redundant stores to FEX's CPU context state. While this usually doesn't save much, it can improve performance for some
edge cases in FEX's JIT. While this is a performance optimization, it likely won't affect many things.
Fix 16-bit POPA instruction
This instruction was accidentally zero extending the 16-bit value in to the 32-bit register. We now insert the 16-bits as expected. This fixes an
issue with OpenAL in some cases.
Raw Changes
-
ARMEmitter
-
Add missing atomic aliases (68cb6e6)
-
Add cinc/cinv/csetm aliases (30ab4d3)
-
Add ngc/ngcs aliases (eebcbfd)
-
Add bfc/bfxil aliases (d2bca9b)
-
Add sbfiz/tst/ubfiz aliases (4681061)
-
Finish off remaining SVE Integer Wide Immediate - Unpredicated categories (2fc6542)
-
Implement cmn alias (1ce0ea8)
-
Arm64
-
Switch to using half barriers (94273fb)
-
Fixes LR corruption in 128-bit divides (5821175)
-
Optimize {Load,Store}ContextIndexed address generation (536b2ed)
-
Only clear icache for code (0674dfa)
-
Emitter: Handle LD1{}/LDFF1{} Vector + Immediate encodings (421214e)
-
Emitter
-
Add remaining missing SVE predicate range assertions (64c7243)
-
Deduplicate some more SVE implementations pt. 2 (0a1820d)
-
Deduplicate some more SVE implementations (9c175da)
-
Reorganize some base opcode and assert locations (0be68a5)
-
Simplify SVE immediate shift helper (e689c6f)
-
Collapse encoding cases for indexed dup (d6697fc)
-
Handle SVE FP convert precision group (e633ef7)
-
Handle SVE FP arithmetic with immediate (predicated) group (754bc18)
-
Handle SVE XAR (842b71c)
-
Add helper...
FEX-2307
Read the blog post at FEX-Emu's Site!
This release we had a bit of a slower month as some larger pieces were being worked on, but we still have some good stuff that is worth talking about.
Implement per-instruction RIP reconstruction
This was a fairly curious bug that FEX encountered. When trying to run the game Ultimate Chicken Horse then the game would crash very in its startup.
While investigating the game we determined that this was one of the first games we tested that uses Unity Engine's AOT scripting reflection(?) mechanism. This codepath seemingly heavily relies on either tagged pointers or some other
mechanism that causes a SIGSEGV when accessing it the first time. After that point the Unity AOT will catch the SIGSEGV and depending on the RIP of
the instruction, it will change behaviour. One of the problems with FEX is that on synchronous faults like SIGSEGV, we don't yet support full state
reconstruction. Since it seems like this only relies on RIP being correct, we can fairly easily wire this up and get Ultimate Chicken Horse running!
AVX work completed!
This last month FEX has done the last remaining work to implement AVX. With this month the remaining SSE4.2 instructions were finished,
and the prerequisite XSAVE and XRSTOR instructions were implemented. Although while the feature is effectively complete we aren't yet enabling the
CPUID bit yet. We are wanting to investigate a potential crash that has cropped up in Java games due to the extension first, and additionally we want
to finish up AVX2 work and enable them both in one step! Next month is looking to be the first version with AVX and AVX2 support in the source.
Fix 32-bit robust futex fetching
This issue has been a thorn in our sides for quite a while now. Usually this only ever manifested as an issue if the user was running Steam using
FEX's official PPA binaries in their setup. Once the user tried running Steam, then it would
crash with a really obscure message about "Fata error: futex robust_list not initialized by pthreads." This was something that would then never
reproduce if the code was rebuilt locally.
With a bit of poking around and using a local pbuilder version of FEX we were finally able to reproduce the error. Turns out FEX was writing a 64-bit
pointer back in to the result when the application tried querying the robust list pointer, overwriting part of the stack and corrupting its data.
This falls under one of the circumstances of "How did this ever work!?" but now with it resolved, theoretically Steam should finally work for our
users that are using the PPA build of FEX. Enjoy~!
Fix application hangs due to mutexes being locked on forks
This has been a very spicy bug that has been haunting FEX for years at this point. Whenever an application in modern day wants to execute a process it
will use a combination of fork and execve. Fork might end up being a vfork, or might end up being a clone syscall that does the same thing. Regardless
fork when executing in a threaded environment has some very strict requirements that it basically can only do an execve afterwards. vfork even adds an
additional restriction that it can't corrupt the stack at all because its sharing memory space.
The problem with this approach is that even if the application is only ever going to call execve after the fact, FEX needs to do a bunch of
bookkeeping or additional JIT emission and execution. This causes the problem that FEX's mutexes might end up being in an unknown state going in to a
fork, which will cause this new child process to hang indefinitely on the mutex.
To work around this issue, FEX will now globally lock all mutexes that matter, do the fork, and then immediately unlock the mutexes on the parent
side. On the child process FEX needs to be a bit mean to these mutexes, resetting them to zero to ensure no thread is holding the mutex. While this is
fairly heavy-handed this dramatically reduces how frequently FEX hangs when fork is used.
Specifically Steam tended to launch a bunch of background processes which would hang indefinitely, causing Proton games or downloads to never
continue. This should pretty much entirely be fixed!
Stop using faccessat2 to emulated faccessat
This was an oops on our part. faccessat2 was added in Linux kernel 5.8, so if your device was running an older kernel this syscall would /never/ work.
We didn't notice this since most of our devices are running a new enough kernel that faccessat2 just worked.
Thanks to the user that found this problem!
Handle xattr syscalls with overlayfs rootfs
Turns out that FEX had missed the various syscalls that access files to get xattr information. This was causing weird failures where some applications
would say that a file doesn't exist purely because it was in the rootfs overlay only.
Sadly Linux doesn't support *at variants of these syscalls so they aren't quite as fast as native execution, but that's fine.
Fix conflicting ARM64 register allocation
A couple months ago we added one more register to our register allocation for slightly more optimal register allocation. This broke a game called
Osmos under FEX. This is purely a bug but in resolving it, we likely fixed crashes in various
applications that we didn't notice before. Oops!
RootFS additions
This month we have a couple new rootfs images on our server that have been hotly requested! We now have an ArchLinux rootfs image and a Fedora 38
rootfs image. These haven't been as thoroughly tested as our Ubuntu images so if you find any problems with them, make sure to let us know on our
Discord
Raw Changes
-
Arm64
-
Fixes paranoidtso option for CPUs that support LRCPC/2 (e5189d6)
-
Fixes GPR pair allocation to get one pair back (16f7002)
-
Fixes register pair conflict. (7c47296)
-
Context
-
Removes dead
AddVirtualMemoryMapping
function (c3e123d) -
Emitter
-
Adds support for CSSC (8047007)
-
External
-
Update jemalloc trees (f872199)
-
jemalloc
-
Updates external jemallocs (1a4d5a1)
-
Externals
-
Update fmt to 10.0.0 (7ee6fc0)
-
FEXServerClient
-
Ensure server socket is created with SOCK_CLOEXEC (e86a792)
-
FHU
-
Workaround libstdc++ version 13+ bug (8a4c5bc)
-
IR
-
Move VPCMPESTRX REX handling to OpcodeDispatcher (f39163b)
-
Pad IROp_Header to be 32-bit in width (fe06f1b)
-
JIT
-
Implement support for per-instruction RIP reconstruction (9dcc1de)
-
Linux
-
Fixes hangs due to mutexes locked while fork happens. (e72fa02)
-
Handle xattr syscalls with emulated paths. (5a53931)
-
Stop using faccessat2 for faccessat emulation (f444b03)
-
Remove warning that isn't necessary anymore (cac7985)
-
Optimize CalculateHostKernelVersion (1506a19)
-
OpcodeDispatcher
-
Ensure MXCSR is saved/restored with FXSAVE/FXRSTOR (66d4206)
-
Handle XSAVE/XRSTOR (a082161)
-
Scripts
-
Disable using catchsegv if it doesn't exist (d6c9b54)
-
VectorFallbacks
-
Fix PCMPSTR fallback ZF/SF flag setting (9b5e1c4)
-
Misc
-
Some small fixes for android building (d2032da)
-
Move config layers to the frontend (2997257)
-
unittests
-
Add include search path for asm tests (cb8bf1a)
-
x32
-
Thread
-
Fixes robust futex fetching (e652399)
FEX-2306
Read the blog post at FEX-Emu's Site!
Another interesting month of changes for FEX-Emu! While this release is shorter than last, this also only has a month of work rather than two. We had
some great work done this month, including a bunch of plumbing that most people won't notice. Let's see what changed!
Adds support for hardware TSO memory emulation prctl
Emulating the x86 memory model is the number one thing that slows down FEX emulation today. Apple Silicon supports this memory model in hardware which
is why Rosetta on MacOS can get amazing performance. With some recent changes from the
Asahi developers, FEX can ask the hardware to enable the TSO emulation bit. If the kernel reports back that hardware TSO memory is enabled, FEX can
take a more lax approach to its memory emulation, getting an automatic speedup for Asahi systems.
Additionally not only is this a speed-up, it's required for correct emulation. When this feature isn't supported by the hardware, FEX needs to emulate
the memory model using atomics and LRCPC instructions. This absolutely demolishes performance so it is usually recommended to disable the emulation to
get "free" performance. This issue with this is that it can crash instability in the most awkward and peculiar of ways. We even found out in this last
month that Unity games with their complex buffer management are highly likely to crash due to old cachelines of data hanging around. The only fix is
to use emulate the memory model using hardware TSO flags or our atomic/LRCPC path. Sadly's ARM Cortex's hardware LRCPC implementation is barely any
faster than atomics.
Steamwebhelper crash fix
New beta versions of Steam has started relying on AT_EXECFN existing. FEX didn't previously emulate this auxv value which was causing it to crash. With this fixed, steamwebhelper is now working again.
More AVX work!
This has been a long time coming and we're almost there finally. After these changes FEX only needs to fix a few implementation bugs in the string
operations, and implement the XSAVE instructions before allowing AVX emulation. In addition to that, FEX is also almost able to supprot AVX2 with the
only instructions that need to be implemented is the gather load instructions!
Implements support for XGETBV
This is a fairly simple instruction as it lets the application query which CPU features are enabled. Necessary for an application to check before
enabling any AVX usage.
Handle PCMPESTRM, PCMPISTRM and AVX variants
These are the remaining string instructions that FEX has implemented. While mostly implemented there are still a couple of edge case behaviours that
aren't quite correct and just need to be fixed.
Implement support for deferring asynchronous signals
This change has been a long time coming to make FEX's JIT faster in the face of handling asynchronous signals. This issue is that FEX needs to enter
code regions that are effectively "uninterruptible" until it is complete. This is basically a reentrancy problem where a piece of code executing could
lock a mutex, or non-atomically updating a container's data, then when a signal occurs it will jump out of the code and potentially come back to this
corrupted state.
As an initial workaround to this problem, FEX would just disable all signals in each of these "signal-deferring regions." This had the overhead that
every region would have a system call going in to it and then another one coming out of it. If how frequently these regions happened was little then
it would be a non-issue; but as is commonly the case, FEX's JIT needs to be wrapped in this signal blocks. If a game is executing a bunch of code,
this means we can be doing thousands of additional system calls per second which adds up as direct overhead.
With this new change, FEX marks that it is in a signal-deferring region with some very cheap memory accesses and if a signal doesn't occur the
overhead is negligible. In the case that a signal does occur, it will get stored to a queue, FEX will finish its signal-deferring region, and then
come back to handle the signal.
This mostly works because asynchronous signals don't have guarantees about the timeliness of the signal being delivered. Sadly this can result in
signal queue depths being subtly incorrect but we are monitoring the situation to know if any game is affected. All in all this finally makes it so
FEX can be straced without being overwhelmed and improves stutter problems!
Grand Theft Auto 5 AVX fix
FEX was accidentally reporting support for BMI1 and BMI2 CPU instructions. These extensions have a requirement that AVX must be implemented for these.
This was causing Grand Theft Auto 5 to crash early trying to use AVX. We will now only report these extensions if emulated AVX is supported, which fixes this game.
Make vfork wait for the process to exit
FEX's previous implementation of vfork actually behaved like fork. The difference between these two syscalls is fairly subtle. In the case of vfork, the parent process will end up sleeping until the child either exits or executes execve.
We were instead treating this like a fork, where the parent continues immediately without waiting. While no known issues were encountered, it is good
to ensure this behaviour is correct for future work.
Getdents optimization
This classic syscall is used for querying directory contents, FEX needs to emulate this syscall since 64-bit applications now use a new syscall called
getdents64. FEX's original implementation was fairly slow due to a misunderstanding as to how this syscall worked. It would create a temporary working
buffer and copy data around a couple of times. With the new implementation it is able to use the buffer provided by the application and doing some
minor fixups to make the overhead fairly light now. This improves performance when an application is doing heavy folder scanning, which mostly means
it improves Proton startup time.
Minor optimizations
There were a handful of minor optimizations that improve performance so minorly that it falls within noise, but is nice to have.
Optimize ARM64 thunk trampolines
This is a very small optimization that changes an indirect load in to a PC relative load, removing a single data dependency.
Minor x87 FCMOV optimization
FEX was duplicating a mask from a GPR in to a vector register using two instruction and now it only uses one instruction.
Optimize ADC/ADD OF flag calculation
This was a small mistake where a bitwise negate was using two instructions instead of one.
Optimize EFLAG unpacking
Each time FEX was unpacking the EFLAG register it was using four instructions per bit of the flag. This has now been improved to only two. Cutting
flag unpacking to 82% of its original size in some edge cases.
Supported emulated Linux kernel version up to 6.2
FEX used to max out the reported kernel version up to 5.18. Now we can report up to 6.2 with this change. 6.3 is going to be harder since it
introduces a new prctl that FEX needs to work around.
Video game showcase
As said previously about Unity engine games having issues without TSO emulation. Here is a clip of Hollow Knight running under FEX full speed on a
Lenovo X13s. Even with the overhead of emulating the x86-TSO memory model, this game runs remarkably well. With x86-TSO emulation disabled this game
would have crashed a few seconds in.
Raw Changes
-
AOTIR
-
Stop passing a mutex around. It's already guarded (f47caf4)
-
ARM64
-
Fixes SRA disabled codepath (dc65a5e)
-
CPUID
-
Only enable BMI1 and BMI2 if AVX is supported (cc7a56b)
-
Context
-
Remove debug namespace (5be798e)
-
ELFCodeLoader
-
Fixes missing AT_EXECFN (00dc373)
-
FEXConfig
-
Removes Emulated CPU cores option (1a9b6a8)
-
FEXCore
-
Implements support for xgetbv (737f917)
-
Support Wine syscalls (77e8be1)
-
Convert Core and Telemetry over to fextl::file::File (5674d3a)
-
Adds support for hardware x86-TSO prctl (ed69eb9)
-
FEXLoader
-
Allow simulated kernel version up to 6.2 (ada226b)
-
FEXRootFSFetcher
-
Support rolling release distros (9473025)
-
IRDumper
-
Fixes ssa number in arguments. (0f4a5ed)
-
InstallFEX
-
Updates helper install script for Ubuntu 23.04 (02f15f4)
-
Linux
-
Make vfork act more similar to how it should. (95b7592)
-
OpcodeDispatcher
-
Optimize ADC/ADD OF flag calculation (5b58082)
-
Optimize EFLAG unpacking (69181d4)
-
Handle PCMPESTRM/VPCMPESTRM (182010c)
-
Handle PCMPISTRM/VPCMPISTRM (https://...
FEX-2305
Read the blog post at FEX-Emu's Site!
Welcome back to another release of FEX-Emu! We had cancelled last month's release due to a large amount of code churn happening. In order to ensure
the highest quality of stability we were forced to do so. Now we're back with an even lengthier release this month, so buckle up because there were a
large number of changes that happened.
More AVX Work!
These last two months have been a while ride towards implementing AVX. @Lioncache has been burning down a ton of
instructions to get everything in place for AVX emulation.
New instructions implemented
- PCMPISTRI/VPCMPISTRI
- VPMASKMOVD/VPMASKMOVQ
- VCVTPD2PS/VCVTPS2PD
- VCVTSD2SS/VCVTSS2SD
- PCMPESTRI/VPCMPESTRI
- VMPSADBW
- VPSLLVD/VPSLLVQ
- VPSRLVD/VPSRLVQ
- VCVTSI2SD/VCVTSI2SS
- VPINSRB/VPINSRD/VPINSRQ/VPINSRW
- VPSADBW
- VTESTPD/VTESTPS
- VPMADDUBSW
- VPMOVMSKB
- VMASKMOVPD/VMASKMOVPS
That's a whole bunch of instructions implemented! We have now nearly implemented all the instructions required for AVX.
The two major instructions before AVX can be exposed is the SSE4.2 instructions VPCMPISTRI and VPCMPESTRM. This is because these two
instructions also have AVX versions so it is a required feature in order to support AVX.
We are getting really close and once this feature is done, we can quickly move on to finishing support for AVX2, F16C, and the fused
multiply-accumulate extensions. At that point our CPU emulation will be effectively "feature-complete" for everything that games will care about in
the short-term. Exciting times!
llvm-mingw and WINE support
This is a very big change that has been coming down the pipe for a while now. We have been mostly working behind the scenes to get FEX-Emu wired up so
that it can be compiled as a Windows shared library. This last month is where this work has finally come to a head and most of the work is in place
for this.
How this works is that FEX-Emu has a shared-library and static-library that gets compiled called FEXCore
. This is where all the CPU emulation
happens and tries to be mostly OS agnostic, while everything that is Linux specific lives in the frontend called FEXInterpreter
. Is is FEXCore now
that can be compiled as a Windows AArch64 PE library. While this isn't currently useful to end users today. This means that WINE can link to this
library for emulating x86/x86-64 on AArch64 platforms. It should be noted that there are still some Linux assumptions strewn about the code, so this
isn't a generic solution for emulation on a true Windows platform. We're writing this support specifically for WINE today.
Converting away from C++ containers that allocate memory
This is the significant change that caused us to cancel last month's release. While @Neobrain was writing code to
support 32-bit library thunking, they had discovered a very big problem. FEX-Emu has long overridden the glibc memory allocation routines in order for
us to ensure that FEX can allocate memory when emulating 32-bit applications. We discovered that this overriding also extends to system libraries that
we load in after the fact. This meant that any time libGL would allocate memory, it would end up being a 64-bit pointer and there was nothing we could
do about it.
The workaround for this problem is to stop overriding the system allocators, which will allow shared libraries to allocate memory that can safely be
used by the 32-bit guest. But this also has the problem that FEX would then run out of memory when executing 32-bit applications. This is due to a
quirk that FEX-Emu needs to allocate all the memory on the system before executing 32-bit applications.
The new workaround is to replace usage of every C++ container that allocates memory with FEX's own container that will use its own allocator. This was
an exceedingly invasive change that touches almost everything in our codebase. With the pain done, FEX now can use its own internal allocators while
system libraries will use the regular glibc allocator as expected. See more about the limitations of this with our
documentation.
Re-enable glibc allocator hooking again
Okay, the previous paragraph was a ruse; FEX-Emu needed to actually override the glibc allocator again. In this case FEX-Emu will actually have three
allocators active at any given moment.
- FEX-Emu uses jemalloc for its internal allocator.
- The system allocator is overridden with another jemalloc allocator.
- The guest application's glibc allocator is untouched.
The problems start occuring when a pointer is shared between thunks and the guest application. If one allocator tries to free a pointer from a
different allocator then fireworks occur. The way around this is to use a jemalloc function to determine if it owns the pointer and choose which
allocator to end up freeing the pointer from. This is particularly painful with X11 thunking because pointers are passed between client and server in
a very laissez faire fashion. This may not stay around in the future but it is a necessary evil for now.
JIT Optimizations and improvements
Reclaim static assigned registers on 32-bit
This allows us to use 8 more general purpose registers and 8 more floating point registers with 32-bit applications. Depending on the game this can
improve performance by a decent margin. We have seen upwards of 20% performance uplift in various games due to it.
Fix Visual C++ redistributable crashing
This was a really annoying bug, where every.single.time. that Proton would run, it would try to install the C++ runtime at least four times. The user
would be required to kill the processes after they were installed. This was fairly egregious because we had thought it was fixed months ago and didn't
realize that it wasn't actually fixed. Depending on the version of the Visual C++ redistributable and Proton it would still occur.
Root causing this issue turns out that the redistributable uses Windows' structured exception handling to catch the case when it passes a null pointer
to strlen
which results in a SIGSEGV on the Linux side. FEX was incorrectly saving and restoring state when this occured, which caused it to
infinitely loop and crash. Now that this is fixed, these install correctly and Proton doesn't try doing it on every single run.
Implement REP MOVS as a memcpy
This instruction behaves like a fairly fast memory copy on the CPU. We now convert this over to an internal memory copy operation.
Similar to last month where we converted an instruction to a memset, this instruction being implemented as an IR operation has many times over
performance improvements. In real games this usually translates to a few percentage FPS improvement which is a nice uplift.
Fix restoring of AVX state
While not actually being utilized today (Except due to a bug), @AndreRH found out that we were accidentally failing to
restore AVX register state when a signal handler returned. It's surprising that this wasn't noticed earlier but it could have resulted in some really
bad floating point state.
Remove double syscall overhead on filesystem accesses
When FEX was checking to see if a file exists in the overlayfs style rootfs image we provide, we need to check if the file exists there first. If the
file exists we will redirect the file to be opened from the rootfs instead of the host filesystem. We had an issue that if the file didn't exist, we
would then check for it again on accident before accessing the host file. This would mean that one syscall turned in to three. With this fix in place
we are now only converting it in to two.
If you're running a rootfs image off of a particularly slow drive (or a network share) then this can shave a decent amount of time off of load times.
This was particularly noticeable when running a Proton game under Steam because they will access a ton of files before starting up.
Adds default DRM ioctl interface
This is a fairly basic change. Instead of breaking when hitting an unknown ioctl, pass it to the kernel and hope for the best. This is mostly so Asahi
and other drivers can test things under FEX without pushing patches to us for downstream support.
Add support for thunking Wayland
This doesn't affect most users today but adding support for thunking wayland means in the future applications that use this can sanely use this thunk.
SDL applications today might be able to take advantage of it but it is fairly fresh. We're looking forward to the inevitable Wayland and WINE
utilization to let things move away from X11.
Fixed 32-bit clock_nanosleep
There was a fairly nasty implementation detail where a 32-bit application trying to sleep with this syscall would actually consume a CPU core to 100%.
While fairly uncommon, this allows the game Alwa's Awakening to not burn a CPU core while running.
Add a bunch of functions to FEX's ARMEmitter
Not really a user facing feature but our code emitter has gained a bunch of new instruction support. This will be used in the future for our AVX2
implementation and various things. So it's good to have.
Raw Changes
FEX-2303
Read the blog post at FEX-Emu's Site!
Oh jeez, another month already? I guess it's time for another FEX-Emu release. Let's pick a commit, spin the roulette wheel, and hope for the best!
Surely that's how releases work?
Rootfs images are now on a new CDN!
While this is something that doesn't directly impact FEX when running applications, it's a problem that most of our users need to deal with when
installing FEX. Our previous CDN which was hosting our x86 images had a fair number of problems that couldn't be solved. The main issue that affected
users was that it was slow to download the images and depending where you were in the world, it could have an unstable connection. This resulted in
gigabyte sized files taking forever to download or never at all!
This month we have switched our CDN to a service that has worldwide data replication across multiple dataservers. This improves the speed in which
users can download our prebuilt images. Going from an average of 20MB/s to over 300MB/s is a significant boost. In addition to that, the connection is
significantly more stable to the far corners of the world. Also something that doesn't affect users at all is that this new CDN is actually
significantly lower cost than what we are currently using. This was unexpected but it's a nice bonus that this CDN is an improvement is every regard,
including cost.
This month's code changes
With that out of the way, onward to this month's changes.
Optimize REP STOS instruction in to inline memset
This is an instruction that x86 offers that behaves similarly to a memory set operation. It behaves slightly differently since this allows you to set
the memory by element size, and also you can choose to direction in which the memory is set. In particular this instruction tends to get used for
zeroing out memory. Latest x86 CPUs have even optimized this instruction in order to be fast as possible. Previously FEX had decomposed this instruction
in to a complex series of code blocks that was inefficient for our JIT and everything surrounding it. Now we instead convert this to a single IR
operation called MemSet which exposes the semantics of how the instruction works. Allowing our IR to be cleaner and the backend to decompose it in
a more optimal fashion. Currently we emit a a fairly trivial loop that handles this memory set operation. ARM has recently announced that future CPUs
are going to support a memory set instruction that is very similar to the 8-bit REP STOS which will make this implementation even faster!
As seen by this graph, FEX is no where near a native implementation. It's important to note that even without writing "optimal" codegen, this change
has still given FEX up to an 11% performance improvement on its implementation. This was primarily focused around improving the IR, we can now
optimize the code that the JIT emits significantly more easily! Getting closer to native is likely something to come in the
future.
Add config option hide hypervisor CPUID bit
We encountered the first game that has anti-virtual machine code and refuses to run if it thinks it is running in a VM. While FEX isn't a virtual
machine, we expose this CPUID bit so software that cares can use it as hint to query FEX specific CPUID information. Now that this game has stumbled
upon this issue, we added a configuration profile to disable this CPUID bit for the game. If any other games also pick up on this issue then we will
need more profiles.
Proton and pressure-vessel startup optimizations
One of this months efforts have been about improving the time it takes for Proton to startup. pressure-vessel is the project that is used to setup the
Proton execution environment which takes a while overall. One of the hardest things about Proton is that it executes thousands of programs and does an
absolute ton of filesystem accesses. ARM devices typically don't have the highest performance filesystems, which makes one part of this hard, but also
FEX's filesystem overlay adds overhead to this. Additionally one of FEX's shortcomings currently is that every application execution must JIT fresh
code every time it restarts. Since pressure-vessel starts so many programs, a lot of the time is just spent emitting code to memory. There were a few
optimizations that went towards making this faster this month.
With the couple of optimizations in place we managed to shave a second off of the start-up time. Cutting the execution from 9.7 seconds down to 8.7
seconds. Or in the case of running on an Apple M1, execution is now down to 7 seconds. Almost all of this time improvement comes from faster syscall
wrapping and the remaining CPU time is code JIT and execution. It'll only get faster in the future!
Fix a race condition with syscall emulation
While this is a fairly minor change, we fixed a race condition around system calls which would consistently cause crashes when Steam was starting up.
Every piece of work that improves stability just makes the whole emulation experience so much better and needs to be celebrated!
Signal frame improvements!
A significant problem with using FEX is the debugging experience when something breaks. We spent a good amount of time this month improving how FEX
sets up its signal frames when the guest application hits a fault. Since we weren't following traditional signal frame generation, tooling around
backtracing was broken in most cases. We have now reworked this so that libSegFault will now work to give FEX a backtrace of the application's
state when it crashes.
We will be shipping a new rootfs which includes x86 and x86-64 libraries for libSegFault so that if users want to debug a crashing application, they
can try and get a backtrace.
AVX work continues
Another month, another bunch of AVX work that has been implemented.
Instructions implemented
- VPHSUBSW
- VHSUBPD/VHSUBPS
- VPERMILPD/VPERMILPS
- VPERMD/VPERMPS
- VPHADDSW
- VPTEST
- VPMOVSD/VPMOVSS
- VSHUFPD/VSHUFPS
- VPSHUFD/VPSHUFHW/VPSHUFLW
- VPSHUFB
- VPALIGNR
- VEXTRACTF128/VEXTRACTI128
- VPBLENDVB/VBLENDVPD/VBLENDVPS
- VBLENDPD/VPBLENDW
As you can see a lot of new instructions are now implemented. This now leaves us with about thirty more instructions that need to be implemented
before we can start avertising the features on SVE2-256bit supporting hardware. This is significant as we keep finding more and more games that are
requiring AVX to run
ARM emitter cleanups
Another change that isn't user facing but is always nice to point out some janitorial tasks that have been done. When we switched over to using our
own code emitter there were some design choices and implementations that weren't quite optimal. This usually culminates as developer pain when using
the emitter but was a necessary evil since we wanted to get rid of VIXL's assembler as fast as possible. @Lioncache
spent some time this month cleaning up a lot of the dirty code in the emitter, in some cases making it slightly faster as well. This is always greatly
appreciated as it reduces maintenance burden when working in the JIT.
They also implemented an absolute ton of new instruction emitter functions which previously didn't exist. While we don't use these yet, we will likely
use them at some point which will make our lives easier in the future.
New development machines for our developers
Just recently a new Snapdragon laptop has gotten working OpenGL and Vulkan drivers up and running! We are gifting each of our developers one of these
great machines in order to ensure we have testing platforms for all the OpenGL 4, DXVK, and VKD3D applications we want to be running! Kudos to all the
developers that worked on bringing this hardware up so quickly!
Raw Changes
- ARMEmitter
- Tidy up some assertion handling (e7069f9)
- Remove predicate implicit conversion operators (41731e2)
- Make second sxtw parameter a WRegister (e71e3ec)
- Remove implicit conversions from Register/XRegister/WRegister (378e069)
- Remove predicate uint32_t conversion operators (e869b2f)
- Remove most implicit conversion operators for vector register types (0f45318)
- Make VRegister constructor explicit (21fbcef)
- Handle sequential registers in lists nicer (ef02083)
- Simplify size handling Advanced SIMD 3 different group (24904f4)
- Simplify advanced SIMD copy (e65b429)
- Centralize handling for unsigned offset load-stores (1832cc8)
- Handle SVE Integer Compare - Scalars group (fe1faf9)
- Finish off SVE Predicate Misc group (165db37)
- Handle SVE partition break categories (4d65521)
- Handle SVE integer compare with wide elements ca...
FEX-2302
Read the blog post at FEX-Emu's Site!
This month certainly passed in the blink of an eye. A lot of good bug fixes this month as usual! Continue reading to find out more.
Fix incorrect operation for cache line clears
In emulating the CLFLUSH instruction, FEX was incorrectly using the wrong operation for clearing caches. We were accidentally using the CVAU operation instead of CIVAC.
While this is incorrect, it was hard to find anything that was actually affected by the wrong implementation. With Snapdragon's open source Vulkan driver implementing what is required for VKD3D,
it became evident from Vulkan tests that this was incorrectly implemented. Switching the implementation is easy and will let VKD3D run without hacks
when the required feature is finished.
Bug fixes to 64-bit x87 emulation
A big thanks to CallumDev for finding and fixing these latest bugs in FEX's less accurate x87 emulation. As a
reminder, x87 on original hardware operates using 80-bit float values. This is a feature that ARM doesn't natively support, so FEX needs to emulate
this using a software floating point library. We have a hack in our configuration to allow removing this software implementation and instead operate
using 64-bit double operations instead. This can significantly improve performance in some 32-bit games but introduce rendering artifacts.
This month there were many bug fixes:
- ALU operations that consume integers converted to floats are fixed
- Float comparison that also consumes 16-bit integers fixed
- FPREM instruction no longer infinite looping
With these fixes in place, a large number of games now actually render correctly with this hack enabled. It will be interesting to see how well this
improves performance or batterty savings in 32-bit games!
More AVX instructions emulated
With one of FEX's developers taking some away time, this was a little less involved than the last couple of months.
There was still a handful of instructions implementation
- VPBLENDD, VBLENDPS, and VPSRAVD
Additionally while these aren't AVX instruction, we also implemented the CLWB and CLFLUSHOPT instructions. These match their ARM equivalents so it was
mostly an easy implementation that applications can use if they want.
Fix copy and paste error in Arm64 JIT
While this is a fairly minor issue, we had a copy and paste error in FEX's register spilling code. This caused Steam to crash in certain situations,
so fixing this since the previous release helps users wanting to run that.
A bunch of minor optimizations
This month had a bunch of small optimizations around the entire project. Alone these are all quite minor but added together should result in a couple
percentage of CPU time removed from FEX's JIT.
- Arm64 Dispatcher is slightly faster
- CPUID emulation initialization is faster
- Optimize File loading, improving config loading time
- Frontend instruction decoder optimizations to be faster
- Makes IR operations 1 byte smaller, improving memory usage
- Inline IR constants optimization to reduce IR memory size
Fixing thunk symbol override fetching
FEX's thunks had an issue where if a library was loaded, we would only ever fetch relevant symbols from that library directly. While this worked for
our use case, it breaks when wanting to use MangoHud in OpenGL applications. Resolving this issue fixes most things that will override symbols with
LD_PRELOAD.
Update JEMalloc from 5.2.1 to 5.3.0
While this is a fairly minor change, this release on JEMalloc fixes some bugs and improves performance. Small but every performance improvement is
welcome.
Support for execveat with AT_EMPTY_PATH
This is an interesting feature where an application can be executed directly through a file descriptor instead of a filepath on disk. This is a fairly
simple idea but has some interesting edge cases that might be interesting to some people. To see the more technical information about implementing
this, check out the pull request.
Raw Changes
-
ARMEmitter
-
Handle integer add/subtract vectors (predicated) instruction class (9d33bba)
-
Handle RMIF, SETF8/SETF16 (a899f9f)
-
Handle SVE floating-point recursive reduction (1cda029)
-
Add a few missing instructions (2c9f99e)
-
Support helper for long address generation (f8d56a8)
-
Removes some warnings that cropped up (5fd8fdb)
-
Arm64
-
Merge two loads in to an LDP (a28039f)
-
Fixes incorrect operation for CacheLineClear (f8d92aa)
-
Use switch statement for op handlers instead of jump table (565ed45)
-
Fix SpillRegister C&P error (9c93c6f)
-
Fixes large offset spill slots (9acb513)
-
VectorOps
-
Clamp shift amount to esize-1 for VSShr (9a318ca)
-
ArmEmitter
-
Adds two more classes of ASIMD instructions (95e544c)
-
Adds three more classes of ASIMD instructions (81e0ac7)
-
CPUID
-
Optimize initialization (f614fc6)
-
Config
-
Fix relative execve applications. (65971ef)
-
ConstProp
-
Pool inline constants (1e90ebb)
-
Core
-
Adjust virtual memory size for 32-bit (7f6a620)
-
Dispatcher
-
Extract 64-bit signal frame save and restore (65b6b6d)
-
Fixes x86-64 SA_SIGINFO generation (8dae785)
-
ELFCodeLoader
-
Don't use std::random_device for RNG (f5e97f3)
-
Emitter
-
Remove unused header (90bcb8c)
-
External
-
Update JEMalloc to disable 16k pages (bbf9198)
-
Externals
-
Update jemalloc to 5.3.0 (9322e55)
-
F64
-
Fix integer immediates for add,mul,div,sub (c2325e1)
-
FEXCore
-
Fixup 32-bit signal handling (fa1193f)
-
FEXLoader
-
Adds support for execveat with AT_EMPTY_PATH (dcce9ad)
-
Build FEXInterpreter and FEXLoader independently (8974509)
-
FEXRootFSFetcher
-
Support option to auto select first distro (a7aeb4a)
-
FEXServer
-
Remove POLLREMOVE usage (d2d5282)
-
FileLoading
-
Optimize FileLoad (28dd946)
-
Frontend
-
Various optimizations (787b689)
-
Github
-
Add ARM emitter tests to CI (da88c68)
-
IR
-
Removes NumArgs member from IR ops (9403c66)
-
Remove HasDest member (f8e762f)
-
JitSymbols
-
Fixes file opening and writing (a486797)
-
Fixes a crash that can occur (34e1ba6)
-
Linux
-
Fixes shebang file execution (477d4b6)
-
MContext
-
Insert a stack cookie with assertions enabled (7664359)
-
OpDispatcher
-
Adds support for CLWB and CLFLUSHOPT (7be2e1a)
-
Fixes a few missing GPR/XMM helper usages (4aa984a)
-
OpcodeDispatcher
-
Handle VPBLENDD/VBLENDPS (62e6ada)
-
Handle VPSRAVD (fe79f61)
-
Scripts
-
Update InstallFEX.py rootfs links (df87042)
-
Syscalls
-
Fix out-of-bounds read when handling single-line shebang files (https:/...