Releases: FEX-Emu/FEX
FEX-2301
Read the blog post at FEX-Emu's Site!
Happy new year! A new month brings a new release of FEX-Emu, bringing in the new year.
A large amount of work in this last month, showing that FEX-Emu isn't slowing down even through the holiday season.
AVX emulation work continues
An absolute ton of work landed this last month towards bringing up AVX emulation in this last month. In total there were around 185 new
AVX instructions implemented in FEX-Emu's backend this month. At this point it starts becoming easier to talk about the number of missing instructions
rather than what is implemented.
According to FEX-Emu's instruction decoder tables, we have around 60 more instructions to implement before we can start advertising the feature. Of course
with anything programming related, the last 10% is going to take the longest to implement.
A huge shoutout to @lioncash for smashing out these implementations so quickly. The amount of work going in to this is
extensive.
As a side-note for users looking forward to this feature. The implementation requires hardware that supports both SVE and SVE2 with a 256-bit register
width now. Which means that Fujitsu A64FX, Neoverse-V1, and all current consumer class Cortex chips are incapable of taking advantage of AVX once
complete. This is a future proofing implementation for when future hardware becomes available that supports what FEX-Emu needs.
Implement a new AArch64 code emitter
One thing that has been a stand out performance bottleneck has been how quickly FEX-Emu can emit AArch64 binary code to memory. The project that
FEX-Emu used for this is ARM/Linaro's project called vixl. This project is a suite of tools including assemblers,
simulators, and disassemblers and many open source projects do use this. This is a very nice project that eases the developer's burden when writing a
JIT that targets ARM devices. Sadly when profiling our code, it turns out that FEX-Emu spensd a decent amount of time inside of vixl code due to how
obtusely large it is. Even with Link-Time-Optimization enabled in our code, we can't reduce the overhead incurred from vixl sadly.
With this in mind, FEX-Emu decided to create its own AArch64 code emitter tailored to what the project needs, which is high performance and low
overhead.
As seen in the chart above, the percentage of time between how long it takes to emit code between Vixl and our new emitter is significant. With the
Cortex-X1 only taking 68.7% of the time, and a smaller Cortex-A55 only taking 60.2% of the time. The Cortex-A55 having more of a win is showcasing
that due to how much code vixl takes to emit code, it is effectively saturating the icache and
BTB of the poor little CPU core.
Only code emission performance isn't the only story that matters here though. We need to showcase how much of an improvement this has including the
rest of the translation from x86 code.
Although code emission is only a percentage of our total time spent when translating x86 code, this new emitter is having a fairly massive ~8%
reduction in time spent JITing. This will manifest as reduced stutters when users are running games and generally faster application execution for
short-lived applications.
We're not stopping there of course, look forward to the coming months as we spend more time optimizing our JIT so it runs even faster!
Initial 32-bit thunk support
A tricky feature that FEX-Emu does with its emulation is that it is translating 32-bit x86 applications to run inside of a 64-bit process space. This
is a hard problem to resolve which is why we don't currently support thunking of libraries when running 32-bit applications. This is the initial work
required to start supporting this use case.
While not wired up to any library currently, we are quickly working towards getting Vulkan and OpenGL wired up to this interface so we can accelerate
older 32-bit games.
Various JIT optimizations
There have been various JIT optimizations this month which will improve performance a small amount. These aren't benchmarked since the percentage
improvements are so small that it is likely to fall in to single digit noise.
Optimize inline syscall spilling
When FEX handles a syscall inline with our JIT, we were spilling all of our registers to memory. Now with this optimization correctly working we only
spill exactly what is required, making inline syscalls faster.
Optimize generic spilling and filling
When jumping out of the JIT to C code, we need to spill both general purpose registers and vector registers to the stack. With this optimization in place we now
generate roughly half the instructions necessary when doing so.
Optimize SVE register spilling and filling
While currently not utilized today, this cuts the number of instructions required for spilling SVE registers to a quarter. Should be quite nice for
future hardware.
Zip elements for PHSUB instructions
These horizontal vector instructions behave a little weirdly and our original JIT implementation wasn't quite optimal. Previously we were doing
explicit element inserts to combine the final result. Now we are using the AArch64 Zip instructions which are significantly more optimal.
Fix global application configurations
This was a bug where we accidentally broke applications configurations shipped with the fex-emu package. In particular this caused the steamwebhelper
to break. With this resolved, steam will work correctly again.
Fix misspelled library names in Thunks Database
While a fairly minor fix, this can have a profound impact on users that are using our thunking infrastructure. Our XCB thunks were incorrectly named,
which meant that if users were enabling XCB thunks independentally of Vulkan/GL, then they wouldn't have actually been enabled.
With this typo fixed then this won't be a concern.
Note that if Vulkan or GL thunks were enabled, then this wouldn't likely have been an issue since X11 would have loaded xcb independentally anyway.
Misc
There was a bunch more this month that was smaller and spread out. We don't want to take up too much of your time so if you want to see more, make
sure to check out the detailed change log!
Raw Changes
-
ARM64
-
Moves RA functions to header (048daa4)
-
Arm64
-
Rename GetSrcPair, GetDst, and GetSrc (bf7d0f7)
-
Enables debug option for disassembling the JIT code (03a0613)
-
Inline Syscall spill optimization (0ebb15c)
-
Optimize SVE register spilling and filling (1ab4471)
-
Optimizing spilling and filling (9a8852f)
-
Reduce dispatcher to 1 page (65e8bf9)
-
VectorOps
-
Simplify FADDP result merging (344ec33)
-
Config
-
Fixes global application configs (dc9737a)
-
Crypto
-
Explicitly clear upper lane with VPCLMULQDQ (4c013c8)
-
Dispatcher
-
Calculate REG_ERR correctly using ARM ESR_EL1 (4f313f5)
-
Frontend
-
Handle 256-bit destination sizes directly (e8aa79b)
-
IR
-
Handle 128-bit VInsElement with SVE (94ae2e3)
-
LookupCache
-
Use a PMR map for our Blocklinks with monotonic allocator (b7358b4)
-
Optimize cache clearing and allocation (2b6a020)
-
OpCodeDispatcher
-
Optimize a case of GOT calculation (b42b4e0)
-
OpcodeDispatcher
-
Handle immediate variants of VPERMILPD/VPERMILPS (3904a52)
-
Handle VMASKMOVDQU (c6297ed)
-
Handle VPHSUBD/VPHSUBW (4786ddc)
-
Zip elements instead of for loop insertion in PHSUB (58ec2b2)
-
Handle VDPPD/VDPPS (9b8c92e)
-
Handle VINSERTPS (6caf764)
-
Handle VMOVMSKPD/VMOVMSKPS (faa81f2)
-
Handle VPUNPCKHBW/VPUNPCKHWD/VPUNPCKHDQ/VPUNPCKHQDQ (64cd377)
-
Handle VUNPCKHPD/VUNPCKHPS (138f1fc)
-
Handle VPUNPCKLBW/VPUNPCKLWD/VPUNPCKLDQ/VPUNPCKLQDQ (6bc1c3f)
-
Handle VUNPCKLPD/VUNPCKLPS (4560c5b)
-
Handle VCVTSS2SI/VCVTTSS2SI/VCVTSD2SI/VCVTTSD2SI (4a884802f86...
FEX-2212
Read the blog post at FEX-Emu's Site!
A lot of good work this month with the highlight being that we have started working on our AVX implementation and started optimizing our IR to be more efficient.
Disable PCLMUL if not supported on host
This carry-less multiplication instruction is only implemented on ARM SoCs that ship the cryptographic extension.
This extension is unsupported on the Raspberry pi which was causing applications that use openssl to crash.
Specifically this fixes Steam running on the Raspberry Pi again.
Adds 256-bit support to the remaining IR vector ops
A lot of work this month for implementing support for 256-bit operations.
With this work in place our JITs now support 256-bit for all of the IR operations.
Work started on AVX emulation
With the previous work completed for having our JITs support 256-bit operations, work could now be started on implementing AVX.
This AVX work is implemented as native SVE 256-bit operations, so the only hardware that can currently execute this partial implementation is Neoverse-V1 CPUs.
The expectation that as ARM CPUs become more powerful, they will eventually support SVE with 256-bit sized registers.
It may take a few generations to get hardware that supports this, if ARM CPUs want to run AVX games then they will need to support the equivalent hardware feature-set.
Current instructions implemented:
- VZEROUPPER, VZEROALL
- VMOVAPS, VMOVQ
- VMOVNTDQ, VMOVNTDQA, VMOVNTPD, VMOVNTPS
- VMOVDQA, VMOVDQU
- VMOVAPD, VMOVUPD, VMOVUPS
- VMOVLPD, VMOVLPS
- VMOVSHDUP, VMOVSLDUP
- VMOVHPD, VMOVHPS
- VMOVDDUP
- VORPD, VORPS, VPOR
- VPXOR, VXORPD, VXORPS
- VANDPD, VANDPS, VPAND, VANDNPD, VANDNPS, VPANDN
- VADDPD, VADDPS, VPADDB, VPADDW, VPADDD, VPADDQ
This is just the beginning of us implementing support for this, stay tuned as we implement the remaining operations over the next few months.
Generate register access IR operations directly
As an original implementation design detail, FEX implemented GPR and XMM register accesses as a generic emulated CPU state access. Once we added
static register allocation we also added an optimization pass to convert these generic accesses in to register accesses which directly map to our
static register allocator.
This is a redundant pass since we know upfront which registers were being accessed. With this change we are generating register access IR operations
directly and removed the optimization pass. This removes around 12% JIT compilation time, which improves responsiveness and lets FEX spend less time
compiling code.
Systemd fixes
While this is a niche supported operation, some people may be interested in running FEXServer as a systemd client.
A FEXServer is meant to be a user-wide server that the FEX clients talk to for rootfs and eventually other management.
Using a systemd user service, a FEXServer can be started early, letting it mount the rootfs image, and run in the background.
This can be fairly useful as FEX error logs can then be printed to journalctl for inspection as for why a process has crashed.
Add support for steamid based configuration files
As an ongoing effort of documenting which applications can run with FEX's OpenGL and Vulkan thunk libraries, it was determined that some applications
use generic executable names. This means that a configuration file that uses the application name would have erroneously enabled thunks for other
untested applications.
In order to work around this issue, our configuration system now supports an optional steamid based naming convention for games that are launched from
Steam. With this in place, we now have a repository that contains application configurations that users can install at their leisure. This repository
can be found on Github
As part of the documentation process, all of these configurations must be documented on our Wiki with
testing results to ensure it works.
Implement SGDT
This is a quirky instruction that is emulated on a native x86 system these days. This instruction is a system instruction that is used by the OS for
getting the configuration of the global descriptor table. Linux captures this instruction and returns a configuration that says the table is living in
kernel memory space. While this is already true, an application usually doesn't need to care about this data.
Curiously enough Denuvo uses this instruction in some of their implementations for some reason. With us implementing this instruction, Denuvo games
now get slightly further before they horribly crash.
auxv fixes
When FEX executes an application, it needs to setup an emulated auxv state since this isn't a cross-architecture state.
- AT_RANDOM
- This now correctly passes through the host's AT_RANDOM value rather than fixed values
- AT_PLATFORM
- Some tooling uses this to determine if it is running as i686 or x86-64
- AT_HWCAP/HWCAP2
- This just returns some CPUID values, most applications use CPUID directly instead of this
- AT_MINSIGSTKSZ
- The minimum signal stack size is no longer being a hardcoded constant size
- Applications are supposed to use this to calculate a signal stack size
Support radeon drm driver in ioctl emulation
Most Radeon GPUs these days use the amdgpu kernel driver, but a user found a hole in our ioctl emulation by using an old Radeon GPU on a Phytium ARM
board.
With this in-place, older Radeon cards that use the radeon kernel driver can now have accelerated OpenGL.
Misc optimizations
This month we have had a random smattering of optimizations that improve startup, shutdown, and execve performance. While not individually providing a
lot of benefit; small optimizations like these add up to make FEX better over time
- Defer cpuinfo file initialization until first access
- Improves startup time
- Use tsl::robin_map for some internal maps
- Improves JIT time, and some minor shutdown performance improvements
- Disable multiblock by default
- This causes excessive JIT overhead which makes the experience worse for the user
- Significantly reduces stutters
- Improve hot path of file existance checking in syscall wrapping
- During our overlayfs handling, this can be hit quite hard during file accesses
- Improves file IO in applications
Raw Changes
-
Arm64
-
Const on unmodified argument (9ca34ca)
-
Minor optimization in AESKEYGENASSIST (c1d118c)
-
Optimize Break IR op codegen (c7dd6ff)
-
VectorOps
-
Simplify VMov IR op on SVE (70e6ab5)
-
CMake
-
Fix typo in clang thunks option. (0030971)
-
Config
-
Disable multiblock by default (df25d4e)
-
Add support for steamid based configurations. (02ca94e)
-
Core
-
Replace a couple maps with tsl robin_map (57c5761)
-
Removes log about migrating to shared memory mode (8b6e9e0)
-
ELFCodeLoader
-
Calculate AT_MINSIGSTKSZ (e0fe916)
-
Fixes AT_PLATFORM null terminator (d7b0e84)
-
Pass through AT_SECURE (8afc3b8)
-
Ensure we set AT_SYSINFO for 32-bit (1d32df9)
-
EmulatedFiles
-
Defer cpuinfo file initialization to first access (8e2b0d1)
-
Externals
-
Update vixl submodule (f066abc)
-
FEXConfig
-
Sort named rootfs vector (71f658b)
-
FEXLoader
-
Make
IsInterpreterInstalled
check less horrible. (1dd5642) -
Disables some AOT shutdown overhead when not enabled (f8b2a0b)
-
FEXServer
-
More Systemd fixes (5e5e5a3)
-
FEXServerClient
-
Disable confusing connection log (cc6306a)
-
Add some debug logs for when FEX can't connect to se… (3c8da3e)
-
IR
-
Handle 256-bit VExtr (5a403b7)
-
Removes the only uses of VSLI and VSRI (7d9ed4e)
-
Remove VLoadMemElement and VStoreMemElement (9cee012)
-
Handle 256-bit LoadRegister/StoreRegister (a9c5138)
-
Handle 256-bit VAddV (04d4c5e)
-
IntrusiveIRList
-
Add a utility helper for getting an OrderedNodeWrapper (3c88180)
-
IoctlEmu
-
Support radeon (https://github...
FEX-2211
Read the blog post at FEX-Emu's Site!
A lot of good changes this month for our users. Both performance and compatibility improvements to be had!
Segment register index optimization
This optimization has been a long time coming. Sitting in pull-request limbo since back in April. This is an optimization to cache segment register
addresses so the JIT can more optimally generate memory accesses. While segment registers are mostly gone with x86-64, 32-bit segment registers are
used fairly commonly with some instructions completely implicitly. This just adds overhead to fetch the LDT and GDT entries for something that
typically doesn't change very quickly.
With this optimization in place, we get an average of 4.3% uplift in 32-bit Bytemark. This performance improvement will be directly felt when
running 32-bit applications.
48-bit Proton Experimental fixes
For a while now FEX has worked with Proton 7.0 and older, but we have had issues running Proton Experimental in some cases.
This was a tricky problem to nail down but we had some good leads. If your ARM device was running its kernel with 48-bit Virtual address space (VA) enabled then Proton
Experimental wouldn't work. On the other-hand if your kernel is compiled using a 36-bit VA then it would run fine. After a few days of debugging, it
turns out that Proton/Wine allocates the lowest 32MB of its stack space, and the kernel by default allocates a 128MB space for the application.
When an application is ran natively the stack is allocated at the fixed location in memory. FEX was failing to allocate the stack at the correct
location. When Wine's preloader eventually ran; FEX will have allocated JIT code at that fixed location, which Wine would then map over, zeroing the
memory and breaking the FEX JIT. The preloader has done this for a long time and it was by pure chance that we weren't breaking older versions of Wine
and Proton.
With this problem fixed in FEX, we are now able to run triple-A games on AArch64. Just like the following images of God of War running on Snapdragon
888.
Even more IR changes preparing for AVX emulation
Once again this month we have a absolute ton of commits from Lioncash working on making our JIT be ready for AVX emulation. Around 25 commits working
towards this, with only about four more IR vector operations to support AVX with.
Once the JITs support 256-bit operations, we can start working towards emulating the instructions themselves.
Fix thunk crashing due to insufficient stack space
When FEX starts we potentially need to allocate all memory inside of the 48-bit VA space to match how x86-64 only has 47-bits.
This intersects with our stack space allocation which is supposed to autogrow, but we allocated it instead. Now we give the full 128MB stack space to
FEX so it won't crash anymore.
Implements support for remaining BCD instructions
Thanks to @wannacu for implementing the remaining handful of 32-bit BCD instructions. DAA, DAS, AAA, AAS, AAM,
AAD were all missing in FEX's implementation. While BCD is fairly uncommonly used these days, they still managed to find an application that uses
these instructions. With these implemented, FEX should have all of the BCD instructions finally implemented.
Implement gpuvis timeline profiler support
While not majorly important for users, this is a very good interface for developers wanting to watch why a game has stuttered and for how long code
took to compile. This lets us take advantage of the same interface that GPU profiling events are using to see why a game missed a vsync.
This isn't enabled by default out of concern for taking too much CPU time, so it needs to be enabled with the ENABLE_FEXCORE_PROFILER cmake
option.
Fix ROR OF flag calculation
This is a fairly minor bug since not many things rely on the OF flag specifically. But in our testing of new Proton games, we found out that Denuvo
Anti-Tamper is relying on this edge case behaviour and we messed it up. While this gets Denuvo running slightly farther, it still doesn't quite work
under FEX.
Fixes FPREM1 C2 flag calculation
FPREM1 will return a flag if the number was too large to calculate in one step. Which is usually not the case. Since we are calculating the full
remainder we will never set say we return a partial remainder. This solves an infinite loop in Mono applications that are using SIN/COS math
operations.
Claim X87 transcenental ops are in range
X87 will set a flag if a program tries to operate on a value that is out of range for trancendental SIN/COS/TAN operations.
FEX-Emu doesn't actually detect these for performance reasons, so instead claim these are always in range. While not always true, if they are out of
range then we weren't detecting them anyway. Fixes an issue where glibc would do some fixups to try and bring the value in range, resulting in invalid
results.
Add missing thunk library versions
This fixes an issue where FEX thunks would try to dlopen development libraries, which are missing on most user's devices.
Fixes indirect thunks with 8+ arguments
This fixes a quite bad crash with OpenGL and Vulkan thunking where every function with 8 or more arguments would be likely to break.
Fixes thunks for a bunch of games.
Add support for disabling thunks in application configurations
This is useful for narrowing down thunk compatibility issues in certain applications. While it is still not recommended to enable thunks globally,
this allows more flexibility with tinkering with it
Implements four more auxv values
FEX implements most of these values for applications to pull but in some cases we didn't have these setup. Specifically AT_PLATFORM is required
so ldconfig can work correctly. AT_HWCAP/AT_HWCAP2 is used for an application to check for CPU features, and AT_RANDOM is a 128-bit
random number that the kernel provides.
Misc
Quite a few more things that were changed this month, but this report has been going on long enough.
Raw Changes
-
32bit
-
Fixes Debug build of VDSO (cf91ab9)
-
Allocator
-
Expand stack space when stealing virtual address space (000677a)
-
Arm64
-
BranchOps
-
Remove unused std::vector (74e18f4)
-
ConversionOps
-
Eliminate use of temporary in Vector_FToF (b1d98f4)
-
MemoryOps
-
Merge if statement into switch in ParanoidLoadMemTSO (199649b)
-
Remove lingering unnecessary ptrue instances (40d820f)
-
VectorOps
-
Make use of MOVPRFX where applicable (639d6e6)
-
Simplify SVE VSXTL/VSXTL2/VUXTL/VUXTL2 implementations (136f1e2)
-
ELFCodeLoader
-
Fixes Proton Experimental on 48-bit VA systems (2fa1a64)
-
Implement four more auxv values (fa5322d)
-
External
-
Update vixl submodule (fc6de5f)
-
FEXCore
-
Adds support for a timeline profiler interface (62a24bd)
-
FEXServer
-
Be robust against invalid packets. (64eb87e)
-
Flags
-
Refine _Bfe's shift (abb44d3)
-
IR
-
Handle 256-bit VInsElement (9e7daf6)
-
Handle 256-bit LoadContextIndexed/StoreContextIndexed (03f0edc)
-
Handle 256-bit StoreMem/StoreMemTSO/ParanoidStoreMemTSO (70a91ee)
-
Handle 256-bit LoadMem/LoadMemTSO/ParanoidLoadMemTSO (8a14f87)
-
Handle 256-bit LoadContext/StoreContext (d475b0b)
-
Handle 256-bit VTBL1 (2332c41)
-
Handle 256-bit VInsGPR (b7d9c00)
-
Handle 256-bit VExtractToGPR (b3ee5db)
-
Handle 256-bit Vector_FToI (7e81023)
-
Check for invalid conversion masks in Float_FromGPR_S (7291b10)
-
Handle 256-bit Vector_FtoF (b8f7e4c)
-
Handle 256-bit Vector_FToZS/Vector_FToS (13003da)
-
Handle 256-bit Vector_SToF (cb17ee9)
-
Handle 256-bit VDupElement (27b022d)
-
Handle 256-bit VUnZip/VUnZip2 (780e3c7)
-
Handle 256-bit VZip/VZip2 (ab45d...
FEX-2210
Read the blog post at FEX-Emu's Site!
This month's release was a bit delayed due to the fact that most of FEX-Emu's developers were meeting up physically at the X.Org Developer's
Conference this year! Before we talk about this months changes we need to spend a bit of time talking about some cool things.
FEX-Emu XDC talk
This year FEX-Emu had a talk to discuss some of the weird interactions with Mesa in an emulated environment. You can see the full talk in the embedded
video.
XDC Talk
At the end of the video we showed a quick demo of (mostly!) Proton games running under FEX-Emu on a Snapdragon 888 device. You can see this demo
directly embedded below.
XDC Sizzle Reel
Ubuntu 22.04 Rootfs Mesa update
We have had to update the Ubuntu 22.04 rootfs image with a newer version of Mesa today. Unfortunately our last update with Mesa 22.2 had a bug in the
Raspberry Pi Vulkan driver which completely broke Vulkan on ALL devices, not just raspberry pi. We have updated the rootfs today with a mesa git
version of the library to work around this issue. As a benefit, this version of the FEX rootfs includes the new Venus Vulkan 1.3 driver which can be
useful for testing.
Pick up the latest rootfs with the FEXRootFSFetcher tool.
New Lenovo ThinkPad X13s Gen 1 laptops
Last month Lenovo launched a new Snapdragon laptop that is one of the best development platforms that FEX-Emu devs could ask for. This platform is
shipping the Snapdragon 8cx Gen 3 SoC which is one of Qualcomm's most powerful chips. The only downside with this platform currently is that the GPU
doesn't yet work under Linux. There is an ongoing community effort to get the GPU up and running but these Snapdragon chips typically take a while
before support is fully in-place.
Once the GPU works then this will be a perfect platform for testing Adreno with the Turnip Vulkan driver and Freedreno. At that point we will be
shipping out these laptops to all of our devs so we have a good Vulkan development platform.
FEX-2210
Although most of our developers were at XDC, there is no shortage of code that was merged this last month.
IR changes preparing for AVX emulation
This last month had at least 32 commits preparing our JITs for emulating AVX. While AVX isn't yet wired up, this is still a required step before it is supported. We are still requiring ARM SVE hardware that is shipping with 256-bit wide registers. This means the current consumer CPUs and just announced Neoverse-V2 won't work for our emulation here! This is future-proofing work since more games are requiring AVX to run but we'll just need to live with the problem that we will need new CPUs for the latest AAA games to run under FEX.
Support clang for thunks
We added support for building our thunks with clang this release. In particular the Ubuntu PPA is shipping this already. This might give a very minor perf increase but the main thing is removing a hard dependency on GCC.
Add uninstall cmake target
While it is generally advised to not install directly from source building, user tend to still do this.
It was asked multiple times to have an uninstall target so we finally added this convenience feature.
32-bit VDSO thunking support
This is FEX-Emu's first 32-bit thunk library! This exercises most of the thunking framework to bring this feature to 32-bit, without some of the harder parts that require data repacking. Now that this is proving that our 32-bit thunking is working, it is likely that we will start working towards getting the rest of the thunks supporting 32-bit as well!
IR cleanups
While this isn't directly user facing, this makes the JIT IR a bit easier to handle. Making the devs lives easier. We've removed redundant operations that aren't necessary.
Add support for vixl simulator in CI
While we are waiting for SVE-256bit hardware to get on the market, we need CI to prove that our implementation is correct. We have once again added the vixl simulator to our source tree.
The vixl simulator supports emulating the SVE instructions at whichever register width you want. While stacking emulators isn't good for performance, it is good for ensuring correct behaviour.
Sadly ARM's simulator doesn't emulate 100% of the operations correctly, we have had to disable a few of our unit tests in this case; but, it works well enough that it can pick up major mistakes.
CI functional testing
We have added functional testing of some of our thunks in our CI system. Specifically we are testing our OpenGL and Vulkan thunks to ensure they don't break. Since this is the beginning of functional testing, we currently only run vulkaninfo and glxinfo.
Soon we will be expanding this functional testing to encompass more features which will likely capture even more problems if they come up.
Map ELF files more like the kernel
The kernel has an interesting behaviour around how it maps ELF files in memory. It will always load the dynamic linker at around the highest address
it can. The primary ELF file will be loaded roughly in the middle of the address space with a bit of ASLR bias. We now emulate the same behaviour in
FEX to help with problems when running WINE. While not all the issues are sorted out, this is a good step towards making it more stable.
Fix LLVM ASAN
We had an issue with our ELF loading where LLVM ASAN was breaking due to mixing multiple mmaps in the same space. Simple bug with a simple fix. ASAN
all the things!
SMC deadlock fix
There was a fix to prevent a potential deadlock in our Self-Modifying-Code detection routines. Thanks to the developer that found this!
Lots of misc fixes this month
It would be hard to list all of the misc other fixes that happened this month. Find out more in our raw release notes!
Raw Changes
-
Arm64
-
Fixes SVE VectorImm (ad85268)
-
Centralize location for register defines (169cfbb)
-
VectorOps
-
Make use of static predicate registers (0fee355)
-
CI
-
Fixes struct verifier on Ubuntu 20.04 (f97a4af)
-
Adds support for flakes (96fecfd)
-
CMake
-
Add toolchain file for 32-bit cross-compiler (1ed3ecb)
-
Extend AArch64 check to include arm64 (a583ebe)
-
Docs
-
Update Release docs (c987e1e)
-
ELFCodeLoader
-
Map primary ELF more like the kernel (b44b340)
-
Fixes dynamic non-interpreter ELFs (edca528)
-
Map interpreter first (71f7ff5)
-
ELFCodeloader
-
Map once and then use MAP_FIXED to overwrite (d68b84b)
-
FEXConfig
-
Ensure APP_CONFIG_NAME isn't stored in json (8d69f53)
-
FEXLinuxTests
-
Adds missing pthread_cancel flake status (c262362)
-
Migrate to Catch2 (8f70137)
-
Build 32-bit and 64-bit test variants separately (3448c83)
-
Use the build system instead of setting up compile flags via source-code annotations (d213869)
-
FEXServer
-
Fix waiting on kernel version older than 5.3 (8f9d799)
-
FHU
-
Convert to a interface target (dee85f1)
-
IR
-
Handle 256-bit VSMul/VUMul (23dd056)
-
Handle 256-bit VRev64 (c412d07)
-
Handle 256-bit VShlI/VUShlI/VUShrI (c2b6aef)
-
Handle 256-bit VSShrS/VUShlS/VUShrS (51214d1)
-
Handle 256-bit VFCMPORD/VFCMPUNO (7b4b9a8)
-
Handle 256-bit VFCMPLT/VFCMPGT/VFCMPLE (25a8a00)
-
Handle 256-bit VFCMPEQ/VFCMPNEQ (a67f742)
-
Handle 256-bit VCMPGT/VCMPGTZ/VCMPLTZ (ed8150c)
-
Handle 256-bit VCMPEQ/VCMPEQZ (462a163)
-
Handle 256-bit VBSL (6374175)
-
Handle 256-bit VSMax/VUMax (8d8b029)
-
Handle 256-bit VSMin/VUMin (aa6a499)
-
Handle 256-bit VNot (64c4fdc)
-
Handle 256-bit VFNeg (d715ffb)
-
Handle 256-bit VNeg (1799d4c)
-
Handle 256-bit VFRSqrt (dacd96c)
-
Handle 256-bit VFSqrt (ca4d3bf)
-
Handle 256-bit VFRecp (ea38b04)
-
Handle 256-bit VFMax (a39746d)
-
Handle 256-bit VFMin (2367a8e)
-
Handle 256-bit VAddP (cb121d7)
-
Handle 256-bit VFDiv (50eba40)
-
Handle 256-bit VFMul (4472265)
-
Handle 256-bit VFSub (3f8b872)
-
Handle 256-bit VFAddP (e573ddc)
-
Handle 256-bit VFAdd (eedbde6)
-
Handle 256-bit VPopcount (4e441e5)
-
Handle 256-bit VAbs (3e287a3)
-
Removes Mov IR op (46bde40)
-
Removes VExtractElement (01beac4)
-
Removes unnecessary VBitcast IR op (fcd981e)
-
Removes SplatVector{2,4} (2b9cc96)
-
Removes VInsScalarElement (82eba22)
-
Interpreter
-
Handle 256-bit VSShr/VUShl/VUShr (4d6e15d)
-
Use constant for AVX register size where applicable (808e1c0)
-
Handle 256-bit VMov (412793c)
-
Handle 256-bit VAnd/VBic/VOr/VXor (b5cb429)
-
JITs
-
Handle spilling/filling 256-bit vectors (6742e0c)
-
Expand max spill slot size to 32 bytes (0d0d116)
-
SMC
-
Fix possible deadlock (8da9ebc)
-
Scripts
-
Updates DefinitionExtract (3977e1f)
-
StructVerifier
-
Fixes CI failure (d4b5bf0)
-
ThunkLibs
-
X11/Xext: Removes two functions that don't exist on 32-bit (cc4c705)
-
Thunks
-
Add support for building with clang (2b1ef97)
-
Adds dependency on linker script (eaddf7f)
-
Implement the Thunk IR op for 32-bit mode (1ea00f6)
-
Adds functional thunk testing to CI (a590977)
-
Host
-
Adds bool operator to fex_guest_function_ptr (3237de3)
-
gen
-
Use fmt for writing formatted output (704afed)
-
libvulkan
-
Fixes print for 32-bit (d8c2a82)
-
VDSO
-
Fix vsyscall (5cf5940)
-
VectorOps
-
Handle 256-bit VURAvg (977d6dd)
-
Handle 256-bit VUMinV (0261ed3)
-
Extend VSQAdd/VSQSub/VUQAdd/VUQSub (f34f130)
-
Extend VAdd/VSub (0ad52b7)
-
Misc
-
Add opencl thunk db (b693112)
-
32-bit VDSO support (6f6f3c9)
-
Update vixl external (832a320)
-
Move thunk generator logic from A...
FEX-2209
Read the blog post at FEX-Emu's Site!
A lot of miscellaneous work this month that isn't directly user facing. We do still have some interesting topics this month that some people will be
interested in.
Simplify StealMemory functions
A fairly significant change this month is reducing the time it takes FEX to set up its memory upon load. FEX needs to do an initial setup of the memory when an application loads
because between x86-64, x86, and AArch64 the memory layouts are significantly different.
Depending on the architecture of the application, FEX needs to allocate a large amount of memory to emulate the x86/x86-64 memory behaviour.
On 32-bit x86
- We need to allocate all memory above 32-bit memory space
- This is because we emulate 32-bit applications as a 64-bit AArch64 application
On 64-bit x86-64
- We need to allocate all memory in the 48-bit virtual address space
- This is because AArch64 supports the full 48-bit space for the user
- x86-64 userspace only receives 47-bit
- Application's rely on not receiving 48-bit pointers!
From this graph showing the amount of CPU time spent in each routine, we can see a significant reduction in time to execute.
For 32-bit and 64-bit specific operations this results in a ~70x and ~181x reduction in in execution time!
How well does this improve execution time in practice though?
This graph is showing the total time it takes to run applications fully through. The smallest test applications have shaved off around 75% - 85% their execution time. The biggest improvement
comes from Proton setting up its execution environment. Proton's underlying execution environment is called pressure-vessel which executes hundreds
of background applications while setting up. This is one of the worst cases for FEX since each independent application execution needs to JIT new code
and handle all of its state setup. This case reduces the execution time from around 21 seconds down to around 17 seconds! This can really
be felt when execution back to back Proton instances when testing games!
While this is a significant step in the right direction, FEX still has a ways to go to hit the native execution time of pressure-vessel which can take
as little as one second.
More AVX work
A bunch more work has gone in to supporting AVX emulation. This is still preliminary backend work for now.
-
HostRunner
- Handle upper YMM lanes in sigsegv handler
-
InterpreterOps
- Extend SSAData size to accomodate 256-bit operations
-
VectorOps
- Extend VAnd/VBic/VOr/VXor
- Extend VMov
- Extend VectorImm
- Extend VectorZero
Thunks
X11
Some fairly minor changes here that improve usability of thunks with Proton. We added more Xlibint functions to the thunks which fixes X11 thunking
with DXVK. X11 is required for both Vulkan and OpenGL thunking so having this working is necessary when running those games.
Another necessary change for supporting thunks with Wine/Proton is more aggressively supporting X11 functions which require variadic arguments. There
are quite a few of these functions sprinkled around that require this. While we supported these functions with open-coded support up to 7 arguments,
we need to support at least up to 14 arbitrary arguments in some instances. We now have some assembly code in place which can support an arbitrary
number of arguments by packing these in memory the expected way. While this only works for 64-bit integers, it's all that we need for X11.
With both of these features implemented both OpenGL and Vulkan thunking works with Proton.
VDSO
While this is implemented as a thunk on the FEX side, it behaves slightly differently that normal thunks. This will always be enabled as long as FEX
can load the VDSO-host.so library installed on the system. Due to the nature of VDSO, all applications always have a VDSO region provided by the
kernel at all times. FEX wants to provide fast emulation of this "library" since applications abuse it heavily for performance. This was noticed when
running Proton games, they abuse the clock_gettime very heavily which was causing significant CPU overhead. Applications were calling this VDSO
syscall hundreds or thousands of times a second. This now significantly lowers the amount of time spent in the kernel for timing functions.
getdents syscall emulation
AArch64 doesn't support this syscall but in most cases applications don't use it. This is because there is a much more modern syscall called
getdents64 that everything uses now. When running older compiled applications they are likely to use the classic syscall. Since AArch64 doesn't have
the classic version, we now emulate it entirely using getdents64, which fixes running applications from centos 7.
Misc
- Fix compiling without jemalloc
- Thunks are unsupported without jemalloc but we need to keep it compiling
- Consolidate generated files to one file per platform
- Nice code cleanup for developers
- Minor cleanups for signature-based function pointer thunking
- Support direct thunk config in configuration files
- This improves the user experience with enabling thunks for application configurations
- No need for two files to describe one thing now
Raw Changes
-
64BitAllocator
-
Fixes a significant state tracking perf problem (123b672)
-
Allocator
-
Simplify StealMemory, make it less chatty with kernel space (04678f8)
-
Arm64
-
JIT
-
Rename CanUseSVE to HostSupportsSVE (7d8950d)
-
CI
-
Build Thunks (e544591)
-
FDUtils
-
Don't make unknown get_fdpath fatal (336dedb)
-
FEXRootFSFetcher
-
Fix crash if curl fails to download rootfs definition file (31fefaa)
-
FEXServer
-
Support socket path override (a2f4f49)
-
Github
-
Fix fresh runner rootfs checkout (d619968)
-
HostRunner
-
Handle upper YMM lanes in sigsegv handler (d5c83a2)
-
IRLoader
-
TestHarnessLoader
-
Don't build if not building tests (097184c)
-
InterpreterOps
-
Extend SSAData size to accomodate 256-bit operations (98dbfbe)
-
Linux
-
Emulate classic getdents syscall for x64 and x32 (9de25c2)
-
Syscalls
-
Use underscored shm syscall names (bbcca80)
-
Termux
-
Add android-shmem library (0adbe31)
-
Thunks
-
Consolidate all generated code to one file per library per platform (c17da25)
-
Adds VDSO thunk library (53623ff)
-
Minor cleanups for signature-based function pointer thunking (e6acdcc)
-
Support direct thunk config in configuration files (84a95ad)
-
Fix compile without jemalloc (d5138f5)
-
X11
-
Support Variadic stack packing (fbb008e)
-
Adds missing XLibint functions (998a3d8)
-
VectorOps
-
Extend VAnd/VBic/VOr/VXor (e776f4c)
-
Extend VMov (e7d7dd1)
-
Extend VectorImm (8439cf4)
-
Extend VectorZero (d03b6a9)
-
Misc
-
New domain. (7f9edbf)
-
x86_64/JIT: Resolve lingering fmt deprecation warning (37ccb13)
-
cmake
-
fix incorrect assumption about the value of git's core.abbrev (c03a7fd)
-
unittests
-
Support skipping unit tests based on host feature support (1fe6fc3)
-
ThunkLibs
-
Fix warning about "dangerous" use of tmpnam (12fee91)
FEX-2208
Read the blog post at FEX-Emu's Site!
Some really exciting changes this month. Thunk stabilization out of the gate is a huge boon to tinkerers and a bunch of other things spread
throughout!
Thunk improvements
The amount of work to reach this point can't be understated. @neobrain has been putting forth a bunch of infrastructure
work over the past few months which hasn't been super visible to the end user. This month it culminated towards fixing a bunch of stability problems
with thunks.
One of the biggest problems ends up being when a pointer is passed between the guest and host through thunks. X11's XFree function is used for both
x86 specific pointers and AArch64 pointers. This ends up being a problem since FEX-Emu uses jemalloc for its allocator while a guest application is
highly likely to use glibc's allocator. Passing the opposite one to either will crash either. We now distinquish between a "Guest" pointer and a
"Host" pointer in our thunk of XFree, which significantly improves stability of X11 thunks.
We also now have our libGL thunk implicitly load libX11. Due to how FEX's thunks work, we don't pull in all of the library dependencies of a "real"
library like libGL. Long term FEX-Emu will likely want to link to the same libraries that the real x86-64 library would have linked to. For now, libGL
now relies on libX11 directly. This fixes an issue where libGL thunks wouldn't work at all for any game launched from Steam.
Then the big pull request that does an absolute wackload of infrastructure work to make things work, Implement signature-based thunking of function pointers. This pull request is a bit complex to explain but it allows function pointers to be marshalled
across the thunk architecture boundary safely. Allowing both x86-64 and AArch64 to call in to the other for code for whatever reason is necessary. I
would recommend reading the pull request itself because the information is quite dense there.
Then some very minor changes the fixes some edge case behaviour.
- Make glXGetProcAddr of unknown address non-fatal
- Make glXGetProcAddr querying itself work
- Add some missing glX and GL functions that were missing.
- Nearly everything supported now, just some minor things missing
The take away for this effort is that OpenGL thunking is now significantly more stable. We don't have a lot of games tested yet, but follow along at
our wiki for documenting which games support thunks under FEX-Emu.
When OpenGL or Vulkan thunks are enabled, games are dramatically sped up. It is the recommended way to play a game but we need more testing coverage
to ensure it is stable enough to use.
Fix edge-case instruction faulting behaviour
x86 has six instructions that explicitly fault in different ways. We weren't handling RIP setting correctly on some of these.
This was uncovered by running Elden Ring inside of FEX. With this bug fixed FEX-Emu can now run the game if Denuvo is disabled by renaming the
executables. Sadly it doesn't seem like Snapdragon can run this game yet.
AVX initial implementation details
@Lioncache has been working away at making this feature a reality. Newer games coming out are starting to require AVX
to run and FEX needs to support this. This is a necessary feature since we need to claim compatibility for any CPU feature that
games use. So far the list of games we found is few that require the feature but it is going to become more common. The latest generation of game
consoles will drive this feature over the next decade of releases.
This sets up the initial groundwork inside of FEX-Emu to support the feature, with instruction implementations coming in the next months.
Don't expect this to run on any of your ARM devices any time soon though. We are going to require SVE with 256-bit register width to expose it. All
current hardware either doesn't support SVE at all, or only supports a 128-bit register width. Look forward to future hardware that ships with this
feature.
FEXConfig quality of life improvements
This is cleaning up some of the rough edges inside of FEXConfig, making it easier to modify your config. Significantly less keypresses to open a
config! Perfect!
Fix SOMA again
Due to how FEX changed some of its threading logic, we broke SOMA abusing the SETXID signal. This is now resolved and the game runs under FEX-Emu
again.
Fix Static Register Allocation in signal handlers
Applications relying on signals would very likely have crashed for a while due to this regression. With this resolve these games are stable again.
FEXRootFSFetcher automation options
FEXRootFSFetcher now has command line options for automatically choosing distro image and setting it up in config. This is useful for containers
embedding FEX and wanting a fresh rootfs instead of shipping it. It also does a runtime check for any image tooling programs before executing, fixing
a spurious error message.
Fix hang in Proton from close_range syscall
Proton would sometimes hang due to a close_range syscall trying to close all file descriptors. Make FEX a bit smarter about closing FDs with this
syscall which fixed the bug.
FEXBash change PS1 description
When running FEXBash it can be confusing for new users how it is behaving. Now we show the current operating path, user, and FEXBash as a prefix.
Hopefully this can be less confusing to new users that FEXBash isn't a VM, or docker style chroot.
Support Linux 5.19 passthrough
Not much changed from FEX's perspective here. Some minor DRM changes that naturally work themselves out.
Only initialize perf map file if profiling is enabled
Sorry about filling up your /tmp folder with empty files. This has been resolved.
Add pidfd_open syscall wrapper for compiling on older Linux distributions
Thanks to @wannacu for this fix. Our testing of older Linux distributions is quite spotty since FEX-Emu only officially
supports back to glibc 2.31 distros. This adds a wrapper utility for this syscall so older versions of Linux can compile FEX. Your mileage may vary
since FEX-Emu doesn't officially support it although.
Developer specific improvements
Remove static-pie
Static pie will never work due to glibc limitations around dlopen that FEX needs for thunks. Remove the option entirely to ensure no one tries to use
it.
FMT updated to 9.0.0
Newest release is best release
Bitness of syscall handler improvements
We were setting up both 32-bit and 64-bit syscall handlers upon initialization. Now we only initialize one or the other. Saves a bit of memory and
startup time shaves some microseconds off.
Vulkan-Headers included in External
It's no longer required as a install dependency on the host. We need to handle all Vulkan signatures, not just what is available on the host.
Makes compiling Vulkan thunks slightly less painful.
Misc
- Cortex-X1C supported in CPUID
- Support a globally installed Config file
- Support installing many json files from FEX Data folder
- Disable UnitTestGenerator since it is unused
- Assume optimizing LogManager assertion functions
- Support executable names being picked up for wineserver.
- Useful to see if a Windows game is doing something that our telemetry picks up
- Removing last usage of raw IR arguments in emulation backends
- x86_64: Migrate args over to named IR arguments
- json_ir_generator: Remove Args() functions from IR structs
- These make it a lot easier to see what arguments are being used and why
Raw Changes
-
AppConfig
-
Fix bug with filename (ac23bce)
-
Arm64
-
JIT
-
Remove unnecessary [[maybe_unused]] attributes (89aa590)
-
Arm64Dispatcher
-
Amend memcpy in SpillSRA (f5e18cc)
-
Arm64Emitter
-
Re-add use of stp/ldp with hosts that don't support SVE2 (8589119)
-
CMake
-
Support multiple json files in the root of Data/ (bd296d7)
-
CPUID
-
Detect Cortex-X1C (c9f0ecb)
-
Config
-
Support a global configuration file (3a64ea1)
-
Dispatcher
-
Fix SRA enabled check in signal delegator handlers (0dfe617)
-
Externals
-
Update fmt to 9.0.0 (54f62b6)
-
FEXBash
-
Changes PS1 to hopefully help users (a72ebfd)
-
FEXConfig
-
Some quality of life improvements (164299c)
-
FEXCore
-
Fix-up edge case behaviour on faulting instructions (8037231)
-
Adds assume optimizing LogManager function (3d347ed)
-
Support synchronizing RIP on block entry through config (80909ea)
-
FEXRootFSFetcher
-
Adds runtime checks for image mounting tools (cfd59db)
-
Actually wire up -a -x (3aabe077...
FEX-2207
Read the blog post at FEX-Emu's Site!
This is going to be a very interesting release this month for users. Quite a large number of features landed for this release!
Automatic TSO mode migration
When FEX is running a single threaded application, we can be optimistic and disable heavy TSO-emulation related features. This significantly speeds up some single threaded applications. Once the program creates a thread then FEX will disable this optimize and clear its code cache to be safe.
EroFS rootfs image support
While FEX has supports SquashFS for a long time. We are now adding support for EroFS as well. The big advantage of EroFS is that it doesn't serialize accesses to a single thread. When you're having dozens of threads accessing the filesystem this is a real bottleneck. Low end devices would end up having a single CPU core maxed out inside of the squashfuse application while multiple threads are trying to request data.
erofsfuse solves this by allowing multi-threaded decompression that scales quite well depending on the number of file requests in flight. We can see how this scales in the following benchmark graphs.
As one can see, while erofsfuse scales quite well with multiple threads; squashfuse stays pretty much flat the entire time. The downsides to EroFS is that the compression ratio of its LZ4HC compression isn't quite as good as ZSTD, causing the rootfs to be larger. But the reduction in memory usage, and lower read amplification plus higher bandwidth is worth it. Seriously improving performance of using a rootfs over a network mapped share like some people do.
An additional problem is that the erofsfuse application requires erofs-utils version 1.5, which came out on 2022-06-13. This is really bleeding edge currently.
Never the less, FEXRootFSFetcher will now allow you to download a FEX Rootfs image with this compression format. Just ensure you have a erofsfuse installed.
FEXServer
This is a fairly significant change to how FEX-Emu operates in the background. Similar to how wine has a wineserver, FEX is now requiring a FEXServer
to always be running.
For now the FEXServer is taking over duty for rootfs image mounting and a logging server. In this future this will be expanded to also handle code
caching services and more. FEXServer will automatically start on invocation of FEX and be running in the background until all instances of FEX close.
Pressure-vessel and Proton Fixes
FEX-Emu now officially works inside of pressure-vessel. This is the tool that Steam uses for running Proton games. Thunking doesn't yet work in this case but it is coming.
If you're wanting to test proton games, make sure to sign up to the latest SteamLinuxRuntime_soldier beta in the settings and give it a go. It's not currently the speediest, but it should work.
Disable FEXServer rootfs when running under pressure-vessel
Pressure-vessel sets up an x86-64 rootfs. FEX shouldn't be using the FEXServer provided rootfs in this case.
We now detect when running inside of pressure-vessel, and disable the FEXServer RootFS
Enable Hypervisor bit
This change allows pressure-vessel to detect FEX-Emu and do FEX specific setup for games.
Fix open syscall path emulation
The open syscall is fairly rarely used so this has gone unnoticed for a while. We weren't wrapping this syscall in our filesystem emulation and was breaking applications from running. With this fixed, the latest Proton Experimental branch from Steam now works!
Support thunks in pressure-vessel
Pressure-vessel uses a bunch of environment variable overriding to replace where libraries are inside of its chroot. Support this inside of FEX. While this is a step to getting Thunks working inside of pressure vessel, it is not yet supported.
Thunks
Lots of improvements to thunks, it's hard to capture them all. There is a heavy amount of infrastructure work going on in here to make thunks more robust and stable. Starting with Vulkan and GL.
- Work around lack of generic callback support in VK_EXT_debug_report (4771a34)
- Disable debug report callback (751b66d)
- Allow building thunks on a wider range of platforms (ad6fd5a)
- Add fex:is_lib_loaded (88b94be)
- Support returning host function pointers to the guest (04a1ac9)
Fix clone3 syscall's stack pointer again
In an edge case of how FEX-Emu handles clone3, it wasn't handling the stack pointer size correctly again.
Resolving this edge case once again gets Steam's web helper working with glibc 2.34.
Fix 32-bit memory allocation range scanning
When scanning for free chunks of memory in the 32-bit range, FEX-Emu needs to use a custom allocator to ensure everything returned ends up in the lower 32-bit memory space. This fixes a bug where large allocations would never find an empty space. Fixes X-Plane 11!
Optimize file descriptor to filename mapping
It is a common occurrence that FEX needs to map an open file descriptor back to a file path. This used to take 14 system calls.
Since each system call was querying filesystem metadata these could take some time. With this optimized approach it now takes only one system call instead. Significantly lowering file IO overhead!
Enable Wine application profiles
Wine applications when they are executing typically only showed up as wine or wine-preloader to FEX-Emu.
Now we work around this issue by scanning the arguments to find the executable name, which allows application profiles to function.
Now we can easily support SonicMania.exe.json!
FEXRootFSFetcher fix to file hashing
It was discovered that this tool was hashing files incorrectly. The new version is now hashing correctly and image files have been updated to be using the new hash. Nothing to see here
Fix 32-bit DRM ioctl DRM_IOCTL_WAIT_VBLANK
This ioctl does exactly what it says on the tin. Due to a copy and paste error, this wasn't actually waiting on vblank.
Fix 32-bit ioctl structure copying
A feature of the DRM subsystem allows you to extend ioctl struct definitions safely. The kernel knows the size of the ioctl structure and if it
differs from what the userspace application passes in, then it will only copy the smaller amount of data and zero out the rest.
This allows older userspaces to safely work with newer kernels. FEX wasn't reproducing this with its ioctl emulation in some V3D ioctls, resulting in unsafe execution of ioctls. This has been resolved.
Support CLMUL Extension
This instruction is heavily used to accelerate CRC and other hashing algorithms. This perfectly matches the AArch64 instruction as well. So
implementing this was very straightforward!
Self-modifying-code frontend improvements
Allows FEX to track code pages inside our frontend decoding. This fixes some issues where code can be changed while we are decoding things in the frontend. Now FEX can detect this and throw away what it compiled.
Developer specific improvements
Check for binfmt_misc conflict before installing
To ensure building from source doesn't result in a broken configuration, cmake will now check for conflicting binfmt_misc files before installing.
How to uninstall the conflicting binfmt_misc files is specific to how the user has installed them, so it is left up to them to find out how.
Auto CI fetching
If the CI systems need an updated rootfs, the config can now be updated and they will fetch the latest.
unittests now longer forever recompiler
ASM unittests would always reglob on building which took time. This is now fixed
Fix ASAN bug in how register allocation data was allocated
This was hard to track, finally this annoying bug that has gone back and forth a bit has been resolved!
ARM64 CPU feature detection for ASM unit tests
Automatically disables some incompatible unit tests on ARM64 devices that don't support some features. No more confusing failures.
GDB integration
This allows a plugin to be loaded in GDB to show more information that we would otherwise have. Giving us both backtraces and source inside of GDB
even through the JIT. Should let debugging the JIT be that much easier.
Raw Changes
-
AOTIR
-
Fix IRList delete (fb41ba1)
-
Fix RAData free (9242e59)
-
Arm64
-
EncryptionOps
-
Fix register specifiers in PCLMUL movs (63b70ff)
-
JIT
-
Use IR names in opcode implementations (19b0a9c)
-
Backends
-
Unified dispatch, interface rework, cleanups (072690a)
-
CI
-
Auto rootfs fetching (c027ace)
-
CMAKE
-
Create directories during configuration, fixes endless generation of unittests (e62bc24)
-
CMake
-
Check for binfmt_misc conflicts before install (6d2f98a)
-
CPUID
-
Enable the hypervisor bit (da8dbf1)
-
Common
-
Support application profiles for games launched through wine (3913dd6)
-
Config
-
Fixes AppConfig for wine-preloader (ae6a57e)
-
Context
-
Fix CreateThread partial initialization issue (eac579f)
-
Decouple from CodeLoader, introduce generic CustomIREntrypoints (https://git...
FEX-2206
Read the blog post at FEX-Emu's Site!
Quite a large amount of changes this month since we cancelled last month's release.
Steam's webhelper working again
Steam started enabling the chromium sandbox. Seccomp isn't supported in FEX-Emu so it was crashing early on.
Forcibly disable it trying to use the sandbox using an application profile.
This lets the game library be visible again, although it can take a while to appear.
Fix LRCPC and add support for LRCPC2
There was a bug in our CMPXCHG implementation that wasn't using ARM's acquire-release semantics accidentally.
Fixing this bug allowed us to reenable our TSO emulation using LRCPC.
Additionally we have added support for LRCPC2 which gives us some immediate encoded instructions to further reduce overhead.
On hardware that supports LRCPC these can result in a reasonable performance uplift.
SHA-1 and SHA-256 instructions implemented
These SHA instruction have been implemented and the CPUID bit is now exposed.
This is a GPR based implementation, an implementation using AArch64's equivalent SHA instructions will be implemented at a later time.
Self-modifying code support improvements
Many things have changed with supporting self-modifying code in a more extensive fashion.
FEX-Emu will now tracking guest allocations of executable memory and when the code has been modified, we will clear the JIT caches.
This happens for both true self-modifying code and also libraries being loaded.
Fault handling is employed to know when code is modified in memory to ensure we can tracak changes.
This is a new setting in FEXConfig called mtrack. The older syscall only tracking path is deprecated but still available for testing.
Option to emulate x87 with 64-bit float operations
Big shout out to CallumDev for implementing this long awaited feature.
A major performance problem of emulating x86 is any older game will be compiled to use the x87 extension. This is especially true for 32-bit games.
The problem with this extension is that by default it uses 80-bit floats, which AArch64 doesn't support.
We end up emulating this entire extension using a soft-float implementaiton, which while being quite accurate, is obscenely slow.
This performance hack is now available to remove a significant amount of the overhead by operating x87 instructions using 64-bit float scalar
operations instead.
This is known to be inaccurate, but most Windows games will actually be configuring the x87 unit to be lower precision than 80-bit.
Additionally most games don't actually need the extra precision that 80-bit provides, so it is usually safe to emulate it more inaccurately.
This may still have some bugs, we know at least one game that has issues that aren't explained by pure precision problems. The feature can be enabled
in FEXConfig under the Hacks tab, look for "X87 Reduced Precision"
Clone3 syscall fixed
With Glibc 2.34 released, this project has started using the clone3 syscall for creating threads.
FEX's implementation was mostly untested which resulted in all applications breaking.
Stack pointer behaviour was broken and now with this fixed, glibc 2.34 now works out of the box.
FEXRootFSFetcher don't try to continue download
FEX-Emu's CDN doesn't support continuing file downloads. Disable to not cause issues.
FEXCore: Reclaimable thread pool allocator
FEXCore now uses an intrusive pooling allocator to allow sleeping threads to give back memory to the pool.
This allows multiple threads to share a memory resource, reducing memory usage by a significant amount if an application has a bunch of sleeping
threads.
FEXBash: Set PS1 environment variable to show running under emulation
Once running FEXBash it can be hard to tell if you're running your bash terminal under emulation.
Setting PS1 to FEXBash> makes it easier to tell that the terminal is running under emulation.
FEXBash> uname -a
Linux ryanh-TR2 5.17.5 #FEX-2206 SMP Jun 4 2022 15:11:07 x86_64 x86_64 x86_64 GNU/Linux
OpcodeDispatcher: Fixes PEXTRB
Newer Unreal engine releases were generating a PEXTRB instruction that our frontend decoder was decoding incorrectly.
Typically this would result in a crash.
This fixes both Dirt 4 and Psychonauts 2.
Misc
- CMake
- Add support for mold
- Add flag for defined signed overflow handling
- Arm64: Optimize constant generation with ADRP+ADR
- EmulatedFiles: Fixes temporary file generation flags
- Struct Verifier: Fixes some bugs with DRM headers not getting picked up
- Linux v5.17 and v5.18 support
- JIT: Code relocation support
- OpcodeDispatcher:
- Adds support for non-temporal loadstores
- Implements support for PAUSE instruction
- Syscalls:
- 32-bit mmap syscalls fixes
- Has been broken since the start, most applications use mmap2 instead
- Fixes Kega Fusion
- 32-bit mmap syscalls fixes
- CompileService: Removed since it is no longer required
- We no longer try to compile in a reentrant safe fashion
- JITSymbols: Cleaner printing of RIP relative to a file
- Standard TODO markers for code searching
- Some 32-bit FS/GS writing fixes
- Not really used so didn't affect anything
Raw Changes
FEX Release FEX-2206
-
AOTIR
-
copy RAData and IRList, make sure data is accessible (da2e44d)
-
AppConfig
-
Inject --no-sandbox in to steamwebhelper (c14c0c2)
-
ArchHelpers
-
Adds relocation struct defines (b5ae9e4)
-
Arm64
-
Fix LDAPUR/STLUR DMB backpatch (27f2e0b)
-
Adds support for RCPC2 extension (f8ba373)
-
Fixes AtomicSwap (70988cc)
-
Arm64Emitter
-
Optimize constants with ADRP and ADR (912dbfe)
-
CMake
-
C/C++ flags for defined singed overflow warping (2e05349)
-
Add option to use the mold linker (5884114)
-
CompileService
-
Removes no longer necessary service thread (b1033ed)
-
Config
-
Adds code cache config option (278ca52)
-
Core
-
Adds Code Object Cache service (13f3c6e)
-
context-wide guest code invalidations (d810988)
-
EmulatedFiles
-
Fixes temporary file flags (4fbc266)
-
F64
-
Implement FCW using host rounding mode (db3854e)
-
Fix FILD and FIST for Size < 8 (89d6752)
-
FEXBash
-
Set PS1 to make it more obvious when running under FEX (ec38d58)
-
FEXCore
-
Adds refcount_shared_mutex class (1e597bf)
-
Reclaimable thread pool allocator (8a7f395)
-
FEXLoader
-
Fix create_directories check for aotir .path file writting (90f338d)
-
FEXLogServer
-
Stop improper use of std::erase_if (d523b7a)
-
FEXRootFSFetcher
-
Don't continue download (fa87c73)
-
JitSymbols
-
Print file+offset if possible (a715627)
-
Linux
-
Fixes 32-bit mmap (3fd136b)
-
MemAllocator32Bit
-
Add missing lock to shmdt, fix error returns (b2b4c2b)
-
OpcodeDispatcher
-
Implement SHA256 instructions (3bbff8a)
-
Handle SHA-1 instructions (8dd9a5b)
-
Implements support for PAUSE (da48020)
-
Fixes pextrb with high registers (fe11bd2)
-
Remove debugging dump statement (c8dc663)
-
Adds support for non-temporal loadstores (ba78dff)
-
ScopedSignalMask
-
Add shared mutex support, move constructors (8e36f53)
-
Syscalls
-
Fixes clone3 stack pointer (0ed9654)
-
Linux
-
Add guest[Mmap/Munmap] (b9d878b)
-
Refactor guest mman tracking (ce0f5db)
-
TestHarnessRunner
-
Use guest mapper for test harness files (b78af2f...
FEX-2204
Changes
CPUID
- Adds 4000_0001h function (977bda9)
- Allows guest applications to check the hypervisor for FEX-Emu
FEXCore
- Delete IR after it is used (a247df5)
- Lowers FEX memory usage
- Fixes #1618 (60c7ea6)
- Could have caused a crash in the signal handler
- Removes unused debug data (8422fc6)
OpcodeDispatcher
- Fixes SIGILL on unsupported host instructions (042cd35)
- Fixes FNINIT (187c641)
- FCW wasn't initialized correctly. Fixes a Visual Novel game engine CPUID initialization code.
JIT
- Emit identification string in the code buffers (23a1c64)
- Adds comment to EmitDetectionString (d39df8d)
- Get long divide out of the hot path (8ad1472)
FileManager
- Fix realpath failed on debian buster (91665fd)
Linux
- Fixes MAP_32BIT supported range (fad91bb)
Scripts
- Updates AArch64 fit for Clang 14 (6b3cd3d)
Softfloat
- Fixes FSCALE (5677924)
TestHarnessRunner
- Flush log on asserts (ebd0edb)
Misc
Termux
- Adds a cmake option for forcing a termux build (5de6c86)
- Disables GUI applications in a Termux build (fb27cb4)
GDBServer
- GDBServer improvements: Three's a crowd (7b0265f)
- Gdbstub improvements: The sequel (6a5abd3)
- GDBServer improvements (53ffe5d)
Documentation
FEX-2203
Changes
-
JIT
- Implements x87 fallback helpers as lookups in to state (b65194f)
-
ARMJIT
-
JITx86
- Switches over to loading pointers from state (99dcda7)
-
OpcodeDispatcher
-
CPUID
- Implements leaf 4000_0000 (4a3cbf1)
- Allows guest applications to see through CPUID that they are running in a FEX "hypervisor"
-
ELFCodeLoader
- Fixes typo in AT_BASE calculation (43fada7)
-
FEXCore
- Adds support for RDRAND/RDSEED (bebcab0)
-
FEXGetConfig
- Fix --current-rootfs option (76f86e5)
-
FEXLoader
-
IR
- New IR JSON format (3e8c6d0)
-
Linux
-
LinuxAllocator
- Fixes bug with old kernels and hint allocation (08938ec)
-
Scripts
-
Telemetry
- Fix missing telemetry names (30803c6)
-
TestHarnessRunner
- Wire up environment variable option setting (3ba2d6c)
-
UContext
- Fixes 32-bit siginfo_t copying definition (cb491a8)
-
Misc
- Adds tsl::robin_map (a7ad7f4)
- Update vixl to fix assert (57a5654)
- Termux fixes (b4e0565)
- Allow classifying syscalls with flags (2933a00)
- Improve compatibility with older uapi kernel headers (fa554d3)
- Updates vixl for new cursor updating methods (5ec6ee5)
- Adds option to disable ccache (0db7205)
- Fixes Host and guest thunks install path (a5fb7e7)
- Removes MAP_GROWSDOWN usage (afa7172)
- Miscellaneous thunk cleanups (4bb3a54)
- Enable proper IDE integration of thunk libraries (5854d4a)
- Updates Readme to fix install script (defd3be)
- Some fixes for older environments (1b99495)
-
unittests
- Disables Interpreter tests when its disabled (3453023)