FEX-2212
Read the blog post at FEX-Emu's Site!
A lot of good work this month with the highlight being that we have started working on our AVX implementation and started optimizing our IR to be more efficient.
Disable PCLMUL if not supported on host
This carry-less multiplication instruction is only implemented on ARM SoCs that ship the cryptographic extension.
This extension is unsupported on the Raspberry pi which was causing applications that use openssl to crash.
Specifically this fixes Steam running on the Raspberry Pi again.
Adds 256-bit support to the remaining IR vector ops
A lot of work this month for implementing support for 256-bit operations.
With this work in place our JITs now support 256-bit for all of the IR operations.
Work started on AVX emulation
With the previous work completed for having our JITs support 256-bit operations, work could now be started on implementing AVX.
This AVX work is implemented as native SVE 256-bit operations, so the only hardware that can currently execute this partial implementation is Neoverse-V1 CPUs.
The expectation that as ARM CPUs become more powerful, they will eventually support SVE with 256-bit sized registers.
It may take a few generations to get hardware that supports this, if ARM CPUs want to run AVX games then they will need to support the equivalent hardware feature-set.
Current instructions implemented:
- VZEROUPPER, VZEROALL
- VMOVAPS, VMOVQ
- VMOVNTDQ, VMOVNTDQA, VMOVNTPD, VMOVNTPS
- VMOVDQA, VMOVDQU
- VMOVAPD, VMOVUPD, VMOVUPS
- VMOVLPD, VMOVLPS
- VMOVSHDUP, VMOVSLDUP
- VMOVHPD, VMOVHPS
- VMOVDDUP
- VORPD, VORPS, VPOR
- VPXOR, VXORPD, VXORPS
- VANDPD, VANDPS, VPAND, VANDNPD, VANDNPS, VPANDN
- VADDPD, VADDPS, VPADDB, VPADDW, VPADDD, VPADDQ
This is just the beginning of us implementing support for this, stay tuned as we implement the remaining operations over the next few months.
Generate register access IR operations directly
As an original implementation design detail, FEX implemented GPR and XMM register accesses as a generic emulated CPU state access. Once we added
static register allocation we also added an optimization pass to convert these generic accesses in to register accesses which directly map to our
static register allocator.
This is a redundant pass since we know upfront which registers were being accessed. With this change we are generating register access IR operations
directly and removed the optimization pass. This removes around 12% JIT compilation time, which improves responsiveness and lets FEX spend less time
compiling code.
Systemd fixes
While this is a niche supported operation, some people may be interested in running FEXServer as a systemd client.
A FEXServer is meant to be a user-wide server that the FEX clients talk to for rootfs and eventually other management.
Using a systemd user service, a FEXServer can be started early, letting it mount the rootfs image, and run in the background.
This can be fairly useful as FEX error logs can then be printed to journalctl for inspection as for why a process has crashed.
Add support for steamid based configuration files
As an ongoing effort of documenting which applications can run with FEX's OpenGL and Vulkan thunk libraries, it was determined that some applications
use generic executable names. This means that a configuration file that uses the application name would have erroneously enabled thunks for other
untested applications.
In order to work around this issue, our configuration system now supports an optional steamid based naming convention for games that are launched from
Steam. With this in place, we now have a repository that contains application configurations that users can install at their leisure. This repository
can be found on Github
As part of the documentation process, all of these configurations must be documented on our Wiki with
testing results to ensure it works.
Implement SGDT
This is a quirky instruction that is emulated on a native x86 system these days. This instruction is a system instruction that is used by the OS for
getting the configuration of the global descriptor table. Linux captures this instruction and returns a configuration that says the table is living in
kernel memory space. While this is already true, an application usually doesn't need to care about this data.
Curiously enough Denuvo uses this instruction in some of their implementations for some reason. With us implementing this instruction, Denuvo games
now get slightly further before they horribly crash.
auxv fixes
When FEX executes an application, it needs to setup an emulated auxv state since this isn't a cross-architecture state.
- AT_RANDOM
- This now correctly passes through the host's AT_RANDOM value rather than fixed values
- AT_PLATFORM
- Some tooling uses this to determine if it is running as i686 or x86-64
- AT_HWCAP/HWCAP2
- This just returns some CPUID values, most applications use CPUID directly instead of this
- AT_MINSIGSTKSZ
- The minimum signal stack size is no longer being a hardcoded constant size
- Applications are supposed to use this to calculate a signal stack size
Support radeon drm driver in ioctl emulation
Most Radeon GPUs these days use the amdgpu kernel driver, but a user found a hole in our ioctl emulation by using an old Radeon GPU on a Phytium ARM
board.
With this in-place, older Radeon cards that use the radeon kernel driver can now have accelerated OpenGL.
Misc optimizations
This month we have had a random smattering of optimizations that improve startup, shutdown, and execve performance. While not individually providing a
lot of benefit; small optimizations like these add up to make FEX better over time
- Defer cpuinfo file initialization until first access
- Improves startup time
- Use tsl::robin_map for some internal maps
- Improves JIT time, and some minor shutdown performance improvements
- Disable multiblock by default
- This causes excessive JIT overhead which makes the experience worse for the user
- Significantly reduces stutters
- Improve hot path of file existance checking in syscall wrapping
- During our overlayfs handling, this can be hit quite hard during file accesses
- Improves file IO in applications
Raw Changes
-
Arm64
-
Const on unmodified argument (9ca34ca)
-
Minor optimization in AESKEYGENASSIST (c1d118c)
-
Optimize Break IR op codegen (c7dd6ff)
-
VectorOps
-
Simplify VMov IR op on SVE (70e6ab5)
-
CMake
-
Fix typo in clang thunks option. (0030971)
-
Config
-
Disable multiblock by default (df25d4e)
-
Add support for steamid based configurations. (02ca94e)
-
Core
-
Replace a couple maps with tsl robin_map (57c5761)
-
Removes log about migrating to shared memory mode (8b6e9e0)
-
ELFCodeLoader
-
Calculate AT_MINSIGSTKSZ (e0fe916)
-
Fixes AT_PLATFORM null terminator (d7b0e84)
-
Pass through AT_SECURE (8afc3b8)
-
Ensure we set AT_SYSINFO for 32-bit (1d32df9)
-
EmulatedFiles
-
Defer cpuinfo file initialization to first access (8e2b0d1)
-
Externals
-
Update vixl submodule (f066abc)
-
FEXConfig
-
Sort named rootfs vector (71f658b)
-
FEXLoader
-
Make
IsInterpreterInstalled
check less horrible. (1dd5642) -
Disables some AOT shutdown overhead when not enabled (f8b2a0b)
-
FEXServer
-
More Systemd fixes (5e5e5a3)
-
FEXServerClient
-
Disable confusing connection log (cc6306a)
-
Add some debug logs for when FEX can't connect to se… (3c8da3e)
-
IR
-
Handle 256-bit VExtr (5a403b7)
-
Removes the only uses of VSLI and VSRI (7d9ed4e)
-
Remove VLoadMemElement and VStoreMemElement (9cee012)
-
Handle 256-bit LoadRegister/StoreRegister (a9c5138)
-
Handle 256-bit VAddV (04d4c5e)
-
IntrusiveIRList
-
Add a utility helper for getting an OrderedNodeWrapper (3c88180)
-
IoctlEmu
-
Support radeon (fc28062)
-
Linux
-
Improve performance of hot paths in path searching (66e0d46)
-
OpcodeDispatcher
-
Handle VADDPD/VADDPS/VPADDB/VPADDW/VPADDD/VPADDQ (8f157e4)
-
Handle VANDPD/VANDPS/VPAND/VANDNPD/VANDNPS/VPANDN (c37fcf1)
-
Handle VORPD/VORPS/VPOR (34e39c9)
-
Handle VPXOR/VXORPD/VXORPS (4de6902)
-
Handle VZEROUPPER/VZEROALL (a374a9a)
-
Handle VMOVQ (b35c6c6)
-
Handle VMOVNTDQ/VMOVNTDQA/VMOVNTPD/VMOVNTPS (0841ff5)
-
Implement SGDT (9912d41)
-
Moves all GPR and XMM accesses to direct register accesses (a4556e9)
-
Handle VMOVDQA/VMOVDQU (df761a9)
-
Handle VMOVDDUP (f8a199a)
-
Handle VMOVSHDUP/VMOVSLDUP (58f35ba)
-
Handle VMOVHPD/VMOVHPS (bc20f1e)
-
Handle VMOVLPD/VMOVLPS (69045db)
-
Handle VMOVAPD/VMOVUPD/VMOVUPS (2271a90)
-
Handle VMOVAPS (56ff09f)
-
Disable PCLMUL if not supported on host (c8293cb)
-
Syscalls
-
Minor optimization with initialization of syscall definition vector (4d2c4b4)
-
Thunks
-
Fix guest targets not being detected by IDEs (9158877)
-
Misc
-
x86_64/VectorOps: Separate 128-bit/256-bit paths (181d315)
-
Systemd fixes (d5f7e61)
-
InstallFEX.py: Adds support for Kinetic (aa837ed)
-
Update release process to include AUR (5336f01)
-
unittests
-
Expand VPCLMULQDQ unit test (9fea774)