FEX-2301
Read the blog post at FEX-Emu's Site!
Happy new year! A new month brings a new release of FEX-Emu, bringing in the new year.
A large amount of work in this last month, showing that FEX-Emu isn't slowing down even through the holiday season.
AVX emulation work continues
An absolute ton of work landed this last month towards bringing up AVX emulation in this last month. In total there were around 185 new
AVX instructions implemented in FEX-Emu's backend this month. At this point it starts becoming easier to talk about the number of missing instructions
rather than what is implemented.
According to FEX-Emu's instruction decoder tables, we have around 60 more instructions to implement before we can start advertising the feature. Of course
with anything programming related, the last 10% is going to take the longest to implement.
A huge shoutout to @lioncash for smashing out these implementations so quickly. The amount of work going in to this is
extensive.
As a side-note for users looking forward to this feature. The implementation requires hardware that supports both SVE and SVE2 with a 256-bit register
width now. Which means that Fujitsu A64FX, Neoverse-V1, and all current consumer class Cortex chips are incapable of taking advantage of AVX once
complete. This is a future proofing implementation for when future hardware becomes available that supports what FEX-Emu needs.
Implement a new AArch64 code emitter
One thing that has been a stand out performance bottleneck has been how quickly FEX-Emu can emit AArch64 binary code to memory. The project that
FEX-Emu used for this is ARM/Linaro's project called vixl. This project is a suite of tools including assemblers,
simulators, and disassemblers and many open source projects do use this. This is a very nice project that eases the developer's burden when writing a
JIT that targets ARM devices. Sadly when profiling our code, it turns out that FEX-Emu spensd a decent amount of time inside of vixl code due to how
obtusely large it is. Even with Link-Time-Optimization enabled in our code, we can't reduce the overhead incurred from vixl sadly.
With this in mind, FEX-Emu decided to create its own AArch64 code emitter tailored to what the project needs, which is high performance and low
overhead.
As seen in the chart above, the percentage of time between how long it takes to emit code between Vixl and our new emitter is significant. With the
Cortex-X1 only taking 68.7% of the time, and a smaller Cortex-A55 only taking 60.2% of the time. The Cortex-A55 having more of a win is showcasing
that due to how much code vixl takes to emit code, it is effectively saturating the icache and
BTB of the poor little CPU core.
Only code emission performance isn't the only story that matters here though. We need to showcase how much of an improvement this has including the
rest of the translation from x86 code.
Although code emission is only a percentage of our total time spent when translating x86 code, this new emitter is having a fairly massive ~8%
reduction in time spent JITing. This will manifest as reduced stutters when users are running games and generally faster application execution for
short-lived applications.
We're not stopping there of course, look forward to the coming months as we spend more time optimizing our JIT so it runs even faster!
Initial 32-bit thunk support
A tricky feature that FEX-Emu does with its emulation is that it is translating 32-bit x86 applications to run inside of a 64-bit process space. This
is a hard problem to resolve which is why we don't currently support thunking of libraries when running 32-bit applications. This is the initial work
required to start supporting this use case.
While not wired up to any library currently, we are quickly working towards getting Vulkan and OpenGL wired up to this interface so we can accelerate
older 32-bit games.
Various JIT optimizations
There have been various JIT optimizations this month which will improve performance a small amount. These aren't benchmarked since the percentage
improvements are so small that it is likely to fall in to single digit noise.
Optimize inline syscall spilling
When FEX handles a syscall inline with our JIT, we were spilling all of our registers to memory. Now with this optimization correctly working we only
spill exactly what is required, making inline syscalls faster.
Optimize generic spilling and filling
When jumping out of the JIT to C code, we need to spill both general purpose registers and vector registers to the stack. With this optimization in place we now
generate roughly half the instructions necessary when doing so.
Optimize SVE register spilling and filling
While currently not utilized today, this cuts the number of instructions required for spilling SVE registers to a quarter. Should be quite nice for
future hardware.
Zip elements for PHSUB instructions
These horizontal vector instructions behave a little weirdly and our original JIT implementation wasn't quite optimal. Previously we were doing
explicit element inserts to combine the final result. Now we are using the AArch64 Zip instructions which are significantly more optimal.
Fix global application configurations
This was a bug where we accidentally broke applications configurations shipped with the fex-emu package. In particular this caused the steamwebhelper
to break. With this resolved, steam will work correctly again.
Fix misspelled library names in Thunks Database
While a fairly minor fix, this can have a profound impact on users that are using our thunking infrastructure. Our XCB thunks were incorrectly named,
which meant that if users were enabling XCB thunks independentally of Vulkan/GL, then they wouldn't have actually been enabled.
With this typo fixed then this won't be a concern.
Note that if Vulkan or GL thunks were enabled, then this wouldn't likely have been an issue since X11 would have loaded xcb independentally anyway.
Misc
There was a bunch more this month that was smaller and spread out. We don't want to take up too much of your time so if you want to see more, make
sure to check out the detailed change log!
Raw Changes
-
ARM64
-
Moves RA functions to header (048daa4)
-
Arm64
-
Rename GetSrcPair, GetDst, and GetSrc (bf7d0f7)
-
Enables debug option for disassembling the JIT code (03a0613)
-
Inline Syscall spill optimization (0ebb15c)
-
Optimize SVE register spilling and filling (1ab4471)
-
Optimizing spilling and filling (9a8852f)
-
Reduce dispatcher to 1 page (65e8bf9)
-
VectorOps
-
Simplify FADDP result merging (344ec33)
-
Config
-
Fixes global application configs (dc9737a)
-
Crypto
-
Explicitly clear upper lane with VPCLMULQDQ (4c013c8)
-
Dispatcher
-
Calculate REG_ERR correctly using ARM ESR_EL1 (4f313f5)
-
Frontend
-
Handle 256-bit destination sizes directly (e8aa79b)
-
IR
-
Handle 128-bit VInsElement with SVE (94ae2e3)
-
LookupCache
-
Use a PMR map for our Blocklinks with monotonic allocator (b7358b4)
-
Optimize cache clearing and allocation (2b6a020)
-
OpCodeDispatcher
-
Optimize a case of GOT calculation (b42b4e0)
-
OpcodeDispatcher
-
Handle immediate variants of VPERMILPD/VPERMILPS (3904a52)
-
Handle VMASKMOVDQU (c6297ed)
-
Handle VPHSUBD/VPHSUBW (4786ddc)
-
Zip elements instead of for loop insertion in PHSUB (58ec2b2)
-
Handle VDPPD/VDPPS (9b8c92e)
-
Handle VINSERTPS (6caf764)
-
Handle VMOVMSKPD/VMOVMSKPS (faa81f2)
-
Handle VPUNPCKHBW/VPUNPCKHWD/VPUNPCKHDQ/VPUNPCKHQDQ (64cd377)
-
Handle VUNPCKHPD/VUNPCKHPS (138f1fc)
-
Handle VPUNPCKLBW/VPUNPCKLWD/VPUNPCKLDQ/VPUNPCKLQDQ (6bc1c3f)
-
Handle VUNPCKLPD/VUNPCKLPS (4560c5b)
-
Handle VCVTSS2SI/VCVTTSS2SI/VCVTSD2SI/VCVTTSD2SI (4a88480)
-
Handle VCVTPD2DQ/VCVTTPD2DQ/VCVTPS2DQ/VCVTTPS2DQ (f379385)
-
Handle VPMULHRSW (82adc2f)
-
Handle VPMULHW/VPMULHUW (4a3af8d)
-
Handle VPHMINPOSUW (9d58514)
-
Handle VPMULDQ/VPMULUDQ (33e8f21)
-
Handle VCMPSD/VCMPSS (cecda7b)
-
Remove lingering debug log from VPFCMPOp (ce35128)
-
Convert runtime assert to static_assert in SHUFOps (345e9b9)
-
Handle VCMPPD/VCMPPS (0c651dd)
-
Handle VPSRLDQ (1668db0)
-
Remove unnecessary usage of VMov in VPSLLDQOp (515b3e4)
-
Handle VCVTDQ2PD/VCVTDQ2PS (4aed60e)
-
Handle VPSLLDQ (60a2fb1)
-
Simplify SHA1MSG1 implementation (d0cb329)
-
Handle immediate variants of VPSRLD/VPSRLQ/VPSRLW (72a3b18)
-
Handle 128-bit AVX AES instructions (1800451)
-
Handle immediate variants of VPSRAD/VPSRAW (1d92182)
-
Handle VPACKUSDW/VPACKUSWB (6e733bf)
-
Handle VPACKSSDW/VPACKSSWB (01d2284)
-
Handle vector versions of VPSRA{D, W} (78b53bf)
-
Handle remaining PEXTRW opcode (fabf453)
-
Handle VPMULL{D, B} (b26e410)
-
Handle vector variants of VPSRL{D, Q, W} (ad3bf18)
-
Handle VPEXTR{B, D, Q, W}/VEXTRACTPS (c86ba76)
-
Handle immediate variants of VPSLL{D, Q, W} (c1e301a)
-
Handle vector variants of VPSLL{D, Q, W} (58fab72)
-
Handle VPHADDW/VPHADDD (8ce6c08)
-
Handle VPMOVSXB{D, W, Q}/VPMOVSXW{D, Q}/VPMOVSXDQ/VPMOVZXB{D, W, Q}/VPMOVZXW{D, Q}/VPMOVZXDQ (dc2eaf6)
-
Narrow memory access with scalar rounding operations (0e233a9)
-
Move template impl to regular function where applicable (4b891d6)
-
Handle VROUNDS{D, S}/VROUNDP{D, S} (1ca3563)
-
Handle VINSERTF128/VINSERTI128 (4b21647)
-
Handle VPERM2F128/VPERM2I128 (f3d0fa6)
-
Handle VPERMQ/VPERMPD (60a4561)
-
Handle VHADDP{D, S} (ded257c)
-
Handle VPMAXS{B, D, W}/VPMAXU{B, D, W} (9de5840)
-
Handle VPMINS{B, D, W}/VPMINU{B, D, W} (40bab6b)
-
Handle VPADDS{B, W}/VPSUBS{B, W} (98a4541)
-
Handle VPADDUS{B, W}/VPSUBUS{B, W} (757602b)
-
Handle VPSUB{B, D, Q, W} (a90067f)
-
Handle VPSIGN{B, D, W} (1bc013d)
-
Handle VDIVP{D, S}/VDIVS{D, S} (a07a533)
-
Handle VMULP{D, S}/VMULS{D, S} (eefcea4)
-
Handle VMAXP{D, S}/VMAXS{D, S}/VMINP{D, S}/VMINS{D, S} (d6b137e)
-
Handle VSUBP{D, S}/ VSUBS{D, S} (db90390)
-
Handle VRCPPS/VRCPSS (2937344)
-
Handle VLDDQU (03fbb92)
-
Handle VPABS{B, D, W} (a57f3a6)
-
Handle VPCMPEQ{B, D, Q, W}/VPCMPGT{B, D, Q, W} (573896d)
-
Handle VRSQRTSS/VRSQRTPS (ab14375)
-
Handle VPBROADCAST{B, D, Q, W}/VBROADCASTI128 (ace90aa)
-
Handle VBROADCASTSD/VBROADCASTSD/VBROADCASTF128 (2123868)
-
Handle VLDMXCSR/VSTMXCSR (b73aeb8)
-
Handle VSQRTPD/VSQRTPS/VSQRTSD/VSQRTSS (d965ae0)
-
Handle VCOMISD/VCOMISS/VUCOMISD/VUCOMISS (7ac21e7)
-
Handle VPAVGB/VPAVGW (a98920d)
-
Explicitly zero upper lanes (e9aa368)
-
Handle VADDSD/VADDSS (5ac44ba)
-
Merge HADDP/PHADD into VectorALUOp (bf86df7)
-
Merge PAVGOp with VectorALUOp (3322f8b)
-
Merge PADDQOp, PSUBQOp, PADDSOp, PSUBSOp with VectorALUOp (9eaa45f)
-
Merge ANDNOp with VectorALUROp (4b16718)
-
Simplify VANDN (2bf7e09)
-
OpcodeHandler
-
Handle VADDSUBP{D, S} (905eb01)
-
ThunkDB
-
Clean up database loading (12b866c)
-
Thunks
-
Fix IDE integration (16969fc)
-
ThunksDB
-
Fix misspelt guest library names (e486833)
-
X86Tables
-
Restrict CVTDQ2PD and CVTTSD2SI to 64-bit memory accesses (91c00d2)
-
Misc
-
Create a new ARM64 Emitter and move JIT over to it. (ec55ecd)
-
Initial 32-bit host thunk feature support (d5f3a09)