Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeGen: Implement support for math.lerp lowering #1609

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

zeux
Copy link
Collaborator

@zeux zeux commented Jan 9, 2025

To implement math.lerp without branches, we add SELECT_NUM which
selects one of the two inputs based on the comparison condition.

For simplicity, we only support C == D for now; this can be extended to
a more generic version with a IrCondition operand E, but that requires
more work on the SSE side (to flip the comparison for some conditions
like Greater, and expose more generic vcmpsd).

Note: On AArch64 this will effectively result in a change in floating point
behavior between native code and non-native code: clang synthesizes
fmadd (because floating point contraction is allowed by default, and the
arch always has the instruction), whereas this change will use fmul+fadd.

I am not sure if this is good or bad, and if this is a problem in C or not.
Specifically, clang's behavior results in different results between X64
and AArch64 when not using codegen, and with this change the behavior
when using codegen is... the same? :)

Fixing this will require either using LERP_NUM instead and hand-coding
lowering, or exposing some sort of "quasi" MADD_NUM (which would
lower to fma on AArch64 and mul+add on X64).

A small benefit to the current approach is lerp(1, 5, t) constant-folds the
subtraction. With LERP_NUM this optimization will need to be implemented
manually as a partial constant-folding for LERP_NUM.

A similar problem exists today for vector.cross & vector.dot. So maybe this
is not something we need to fix, unsure. Thoughts welcome, I'll leave this
in draft for now.

@zeux
Copy link
Collaborator Author

zeux commented Jan 9, 2025

I'll wait until the next upstream sync to rebase this anyway but I'm leaning towards keeping this implementation for now.

Specifically:

  • There is already a set of functions that have similar discrepancies between NCG & interpreted execution
  • The behavior in interpreted execution is different between 32-bit ARM v7 (mul+add) and 64-bit ARM v8 (fma)
  • In typical offline compilation contexts like Roblox, if the bytecode is compiled on a different architecture, constant folding can compute different results
  • The general problem is not quite limited to fma either: a lot of trigonometric functions will return different results based on the libm used, denormal behavior is different based on FTZ flags (that may differ between compiler & runtime) and even if the flags are the same, FTZ behavior is different on ARM vs X86
  • MSVC does not use fma for Windows-on-ARM builds for lerp implementation, so using fma in codegen would create another inconsistency there.

So basically this space just feels too fragmented for any decision to be obviously right. Henceforth the "use mul+add" approach does not seem obviously wrong :)

To implement math.lerp without branches, we add SELECT_NUM which
selects one of the two inputs based on the comparison condition.

For simplicity, we only support C == D for now; this can be extended to
a more generic version with a IrCondition operand E, but that requires
more work on the SSE side (to flip the comparison for some conditions
like Greater, and expose more generic vcmpsd).
@zeux zeux marked this pull request as ready for review January 10, 2025 19:46
@zeux zeux requested a review from vegorov-rbx January 10, 2025 19:46
@vegorov-rbx
Copy link
Collaborator

Yes, there are many small inconsistencies like that, so we provide the best effort solution, but do not place the strictness guarantees on floating-point.
So it's ok to proceed and if we want to use compiler flags to improve the situation in one case or another, that can be done separately.

- A64 can reuse any of a/b/c/d for output register
- X64 can reuse c/d for output register; only b can't be reused because
  it might be clobbered when a is loaded from memory
- Remove unused condition argument from SELECT_NUM as we assume == for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants