Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

Merged
merged 1 commit into from
May 14, 2024

Conversation

sergeyvfx
Copy link
Contributor

On 64bit ARM platforms there is no need to use rcp with NR steps to implement _mm_div_ps as there is an exact instruction for the division. In fact, using an rcp and RN step makes _mm_div_ps have different precision from native SSE, and also makes sse2neon tests to fail.

This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps to have exact result to what this intrinsic provides on X64 platforms. It also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1.

There is no change in behavior or precision of _mm_div_ps on 32bit ARM platforms.

In order to reproduce the issue this change aims to address, sse2neon tests could be used:

make clean && CXXFLAGS='-DSSE2NEON_PRECISE_DIV=1' make && ./tests/main

Before this change the tests fails on Apple M2 Ultra, after this change they pass.

In order to compare behavior of _mm_div_ps with native SSE instruction the following values were used:

_mm_div_ps(_mm_set1_ps(0.00160000019241124391555786132812),
           _mm_set1_ps(0.00579846277832984924316406250000))

The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625. Before this change sse2neon on Apple M2 Ultra gives the same exact value when SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of 0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the result of _mm_div_ps is matching X64 platform regardless of the value of the SSE2NEON_PRECISE_DIV.

From Blender side the original issue which required more precise version of intrinsics is still solved, and all tests are passing. So there is no impact on correctness of work done by Brecht earlier in this area.

On 64bit ARM platforms there is no need to use rcp with NR steps to implement
_mm_div_ps as there is an exact instruction for the division. In fact, using
an rcp and RN step makes _mm_div_ps have different precision from native SSE,
and also makes sse2neon tests to fail.

This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step
when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps
to have exact result to what this intrinsic provides on X64 platforms. It
also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1.

There is no change in behavior or precision of _mm_div_ps on 32bit ARM
platforms.

In order to reproduce the issue this change aims to address, sse2neon
tests could be used:

    make clean && CXXFLAGS='-DSSE2NEON_PRECISE_DIV=1' make && ./tests/main

Before this change the tests fails on Apple M2 Ultra, after this change
they pass.

In order to compare behavior of _mm_div_ps with native SSE instruction the
following values were used:

    _mm_div_ps(_mm_set1_ps(0.00160000019241124391555786132812),
               _mm_set1_ps(0.00579846277832984924316406250000))

The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625.
Before this change sse2neon on Apple M2 Ultra gives the same exact value when
SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of
0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the
result of _mm_div_ps is matching X64 platform regardless of the value of the
SSE2NEON_PRECISE_DIV.

From Blender side the original issue which required more precise version of
intrinsics is still solved, and all tests are passing. So there is no impact
on correctness of work done by Brecht earlier in this area.
@sergeyvfx sergeyvfx requested review from jserv and marktwtn as code owners May 14, 2024 07:50
@jserv jserv requested review from Cuda-Chen and howjmay May 14, 2024 14:37
@jserv jserv merged commit de0538f into DLTcollab:master May 14, 2024
16 checks passed
@jserv
Copy link
Member

jserv commented May 14, 2024

Thank @sergeyvfx for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants