Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

sergeyvfx · 2024-05-14T07:50:10Z

On 64bit ARM platforms there is no need to use rcp with NR steps to implement _mm_div_ps as there is an exact instruction for the division. In fact, using an rcp and RN step makes _mm_div_ps have different precision from native SSE, and also makes sse2neon tests to fail.

This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps to have exact result to what this intrinsic provides on X64 platforms. It also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1.

There is no change in behavior or precision of _mm_div_ps on 32bit ARM platforms.

In order to reproduce the issue this change aims to address, sse2neon tests could be used:

make clean && CXXFLAGS='-DSSE2NEON_PRECISE_DIV=1' make && ./tests/main

Before this change the tests fails on Apple M2 Ultra, after this change they pass.

In order to compare behavior of _mm_div_ps with native SSE instruction the following values were used:

_mm_div_ps(_mm_set1_ps(0.00160000019241124391555786132812),
           _mm_set1_ps(0.00579846277832984924316406250000))

The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625. Before this change sse2neon on Apple M2 Ultra gives the same exact value when SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of 0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the result of _mm_div_ps is matching X64 platform regardless of the value of the SSE2NEON_PRECISE_DIV.

From Blender side the original issue which required more precise version of intrinsics is still solved, and all tests are passing. So there is no impact on correctness of work done by Brecht earlier in this area.

On 64bit ARM platforms there is no need to use rcp with NR steps to implement _mm_div_ps as there is an exact instruction for the division. In fact, using an rcp and RN step makes _mm_div_ps have different precision from native SSE, and also makes sse2neon tests to fail. This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps to have exact result to what this intrinsic provides on X64 platforms. It also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1. There is no change in behavior or precision of _mm_div_ps on 32bit ARM platforms. In order to reproduce the issue this change aims to address, sse2neon tests could be used: make clean && CXXFLAGS='-DSSE2NEON_PRECISE_DIV=1' make && ./tests/main Before this change the tests fails on Apple M2 Ultra, after this change they pass. In order to compare behavior of _mm_div_ps with native SSE instruction the following values were used: _mm_div_ps(_mm_set1_ps(0.00160000019241124391555786132812), _mm_set1_ps(0.00579846277832984924316406250000)) The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625. Before this change sse2neon on Apple M2 Ultra gives the same exact value when SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of 0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the result of _mm_div_ps is matching X64 platform regardless of the value of the SSE2NEON_PRECISE_DIV. From Blender side the original issue which required more precise version of intrinsics is still solved, and all tests are passing. So there is no impact on correctness of work done by Brecht earlier in this area.

jserv · 2024-05-14T22:28:50Z

Thank @sergeyvfx for contributing!

sergeyvfx requested review from jserv and marktwtn as code owners May 14, 2024 07:50

jserv requested review from Cuda-Chen and howjmay May 14, 2024 14:37

howjmay approved these changes May 14, 2024

View reviewed changes

jserv merged commit de0538f into DLTcollab:master May 14, 2024
16 checks passed

chenrui333 mentioned this pull request Dec 26, 2024

sse2neon 1.8.0 Homebrew/homebrew-core#202465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

sergeyvfx commented May 14, 2024

jserv commented May 14, 2024

Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631

Conversation

sergeyvfx commented May 14, 2024

jserv commented May 14, 2024