Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1 #631
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On 64bit ARM platforms there is no need to use rcp with NR steps to implement _mm_div_ps as there is an exact instruction for the division. In fact, using an rcp and RN step makes _mm_div_ps have different precision from native SSE, and also makes sse2neon tests to fail.
This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps to have exact result to what this intrinsic provides on X64 platforms. It also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1.
There is no change in behavior or precision of _mm_div_ps on 32bit ARM platforms.
In order to reproduce the issue this change aims to address, sse2neon tests could be used:
Before this change the tests fails on Apple M2 Ultra, after this change they pass.
In order to compare behavior of _mm_div_ps with native SSE instruction the following values were used:
The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625. Before this change sse2neon on Apple M2 Ultra gives the same exact value when SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of 0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the result of _mm_div_ps is matching X64 platform regardless of the value of the SSE2NEON_PRECISE_DIV.
From Blender side the original issue which required more precise version of intrinsics is still solved, and all tests are passing. So there is no impact on correctness of work done by Brecht earlier in this area.