From 85edff96955eb2850c02a62e64fac4563531d3e4 Mon Sep 17 00:00:00 2001
From: Sergey Sharybin <sergey@blender.org>
Date: Fri, 3 May 2024 10:52:00 +0200
Subject: [PATCH] Fix with _mm_div_ps when SSE2NEON_PRECISE_DIV=1

On 64bit ARM platforms there is no need to use rcp with NR steps to implement
_mm_div_ps as there is an exact instruction for the division. In fact, using
an rcp and RN step makes _mm_div_ps have different precision from native SSE,
and also makes sse2neon tests to fail.

This change makes it so the SSE2NEON_PRECISE_DIV only adds an extra RN step
when building for 32bit ARM platforms. On 64bit platforms it makes _mm_div_ps
to have exact result to what this intrinsic provides on X64 platforms. It
also fixes regression tests when compiled SSE2NEON_PRECISE_DIV=1.

There is no change in behavior or precision of _mm_div_ps on 32bit ARM
platforms.

In order to reproduce the issue this change aims to address, sse2neon
tests could be used:

    make clean && CXXFLAGS='-DSSE2NEON_PRECISE_DIV=1' make && ./tests/main

Before this change the tests fails on Apple M2 Ultra, after this change
they pass.

In order to compare behavior of _mm_div_ps with native SSE instruction the
following values were used:

    _mm_div_ps(_mm_set1_ps(0.00160000019241124391555786132812),
               _mm_set1_ps(0.00579846277832984924316406250000))

The native _mm_div_ps on i7-6770HQ gives a value of 0.275935232639312744140625.
Before this change sse2neon on Apple M2 Ultra gives the same exact value when
SSE2NEON_PRECISE_DIV=0, but was giving slightly different value of
0.2759352028369903564453125 when SSE2NEON_PRECISE_DIV=1. With this change the
result of _mm_div_ps is matching X64 platform regardless of the value of the
SSE2NEON_PRECISE_DIV.

From Blender side the original issue which required more precise version of
intrinsics is still solved, and all tests are passing. So there is no impact
on correctness of work done by Brecht earlier in this area.
---
 sse2neon.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sse2neon.h b/sse2neon.h
index 48da95fa..fcd848cc 100644
--- a/sse2neon.h
+++ b/sse2neon.h
@@ -62,7 +62,7 @@
 #ifndef SSE2NEON_PRECISE_MINMAX
 #define SSE2NEON_PRECISE_MINMAX (0)
 #endif
-/* _mm_rcp_ps and _mm_div_ps */
+/* _mm_rcp_ps */
 #ifndef SSE2NEON_PRECISE_DIV
 #define SSE2NEON_PRECISE_DIV (0)
 #endif
@@ -1724,7 +1724,7 @@ FORCE_INLINE int64_t _mm_cvttss_si64(__m128 a)
 // https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_div_ps
 FORCE_INLINE __m128 _mm_div_ps(__m128 a, __m128 b)
 {
-#if (defined(__aarch64__) || defined(_M_ARM64)) && !SSE2NEON_PRECISE_DIV
+#if defined(__aarch64__) || defined(_M_ARM64)
     return vreinterpretq_m128_f32(
         vdivq_f32(vreinterpretq_f32_m128(a), vreinterpretq_f32_m128(b)));
 #else