Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New kernel for reciprocal #643

Closed
wants to merge 4 commits into from
Closed

New kernel for reciprocal #643

wants to merge 4 commits into from

Conversation

Ka-zam
Copy link
Contributor

@Ka-zam Ka-zam commented Oct 13, 2023

New kernel that calculates the reciprocal:

volk_profile -R rec -i 10000

RUN_VOLK_TESTS: volk_32f_reciprocal_32f(131071,10000)
generic completed in 119.092 ms
a_avx2_fma completed in 94.7562 ms
a_avx completed in 97.1054 ms
a_avx_div completed in 118.053 ms
u_avx2_fma completed in 97.3939 ms
u_avx completed in 98.4622 ms
u_avx_div completed in 129.018 ms
Best aligned arch: a_avx2_fma
Best unaligned arch: u_avx2_fma

Signed-off-by: Magnus Lundmark <[email protected]>
Signed-off-by: Magnus Lundmark <[email protected]>
Signed-off-by: Magnus Lundmark <[email protected]>
Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an interesting kernel. I have a few comments.
My biggest worry would be to see this kernel fail randomly with some 1.0f / 1e-xx value.

include/volk/volk_avx2_fma_intrinsics.h Outdated Show resolved Hide resolved
kernels/volk/volk_32f_reciprocal_32f.h Outdated Show resolved Hide resolved
kernels/volk/volk_32f_reciprocal_32f.h Outdated Show resolved Hide resolved
kernels/volk/volk_32f_reciprocal_32f.h Outdated Show resolved Hide resolved
kernels/volk/volk_32f_reciprocal_32f.h Outdated Show resolved Hide resolved
Signed-off-by: Magnus Lundmark <[email protected]>
@Ka-zam
Copy link
Contributor Author

Ka-zam commented Oct 14, 2023

This looks like an interesting kernel. I have a few comments. My biggest worry would be to see this kernel fail randomly with some 1.0f / 1e-xx value.

I ran the code for some special values:

	x	x		generic		div		rcp		avx2_fma		avx
-0.000000e+00	 -0x0p+0	    -inf	    -inf	    -inf	    -nan	    -nan
+0.000000e+00	 +0x0p+0	    +inf	    +inf	    +inf	    -nan	    -nan
+3.673420e-40	+0x1p-131	    +inf	    +inf	    +inf	    -inf	    -inf
+5.877472e-39	+0x1p-127	+1.701412e+38	+1.701412e+38	    +inf	    -inf	    -inf
+1.175494e-38	+0x1.fffffp-127	+8.507063e+37	+8.507063e+37	    +inf	    -inf	    -inf
+1.175494e-38	+0x1p-126	+8.507059e+37	+8.507059e+37	+8.504982e+37	+8.507059e+37	+8.507059e+37
+4.253530e+37	+0x1p+125	+2.350989e-38	+2.350989e-38	+2.350415e-38	+2.350989e-38	+2.350989e-38
    +inf	    +inf	+0.000000e+00	+0.000000e+00	+0.000000e+00	    -nan	    -nan

div is the _mm256_div_ps instruction
rcp is the _mm256_rcp_ps instruction

So, the proposed implementation fails for special values more or less spectactularly. However, it's 20% faster for valid inputs and more accurate than _mm256_rcp_ps.

_mm256_rcp_ps is 10% faster still but not as accurate in the valid domain.

@Ka-zam
Copy link
Contributor Author

Ka-zam commented Oct 17, 2023

I did some more experiments and tried to add checks for nan, 0.0 and inf. Turns out it's not worth it and then it's better to just use div_ps()

So, 30% faster but

  • returns nan instead of inf for 0.0 input
  • returns nan instead of 0.0 for inf input

@jdemel
Copy link
Contributor

jdemel commented Nov 4, 2023

How do we prevent that this kernel is constantly causing random CI failures? Values close to 0.0f are likely in our tests. We'd need to make sure that we don't see flaky kernel issues raised.

Also, the inf vs nan issue needs to be discussed. If I'd use this kernel I'd expect that 1.0f / inf = 0.0f. This is not consistent across all implementations. This would probably cause issues in the future.
We had a similar issue in #622 and eventually had to remove some kernel implementations #658 . I'm worried this kernel will see the same fate. We should find a solution for this issue before we merge it.

@Ka-zam
Copy link
Contributor Author

Ka-zam commented Nov 21, 2023

Values close to 0.f are no problem, it's only a very slight problem for subnormals where there's decreased accuracy.

inf vs nan is more of a problem. Perhaps that can be looked at but the problem is we can't do too much work before the div_ps( 1.f / x ) instruction is faster.

I will investigate.

@Ka-zam Ka-zam closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants