-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
volk_64u_byteswap_neonv8 incorrect results #605
Comments
What version of gcc? Do the volk Qa tests catch this failure?
Philip
…On 10/24/22 10:20, jsallay wrote:
I am using this kernel on an nvidia jetson and noticed that I am getting
incorrect results. To debug, I made a simple program that shows the issue.
|#include <arm_neon.h> #include <cstddef> #include <cstdio> int main() {
uint64_t data[] = {0x0706050403020100, 0x1716151413121110,
0x2726252423222120, 0x3736353433323130}; uint8x16x2_t input; uint8x16_t
idx = { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }; for
(size_t i = 0; i < 32; i++) printf("%02x ", ((char*)(&data))[i]);
printf("\n"); input = vld2q_u8((uint8_t*)data); input.val[0] =
vqtbl1q_u8(input.val[0], idx); input.val[1] = vqtbl1q_u8(input.val[1],
idx); vst2q_u8((uint8_t*)data, input); for (size_t i = 0; i < 32; i++)
printf("%02x ", ((char*)(&data))[i]); printf("\n"); } |
I've tried this program out on an armv8 nivdia jetson nx and using QEMU
in docker. In both cases, the output does not match what is expected.
Doing some digging, it appears that the load instruction isn't doing
what I would expect:
|vld2q_u8 Load multiple 2-element structures to two registers. This
instruction loads multiple 2-element structures from memory and writes
the result to the two SIMD&FP registers, with de-interleaving. |
The instruction is loading in the 32-bytes, but assuming that they are
interleaved and moving the memory around before we do the shuffle
instruction.
Switching the load and store to:
|// input = vld2q_u8((uint8_t*)data); Doesn't work input =
vld1q_u8_x2((uint8_t*)data); // vst2q_u8((uint8_t*)data, input); Doesn't
work vst1q_u8_x2((uint8_t*)data, input); |
Appears to solve the problem.
—
Reply to this email directly, view it on GitHub
<#605>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5HEZXYPLTV3HB3XBK6ELWE2LJ7ANCNFSM6AAAAAARNA2ZWI>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I'm using gcc9 in ubuntu 20.04. All of the unit tests pass. |
If we change anything here, we should be sure that it actually improves things. Since the unit test passes, this is a very difficult to catch and test bug. Is this still an issue? |
The code is definitely broken. I was able to verify it very easily by calling the function. Let me do some research and see if I can figure why it passes the unit tests. |
This was fixed by #680. |
I am using this kernel on an nvidia jetson and noticed that I am getting incorrect results. To debug, I made a simple program that shows the issue.
I've tried this program out on an armv8 nivdia jetson nx and using QEMU in docker. In both cases, the output does not match what is expected. Doing some digging, it appears that the load instruction isn't doing what I would expect:
The instruction is loading in the 32-bytes, but assuming that they are interleaved and moving the memory around before we do the shuffle instruction.
Switching the load and store to:
Appears to solve the problem.
The text was updated successfully, but these errors were encountered: