volk_64u_byteswap_neonv8 incorrect results #605

jsallay · 2022-10-24T14:20:05Z

I am using this kernel on an nvidia jetson and noticed that I am getting incorrect results. To debug, I made a simple program that shows the issue.

#include <arm_neon.h>
#include <cstddef>
#include <cstdio>

int main() {
        uint64_t data[] = {0x0706050403020100, 0x1716151413121110, 0x2726252423222120, 0x3736353433323130};
        uint8x16x2_t input;
        uint8x16_t idx = { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 };
        
        for (size_t i = 0; i < 32; i++)
                printf("%02x ", ((char*)(&data))[i]);
        printf("\n");

        input = vld2q_u8((uint8_t*)data);
        input.val[0] = vqtbl1q_u8(input.val[0], idx);
        input.val[1] = vqtbl1q_u8(input.val[1], idx);
        vst2q_u8((uint8_t*)data, input);

        for (size_t i = 0; i < 32; i++)
                printf("%02x ", ((char*)(&data))[i]);
        printf("\n");
}

I've tried this program out on an armv8 nivdia jetson nx and using QEMU in docker. In both cases, the output does not match what is expected. Doing some digging, it appears that the load instruction isn't doing what I would expect:

vld2q_u8
Load multiple 2-element structures to two registers. This instruction loads multiple 2-element structures
from memory and writes the result to the two SIMD&FP registers, with de-interleaving.

The instruction is loading in the 32-bytes, but assuming that they are interleaved and moving the memory around before we do the shuffle instruction.

Switching the load and store to:

//  input = vld2q_u8((uint8_t*)data);  Doesn't work
input = vld1q_u8_x2((uint8_t*)data);
// vst2q_u8((uint8_t*)data, input);  Doesn't work
vst1q_u8_x2((uint8_t*)data, input);

Appears to solve the problem.

The text was updated successfully, but these errors were encountered:

balister · 2022-10-24T14:36:13Z

What version of gcc? Do the volk Qa tests catch this failure? Philip

…

On 10/24/22 10:20, jsallay wrote: I am using this kernel on an nvidia jetson and noticed that I am getting incorrect results. To debug, I made a simple program that shows the issue. |#include <arm_neon.h> #include <cstddef> #include <cstdio> int main() { uint64_t data[] = {0x0706050403020100, 0x1716151413121110, 0x2726252423222120, 0x3736353433323130}; uint8x16x2_t input; uint8x16_t idx = { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }; for (size_t i = 0; i < 32; i++) printf("%02x ", ((char*)(&data))[i]); printf("\n"); input = vld2q_u8((uint8_t*)data); input.val[0] = vqtbl1q_u8(input.val[0], idx); input.val[1] = vqtbl1q_u8(input.val[1], idx); vst2q_u8((uint8_t*)data, input); for (size_t i = 0; i < 32; i++) printf("%02x ", ((char*)(&data))[i]); printf("\n"); } | I've tried this program out on an armv8 nivdia jetson nx and using QEMU in docker. In both cases, the output does not match what is expected. Doing some digging, it appears that the load instruction isn't doing what I would expect: |vld2q_u8 Load multiple 2-element structures to two registers. This instruction loads multiple 2-element structures from memory and writes the result to the two SIMD&FP registers, with de-interleaving. | The instruction is loading in the 32-bytes, but assuming that they are interleaved and moving the memory around before we do the shuffle instruction. Switching the load and store to: |// input = vld2q_u8((uint8_t*)data); Doesn't work input = vld1q_u8_x2((uint8_t*)data); // vst2q_u8((uint8_t*)data, input); Doesn't work vst1q_u8_x2((uint8_t*)data, input); | Appears to solve the problem. — Reply to this email directly, view it on GitHub <#605>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5HEZXYPLTV3HB3XBK6ELWE2LJ7ANCNFSM6AAAAAARNA2ZWI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jsallay · 2022-10-24T15:23:18Z

I'm using gcc9 in ubuntu 20.04. All of the unit tests pass.

jdemel · 2023-10-22T16:04:29Z

If we change anything here, we should be sure that it actually improves things. Since the unit test passes, this is a very difficult to catch and test bug.

Is this still an issue?

jsallay · 2023-10-26T21:15:17Z

The code is definitely broken. I was able to verify it very easily by calling the function. Let me do some research and see if I can figure why it passes the unit tests.

argilo · 2023-11-04T10:53:39Z

This was fixed by #680.

jsallay mentioned this issue Oct 28, 2023

Poor random numbers in uint tests. #688

Closed

argilo closed this as completed Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volk_64u_byteswap_neonv8 incorrect results #605

volk_64u_byteswap_neonv8 incorrect results #605

jsallay commented Oct 24, 2022

balister commented Oct 24, 2022 via email

jsallay commented Oct 24, 2022

jdemel commented Oct 22, 2023

jsallay commented Oct 26, 2023

argilo commented Nov 4, 2023

volk_64u_byteswap_neonv8 incorrect results #605

volk_64u_byteswap_neonv8 incorrect results #605

Comments

jsallay commented Oct 24, 2022

balister commented Oct 24, 2022 via email

jsallay commented Oct 24, 2022

jdemel commented Oct 22, 2023

jsallay commented Oct 26, 2023

argilo commented Nov 4, 2023