Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volk_64u_byteswap_neonv8 incorrect results #605

Closed
jsallay opened this issue Oct 24, 2022 · 5 comments
Closed

volk_64u_byteswap_neonv8 incorrect results #605

jsallay opened this issue Oct 24, 2022 · 5 comments

Comments

@jsallay
Copy link
Contributor

jsallay commented Oct 24, 2022

I am using this kernel on an nvidia jetson and noticed that I am getting incorrect results. To debug, I made a simple program that shows the issue.

#include <arm_neon.h>
#include <cstddef>
#include <cstdio>

int main() {
        uint64_t data[] = {0x0706050403020100, 0x1716151413121110, 0x2726252423222120, 0x3736353433323130};
        uint8x16x2_t input;
        uint8x16_t idx = { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 };
        
        for (size_t i = 0; i < 32; i++)
                printf("%02x ", ((char*)(&data))[i]);
        printf("\n");

        input = vld2q_u8((uint8_t*)data);
        input.val[0] = vqtbl1q_u8(input.val[0], idx);
        input.val[1] = vqtbl1q_u8(input.val[1], idx);
        vst2q_u8((uint8_t*)data, input);

        for (size_t i = 0; i < 32; i++)
                printf("%02x ", ((char*)(&data))[i]);
        printf("\n");
}

I've tried this program out on an armv8 nivdia jetson nx and using QEMU in docker. In both cases, the output does not match what is expected. Doing some digging, it appears that the load instruction isn't doing what I would expect:

vld2q_u8
Load multiple 2-element structures to two registers. This instruction loads multiple 2-element structures
from memory and writes the result to the two SIMD&FP registers, with de-interleaving.

The instruction is loading in the 32-bytes, but assuming that they are interleaved and moving the memory around before we do the shuffle instruction.

Switching the load and store to:

//  input = vld2q_u8((uint8_t*)data);  Doesn't work
input = vld1q_u8_x2((uint8_t*)data);
// vst2q_u8((uint8_t*)data, input);  Doesn't work
vst1q_u8_x2((uint8_t*)data, input);

Appears to solve the problem.

@balister
Copy link
Contributor

balister commented Oct 24, 2022 via email

@jsallay
Copy link
Contributor Author

jsallay commented Oct 24, 2022

I'm using gcc9 in ubuntu 20.04. All of the unit tests pass.

@jdemel
Copy link
Contributor

jdemel commented Oct 22, 2023

If we change anything here, we should be sure that it actually improves things. Since the unit test passes, this is a very difficult to catch and test bug.

Is this still an issue?

@jsallay
Copy link
Contributor Author

jsallay commented Oct 26, 2023

The code is definitely broken. I was able to verify it very easily by calling the function. Let me do some research and see if I can figure why it passes the unit tests.

@argilo
Copy link
Member

argilo commented Nov 4, 2023

This was fixed by #680.

@argilo argilo closed this as completed Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants