Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly improve performance by using chunks_exact(SIMD_LEN) and remainder() #10

Open
smu160 opened this issue Jul 19, 2024 · 4 comments

Comments

@smu160
Copy link
Contributor

smu160 commented Jul 19, 2024

The as_simd call is not guaranteed to gives us the perfect output of prefix, middle, suffix (where we know middle will have everything we need). You'd probably have to guarantee some sort of alignment in order to make that work well. So, we'll always have to check the prefix/suffix and there's no guarantee that the middle will even have any elements. We can get around this by going by to the slice api and just using chunks_exact(SIMD_LEN) followed by checking in chunks_exact(SIMD_LEN).remainder().

For example:

    fn all_equal_simd(&self) -> bool {
        let arr = self.as_slice();
        if arr.is_empty() {
            return true;
        }
        let first = arr[0];
        let simd_needle = Simd::splat(first);
        let remainder = arr.chunks_exact(SIMD_LEN).remainder();

        for chunk in arr.chunks_exact(SIMD_LEN) {
            let simd_chunk = Simd::<T, SIMD_LEN>::from_slice(chunk);
            let mask = simd_chunk.simd_ne(simd_needle).to_bitmask();
            if mask != 0 {
                return false;
            }
        }

        remainder.iter().all(|x| *x == first)
    }

You can see the generated asm for yourself on godbolt. You'll have to comment out my version and uncomment yours to see the differnce in cycle count. Of course, the proof is in the pudding. Using the benchmarks in PR #8, I got great preliminary results!

Screenshot 2024-07-18 at 11 25 39 PM

Let me know your thoughts. Thank you!!

@LaihoE
Copy link
Owner

LaihoE commented Jul 19, 2024

The reason as_simd() gives the odd splits is to guarantee that the middle elements are aligned to simd_len, allowing us to do load_aligned with simd. This should be a little faster than unaligned loads, especially with bigger N. On small N it is very possible that chunks_exact is faster.

I was also looking at the performance yesterday (it was much slower than memchr etc.) and noticed that the if statements complicate unrolling and we need to do a bit of loop unrolling ourselves:

https://godbolt.org/z/88nM38hhn

should give something like this:

.LBB0_11:
        add     rdi, 4
        je      .LBB0_12
        movdqa  xmm1, xmmword ptr [r10 + 16]
        pcmpeqb xmm1, xmm0
        movdqa  xmm2, xmmword ptr [r10]
        pcmpeqb xmm2, xmm0
        movdqa  xmm3, xmmword ptr [r10 + 48]
        pcmpeqb xmm3, xmm0
        por     xmm3, xmm1
        movdqa  xmm1, xmmword ptr [r10 + 32]
        pcmpeqb xmm1, xmm0
        por     xmm1, xmm2
        movdqa  xmm2, xmmword ptr [r10 + 64]
        pcmpeqb xmm2, xmm0
        movdqa  xmm4, xmmword ptr [r10 + 80]
        pcmpeqb xmm4, xmm0
        movdqa  xmm5, xmmword ptr [r10 + 112]
        pcmpeqb xmm5, xmm0
        por     xmm5, xmm4
        por     xmm5, xmm3
        movdqa  xmm3, xmmword ptr [r10 + 96]
        pcmpeqb xmm3, xmm0
        por     xmm3, xmm2
        por     xmm3, xmm1
        por     xmm3, xmm5
        pmovmskb        r11d, xmm3
        sub     r10, -128
        test    r11d, r11d
        je      .LBB0_11

Using this trick I was able to get close to memchr() performance on position_simd() with u8's.

@smu160
Copy link
Contributor Author

smu160 commented Jul 19, 2024

Hi @LaihoE,

Thanks for getting back to me! I took the liberty of modifying your version just a bit. You can check it on godbolt.

The performance seems way too good to be true. Would you be able to run it on your end?

@LaihoE
Copy link
Owner

LaihoE commented Jul 20, 2024

@smu160 I've also had some sus results with all_equal(), the other functions seem to have more reasonable results.

@smu160
Copy link
Contributor Author

smu160 commented Jul 21, 2024

Hi @LaihoE,

I realized I had a bug in my implementation that I shared via godbolt. For some reason I put eq vs ne when checking the chunks. That's probably the cause of the issue. Now I wish I could reproduce the examples that were causing this issue so I can add a regression test.

Anyhow, I'll finish the benchmarks to see it does compared to the other versions we mentioned here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants