-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression: rustc failed to optimize _mm256_avg_epu8
after 1.75.0
#124216
Comments
Did you confirm that this is the responsible change or are you guessing? |
@Nugine This is definitely more instructions and more bytes on each, so I'm marking it with I-heavy, but it appears this comes with a performance regression. Can you be precise about which of the ~19 benchmarks you appear to run have regressed, and on what architecture? I would rather we not make the 2nd vpavgb instruction come back only for your algorithm to still be dog-slow because some of the other instructions are different. Also, can you be more precise on what architectures and with what target features you're testing on? GitHub is allowed to change the CPU you run benchmarks on, and does, because their fleet is not perfectly uniform, so |
Base64-decode in rust-lang/stdarch#1477 made the change. However, the root cause may be elsewhere, possibly LLVM. To see the asm, you can use the following commands. git clone https://github.com/Nugine/simd.git
cd simd
rustup override set 1.74.1 # or 1.75.0
RUSTFLAGS="--cfg vsimd_dump_symbols" cargo asm -p base64-simd --lib --simplify --target x86_64-unknown-linux-gnu --context 1 -- base64_simd::multiversion::decode::avx2 > base64-decode-avx2.asm
cat base64-decode-avx2.asm Target: x86_64-unknown-linux-gnu I have extracted the decode function and reproduced the regression. https://rust.godbolt.org/z/KG4cT6aPK
|
@Nugine re: the workaround: On current Rust, stable, the |
Seems the early exit somehow makes llvm loose track of the equivalence to #[target_feature(enable = "avx2")]
pub unsafe fn decode(
x: __m256i,
ch: __m256i,
ct: __m256i,
dh: __m256i,
dt: __m256i,
) -> Result<__m256i, __m256i> {
let shr3 = _mm256_srli_epi32::<3>(x);
let h1 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(ch, x));
let h2 = _mm256_avg_epu8(shr3, _mm256_shuffle_epi8(dh, x));
let o1 = _mm256_shuffle_epi8(ct, h1);
let o2 = _mm256_shuffle_epi8(dt, h2);
let c1 = _mm256_adds_epi8(x, o1);
let c2 = _mm256_add_epi8(x, o2);
if _mm256_movemask_epi8(c1) != 0 {
return Err(c2);
}
Ok(c2)
} But I guess this will break down as soon as the function gets inlined if the error value is not otherwise used. |
Cool! I'll try asm wrapper. |
based on jhorstmann's remark, it would be nicest to fix this in LLVM, since LLVM appears to have the information necessary to do this optimization, it just is missing it in the early-return case. I don't think partially reverting a diff is unwarranted, however. |
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-medium |
Code
I tried this code:
https://rust.godbolt.org/z/KG4cT6aPK
I expected to see this happen: This code should emit two
vpavgb
instructions.Instead, this happened: One of the
vpavgb
instructions is missing.Nugine/simd#43
Version it worked on
It most recently worked on: 1.74.1
Version with regression
1.75.0 ~ nightly
@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged
The text was updated successfully, but these errors were encountered: