Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bc7enc: Optimize "find approximate selector" branch chains #24

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

abbriggs
Copy link

@abbriggs abbriggs commented Sep 24, 2024

Description

Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations.

In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted.

To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving codegen and making room for further compiler optimizations (such as auto-vectorization).

Results

I compiled bc7enc.exe using Clang-CL 17 on Windows and ran it on an AMD RZ9-7950x system for this data. I did spot check MSVC and it appears to receive similar performance benefits.

Before changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.197000 secs
Total processing time: 0.206000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.757000 secs
Total processing time: 4.772000 secs

After changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.186000 secs
Total processing time: 0.195000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.429000 secs
Total processing time: 4.445000 secs

If needed, I can provide some images that show the difference in x86 codegen before and after the changes.

Several BC7 code paths have branch chains which sequentially
compare a value against an array of thresholds. These chains
are long enough that compilers have trouble converting them to
branchless operations.

In all of these code paths, the value produced by the branch chain
is a direct dependency of of the subsequent code. This often results
in a pipeline stall, because the branches can't be easily predicted.

To improve this, convert each branch chain to a branchless loop.
Compiler optimizations will inline and unroll the loop, significantly
improving the codgen in these code paths and making room for further
compiler optimizations (such as auto-vectorization).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant