-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various performance improvements #83
Conversation
On a TGL: Improves simdsort perf on 16-bit data by up to 1.25x
|
} | ||
} | ||
else { | ||
X86_SIMD_SORT_UNROLL_LOOP(8) | ||
for (int ii = 0; ii < num_unroll; ++ii) { | ||
curr_vec[ii] = vtype::loadu(arr + left + ii * vtype::numlanes); | ||
_mm_prefetch(arr + left + ii * vtype::numlanes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, the formatting from clang_format is ugly here. Might need to tweak some parameters in the _clang-format
file.
24701aa
to
dfa65db
Compare
Perf changes summary: Qsort: On SKX, up-to 1.9x speed up for 32-bit and up-to 1.5x speed up for 64-bit data.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the awesome work @sterrettm2!
This merge request adds a bunch of performance enhancements. It changes:
The small sorting algorithm to be more efficient (b52e889)
Changes how the partitioning is done in a few small ways (16e35b0)
Changes how the array is shortened to be a multiple of the correct length (d617059)
Changes how pivots are selected for larger arrays (91928b6)
Increases the amount of prefetching done (d4ecb7e)
And some smaller changes, like some small changes to the parameters used.
Note that this was tested on a 7900x, so the 16-bit performance results should be ignored.