Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 · 2024-09-20T21:10:21Z

Summary

This PR improves __parallel_for performance for large input sizes by switching to an nd-range kernel to process multiple inputs per work item which enables us to use the full hardware bandwidth.

Details

On some target architectures, we are currently not hitting roofline memory bandwidth performance in our __parallel_for pattern. The cause is that our SYCL basic kernel implementation only processes a single element per item. This is insufficient to fully utilize memory bandwidth on some target architectures. Processing multiple inputs per work item enables us to perform enough loads / stores to saturate the hardware bandwidth. Explicitly using a coalesced pattern through either a sub-group or work-group stride ensures that a good access pattern is achieved.

A nd-range kernel has been added for large input sizes that uses a heuristic based upon the smallest sized type in the set of provided ranges to determine the number of iterations to process per input item. This drastically improves performance on target architectures for large inputs across nearly all for-based algorithms.

A second kernel has been added as opposed to merging both paths within a single kernel to prevent extra runtime dispatch within the kernel which hurt performance for small inputs. There is a smaller runtime overhead for selecting the best path from the host and compiling two kernels. For small-to-medium inputs, the SYCL basic kernel performs the best.

Signed-off-by: Matthew Michel <[email protected]>

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

…ute work for small inputs Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <[email protected]>

We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <[email protected]>

Signed-off-by: Matthew Michel <[email protected]>

SergeyKopienko · 2024-09-24T07:34:17Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                    // vectorize our code. Vectorization may also improve performance of for-algorithms over small data
+                    // types.
+                    auto [__idx, __group_start_idx, __stride, __is_full] =
+                        __stride_recommender(__ndi, __count, __iters_per_work_item, __work_group_size);


May be better to specify __stride_recommender as template param type with ability to change from caller side if required?

SergeyKopienko · 2024-09-24T07:36:34Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                    else
+                    {
+                        // Recompute iters per item and manually unroll last loop iteration to remove most branching.
+                        if (__group_start_idx >= __count)
+                            return;
+                        const std::uint8_t __adjusted_iters_per_work_item =
+                            oneapi::dpl::__internal::__dpl_ceiling_div(__count - __group_start_idx, __stride);
+                        for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item - 1; ++__i)
+                        {
+                            __brick(__idx, __rngs...);
+                            __idx += __stride;
+                        }
+                        if (__idx < __count)
+                        {
+                            __brick(__idx, __rngs...);
+                        }
+                    }


Suggested change

else

{

// Recompute iters per item and manually unroll last loop iteration to remove most branching.

if (__group_start_idx >= __count)

return;

const std::uint8_t __adjusted_iters_per_work_item =

oneapi::dpl::__internal::__dpl_ceiling_div(__count - __group_start_idx, __stride);

for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item - 1; ++__i)

{

__brick(__idx, __rngs...);

__idx += __stride;

}

if (__idx < __count)

{

__brick(__idx, __rngs...);

}

}

// Recompute iters per item and manually unroll last loop iteration to remove most branching.

if (__group_start_idx < __count)

{

const std::uint8_t __adjusted_iters_per_work_item =

oneapi::dpl::__internal::__dpl_ceiling_div(__count - __group_start_idx, __stride);

for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item - 1; ++__i)

{

__brick(__idx, __rngs...);

__idx += __stride;

}

if (__idx < __count)

{

__brick(__idx, __rngs...);

}

}

A similar change has been committed.

SergeyKopienko · 2024-09-24T07:39:18Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    using __large_submitter = __parallel_for_large_submitter<_ForKernelLarge, _Ranges...>;
+    // Compile two kernels: one for small-to-medium inputs and a second for large. This avoids runtime checks within a single
+    // kernel that worsen performance for small cases.
+    if (__count < __large_submitter::__estimate_best_start_size(__exec))


I am not sure that __large_submitter able to calculate to best size in all cases and it doesn't depends on Kernel code and it's logic at all. So may be we should have ability to customize this condition check to?

Its an interesting point. I'm trying to think about how different workloads / algorithms would impact this number...

I think the decision is largely dependent on overhead vs memory bandwidth optimization. Its possible that more computation would make this less important because we are less reliant on memory bandwidth. However, depending on user provided call-ables in the library makes this very difficult to make good decisions, unless there are some APIs which we know are always very high computationally that use parallel_for internally (I don't know of any).

Another aspect to consider, the larger the size of the minimum type size in the input ranges, the fewer iterations would be run by the large submitter. At the limit, I imagine there is no advantage of the large submitter when we only would run a single iteration. This is knowable at compile time, perhaps we should detect this case and always choose the small submitter when a single iteration would be used.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

SergeyKopienko · 2024-09-24T07:57:34Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                            return;
+                        const std::uint8_t __adjusted_iters_per_work_item =
+                            oneapi::dpl::__internal::__dpl_ceiling_div(__count - __group_start_idx, __stride);
+                        for (std::uint8_t __i = 0; __i < __adjusted_iters_per_work_item - 1; ++__i)


Answered in the other comment regarding uint8_t

danhoeflinger · 2024-09-24T12:33:45Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

 }

 //------------------------------------------------------------------------
 // parallel_transform_scan - async pattern
 //------------------------------------------------------------------------

-// Please see the comment for __parallel_for_submitter for optional kernel name explanation


There are a few more instances of this same comment which needs renaming (or instead adding the explanation in place):
esimd_radix_sort_submitters.h
parallel_backend_sycl_fpga.h
parallel_backend_sycl_merge_sort.h
parallel_backend_sycl_merge.h
parallel_backend_sycl_reduce.h

Thanks, I have updated these.

Signed-off-by: Matthew Michel <[email protected]>

danhoeflinger · 2024-09-24T13:47:16Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+                        if (__group_start_idx >= __count)
+                            return;
+                        const std::uint8_t __adjusted_iters_per_work_item =
+                            oneapi::dpl::__internal::__dpl_ceiling_div(__count - __group_start_idx, __stride);


Is this better than just using
__adjusted_iters_per_work_item = oneapi::dpl::__internal::__dpl_ceiling_div(__count - __idx, __stride);
directly and not having the extra separated last iteration?

It seems like the compiler may end up with the same thing (and same amount of work-item divergence). I'm probably missing something though.

I think we may be able to even remove __group_start_idx and the early exit above, merely relying upon the loop bounds to short circuit work-items beyond the end of the list.

Of course if we get a benefit from the way it is written, lets keep it, but I'd like to understand it better.

This is a good point. __adjusted_iters_per_work_item is a runtime variable, so we can't do any unrolling or anything that would bring any advantage to the current version. It should perform similarly to your suggestion.

I added this change. The else is now an else if with a check that __idx < __count to prevent a wrap around when subtracting the unsigned types. I guess this is unnecessary if we were to cast and switch to signed types for __adjusted_iters_per_work_item, but I think this looks cleaner.

danhoeflinger · 2024-09-24T13:57:33Z

include/oneapi/dpl/pstl/utils.h

+// Utility that returns the smallest type in tuple.
+template <typename _Tuple>
+class __min_tuple_type;
+
+template <typename _T>
+class __min_tuple_type<std::tuple<_T>>
+{
+  public:
+    using type = _T;
+};
+
+template <typename _T, typename... _Ts>
+class __min_tuple_type<std::tuple<_T, _Ts...>>
+{
+    using __min_type_ts = typename __min_tuple_type<std::tuple<_Ts...>>::type;
+
+  public:
+    using type = std::conditional_t<(sizeof(_T) < sizeof(__min_type_ts)), _T, __min_type_ts>;
+};
+


Do you think we will ever want the type specifics from a utility like this, or are we always only interested in the sizeof(type)?

If its likely to be the latter, we could just make this directly have a public value equal to the size of the minimum type rather than a public type (...and even provide it as a __min_tuple_type_size_v).

This would simplify the code which uses it.

We are probably not going to ever use this to just get the type, so I have added this change to instead give a value which is the size of the smallest type. Also added a variable template.

Signed-off-by: Matthew Michel <[email protected]>

dmitriy-sobolev · 2024-09-24T14:43:43Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h

+    // bandwidth. This heuristic errs on the side of launching more work per item than what is needed to
+    // achieve full bandwidth utilization. 16 bytes per range per work item has been found as a good
+    // value across the different for-based algorithms.
+    static constexpr std::uint8_t __bytes_per_work_item = 16;


Will it also benefit from having SPIR-V and non-SPIR-V path? I vaguely remember that there was a differentiation for these targets in reduce, where SPIR-V targets performed better by sequentially processing of 4 integers, and non-SPIR-V targets were better in processing them in parallel.

Since the reduce discussions, I believe there have been some improvements to coalesce style accesses in the compiler. For some cases with parallel for, there is benefit for adding a vertically vectorized path (sequential processing within a work item) but getting the compiler to generate the correct instructions is tricky to do for all cases.

For 32-bit and 64-bit types, I have seen similar performance with std::copy and a handcrafted version which generates vectorized loads / stores. Interestingly, there seems to be small benefits for std::for_each for these types. For smaller types such as 16-bit, vectorization brings significant improvement as the compiler is able to combine memory transactions. The bulk of further performance improvements would come with these small types I believe.

The reason that I have not added it now is that it has been non-trivial to implement in a generic fashion with our __parallel_for pattern to accommodate the different bricks. I have seen spotty performance thus far when trying to do so. When doing just a vector load such as in reduce, the compiler seems to easily generate the correct instructions. For load / store such as in std::copy it has been difficult to enable vectorization with our bricks. If the compiler doesn't generate the correct vector instructions, the performance significantly degrades. I think it is possible to implement a version that does this, but requires additional work that should come in a subsequent round of improvements that build on top of these. For the most common cases, this PR closes most of the gap and puts us near the memory bandwidth roof.

dmitriy-sobolev · 2024-09-24T15:02:14Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h

+//
+// SPIR-V compilation targets show best performance with a stride of the sub-group size.
+// Other compilation targets perform best with a work-group size stride.
+template <typename _NdItem>


Minor thing. I assume there is no need to pass nd_item as a template, since only nd_item<1> is expected.

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 added 15 commits September 5, 2024 12:08

Optimize memory transactions in SYCL backend parallel for

66989b9

Signed-off-by: Matthew Michel <[email protected]>

clang-format

5ddb258

Signed-off-by: Matthew Michel <[email protected]>

Correct comment and error handling.

d5d9661

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <[email protected]>

__num_groups bugfix

6e72d26

Signed-off-by: Matthew Michel <[email protected]>

Introduce stride recommender for different targets and better distrib…

bedb2b3

…ute work for small inputs Signed-off-by: Matthew Michel <[email protected]>

Cleanup

94800af

Signed-off-by: Matthew Michel <[email protected]>

Unroll loop if possible

e4cbceb

Signed-off-by: Matthew Michel <[email protected]>

Revert "Unroll loop if possible"

87e6571

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Code cleanup

429a81b

Signed-off-by: Matthew Michel <[email protected]>

Clang format

3995861

Signed-off-by: Matthew Michel <[email protected]>

Update comments

d034fd3

Signed-off-by: Matthew Michel <[email protected]>

Bugfix in comment

4caa4f4

Signed-off-by: Matthew Michel <[email protected]>

More cleanup and better handle non-full case

7ddf27d

Signed-off-by: Matthew Michel <[email protected]>

mmichel11 marked this pull request as ready for review September 23, 2024 21:24

mmichel11 requested review from julianmi, dmitriy-sobolev, danhoeflinger, adamfidel and MikeDvorskiy September 23, 2024 21:25

mmichel11 added the enhancement label Sep 23, 2024

mmichel11 requested a review from SergeyKopienko September 23, 2024 21:26

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_utils.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Sep 24, 2024

View reviewed changes

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

mmichel11 added 2 commits September 24, 2024 08:38

Rename __ndi to __item for consistency with codebase

543ab58

Signed-off-by: Matthew Michel <[email protected]>

Update all comments on kernel naming trick

cabb234

Signed-off-by: Matthew Michel <[email protected]>

danhoeflinger reviewed Sep 24, 2024

View reviewed changes

Handle non-full case in a cleaner way

417e9fb

Signed-off-by: Matthew Michel <[email protected]>

dmitriy-sobolev reviewed Sep 24, 2024

View reviewed changes

Switch min tuple type utility to return size of type

3b44c6d

Signed-off-by: Matthew Michel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

SergeyKopienko Sep 24, 2024

SergeyKopienko Sep 24, 2024 •

edited

Loading

mmichel11 Sep 24, 2024

SergeyKopienko Sep 24, 2024

danhoeflinger Sep 24, 2024

SergeyKopienko Sep 24, 2024

mmichel11 Sep 24, 2024

danhoeflinger Sep 24, 2024

mmichel11 Sep 24, 2024

danhoeflinger Sep 24, 2024 •

edited

Loading

mmichel11 Sep 24, 2024

danhoeflinger Sep 24, 2024 •

edited

Loading

mmichel11 Sep 24, 2024

dmitriy-sobolev Sep 24, 2024 •

edited

Loading

mmichel11 Sep 24, 2024

dmitriy-sobolev Sep 24, 2024 •

edited

Loading

Improve SYCL backend __parallel_for performance for large input sizes #1870

Are you sure you want to change the base?

Improve SYCL backend __parallel_for performance for large input sizes #1870

Conversation

mmichel11 commented Sep 20, 2024 • edited Loading

Summary

Details

Choose a reason for hiding this comment

SergeyKopienko Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitriy-sobolev Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

Improve SYCL backend `__parallel_for` performance for large input sizes #1870

mmichel11 commented Sep 20, 2024 •

edited

Loading

SergeyKopienko Sep 24, 2024 •

edited

Loading

danhoeflinger Sep 24, 2024 •

edited

Loading

danhoeflinger Sep 24, 2024 •

edited

Loading

dmitriy-sobolev Sep 24, 2024 •

edited

Loading

dmitriy-sobolev Sep 24, 2024 •

edited

Loading