Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize merge algorithm for largest data sizes #1933

Merged
merged 83 commits into from
Dec 20, 2024

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Nov 6, 2024

In this PR we optimize merge algorithm for data sizes equal or greater then 8M items (16M items for int type).

The main idea - we doing two submits:

  1. in the first submit we find split point in some "base" diagonal's subset.
  2. in the second submit we find split points in all other diagonal and run serial merge for each diagonal (as before).
    But when we find split point on the current diagonal, we setup some indexes limits for rng1 and rng2: by this way we decrease the amount of reading data from source data ranges. For these limits we load split point's data from previous and next "base" diagonals, calculated on the step (1).

Applying this approach we have good perf profit for biggest data sizes with float and int data types.

Details:

Probably I should explain why I have create

struct __result_and_scratch_storage_base
{
    virtual ~__result_and_scratch_storage_base() = default;
};

and why I inherit __result_and_scratch_storage from __result_and_scratch_storage_base :

template <typename _ExecutionPolicy, typename _T>
struct __result_and_scratch_storage : __result_and_scratch_storage_base
{
    // ...
};

Let me explain the reason.
In the __parallel_merge(oneapi::dpl::__internal::__device_backend_tag, ...) function we checks in run-time some conditions to make decision which submitter we will call:

  1. __parallel_merge_submitter<std::uint32_t, ...>
  2. __parallel_merge_submitter_large<std::uint32_t, ...>
  3. __parallel_merge_submitter_large<std::uint64_t, ...>

In the cases (2) and (3) we should extend the life-time of different __result_and_scratch_storage instances which were created inside __result_and_scratch_storage<std::uint32_t>::operator() or __result_and_scratch_storage<std::uint64_t>::operator():

  • in the case (1) we will not have any __result_and_scratch_storage;
  • in the case (2) we will have __result_and_scratch_storage<std::uint32_t>;
  • in the case (3) we will have __result_and_scratch_storage<std::uint64_t>.

So the std::shared_ptr<__result_and_scratch_storage_base> is some common interface for the both of them which can extend their life-time.

About virtual ~__result_and_scratch_storage_base() = default - it's needed to properly destroy instances of inherited classes, which are really owned by shared-pointer instance in the cases (2) and (3). Without virtual here we will have memory-leaks.

@SergeyKopienko SergeyKopienko added this to the 2022.8.0 milestone Nov 6, 2024
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch 5 times, most recently from a6164fd to d4721ca Compare November 7, 2024 12:24
…introduce new function __find_start_point_in

Signed-off-by: Sergey Kopienko <[email protected]>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes

Signed-off-by: Sergey Kopienko <[email protected]>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…rename template parameter names in __parallel_merge_submitter

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch from eebf508 to 400f695 Compare November 28, 2024 13:43
…introduce __starting_size_limit_for_large_submitter into __parallel_merge

Signed-off-by: Sergey Kopienko <[email protected]>
…fix warning: warning C4804: '<': unsafe use of type 'bool' in operation

Signed-off-by: Sergey Kopienko <[email protected]>
…remove extra comments before __find_start_point_in function

Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch 2 times, most recently from 8465a4b to 2f9a357 Compare December 20, 2024 09:28
@SergeyKopienko
Copy link
Contributor Author

SergeyKopienko commented Dec 20, 2024

Other than the comment about __find_start_point, I think this is in good shape and I have no other comments. Hopefully we can find resolution there, and without loss of performance.

I have fixed two moments in the __find_start_point_in function and delete __find_start_point and it's usages at all.
My checks shown me that there is no performance degradation at this step. only profit.

So I propose to save __find_start_point_in only like in mt last commit.

May be make sense only one question here: do we prefer to rename __find_start_point_in to __find_start_point or save as is?

@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch from 2f9a357 to 6dd8e51 Compare December 20, 2024 10:47
…fix self-review comment: we should describe lambda here as mutable

1) for compatibility with previous implementation
2) because at https://en.cppreference.com/w/cpp/algorithm/merge (for example) we see that bool cmp(const Type1& a, const Type2& b); isn't const

Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
@SergeyKopienko SergeyKopienko force-pushed the dev/skopienko/optimize_merge_to_main branch from 1bae441 to e55ee66 Compare December 20, 2024 12:39
Signed-off-by: Sergey Kopienko <[email protected]>
Signed-off-by: Sergey Kopienko <[email protected]>
…fix review comment: constexpr bool kValue = false; has been removed

Signed-off-by: Sergey Kopienko <[email protected]>
Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great optimization work! Please wait for the final reviews from others as well.

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I think it worked out nicely.
As mentioned by @mmichel11, I still think there are couple places we have room for more optimization with more effort but this is a big step in the right direction.

Thanks @SergeyKopienko !

: (__i_elem < __n ? __find_start_point(__rng1, _IdType{0}, __n1, __rng2,
_IdType{0}, __n2, __i_elem, __comp)
: _split_point_t<_IdType>{__n1, __n2});
});
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? : operator is used instead of if.. else in my proposal. It is by performance reasons? I've just interesting in...
(if.. else was more readable, IMHO)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of __serial_merge on ternary operator gave us good profit. So I hope some small profit we may have in this place too. But I am not ready to prepare any precision data about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SergeyKopienko SergeyKopienko merged commit f6d9ea3 into main Dec 20, 2024
22 checks passed
@SergeyKopienko SergeyKopienko deleted the dev/skopienko/optimize_merge_to_main branch December 20, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants