New Split optimizations #506

MichaelS239 · 2024-12-08T22:36:12Z

Add new optimizations for Split algorithm

github-actions

clang-tidy made some suggestions

src/core/algorithms/dd/split/split.cpp

ol-imorozko · 2024-12-13T14:23:32Z

Since your previous PR #439 has been merged, please rebase this one onto the latest main.

ol-imorozko · 2024-12-16T15:23:08Z

src/core/algorithms/dd/split/model/distance_position_list_index.cpp

+namespace algos::dd {
+void DistancePositionListIndex::AddValue(std::string&& value) {
+    auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_);
+    if (is_value_new) {
+        clusters_.emplace_back(cur_tuple_index_, 0);
+        ++next_cluster_index_;
+    }
+    ++clusters_[it->second].size;
+    inverted_index_.emplace_back(it->second);
+    ++cur_tuple_index_;
+}


why do we taking only rvalue reference? Is there any particular reasons why we shouldn't be able to use lvalues with AddValue? If so, document it in a commentary, if not, use universal references with std::forward

ol-imorozko · 2024-12-16T15:26:51Z

src/core/algorithms/dd/split/model/distance_position_list_index.cpp

+
+namespace algos::dd {
+void DistancePositionListIndex::AddValue(std::string&& value) {
+    auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_);


This does not achieve forwarding. Even though variable value has type rvalue reference, in expression
value_mapping_.try_emplace(value, next_cluster_index_) that variable has lvalue value category, since it's a named variable.
So here you should use std::move(value) to cast it to rvalue reference.

shouldn't it be auto&& [it, is_value_new]?

ol-imorozko · 2024-12-16T15:29:29Z

src/core/algorithms/dd/split/model/distance_position_list_index.cpp

+        ++next_cluster_index_;
+    }
+    ++clusters_[it->second].size;
+    inverted_index_.emplace_back(it->second);


Suggested change

inverted_index_.emplace_back(it->second);

inverted_index_.push_back(it->second);

it->second is already of ClusterIndex, as far as I can see

ol-imorozko · 2024-12-16T15:31:08Z

src/core/algorithms/dd/split/model/distance_position_list_index.cpp

+                                                     model::TupleIndex num_rows) {
+    if (num_rows == 0) num_rows = column.GetNumRows();
+    for (model::TupleIndex index = 0; index != num_rows; ++index) {
+        AddValue(column.GetDataAsString(index));


again, GetDataAsString returns std::string. If you don't want an exta copy, you should use std::move

ol-imorozko · 2024-12-16T15:31:58Z

src/core/algorithms/dd/split/model/distance_position_list_index.cpp

+    if (num_rows == 0) num_rows = column.GetNumRows();
+    for (model::TupleIndex index = 0; index != num_rows; ++index) {


I think we should do

clusters_.reserve(num_rows); inverted_index_.reserve(num_rows);

to avoid reallocations

ol-imorozko · 2024-12-16T16:12:38Z

src/core/algorithms/dd/split/split.cpp

+    for (auto const& pair : tuple_pairs_) {
+        if (CheckDF(d, pair)) return true;
    }
    return false;


Suggested change

for (auto const& pair : tuple_pairs_) {

if (CheckDF(d, pair)) return true;

}

return false;

return std::ranges::any_of(tuple_pairs_, [this, &d](const auto& pair) { return CheckDF(d, pair); });

maybe other methods could also be simplified in this way by using exact algorithm with readable name instead of a raw loop

ol-imorozko · 2024-12-16T16:16:32Z

src/core/algorithms/dd/split/split.cpp

-        std::vector<DF> const& search, DF const& rhs, unsigned& cnt) {
+std::list<DD> Split::InstanceExclusionReduce(std::vector<std::size_t> const& tuple_pair_indices,
+                                             std::vector<DF> const& search, DF const& rhs,
+                                             unsigned& cnt) {
    if (!search.size()) return {};


Suggested change

if (!search.size()) return {};

if (search.empty()) return {};

ol-imorozko · 2024-12-16T16:17:26Z

src/core/algorithms/dd/split/split.cpp

+        if (!CheckDF(rhs, tuple_pairs_[index])) {
+            if (CheckDF(first_df, tuple_pairs_[index])) {
+                remaining_tuple_pair_indices.push_back(index);
                no_pairs_left = false;
            }
-            if (last_dd_holds && CheckDF(last_df, pair)) last_dd_holds = false;
+            if (last_dd_holds && CheckDF(last_df, tuple_pairs_[index])) last_dd_holds = false;
            if (!no_pairs_left && !last_dd_holds) break;
        }
    }


add const auto& pair = tuple_pairs_[index];

ol-imorozko · 2024-12-16T16:18:13Z

src/core/algorithms/dd/split/split.cpp

            if (!no_pairs_left && !last_dd_holds) break;
        }
    }

    if (no_pairs_left) {
        if (IsFeasible(first_df)) dds.emplace_back(first_df, rhs);
        std::vector<DF> remainder = DoPositivePruning(search, first_df);
-        std::list<DD> remaining_dds = InstanceExclusionReduce(tuple_pairs, remainder, rhs, cnt);
+        std::list<DD> remaining_dds =
+                InstanceExclusionReduce(tuple_pair_indices, remainder, rhs, cnt);
        dds.splice(dds.end(), remaining_dds);


maybe

Suggested change

dds.splice(dds.end(), remaining_dds);

dds.splice(dds.end(), std::move(remaining_dds));

?

We don't need remaining_dds anymore

ol-imorozko · 2024-12-16T16:19:23Z

src/core/algorithms/dd/split/split.cpp

    std::list<DD> const pruning_dds =
-            InstanceExclusionReduce(remaining_tuple_pairs, prune, rhs, cnt);
+            InstanceExclusionReduce(remaining_tuple_pair_indices, prune, rhs, cnt);

    std::list<DD> merged_dds = MergeReducedResults(dds, pruning_dds);
    dds.splice(dds.end(), merged_dds);


again, I think this should be done:

Suggested change

dds.splice(dds.end(), merged_dds);

dds.splice(dds.end(), std::move(merged_dds));

github-actions bot reviewed Dec 8, 2024

View reviewed changes

src/core/algorithms/dd/split/split.cpp Outdated Show resolved Hide resolved

MichaelS239 force-pushed the new-dd-optimizations2 branch from 60e998d to 4b97ba6 Compare December 8, 2024 22:55

MichaelS239 added 5 commits December 14, 2024 17:49

Add custom PLI

d7cf272

Move type check

a1a86fe

Optimize tuple pair construction

38b7c52

Reduce distances_ size

cb0cbf9

Add indices for tuple pairs

24a7846

MichaelS239 force-pushed the new-dd-optimizations2 branch from 4b97ba6 to 24a7846 Compare December 14, 2024 14:50

MichaelS239 marked this pull request as ready for review December 14, 2024 15:42

ol-imorozko requested changes Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Split optimizations #506

New Split optimizations #506

MichaelS239 commented Dec 8, 2024

github-actions bot left a comment

ol-imorozko commented Dec 13, 2024

ol-imorozko Dec 16, 2024 •

edited

Loading

ol-imorozko Dec 16, 2024 •

edited

Loading

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

ol-imorozko Dec 16, 2024

	inverted_index_.emplace_back(it->second);
	inverted_index_.push_back(it->second);

		if (num_rows == 0) num_rows = column.GetNumRows();
		for (model::TupleIndex index = 0; index != num_rows; ++index) {

	if (!search.size()) return {};
	if (search.empty()) return {};

	dds.splice(dds.end(), remaining_dds);
	dds.splice(dds.end(), std::move(remaining_dds));

	dds.splice(dds.end(), merged_dds);
	dds.splice(dds.end(), std::move(merged_dds));

New Split optimizations #506

Are you sure you want to change the base?

New Split optimizations #506

Conversation

MichaelS239 commented Dec 8, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

ol-imorozko commented Dec 13, 2024

ol-imorozko Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

ol-imorozko Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ol-imorozko Dec 16, 2024 •

edited

Loading

ol-imorozko Dec 16, 2024 •

edited

Loading