Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Split optimizations #506

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MichaelS239
Copy link
Collaborator

Add new optimizations for Split algorithm

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/core/algorithms/dd/split/split.cpp Outdated Show resolved Hide resolved
@MichaelS239 MichaelS239 force-pushed the new-dd-optimizations2 branch from 60e998d to 4b97ba6 Compare December 8, 2024 22:55
@ol-imorozko
Copy link
Collaborator

Since your previous PR #439 has been merged, please rebase this one onto the latest main.

@MichaelS239 MichaelS239 force-pushed the new-dd-optimizations2 branch from 4b97ba6 to 24a7846 Compare December 14, 2024 14:50
@MichaelS239 MichaelS239 marked this pull request as ready for review December 14, 2024 15:42
Comment on lines +8 to +18
namespace algos::dd {
void DistancePositionListIndex::AddValue(std::string&& value) {
auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_);
if (is_value_new) {
clusters_.emplace_back(cur_tuple_index_, 0);
++next_cluster_index_;
}
++clusters_[it->second].size;
inverted_index_.emplace_back(it->second);
++cur_tuple_index_;
}
Copy link
Collaborator

@ol-imorozko ol-imorozko Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we taking only rvalue reference? Is there any particular reasons why we shouldn't be able to use lvalues with AddValue? If so, document it in a commentary, if not, use universal references with std::forward


namespace algos::dd {
void DistancePositionListIndex::AddValue(std::string&& value) {
auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_);
Copy link
Collaborator

@ol-imorozko ol-imorozko Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not achieve forwarding. Even though variable value has type rvalue reference, in expression
value_mapping_.try_emplace(value, next_cluster_index_) that variable has lvalue value category, since it's a named variable.
So here you should use std::move(value) to cast it to rvalue reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be auto&& [it, is_value_new]?

++next_cluster_index_;
}
++clusters_[it->second].size;
inverted_index_.emplace_back(it->second);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inverted_index_.emplace_back(it->second);
inverted_index_.push_back(it->second);

it->second is already of ClusterIndex, as far as I can see

model::TupleIndex num_rows) {
if (num_rows == 0) num_rows = column.GetNumRows();
for (model::TupleIndex index = 0; index != num_rows; ++index) {
AddValue(column.GetDataAsString(index));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, GetDataAsString returns std::string. If you don't want an exta copy, you should use std::move

Comment on lines +22 to +23
if (num_rows == 0) num_rows = column.GetNumRows();
for (model::TupleIndex index = 0; index != num_rows; ++index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do

   clusters_.reserve(num_rows);
   inverted_index_.reserve(num_rows);

to avoid reallocations

Comment on lines +427 to 430
for (auto const& pair : tuple_pairs_) {
if (CheckDF(d, pair)) return true;
}
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto const& pair : tuple_pairs_) {
if (CheckDF(d, pair)) return true;
}
return false;
return std::ranges::any_of(tuple_pairs_, [this, &d](const auto& pair) { return CheckDF(d, pair); });

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe other methods could also be simplified in this way by using exact algorithm with readable name instead of a raw loop

std::vector<DF> const& search, DF const& rhs, unsigned& cnt) {
std::list<DD> Split::InstanceExclusionReduce(std::vector<std::size_t> const& tuple_pair_indices,
std::vector<DF> const& search, DF const& rhs,
unsigned& cnt) {
if (!search.size()) return {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!search.size()) return {};
if (search.empty()) return {};

Comment on lines +685 to 693
if (!CheckDF(rhs, tuple_pairs_[index])) {
if (CheckDF(first_df, tuple_pairs_[index])) {
remaining_tuple_pair_indices.push_back(index);
no_pairs_left = false;
}
if (last_dd_holds && CheckDF(last_df, pair)) last_dd_holds = false;
if (last_dd_holds && CheckDF(last_df, tuple_pairs_[index])) last_dd_holds = false;
if (!no_pairs_left && !last_dd_holds) break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add const auto& pair = tuple_pairs_[index];

if (!no_pairs_left && !last_dd_holds) break;
}
}

if (no_pairs_left) {
if (IsFeasible(first_df)) dds.emplace_back(first_df, rhs);
std::vector<DF> remainder = DoPositivePruning(search, first_df);
std::list<DD> remaining_dds = InstanceExclusionReduce(tuple_pairs, remainder, rhs, cnt);
std::list<DD> remaining_dds =
InstanceExclusionReduce(tuple_pair_indices, remainder, rhs, cnt);
dds.splice(dds.end(), remaining_dds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

Suggested change
dds.splice(dds.end(), remaining_dds);
dds.splice(dds.end(), std::move(remaining_dds));

?

We don't need remaining_dds anymore

std::list<DD> const pruning_dds =
InstanceExclusionReduce(remaining_tuple_pairs, prune, rhs, cnt);
InstanceExclusionReduce(remaining_tuple_pair_indices, prune, rhs, cnt);

std::list<DD> merged_dds = MergeReducedResults(dds, pruning_dds);
dds.splice(dds.end(), merged_dds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I think this should be done:

Suggested change
dds.splice(dds.end(), merged_dds);
dds.splice(dds.end(), std::move(merged_dds));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants