-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Split optimizations #506
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
60e998d
to
4b97ba6
Compare
Since your previous PR #439 has been merged, please rebase this one onto the latest |
4b97ba6
to
24a7846
Compare
namespace algos::dd { | ||
void DistancePositionListIndex::AddValue(std::string&& value) { | ||
auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_); | ||
if (is_value_new) { | ||
clusters_.emplace_back(cur_tuple_index_, 0); | ||
++next_cluster_index_; | ||
} | ||
++clusters_[it->second].size; | ||
inverted_index_.emplace_back(it->second); | ||
++cur_tuple_index_; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we taking only rvalue reference? Is there any particular reasons why we shouldn't be able to use lvalues with AddValue
? If so, document it in a commentary, if not, use universal references with std::forward
|
||
namespace algos::dd { | ||
void DistancePositionListIndex::AddValue(std::string&& value) { | ||
auto [it, is_value_new] = value_mapping_.try_emplace(value, next_cluster_index_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not achieve forwarding. Even though variable value
has type rvalue reference, in expression
value_mapping_.try_emplace(value, next_cluster_index_)
that variable has lvalue value category, since it's a named variable.
So here you should use std::move(value)
to cast it to rvalue reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be auto&& [it, is_value_new]
?
++next_cluster_index_; | ||
} | ||
++clusters_[it->second].size; | ||
inverted_index_.emplace_back(it->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverted_index_.emplace_back(it->second); | |
inverted_index_.push_back(it->second); |
it->second
is already of ClusterIndex
, as far as I can see
model::TupleIndex num_rows) { | ||
if (num_rows == 0) num_rows = column.GetNumRows(); | ||
for (model::TupleIndex index = 0; index != num_rows; ++index) { | ||
AddValue(column.GetDataAsString(index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, GetDataAsString
returns std::string
. If you don't want an exta copy, you should use std::move
if (num_rows == 0) num_rows = column.GetNumRows(); | ||
for (model::TupleIndex index = 0; index != num_rows; ++index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do
clusters_.reserve(num_rows);
inverted_index_.reserve(num_rows);
to avoid reallocations
for (auto const& pair : tuple_pairs_) { | ||
if (CheckDF(d, pair)) return true; | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto const& pair : tuple_pairs_) { | |
if (CheckDF(d, pair)) return true; | |
} | |
return false; | |
return std::ranges::any_of(tuple_pairs_, [this, &d](const auto& pair) { return CheckDF(d, pair); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe other methods could also be simplified in this way by using exact algorithm with readable name instead of a raw loop
std::vector<DF> const& search, DF const& rhs, unsigned& cnt) { | ||
std::list<DD> Split::InstanceExclusionReduce(std::vector<std::size_t> const& tuple_pair_indices, | ||
std::vector<DF> const& search, DF const& rhs, | ||
unsigned& cnt) { | ||
if (!search.size()) return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!search.size()) return {}; | |
if (search.empty()) return {}; |
if (!CheckDF(rhs, tuple_pairs_[index])) { | ||
if (CheckDF(first_df, tuple_pairs_[index])) { | ||
remaining_tuple_pair_indices.push_back(index); | ||
no_pairs_left = false; | ||
} | ||
if (last_dd_holds && CheckDF(last_df, pair)) last_dd_holds = false; | ||
if (last_dd_holds && CheckDF(last_df, tuple_pairs_[index])) last_dd_holds = false; | ||
if (!no_pairs_left && !last_dd_holds) break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add const auto& pair = tuple_pairs_[index];
if (!no_pairs_left && !last_dd_holds) break; | ||
} | ||
} | ||
|
||
if (no_pairs_left) { | ||
if (IsFeasible(first_df)) dds.emplace_back(first_df, rhs); | ||
std::vector<DF> remainder = DoPositivePruning(search, first_df); | ||
std::list<DD> remaining_dds = InstanceExclusionReduce(tuple_pairs, remainder, rhs, cnt); | ||
std::list<DD> remaining_dds = | ||
InstanceExclusionReduce(tuple_pair_indices, remainder, rhs, cnt); | ||
dds.splice(dds.end(), remaining_dds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
dds.splice(dds.end(), remaining_dds); | |
dds.splice(dds.end(), std::move(remaining_dds)); |
?
We don't need remaining_dds
anymore
std::list<DD> const pruning_dds = | ||
InstanceExclusionReduce(remaining_tuple_pairs, prune, rhs, cnt); | ||
InstanceExclusionReduce(remaining_tuple_pair_indices, prune, rhs, cnt); | ||
|
||
std::list<DD> merged_dds = MergeReducedResults(dds, pruning_dds); | ||
dds.splice(dds.end(), merged_dds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, I think this should be done:
dds.splice(dds.end(), merged_dds); | |
dds.splice(dds.end(), std::move(merged_dds)); |
Add new optimizations for Split algorithm