Skip to content
This repository has been archived by the owner on Sep 27, 2019. It is now read-only.

[15721] Hash Join Cost Update #1344

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

GustavoAngulo
Copy link
Member

@GustavoAngulo GustavoAngulo commented May 5, 2018

This PR focuses on improving the cost model estimate for Hash Joins. At first we wanted to tackle join reordering, however, we realized the cost model itself needed work to provide more accurate estimates before we tackled join reordering.

Along the way we encountered bugs for the statistics and costs, which unfortunately took time to fix. The primary addition is the improvement for the cost estimate in cost_calculator.cpp.

It also took time to research the best approach for estimating the cost, as there is surprisingly little literature on the topic. In the end, we decided to try what Postgres does for their hash join cost estimate.

First PR is #1285 for reference. The first PR largely concerned the testing infrastructure, who's code is also included in this PR.

GustavoAngulo and others added 30 commits March 23, 2018 16:23
Copy link

@gangsterveggies gangsterveggies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks good. I feel like there are some extra things that could be tried on the hash join cost model, like using statistics more to find correlations between columns (not sure if you tried some things that ended up not working).

I have a question that might be silly, but the way the bucket size estimation is working, from what I gathered, is looking at all the join keys and picking the optimal, but this cost is only accurate if the optimizer is using the same values to make that decision.

I look forward to see how this new model performed on the perf tests you wrote for the first PR.

Also, I looked at the fixes from the first PR and they look good as well.

@@ -18,209 +18,168 @@
// CONNECTIONS
//===----------------------------------------------------------------------===//
// Peloton port
SETTING_int(port,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look accidental, no need to change the indentation of this file.

output_cost_ = table_stats->num_rows * DEFAULT_TUPLE_COST;
}
void PostgresCostCalculator::Visit(
UNUSED_ATTRIBUTE const PhysicalIndexScan *op) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attribute op is being used, so no need for the UNUSED_ATTRIBUTE flag.

StatsStorage::GetInstance()->GetTableStats(
op->table_->GetDatabaseOid(), op->table_->GetTableOid(), txn_));
if (table_stats->GetColumnCount() == 0) {
output_cost_ = 1.f;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cost 1 if the column count is 0? We can consider that even though this is case is a no op since there are no columns this operation has some constant cost, but in that case the DummyScan and the empty IndexScan should have the same cost (currently it's 0). (I know this came from the original cost model, but we can still change it on the new code).

if (stats == nullptr) continue;

// TODO: A new hash join PR uses 256 as default so we're using this for now and hardcoding it here
auto num_buckets = 256.0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though hardcoding this constant here seems fine, since this is tied to another variable used in the hash join execution/codegen the variables should be tied (for example, by setting some constant on the hash join implementation file or on the include/common) such that if someone in the future changes one, both change.

Copy link

@qwkai-zz qwkai-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good overall. The modifications based on last pr comments have also been addressed.

output_cost_ = 0.f;
}

void PostgresCostCalculator::Visit(const PhysicalOrderBy *) { SortCost(); }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter should be (UNUSED_ATTRIBUTE const PhysicalOrderBy *op) for consistency?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants