-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lazy join #1524
Implement lazy join #1524
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1524 +/- ##
==========================================
+ Coverage 89.24% 89.29% +0.04%
==========================================
Files 374 374
Lines 35683 35856 +173
Branches 4027 4039 +12
==========================================
+ Hits 31845 32016 +171
+ Misses 2538 2522 -16
- Partials 1300 1318 +18 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of reviews, let's discuss the complexity of this, which somewhat surprises me
(I already think my original code is too complex).
Quality Gate failedFailed conditions See analysis details on SonarCloud Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of reviews of everything but the tests.
There are quite some possibilities to clean up, but we can fix them soon.
auto localVocab = std::move(rowAdder.localVocab()); | ||
return Result::IdTableVocabPair{std::move(rowAdder).resultTable(), | ||
std::move(localVocab)}; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here, I don't like the nesting of the lambda.
src/engine/Join.cpp
Outdated
bool idTableHasUndef = | ||
!idTable.empty() && idTable.at(0, joinColTable).isUndefined(); | ||
std::optional<std::shared_ptr<const Result>> indexScanResult = | ||
std::nullopt; | ||
using FirstColView = ad_utility::IdTableAndFirstCol<IdTable>; | ||
using GenWithDetails = | ||
cppcoro::generator<FirstColView, | ||
CompressedRelationReader::LazyScanMetadata>; | ||
auto rightBlocks = [&scan, idTableHasUndef, &permutationIdTable, | ||
&indexScanResult]() | ||
-> std::variant<cppcoro::generator<FirstColView>, GenWithDetails> { | ||
if (idTableHasUndef) { | ||
indexScanResult = | ||
scan->getResult(false, ComputationMode::LAZY_IF_SUPPORTED); | ||
AD_CORRECTNESS_CHECK( | ||
!indexScanResult.value()->isFullyMaterialized()); | ||
return convertGenerator( | ||
std::move(indexScanResult.value()->idTables())); | ||
} else { | ||
auto rightBlocksInternal = | ||
scan->lazyScanForJoinOfColumnWithScan(permutationIdTable.col()); | ||
return convertGenerator(std::move(rightBlocksInternal)); | ||
} | ||
}(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can and should be a separate function (getGeneratorForIndexScanThatIsJoinedWithTable
) or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that simple unfortunately because the lifetime of indexScanResult
needs to be extended past the scope.
I'm sure there's a nicer solution for this, but this function needs to be changed anyway when implementing a block prefiltering for lazy results.
@@ -320,7 +330,7 @@ class AddCombinedRowToIdTable { | |||
indexBuffer_.clear(); | |||
optionalIndexBuffer_.clear(); | |||
nextIndex_ = 0; | |||
std::invoke(blockwiseCallback_, result); | |||
std::invoke(blockwiseCallback_, result, mergedVocab_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely have to talk about the LocalVocab
stuff, as this really aggressively merges the local vocabs over and over again in the following cases:
- The undef blocks are set over and over again
- We have lazy inputs, but a fully materialized result [The most common case]
- We have many cartesian blocks (but there we typically have much more input rows than input vocabularies).
I see two potential angles here:
- Deduplicate inside the
LocalVocab
class (the most aggressive way: Use a HashMap instead of a vector for theotherWordSets
. - Fiddle with the internals of the join here (hard to do, because we have to further hack the ZipperJoinWithBlocksAndUndef to be aware of the result being fully materialized etc. That's why I am a little bit against it.
I will ask @hannahbast what she thinks about 1. It will only have a performance impact for inputs with very many nonempty local vocabs, and these are typically slower anyway, so it seems feasible to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another review on the Code.
Will look at the tests next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions for the tests.
This reverts commit 62d9295.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several small, and one important thing remaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tiny change which is necessary, because I think you didn't read the comment carefully enough. Otherwise this now looks fine:)
Conformance check passed ✅No test result changes. |
Quality Gate passedIssues Measures |
Since #1524, lazy joins can produce `LocalVocab`s with many duplicate "other sets". These duplicates cannot be easily detected because they are stored in a `std::vector`. They are now stored as a `absl:flat_hash_set` instead. This may come with a small performance penalty when the size of this set becomes large. To be able to detect this, add the size of the set to the details of the runtime information. Based on the suggestion from #1524 (comment)
This PR lazily computes single-column JOINs. Both inputs to the JOIN still have to be sorted, because the only supported algorithm so far is a sort-merge join, bot star joins between (large) Index Scans can heavily benefit from this optimization.