Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy join #1524

Merged
merged 34 commits into from
Nov 25, 2024
Merged

Implement lazy join #1524

merged 34 commits into from
Nov 25, 2024

Conversation

RobinTF
Copy link
Collaborator

@RobinTF RobinTF commented Sep 30, 2024

This PR lazily computes single-column JOINs. Both inputs to the JOIN still have to be sorted, because the only supported algorithm so far is a sort-merge join, bot star joins between (large) Index Scans can heavily benefit from this optimization.

Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 97.75641% with 7 lines in your changes missing coverage. Please review.

Project coverage is 89.29%. Comparing base (5f28e83) to head (66cb836).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/engine/Join.cpp 97.20% 0 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1524      +/-   ##
==========================================
+ Coverage   89.24%   89.29%   +0.04%     
==========================================
  Files         374      374              
  Lines       35683    35856     +173     
  Branches     4027     4039      +12     
==========================================
+ Hits        31845    32016     +171     
+ Misses       2538     2522      -16     
- Partials     1300     1318      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@RobinTF RobinTF marked this pull request as ready for review October 3, 2024 22:50
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of reviews, let's discuss the complexity of this, which somewhat surprises me
(I already think my original code is too complex).

src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/util/JoinAlgorithms/JoinAlgorithms.h Outdated Show resolved Hide resolved
src/util/JoinAlgorithms/JoinAlgorithms.h Outdated Show resolved Hide resolved
src/util/JoinAlgorithms/JoinAlgorithms.h Outdated Show resolved Hide resolved
src/util/JoinAlgorithms/JoinAlgorithms.h Outdated Show resolved Hide resolved
src/util/JoinAlgorithms/JoinAlgorithms.h Outdated Show resolved Hide resolved
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

src/engine/Join.h Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of reviews of everything but the tests.
There are quite some possibilities to clean up, but we can fix them soon.

src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
auto localVocab = std::move(rowAdder.localVocab());
return Result::IdTableVocabPair{std::move(rowAdder).resultTable(),
std::move(localVocab)};
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, I don't like the nesting of the lambda.

Comment on lines 818 to 841
bool idTableHasUndef =
!idTable.empty() && idTable.at(0, joinColTable).isUndefined();
std::optional<std::shared_ptr<const Result>> indexScanResult =
std::nullopt;
using FirstColView = ad_utility::IdTableAndFirstCol<IdTable>;
using GenWithDetails =
cppcoro::generator<FirstColView,
CompressedRelationReader::LazyScanMetadata>;
auto rightBlocks = [&scan, idTableHasUndef, &permutationIdTable,
&indexScanResult]()
-> std::variant<cppcoro::generator<FirstColView>, GenWithDetails> {
if (idTableHasUndef) {
indexScanResult =
scan->getResult(false, ComputationMode::LAZY_IF_SUPPORTED);
AD_CORRECTNESS_CHECK(
!indexScanResult.value()->isFullyMaterialized());
return convertGenerator(
std::move(indexScanResult.value()->idTables()));
} else {
auto rightBlocksInternal =
scan->lazyScanForJoinOfColumnWithScan(permutationIdTable.col());
return convertGenerator(std::move(rightBlocksInternal));
}
}();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can and should be a separate function (getGeneratorForIndexScanThatIsJoinedWithTable) or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that simple unfortunately because the lifetime of indexScanResult needs to be extended past the scope.
I'm sure there's a nicer solution for this, but this function needs to be changed anyway when implementing a block prefiltering for lazy results.

src/engine/Join.cpp Outdated Show resolved Hide resolved
@@ -320,7 +330,7 @@ class AddCombinedRowToIdTable {
indexBuffer_.clear();
optionalIndexBuffer_.clear();
nextIndex_ = 0;
std::invoke(blockwiseCallback_, result);
std::invoke(blockwiseCallback_, result, mergedVocab_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely have to talk about the LocalVocab stuff, as this really aggressively merges the local vocabs over and over again in the following cases:

  • The undef blocks are set over and over again
  • We have lazy inputs, but a fully materialized result [The most common case]
  • We have many cartesian blocks (but there we typically have much more input rows than input vocabularies).

I see two potential angles here:

  1. Deduplicate inside the LocalVocab class (the most aggressive way: Use a HashMap instead of a vector for the otherWordSets.
  2. Fiddle with the internals of the join here (hard to do, because we have to further hack the ZipperJoinWithBlocksAndUndef to be aware of the result being fully materialized etc. That's why I am a little bit against it.
    I will ask @hannahbast what she thinks about 1. It will only have a performance impact for inputs with very many nonempty local vocabs, and these are typically slower anyway, so it seems feasible to me.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another review on the Code.
Will look at the tests next.

src/engine/Join.cpp Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/AddCombinedRowToTable.h Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for the tests.

test/JoinTest.cpp Show resolved Hide resolved
test/JoinTest.cpp Outdated Show resolved Hide resolved
test/JoinTest.cpp Outdated Show resolved Hide resolved
test/JoinTest.cpp Outdated Show resolved Hide resolved
test/JoinTest.cpp Outdated Show resolved Hide resolved
test/JoinTest.cpp Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several small, and one important thing remaining.

src/engine/AddCombinedRowToTable.h Outdated Show resolved Hide resolved
src/engine/AddCombinedRowToTable.h Show resolved Hide resolved
test/util/RuntimeParametersTestHelpers.h Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/AddCombinedRowToTable.h Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tiny change which is necessary, because I think you didn't read the comment carefully enough. Otherwise this now looks fine:)

src/engine/AddCombinedRowToTable.h Outdated Show resolved Hide resolved
src/engine/AddCombinedRowToTable.h Outdated Show resolved Hide resolved
@sparql-conformance
Copy link

@joka921 joka921 merged commit a9b9862 into ad-freiburg:master Nov 25, 2024
22 checks passed
@RobinTF RobinTF deleted the lazy-join branch November 25, 2024 15:19
joka921 pushed a commit that referenced this pull request Dec 4, 2024
Since #1524, lazy joins can produce `LocalVocab`s with many duplicate "other sets". These duplicates cannot be easily detected because they are stored in a `std::vector`. They are now stored as a `absl:flat_hash_set` instead. This may come with a small performance penalty when the size of this set becomes large. To be able to detect this, add the size of the set to the details of the runtime information.

Based on the suggestion from #1524 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants