Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BlankNode support for SERVICE #1504

Merged
merged 18 commits into from
Oct 18, 2024

Conversation

UNEXENU
Copy link
Contributor

@UNEXENU UNEXENU commented Sep 19, 2024

With this commit, QLever supports to add new blank nodes during the evaluation of a query. This function is used to support blank nodes in the result of a SERVICE request. These blank nodes are distinct from all blank nodes in the index, and also from all blank nodes from other SERVICE request, eve if they came from the same server. This behavior is correct accordin to the SPARQL 1.1 federated query standard.

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 98.50746% with 1 line in your changes missing coverage. Please review.

Project coverage is 88.94%. Comparing base (2b4e6b3) to head (9a01841).

Files with missing lines Patch % Lines
src/util/BlankNodeManager.cpp 96.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1504      +/-   ##
==========================================
+ Coverage   88.93%   88.94%   +0.01%     
==========================================
  Files         364      366       +2     
  Lines       33127    33194      +67     
  Branches     3715     3722       +7     
==========================================
+ Hits        29462    29526      +64     
- Misses       2433     2439       +6     
+ Partials     1232     1229       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at everything but the tests. This already looks nice.
You have forgotten some switch statements that use the Datatype of an ID
(at least in ExportQueryExecutionTrees.cpp. But search for all occurences of the BlankNodeIndex, in every occasion you should also do at least something for NewBlankNodeIndex.

@@ -116,6 +118,17 @@ class QueryExecutionContext {
updateCallback_(nlohmann::ordered_json(runtimeInformation).dump());
}

// Getter for the global `NEXT_NEWBLANKNODEINDEX` counter.
[[nodiscard]] static int getNextNewBlankNodeIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't make this function static, even though it can currently be. The staticness should in this case be an implementation detail.

Comment on lines 123 to 130
if (NEXT_NEWBLANKNODEINDEX == std::numeric_limits<uint64_t>::max())
[[unlikely]] {
throw std::runtime_error(
"Counter for blank nodes overflowed. Tell the developers to reset "
"it.");
}
return NEXT_NEWBLANKNODEINDEX++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this has a race condition (there is one blank node left, two threads first run the check in line 123,
(which for both is okay), then they both increment by one , upps overflow.
You either have to work with a compare_exchange_weak loop (I am sure that cppreference has an example)
or with a std::mutex.
Also you could here directly also implement the function getMultipleNewBlankNodes that reserves several nodes (e.g. 100) at once for performance reasons.

@@ -245,3 +246,6 @@ constexpr inline size_t NUM_SORT_THREADS = 4;
constexpr inline std::string_view EMPH_ON = "\033[1m";
/// ANSI escape sequence to print "normal" text again in the console.
constexpr inline std::string_view EMPH_OFF = "\033[22m";

// Counter for internal Blank Node Ids.
inline std::atomic_uint64_t NEXT_NEWBLANKNODEINDEX{0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better call them (here and everywhere local blank nodes (not new). Then it is consistent with the local vocab, which is a similar concept.

@@ -245,3 +246,6 @@ constexpr inline size_t NUM_SORT_THREADS = 4;
constexpr inline std::string_view EMPH_ON = "\033[1m";
/// ANSI escape sequence to print "normal" text again in the console.
constexpr inline std::string_view EMPH_OFF = "\033[22m";

// Counter for internal Blank Node Ids.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Counter for internal Blank Node Ids.
// Counter for internal Blank Node Ids that are not part of the index, but were locally added (by SERVICE clauses or expressions).

Comment on lines 365 to 370
if (!blankNodeMap.contains(value)) {
// create a new blankNodeIndex
blankNodeMap[value] =
Id::makeFromNewBlankNodeIndex(NewBlankNodeIndex::make(
getExecutionContext()->getNextNewBlankNodeIndex()));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to compute the hash of value twice or three times in all cases.
Use try_emplace. You would have to first insert a dummy Id as the value, but that should not be a problem here.
(see cppreference, ask me if there's trouble).

src/engine/Service.cpp Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several small, and 1.5 big requests:
In particular,

  1. Store the highest BlankNodeIndex s.t. you don't need the LocalBlankNodeIndex
  2. (necessary if you implement 1.) Don't use a global variable for the blank node manager, but always access it via the corresponding Index.

@@ -217,6 +217,9 @@ ExportQueryExecutionTrees::idToStringAndTypeForEncodedValue(Id id) {
case BlankNodeIndex:
return std::pair{absl::StrCat("_:bn", id.getBlankNodeIndex().get()),
nullptr};
case LocalBlankNodeIndex:
return std::pair{absl::StrCat("_:bn", id.getLocalBlankNodeIndex().get()),
nullptr};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should the index of the highest blank node in the index during index building
(It is created in the VocabularyMerger, should be returned from there and can be added during the index building via IndexImpl::writeConfiguration (to the metaData.json file). It should be read from there during the reading of the index builder from file. The LocalBlankNodeManager can then be a member of the Index class, and we don't need the additional datatype LocalBlankNodeIndex. (Datatypes are valuable).

"For now, consider filtering them out using the ISBLANK function "
"or "
"converting them via the STR function.");
auto optEmplace = blankNodeMap.try_emplace(value, Id());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto optEmplace = blankNodeMap.try_emplace(value, Id());
auto [it, wasNew] = blankNodeMap.try_emplace(value, Id());

public:
static const uint blockSize_ = 1000;
static constexpr uint64_t totalAvailableBlocks_ =
(ValueId::maxIndex + 1) / blockSize_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the +1 correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is it necessary? one block less for better readability shouldn't be much of an issue.

namespace ad_utility {
/*
* Manager class for LocalBlankNodeIndex.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate a little bit, e.g. what are local blank node indices, what does this class do/manage, etc.

*/
class BlankNodeManager {
public:
static const uint blockSize_ = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also be constexpr.

uint64_t nextIdx_;
};

// Manages the LocalBlankNodes used within a LocalVocab.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document the lifetime implications (how long does what have to be kept alive to savely work with the blank node indices.


} // namespace ad_utility

inline ad_utility::BlankNodeManager globalBlankNodeManager;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my proposed change from above, you have to get rid of the global blank node manager, so to create a blank node manager, you need access to something. (e.g. a queryExecutionContext).


// Lock this part, as it might be accessed concurrently by
// `LocalBlankNodeManager` instances.
mtx_.lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a std::lock_guard object instead of manually (un)-locking the object.

Comment on lines 33 to 36
// _____________________________________________________________________________
void BlankNodeManager::freeBlock(uint64_t blockIndex) {
usedBlocksSet_.erase(blockIndex);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasing the blocks must also lock a mutex.
Best use the ad_utility::Synchronized class for access to the usedBlockSet (see the header + tests for documentation).

// BlankNode exists already, known Id will be used.
Id a2 =
bTTC({{"type", "bnode"}, {"value", "A"}}).toValueIdIfNotString().value();
EXPECT_EQ(a, a2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the datatype of the blank node index.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thorough review on everything but the tests.
There's not that much to do, this looks very nice.

src/engine/LocalVocab.cpp Outdated Show resolved Hide resolved
src/index/Index.cpp Outdated Show resolved Hide resolved
@@ -62,7 +62,7 @@ struct VocabularyMetaData {
Id begin() const { return begin_; }
Id end() const { return end_; }

// Return true iff the `id` belongs to this range.
// Return true if the `id` belongs to this range.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Return true if the `id` belongs to this range.
// Return true iff the `id` belongs to this range.

That is not really a typo.
In mathematics and therefore also in computer since you write iff with two ffs for if and only if to be more precise.

// The Random-Generation Algorithm's performance is reduced once the number of
// used blocks exceeds a limit.
AD_CORRECTNESS_CHECK(usedBlocksSet_.rlock()->size() <
totalAvailableBlocks_ / 256);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a message to this check (the second argument to the CORRECTNESS_CHECK can be a string or a lambda that returns a string., s.t. we know, what has gone wrong here.

Comment on lines 49 to 51
for (auto block : blocks_) {
blankNodeManager_->freeBlock(block.blockIdx_);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes one lock + unlock per block, which is a waste of resources.
You can either create a templated function freeBlocks s.t. you here can write manager_->freeBlocks(blocks_ | std::views::transform(...)).
or you make the LocalBlankNodeManager a friend of the BlankNodeManager and handle the locking here yourself.

Comment on lines 57 to 59
blocks_.back().nextIdx_ ==
(blankNodeManager_->minIndex_ + blocks_.back().blockIdx_ + 1) *
blockSize_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can explicitly store the lastIndex or the endIndex in the Block , that way you don't have to fiddle with the internals here.

test/LocalVocabTest.cpp Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also,
This will be an index-breaking change (because you have to know the number of blank nodes), so you have to change something in the IndexFormatVersion.h

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, almost finished, only minor suggestions remaining.
Make sure to also fix that one trivial sonarcloud thing (public member var in indexImpl, can definitely be private.

// Initialize the `localBlankNodeManager_` if it doesn't exist yet.
if (!localBlankNodeManager_) [[unlikely]] {
localBlankNodeManager_ =
std::make_unique<ad_utility::BlankNodeManager::LocalBlankNodeManager>(
blankNodeManager);
ad_utility::BlankNodeManager::LocalBlankNodeManager(blankNodeManager);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for optional you can write
localBlankNodeManager_.emplace(blankNodeManager) , emplace also works for explicit constructors.

Comment on lines 20 to 22
usedBlocksSet_.rlock()->size() < totalAvailableBlocks_ / 256,
absl::StrCat("Critical high number of blank node blocks in use: ",
usedBlocksSet_.rlock()->size(), " blocks"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should in priniple (because of rlock) be deadlock-free, but stil pecompute
auto numBlocks = ... and then use that variable in the check to reduce the code duplication.

Comment on lines 25 to 29
auto usedBlocksSetPtr = usedBlocksSet_.wlock();
while (usedBlocksSetPtr->contains(newBlockIndex)) {
newBlockIndex = randBlockIndex_();
}
usedBlocksSetPtr->insert(newBlockIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines withrandBlockIndex() can be refactored, what about

while(true) {
auto idx = randIndx();
  if not contained(idx) {
     insert + return;
}

LocalBlankNodeManager& operator=(const LocalBlankNodeManager&) = delete;

LocalBlankNodeManager(LocalBlankNodeManager&&) = default;
LocalBlankNodeManager& operator=(LocalBlankNodeManager&&) = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point of codecov:
Add a unit test, that moving of the local blank node managers does the freeing of the blocks still only once and at the right place. That makes us robust on ou assuptions on the moving of std::vector

@@ -2,4 +2,4 @@ add_subdirectory(ConfigManager)
add_subdirectory(MemorySize)
add_subdirectory(http)
add_library(util GeoSparqlHelpers.cpp antlr/ANTLRErrorHandling.cpp ParseException.cpp Conversions.cpp Date.cpp DateYearDuration.cpp Duration.cpp antlr/GenerateAntlrExceptionMetadata.cpp CancellationHandle.cpp StringUtils.cpp LazyJsonParser.cpp BlankNodeManager.cpp)
qlever_target_link_libraries(util re2::re2)
qlever_target_link_libraries(util re2::re2 s2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right... thanks.


const Index& index2 = getQec("")->getIndex();
EXPECT_NO_THROW(index2.getBlankNodeManager());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test for the correct reading and writing of the minimal local blank node index
just use getQec(...) with a kg that has some blank nodes, and then access the blank node manager of the returned index to check that the number of blank nodes was propagated correctly.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much

@joka921 joka921 merged commit ef057ac into ad-freiburg:master Oct 18, 2024
18 checks passed
@sparql-conformance
Copy link

Copy link

joka921 pushed a commit that referenced this pull request Nov 21, 2024
#1617)

This resolves a bug that was introduced in #1504, where the `LocalBlankNodeManager` of a `LocalVocab` was not accounted for in the merge functions of the `LocalVocab`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants