Skip to content

Commit

Permalink
Merge ssh://github.com/ad-freiburg/qlever into aggregate-stdev
Browse files Browse the repository at this point in the history
  • Loading branch information
ullingerc committed Nov 11, 2024
2 parents c71df41 + 1bcfeeb commit 321dc54
Show file tree
Hide file tree
Showing 60 changed files with 2,752 additions and 1,104 deletions.
11 changes: 8 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,21 @@ RUN apt-get update && apt-get install -y software-properties-common wget && add-
RUN wget https://apt.kitware.com/kitware-archive.sh && chmod +x kitware-archive.sh &&./kitware-archive.sh

FROM base as builder

Check warning on line 10 in Dockerfile

View workflow job for this annotation

GitHub Actions / docker

The 'as' keyword should match the case of the 'from' keyword

FromAsCasing: 'as' and 'FROM' keywords' casing do not match More info: https://docs.docker.com/go/dockerfile/rule/from-as-casing/
ARG TARGETPLATFORM
RUN apt-get update && apt-get install -y build-essential cmake libicu-dev tzdata pkg-config uuid-runtime uuid-dev git libjemalloc-dev ninja-build libzstd-dev libssl-dev libboost1.81-dev libboost-program-options1.81-dev libboost-iostreams1.81-dev libboost-url1.81-dev

COPY . /app/

WORKDIR /app/
ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /app/build/
RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=INFO -DUSE_PARALLEL=true -D_NO_TIMING_TESTS=ON -GNinja .. && ninja
RUN ctest --rerun-failed --output-on-failure
RUN cmake -DCMAKE_BUILD_TYPE=Release -DLOGLEVEL=INFO -DUSE_PARALLEL=true -D_NO_TIMING_TESTS=ON -GNinja ..
# When cross-compiling the container for ARM64, then compiling and running all tests runs into a timeout on GitHub actions,
# so we disable tests for this platform.
# TODO(joka921) re-enable these tests as soon as we can use a native ARM64 platform to compile the docker container.
RUN if [ $TARGETPLATFORM = "linux/arm64" ] ; then echo "target is ARM64, don't build tests to avoid timeout"; fi
RUN if [ $TARGETPLATFORM = "linux/arm64" ] ; then cmake --build . --target IndexBuilderMain ServerMain; else cmake --build . ; fi
RUN if [ $TARGETPLATFORM = "linux/arm64" ] ; then echo "Skipping tests for ARM64" ; else ctest --rerun-failed --output-on-failure ; fi

FROM base as runtime

Check warning on line 27 in Dockerfile

View workflow job for this annotation

GitHub Actions / docker

The 'as' keyword should match the case of the 'from' keyword

FromAsCasing: 'as' and 'FROM' keywords' casing do not match More info: https://docs.docker.com/go/dockerfile/rule/from-as-casing/
WORKDIR /app
Expand Down
39 changes: 38 additions & 1 deletion docs/path_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ SELECT ?start ?end ?path ?edge WHERE {
**one target**. Sources and targets are paired based on their index (i.e. the paths
from the first source to the first target are searched, then the second source and
target, and so on).
- **pathSearch:numPathsPerTarget** (optional): The path search will only search and store paths,
if the number of found paths is lower or equal to the value of the parameter. Expects an integer.
Example: if the value is 5, then the search will enumerate all paths until 5 paths have been found.
Other paths will be ignored.


### Example 1: Single Source and Target
Expand Down Expand Up @@ -170,7 +174,7 @@ SELECT ?start ?end ?path ?edge WHERE {
}
```

This is esecially useful for [N-ary relations](https://www.w3.org/TR/swbp-n-aryRelations/).
This is especially useful for [N-ary relations](https://www.w3.org/TR/swbp-n-aryRelations/).
Considering the example above, it is possible to query additional relations of `?middle`:

```sparql
Expand Down Expand Up @@ -255,6 +259,39 @@ SELECT ?start ?end ?path ?edge WHERE {
}
```

### Example 5: Limit Number of Paths per Target

It is possible to limit how many paths per target are returned. This is especially useful if
the query uses a lot of memory. In that case, it is possible to query a limited number of
paths to debug where the problem is.

The following query for example will only return one path per source and target pair.
I.e. one path for `(<source1>, <target1>)`, one path for `(<source1>, <target2>)` and so on.

```sparql
PREFIX pathSearch: <https://qlever.cs.uni-freiburg.de/pathSearch/>
SELECT ?start ?end ?path ?edge WHERE {
SERVICE pathSearch: {
_:path pathSearch:algorithm pathSearch:allPaths ;
pathSearch:source <source1> ;
pathSearch:source <source2> ;
pathSearch:target <target1> ;
pathSearch:target <target2> ;
pathSearch:pathColumn ?path ;
pathSearch:edgeColumn ?edge ;
pathSearch:start ?start ;
pathSearch:end ?end ;
pathSearch:numPathsPerTarget 1;
{
SELECT * WHERE {
?start <predicate> ?end.
}
}
}
}
```

## Error Handling

The Path Search feature will throw errors in the following scenarios:
Expand Down
2 changes: 1 addition & 1 deletion src/engine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ add_library(engine
VariableToColumnMap.cpp ExportQueryExecutionTrees.cpp
CartesianProductJoin.cpp TextIndexScanForWord.cpp TextIndexScanForEntity.cpp
TextLimit.cpp LazyGroupBy.cpp GroupByHashMapOptimization.cpp SpatialJoin.cpp
CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp)
CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp ExecuteUpdate.cpp)
qlever_target_link_libraries(engine util index parser sparqlExpressions http SortPerformanceEstimator Boost::iostreams s2)
174 changes: 87 additions & 87 deletions src/engine/CartesianProductJoin.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -53,22 +53,21 @@ string CartesianProductJoin::getCacheKeyImpl() const {
// ____________________________________________________________________________
size_t CartesianProductJoin::getResultWidth() const {
auto view = childView() | std::views::transform(&Operation::getResultWidth);
return std::accumulate(view.begin(), view.end(), 0UL, std::plus{});
return std::reduce(view.begin(), view.end(), 0UL, std::plus{});
}

// ____________________________________________________________________________
size_t CartesianProductJoin::getCostEstimate() {
auto childSizes =
childView() | std::views::transform(&Operation::getCostEstimate);
return getSizeEstimate() + std::accumulate(childSizes.begin(),
childSizes.end(), 0UL,
std::plus{});
return getSizeEstimate() +
std::reduce(childSizes.begin(), childSizes.end(), 0UL, std::plus{});
}

// ____________________________________________________________________________
uint64_t CartesianProductJoin::getSizeEstimateBeforeLimit() {
auto view = childView() | std::views::transform(&Operation::getSizeEstimate);
return std::accumulate(view.begin(), view.end(), 1UL, std::multiplies{});
return std::reduce(view.begin(), view.end(), 1UL, std::multiplies{});
}

// ____________________________________________________________________________
Expand All @@ -85,13 +84,10 @@ bool CartesianProductJoin::knownEmptyResult() {
}

// ____________________________________________________________________________
template <size_t StaticGroupSize>
void CartesianProductJoin::writeResultColumn(std::span<Id> targetColumn,
std::span<const Id> inputColumn,
size_t groupSize, size_t offset) {
if (StaticGroupSize != 0) {
AD_CORRECTNESS_CHECK(StaticGroupSize == groupSize);
}
size_t groupSize,
size_t offset) const {
// Copy each element from the `inputColumn` `groupSize` times to
// the `targetColumn`, repeat until the `targetColumn` is completely filled.
size_t numRowsWritten = 0;
Expand All @@ -104,20 +100,13 @@ void CartesianProductJoin::writeResultColumn(std::span<Id> targetColumn,
size_t groupStartIdx = offset % groupSize;
while (true) {
for (size_t i = firstInputElementIdx; i < inputSize; ++i) {
auto writeGroup = [&](size_t actualGroupSize) {
for (size_t u = groupStartIdx; u < actualGroupSize; ++u) {
if (numRowsWritten == targetSize) {
return;
}
targetColumn[numRowsWritten] = inputColumn[i];
++numRowsWritten;
checkCancellation();
for (size_t u = groupStartIdx; u < groupSize; ++u) {
if (numRowsWritten == targetSize) {
return;
}
};
if constexpr (StaticGroupSize == 0) {
writeGroup(groupSize);
} else {
writeGroup(StaticGroupSize);
targetColumn[numRowsWritten] = inputColumn[i];
++numRowsWritten;
checkCancellation();
}
if (numRowsWritten == targetSize) {
return;
Expand All @@ -131,61 +120,52 @@ void CartesianProductJoin::writeResultColumn(std::span<Id> targetColumn,
firstInputElementIdx = 0;
}
}

// ____________________________________________________________________________
ProtoResult CartesianProductJoin::computeResult(
[[maybe_unused]] bool requestLaziness) {
IdTable result{getExecutionContext()->getAllocator()};
result.setNumColumns(getResultWidth());
std::vector<std::shared_ptr<const Result>> subResults;
std::vector<std::shared_ptr<const Result>> subResults = calculateSubResults();

// We don't need to fully materialize the child results if we have a LIMIT
// specified and an OFFSET of 0.
// TODO<joka921> We could in theory also apply this optimization if a
// non-zero OFFSET is specified, but this would make the algorithm more
// complicated.
std::optional<LimitOffsetClause> limitIfPresent = getLimit();
if (!getLimit()._limit.has_value() || getLimit()._offset != 0) {
limitIfPresent = std::nullopt;
}

// Get all child results (possibly with limit, see above).
for (auto& child : childView()) {
if (limitIfPresent.has_value() && child.supportsLimit()) {
child.setLimit(limitIfPresent.value());
}
subResults.push_back(child.getResult());
IdTable result = writeAllColumns(subResults);

const auto& table = subResults.back()->idTable();
// Early stopping: If one of the results is empty, we can stop early.
if (table.empty()) {
break;
}
// Dereference all the subresult pointers because `getSharedLocalVocabFrom...`
// requires a range of references, not pointers.
auto subResultsDeref = std::views::transform(
subResults, [](auto& x) -> decltype(auto) { return *x; });
return {std::move(result), resultSortedOn(),
Result::getMergedLocalVocab(subResultsDeref)};
}

// If one of the children is the neutral element (because of a triple with
// zero variables), we can simply ignore it here.
if (table.numRows() == 1 && table.numColumns() == 0) {
subResults.pop_back();
continue;
}
// Example for the following calculation: If we have a LIMIT of 1000 and
// the first child already has a result of size 100, then the second child
// needs to evaluate only its first 10 results. The +1 is because integer
// divisions are rounded down by default.
if (limitIfPresent.has_value()) {
limitIfPresent.value()._limit = limitIfPresent.value()._limit.value() /
subResults.back()->idTable().size() +
1;
// ____________________________________________________________________________
VariableToColumnMap CartesianProductJoin::computeVariableToColumnMap() const {
VariableToColumnMap result;
// It is crucial that we also count the columns in the inputs to which no
// variable was assigned. This is managed by the `offset` variable.
size_t offset = 0;
for (const auto& child : childView()) {
for (auto varCol : child.getExternallyVisibleVariableColumns()) {
varCol.second.columnIndex_ += offset;
result.insert(std::move(varCol));
}
// `getResultWidth` contains all the columns, not only the ones to which a
// variable is assigned.
offset += child.getResultWidth();
}
return result;
}

// _____________________________________________________________________________
IdTable CartesianProductJoin::writeAllColumns(
const std::vector<std::shared_ptr<const Result>>& subResults) const {
IdTable result{getResultWidth(), getExecutionContext()->getAllocator()};
// TODO<joka921> Find a solution to cheaply handle the case, that only a
// single result is left. This can probably be done by using the
// `ProtoResult`.

auto sizesView = std::views::transform(
subResults, [](const auto& child) { return child->idTable().size(); });
auto totalResultSize = std::accumulate(sizesView.begin(), sizesView.end(),
1UL, std::multiplies{});
auto totalResultSize =
std::reduce(sizesView.begin(), sizesView.end(), 1UL, std::multiplies{});

size_t totalSizeIncludingLimit = getLimit().actualSize(totalResultSize);
size_t offset = getLimit().actualOffset(totalResultSize);
Expand All @@ -211,37 +191,57 @@ ProtoResult CartesianProductJoin::computeResult(
const auto& input = subResultPtr->idTable();
for (const auto& inputCol : input.getColumns()) {
decltype(auto) resultCol = result.getColumn(resultColIdx);
ad_utility::callFixedSize(groupSize, [&]<size_t I>() {
writeResultColumn<I>(resultCol, inputCol, groupSize, offset);
});
writeResultColumn(resultCol, inputCol, groupSize, offset);
++resultColIdx;
}
groupSize *= input.numRows();
}
}

// Dereference all the subresult pointers because `getSharedLocalVocabFrom...`
// requires a range of references, not pointers.
auto subResultsDeref = std::views::transform(
subResults, [](auto& x) -> decltype(auto) { return *x; });
return {std::move(result), resultSortedOn(),
Result::getMergedLocalVocab(subResultsDeref)};
return result;
}

// ____________________________________________________________________________
VariableToColumnMap CartesianProductJoin::computeVariableToColumnMap() const {
VariableToColumnMap result;
// It is crucial that we also count the columns in the inputs to which no
// variable was assigned. This is managed by the `offset` variable.
size_t offset = 0;
for (const auto& child : childView()) {
for (auto varCol : child.getExternallyVisibleVariableColumns()) {
varCol.second.columnIndex_ += offset;
result.insert(std::move(varCol));
// _____________________________________________________________________________
std::vector<std::shared_ptr<const Result>>
CartesianProductJoin::calculateSubResults() {
std::vector<std::shared_ptr<const Result>> subResults;
// We don't need to fully materialize the child results if we have a LIMIT
// specified and an OFFSET of 0.
// TODO<joka921> We could in theory also apply this optimization if a
// non-zero OFFSET is specified, but this would make the algorithm more
// complicated.
std::optional<LimitOffsetClause> limitIfPresent = getLimit();
if (!getLimit()._limit.has_value() || getLimit()._offset != 0) {
limitIfPresent = std::nullopt;
}

// Get all child results (possibly with limit, see above).
for (auto& child : childView()) {
if (limitIfPresent.has_value() && child.supportsLimit()) {
child.setLimit(limitIfPresent.value());
}
subResults.push_back(child.getResult());

const auto& table = subResults.back()->idTable();
// Early stopping: If one of the results is empty, we can stop early.
if (table.empty()) {
break;
}

// If one of the children is the neutral element (because of a triple with
// zero variables), we can simply ignore it here.
if (table.numRows() == 1 && table.numColumns() == 0) {
subResults.pop_back();
continue;
}
// Example for the following calculation: If we have a LIMIT of 1000 and
// the first child already has a result of size 100, then the second child
// needs to evaluate only its first 10 results. The +1 is because integer
// divisions are rounded down by default.
if (limitIfPresent.has_value()) {
limitIfPresent.value()._limit = limitIfPresent.value()._limit.value() /
subResults.back()->idTable().size() +
1;
}
// `getResultWidth` contains all the columns, not only the ones to which a
// variable is assigned.
offset += child.getResultWidth();
}
return result;
return subResults;
}
14 changes: 9 additions & 5 deletions src/engine/CartesianProductJoin.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,15 @@ class CartesianProductJoin : public Operation {
// Copy each element from the `inputColumn` `groupSize` times to the
// `targetColumn`. Repeat until the `targetColumn` is completely filled. Skip
// the first `offset` write operations to the `targetColumn`. Call
// `checkCancellation` after each write. If `StaticGroupSize != 0`, then the
// group size is known at compile time which allows for more efficient loop
// processing for very small group sizes.
template <size_t StaticGroupSize = 0>
// `checkCancellation` after each write.
void writeResultColumn(std::span<Id> targetColumn,
std::span<const Id> inputColumn, size_t groupSize,
size_t offset);
size_t offset) const;

// Write all columns of the subresults into an `IdTable` and return it.
IdTable writeAllColumns(
const std::vector<std::shared_ptr<const Result>>& subResults) const;

// Calculate the subresults of the children and store them into a vector.
std::vector<std::shared_ptr<const Result>> calculateSubResults();
};
7 changes: 4 additions & 3 deletions src/engine/CountAvailablePredicates.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -165,9 +165,10 @@ void CountAvailablePredicates::computePatternTrickAllEntities(
TripleComponent::Iri::fromIriref(HAS_PATTERN_PREDICATE), std::nullopt,
std::nullopt}
.toScanSpecification(index);
auto fullHasPattern = index.getPermutation(Permutation::Enum::PSO)
.lazyScan(scanSpec, std::nullopt, {},
cancellationHandle_, deltaTriples());
auto fullHasPattern =
index.getPermutation(Permutation::Enum::PSO)
.lazyScan(scanSpec, std::nullopt, {}, cancellationHandle_,
locatedTriplesSnapshot());
for (const auto& idTable : fullHasPattern) {
for (const auto& patternId : idTable.getColumn(1)) {
AD_CORRECTNESS_CHECK(patternId.getDatatype() == Datatype::Int);
Expand Down
Loading

0 comments on commit 321dc54

Please sign in to comment.