Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy import for SERVICE #1491

Merged
merged 8 commits into from
Sep 18, 2024
Merged

Conversation

UNEXENU
Copy link
Contributor

@UNEXENU UNEXENU commented Sep 9, 2024

Integrate the LazyJsonParser introduced in #1412 into the SERVICE Operation, which will help to reduce RAM usage for the import of large results. In particular, the (possibly large) JSON result of a SERVICE will not be fully materialized, but converted to a (possibly much smaller) IdTable on the fly. This is a preparation for making the SERVICE operation completely lazy.

@UNEXENU UNEXENU closed this Sep 9, 2024
@UNEXENU UNEXENU reopened this Sep 9, 2024
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 99.22481% with 1 line in your changes missing coverage. Please review.

Project coverage is 94.15%. Comparing base (dc71166) to head (9fa1043).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/util/LazyJsonParser.cpp 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1491      +/-   ##
==========================================
- Coverage   94.15%   94.15%   -0.01%     
==========================================
  Files         347      348       +1     
  Lines       25627    25698      +71     
  Branches     3445     3453       +8     
==========================================
+ Hits        24130    24196      +66     
- Misses       1455     1460       +5     
  Partials       42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this already looks very nice.
There are some small improvements that I have suggested.

src/engine/Service.cpp Outdated Show resolved Hide resolved
src/engine/Service.cpp Outdated Show resolved Hide resolved
src/engine/Service.cpp Outdated Show resolved Hide resolved
src/engine/Service.cpp Outdated Show resolved Hide resolved
src/engine/Service.cpp Outdated Show resolved Hide resolved
src/engine/Service.cpp Outdated Show resolved Hide resolved
src/util/LazyJsonParser.cpp Outdated Show resolved Hide resolved
@@ -202,6 +202,7 @@ TEST_F(ServiceTest, computeResult) {
[&](const std::string& result, std::string_view errorMsg,
boost::beast::http::status status = boost::beast::http::status::ok,
std::string contentType = "application/sparql-results+json") {
LOG(INFO) << "MSG: " << errorMsg << '\n';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like some debugging stuff that can be removed?

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the direction where this is headed, only a few small things are missing.

@@ -20,15 +20,19 @@ namespace ad_utility {
*/
class LazyJsonParser {
public:
// Generator detail, the first 100 input characters for better error context.
struct Details {
std::string first100_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to have the last 100 characters in addition,
it might also happen, that everything starts alright, but in the middle something goes wrong, we also need context for that.

for (const auto& bytes : partialJson) {
co_yield reinterpret_cast<const char*>(bytes.data());
co_yield std::string(reinterpret_cast<const char*>(bytes.data()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
co_yield std::string(reinterpret_cast<const char*>(bytes.data()),
co_yield std::string_view(reinterpret_cast<const char*>(bytes.data()),

// ____________________________________________________________________________
void Service::verifyVariables(
const nlohmann::json& j,
const ad_utility::LazyJsonParser::Generator& gen) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It suffices to pass const Details& as the second argument, I was confused what you need the generator for here and how this should work:)

Comment on lines 371 to 372
throwErrorWithContext("JSON result does not have the expected structure",
gen.details().first100_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also include the complete "head" clause in this error message, because that's what causes the problem here, so something like
... does not have the expected structere, as its "head" section is not according to the SPARQL standard. The "head"section is" .... dump the subjson here.


if (responseVars != expectedVars) {
throwErrorWithContext(
absl::StrCat("Header row of JSON result for SERVICE query is \"",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this message (as well as the one before) also add a "Probable cause: the remote endpoint sent a JSON response that is not according to the SPARQL standard".

">: ", sv, ". First 100 bytes of the response: ",
ctx.substr(0, std::min(100, (int)ctx.size()))));
this->throwErrorWithContext(sv,
ctx.substr(0, std::min(100, (int)ctx.size())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ctx.substr(0, std::min(100, (int)ctx.size())));
ctx.substr(0, 100));

substr automatically handles the out-of-bounds case.

};

// Verify status and content-type of the response.
if (response.status_ != boost::beast::http::status::ok) {
LOG(INFO) << serviceUrl.asString() << '\n';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that remaing here, or was that just for debugging?

Comment on lines 247 to 249
if (!resultExists || !varsChecked) {
throwErrorWithContext("JSON result does not have the expected structure",
body.details().first100_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make two distinct checks with distinct error messages
(head section was missing) or results section was missing.

expectThrowOrSilence(
"{\"head\": {\"vars\": 1},"
"\"results\": {\"bindings\": {}}}",
"JSON result does not have the expected structure.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test case for results section missing and results section has the wrong structure is missing.
In particular we probably want to catch the internal exceptions of the LazyJson parser and make something user-friendly out of them. (People will complain to us, even if its the fault of the other endpoint, so we should help our future selves as much as possible.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things left.

Comment on lines 31 to 32
details.last100_ = chunk.substr(
std::max(0, static_cast<int>(chunk.size() - 100)), 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to have the last 100 always.
So overwrite them directly after setting the first100, but outside the if of course.
That way we also get the context if the parsing of the chunk failed, which is probably the most useful.

@@ -250,7 +252,7 @@ std::optional<nlohmann::json> LazyJsonParser::constructResultFromParsedChunk(
size_t nextChunkStart =
materializeEnd == 0 ? 0 : std::min(materializeEnd + 1, input_.size());
if (input_.size() - nextChunkStart >= 1'000'000) {
throw std::runtime_error("Ill formed Json.");
throw Error("Ill formed JSON.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message should be something like
QLever currently doesn't support SERVICE results where a single result row is larger than 1 MB.

Copy link

@joka921 joka921 merged commit b70df93 into ad-freiburg:master Sep 18, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants