Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lazy parser for the application/sparql-results+json format #1412

Merged
merged 16 commits into from
Aug 29, 2024

Conversation

UNEXENU
Copy link
Contributor

@UNEXENU UNEXENU commented Jul 23, 2024

This parser can be used to lazily parse JSON input where most of the data resides in a single JSON array, the path to which is known in advance with the additional assumption that the single entries of that array are small, but there might be many of them. This assumption holds for the JSON format of SPARQL query results, which consist of some (small) metadata like the contained variables and a large array that contains the result rows, each of which are typically small (at most one entry per variable).
In the future this parser will be used to implement a lazy SERVICE operation.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first comments.

Comment on lines 80 to 93
if (inArrayPath_) {
continue;
}
inString_ = !inString_;

if (expectKey_) {
if (inString_) {
strStart = it + 1;
} else {
absl::StrAppend(&tempKey_, std::string(strStart, it));
setKeysValueType_ = true;
expectKey_ = false;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (inArrayPath_) {
continue;
}
inString_ = !inString_;
if (expectKey_) {
if (inString_) {
strStart = it + 1;
} else {
absl::StrAppend(&tempKey_, std::string(strStart, it));
setKeysValueType_ = true;
expectKey_ = false;
}
}
std::optional<string_view> s = parseString(it, end);
// Do something with the key.

Copy link

codecov bot commented Jul 29, 2024

Codecov Report

Attention: Patch coverage is 98.77301% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.47%. Comparing base (be816a9) to head (9b760ea).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/util/LazyJsonParser.cpp 98.77% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1412      +/-   ##
==========================================
+ Coverage   89.41%   89.47%   +0.06%     
==========================================
  Files         346      347       +1     
  Lines       25350    25513     +163     
  Branches     3381     3433      +52     
==========================================
+ Hits        22667    22829     +162     
  Misses       1494     1494              
- Partials     1189     1190       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first thorough round of reviews.

Comment on lines 8 to 9

LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath)
// ____________________________________________________________________________________
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath)

(Everywhere in .cpp files)

@@ -0,0 +1,159 @@
#include "LazyJsonParser.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global include paths.

namespace ad_utility {

LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath)
: arrayPath_(arrayPath),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: arrayPath_(arrayPath),
: arrayPath_(std::move(arrayPath)),

Comment on lines 24 to 28
auto tryAddKeyToPath = [this]() {
if (strEnd_ != 0) {
curPath_.emplace_back(input_.substr(strStart_, strEnd_ - strStart_));
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a documented private member function.

src/util/LazyJsonParser.cpp Show resolved Hide resolved
Comment on lines 28 to 29
// Parses a chunk of JSON data and returns it with reconstructed structure.
std::string parseChunk(const std::string& inStr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Parses a chunk of JSON data and returns it with reconstructed structure.
std::string parseChunk(const std::string& inStr);
// Parses a chunk of JSON data and returns it with reconstructed structure.
std::string parseChunk(std::string_view inStr);

Comment on lines 21 to 22
// TODO<unexenu> this should take a generator as argument instead
static std::string parse(const std::string& s,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take an auto&& for the first part.
We can then constrain this template to be some kind of range that produces some kinds of strings,
but that is maybe for another time.

Comment on lines 41 to 57
std::string input_;
bool isEscaped_{false};
bool inString_{false};
bool inArrayPath_{false};
int openBracesInArrayPath_{0};
int openBrackets_{0};
unsigned int yieldCount_{0};

size_t strStart_{0};
size_t strEnd_{0};

std::vector<std::string> curPath_;
const std::vector<std::string> arrayPath_;

const std::string prefixInArray_;
const std::string suffixInArray_;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need comments (but first clean up the structure as suggested.

const std::vector<std::string>& arrayPath) {
LazyJsonParser p(arrayPath);
return p.parseChunk(s);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need some safety measure to make sure that we cannot haven an denial of service attack with ill-formed JSON
(basically some thresholds like "after 1 Megabyte we must be in the array path, each entry in the arrayPath must be at most 1 MB and the same goes for everything after the array path.

This should be relatively easy to track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a simple check for that in the latest commit.

While it's possible to compute this precisely, the current solution compares the first/last element in the arrayPath together with the part before/after the arrayPath against the threshold.
All other elements are checked one by one.

This leads to the first/last element of arrayPath being more restricted in size than the other elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mechanism now looks good to me.

test/LazyJsonParserTest.cpp Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much much better,
I can now read and understand this code:)

I have some suggestions, but we are getting there!

Comment on lines 22 to 29
static cppcoro::generator<std::string> parse(
cppcoro::generator<std::string> partJson,
std::vector<std::string> arrayPath) {
LazyJsonParser p(arrayPath);
for (const auto& chunk : partJson) {
co_yield p.parseChunk(chunk);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this generator yield JSON objects, to make it more typesafe?
Just add a the json::parse call into this function (same for the parse overload below).

Comment on lines 31 to 36
// As above just on a single string.
static std::string parse(const std::string& s,
const std::vector<std::string>& arrayPath) {
LazyJsonParser p(arrayPath);
return p.parseChunk(s);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this function? Do we still need it? Otherwise it is best removed.

Comment on lines 44 to 45
switch (state_.index()) {
case 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a cascade of if (holds_alternative) here, nobody knows what index 1 means:)

Comment on lines 50 to 53
if (arrayPath_.empty() &&
std::holds_alternative<AfterArrayPath>(state_)) {
materializeEnd = idx;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something that should be inside parseInArrayPath,
because here we don't have the necessary context to understand it.

if (idx >= 1'000'000) {
throw std::runtime_error(
"Ill-formed JSON: Header size exceeds 1 Megabyte.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mechanism only catches the case
if the single chunks are very big.
That however is not the big problem, because that is the user's responsibility.
The problem is rather, if we parse many chunks without seeing the end of the header, or the end of an array.

The solution to this is basically also simple:
After each chunk (can be done in the very outer loop) Assert that the size of the remaining input (that you have to carry on to the next chunk as a prefix) is smaller than the given threshold if that makes sense.
So no need to add something within the single chunks, if someone wants to pass in megabytes here, this is not our problem:)

break;
case ',':
if (state.openBracketsAndBraces_ == 0) {
materializeEnd = idx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this materializeEnd is not read, but only written, it should be a return value of this function
(together with a comment, what this function returns).

Comment on lines 199 to 202
if (idx - state.arrayEndIdx_ > 1'000'000) {
throw std::runtime_error(
"Ill-formed JSON: Element size exceeds 1 Megabyte.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

state.remainingBraces_ -= 1;
if (state.remainingBraces_ == 0) {
// End reached.
materializeEnd = idx + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should be a return value.

Comment on lines 239 to 240
absl::StrAppend(&res, input_.substr(0, materializeEnd));
input_ = input_.substr(std::min(materializeEnd + 1, input_.size()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implicitly skips one character.
It is probably correct, but it looks really fishy.
This either needs documentation (why is this correct in all cases), or has to be fixed.
Is it correct , that this only happens either at a , inside the ArrayPath, or completely at the end after the trailer when the input is done.
So you can assert something like (inArrayPath && skippedCharIsComm) || ( afterArrayPath + materializeEnd is really at the end).

Comment on lines 251 to 254
if (litLength_ > 0) {
curPath_.emplace_back(input.substr(litStart_, litLength_));
// The marked literal got consumed/added as key -> reset litLength_.
litLength_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually: Are the "" part of the literal or not?
Otherwise your case litLength_ == 0 is ambigous (no literal vs. empty literal).
I think the much cleaner interface is to store the
litStart_ and litLenght_ in a struct and wrap this in an std::optional<> s.t. you really have a clear "there just was no literal" case.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor suggestions,
This is really looking good now.

cppcoro::generator<std::string> partJson,
std::vector<std::string> arrayPath) {
LazyJsonParser p(arrayPath);
for (const auto& chunk : partJson) {
co_yield p.parseChunk(chunk);
if (auto res = p.parseChunk(chunk); res.has_value()) {
co_yield res;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, thsi box contained nonsense because I didn't understand the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is not a template and not super performance critical,
    You can move the definition to the .cpp file.

@@ -67,16 +63,13 @@ void LazyJsonParser::parseLiteral(size_t& idx) {
if (input_[idx] == '"' && !inLiteral_) {
++idx;
if (std::holds_alternative<BeforeArrayPath>(state_)) {
std::get<BeforeArrayPath>(state_).litStart_ = idx;
std::get<BeforeArrayPath>(state_).optLiteral_ =
BeforeArrayPath::LiteralView{.start_ = idx};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit .length_ = 0 would also be nice, then one doesn't have to look that up and it doesn't hurt.

Comment on lines 81 to 82
std::get<BeforeArrayPath>(state_).optLiteral_->length_ =
idx - std::get<BeforeArrayPath>(state_).optLiteral_->start_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use optLiteral.value().lent_ , some for start .
These unsafe accesses give me goosebumps.

if (yieldCount_ > 0) {
res = prefixInArray_;
resStr = prefixInArray_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use a ternary instead of three lines: std::string resStr = yeldCount_ >0 ? prefixInArray_ : "";

const std::vector<std::string>& arrayPath) {
LazyJsonParser p(arrayPath);
return p.parseChunk(s);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mechanism now looks good to me.

EXPECT_NO_THROW(expectYields("[1,2,3]", {}, arrayPath));
EXPECT_NO_THROW(expectYields("{}", {}, arrayPath));

EXPECT_ANY_THROW(expectYields(std::string(1'000'000, '0'), {}, arrayPath));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yellow branches inside the Ad_CORRECTNESS_CHECKs seem to indicate that you haven't tested the case, that
e.g. there is a literal that is between a chunk.
But then again this looks odd to me, because you yield single characters.
Maybe we can both have a look at this.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!
Only two tiny suggestions (best practices)
and then this is ready to go!

cppcoro::generator<nlohmann::json> LazyJsonParser::parse(
cppcoro::generator<std::string> partJson,
std::vector<std::string> arrayPath) {
LazyJsonParser p(arrayPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LazyJsonParser p(arrayPath);
LazyJsonParser p{std::move(arrayPath});

for (const auto& chunk : partJson) {
if (auto res = p.parseChunk(chunk); res.has_value()) {
co_yield res;
if (p.endReached_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a (albeit static) member function of LazyJsonParser, endReached_ can (as all the other member variabls) be private, which then also gets rid of the Sonarcloud message.

std::vector<std::string> arrayPath);

// Indicates whether the end of the object has been reached.
bool endReached_{false};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, I am pretty sure that this can be private.

Copy link

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This is a milestone towards more RAM (and possibly time-efficient) SERVICE handling.

@joka921 joka921 changed the title Implement LazyJsonParser A lazy parser for the application/sparql-results+json format Aug 29, 2024
@joka921 joka921 merged commit 7dcfea2 into ad-freiburg:master Aug 29, 2024
20 checks passed
@UNEXENU UNEXENU mentioned this pull request Sep 9, 2024
joka921 pushed a commit that referenced this pull request Sep 18, 2024
Integrate the `LazyJsonParser` introduced in #1412 into the `SERVICE` Operation, which will help to reduce RAM usage for the import of large results. In particular, the (possibly large) JSON result of a SERVICE will not be fully materialized, but converted to a (possibly much smaller) `IdTable` on the fly. This is a preparation for making the SERVICE operation completely lazy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants