-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A lazy parser for the application/sparql-results+json
format
#1412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first comments.
src/util/LazyJsonParser.cpp
Outdated
if (inArrayPath_) { | ||
continue; | ||
} | ||
inString_ = !inString_; | ||
|
||
if (expectKey_) { | ||
if (inString_) { | ||
strStart = it + 1; | ||
} else { | ||
absl::StrAppend(&tempKey_, std::string(strStart, it)); | ||
setKeysValueType_ = true; | ||
expectKey_ = false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (inArrayPath_) { | |
continue; | |
} | |
inString_ = !inString_; | |
if (expectKey_) { | |
if (inString_) { | |
strStart = it + 1; | |
} else { | |
absl::StrAppend(&tempKey_, std::string(strStart, it)); | |
setKeysValueType_ = true; | |
expectKey_ = false; | |
} | |
} | |
std::optional<string_view> s = parseString(it, end); | |
// Do something with the key. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1412 +/- ##
==========================================
+ Coverage 89.41% 89.47% +0.06%
==========================================
Files 346 347 +1
Lines 25350 25513 +163
Branches 3381 3433 +52
==========================================
+ Hits 22667 22829 +162
Misses 1494 1494
- Partials 1189 1190 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first thorough round of reviews.
src/util/LazyJsonParser.cpp
Outdated
|
||
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath) | |
// ____________________________________________________________________________________ | |
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath) |
(Everywhere in .cpp
files)
src/util/LazyJsonParser.cpp
Outdated
@@ -0,0 +1,159 @@ | |||
#include "LazyJsonParser.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global include paths.
src/util/LazyJsonParser.cpp
Outdated
namespace ad_utility { | ||
|
||
LazyJsonParser::LazyJsonParser(std::vector<std::string> arrayPath) | ||
: arrayPath_(arrayPath), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: arrayPath_(arrayPath), | |
: arrayPath_(std::move(arrayPath)), |
src/util/LazyJsonParser.cpp
Outdated
auto tryAddKeyToPath = [this]() { | ||
if (strEnd_ != 0) { | ||
curPath_.emplace_back(input_.substr(strStart_, strEnd_ - strStart_)); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a documented private member function.
src/util/LazyJsonParser.h
Outdated
// Parses a chunk of JSON data and returns it with reconstructed structure. | ||
std::string parseChunk(const std::string& inStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Parses a chunk of JSON data and returns it with reconstructed structure. | |
std::string parseChunk(const std::string& inStr); | |
// Parses a chunk of JSON data and returns it with reconstructed structure. | |
std::string parseChunk(std::string_view inStr); |
src/util/LazyJsonParser.h
Outdated
// TODO<unexenu> this should take a generator as argument instead | ||
static std::string parse(const std::string& s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take an auto&&
for the first part.
We can then constrain this template to be some kind of range that produces some kinds of strings,
but that is maybe for another time.
src/util/LazyJsonParser.h
Outdated
std::string input_; | ||
bool isEscaped_{false}; | ||
bool inString_{false}; | ||
bool inArrayPath_{false}; | ||
int openBracesInArrayPath_{0}; | ||
int openBrackets_{0}; | ||
unsigned int yieldCount_{0}; | ||
|
||
size_t strStart_{0}; | ||
size_t strEnd_{0}; | ||
|
||
std::vector<std::string> curPath_; | ||
const std::vector<std::string> arrayPath_; | ||
|
||
const std::string prefixInArray_; | ||
const std::string suffixInArray_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need comments (but first clean up the structure as suggested.
src/util/LazyJsonParser.h
Outdated
const std::vector<std::string>& arrayPath) { | ||
LazyJsonParser p(arrayPath); | ||
return p.parseChunk(s); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need some safety measure to make sure that we cannot haven an denial of service attack with ill-formed JSON
(basically some thresholds like "after 1 Megabyte we must be in the array path, each entry in the arrayPath must be at most 1 MB and the same goes for everything after the array path.
This should be relatively easy to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a simple check for that in the latest commit.
While it's possible to compute this precisely, the current solution compares the first/last element in the arrayPath together with the part before/after the arrayPath against the threshold.
All other elements are checked one by one.
This leads to the first/last element of arrayPath being more restricted in size than the other elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mechanism now looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much much better,
I can now read and understand this code:)
I have some suggestions, but we are getting there!
src/util/LazyJsonParser.h
Outdated
static cppcoro::generator<std::string> parse( | ||
cppcoro::generator<std::string> partJson, | ||
std::vector<std::string> arrayPath) { | ||
LazyJsonParser p(arrayPath); | ||
for (const auto& chunk : partJson) { | ||
co_yield p.parseChunk(chunk); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this generator yield JSON
objects, to make it more typesafe?
Just add a the json::parse call into this function (same for the parse
overload below).
src/util/LazyJsonParser.h
Outdated
// As above just on a single string. | ||
static std::string parse(const std::string& s, | ||
const std::vector<std::string>& arrayPath) { | ||
LazyJsonParser p(arrayPath); | ||
return p.parseChunk(s); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of this function? Do we still need it? Otherwise it is best removed.
src/util/LazyJsonParser.cpp
Outdated
switch (state_.index()) { | ||
case 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a cascade of if (holds_alternative)
here, nobody knows what index 1
means:)
src/util/LazyJsonParser.cpp
Outdated
if (arrayPath_.empty() && | ||
std::holds_alternative<AfterArrayPath>(state_)) { | ||
materializeEnd = idx; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like something that should be inside parseInArrayPath
,
because here we don't have the necessary context to understand it.
src/util/LazyJsonParser.cpp
Outdated
if (idx >= 1'000'000) { | ||
throw std::runtime_error( | ||
"Ill-formed JSON: Header size exceeds 1 Megabyte."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mechanism only catches the case
if the single chunks are very big.
That however is not the big problem, because that is the user's responsibility.
The problem is rather, if we parse many chunks without seeing the end of the header, or the end of an array.
The solution to this is basically also simple:
After each chunk (can be done in the very outer loop) Assert that the size of the remaining input (that you have to carry on to the next chunk as a prefix) is smaller than the given threshold if that makes sense.
So no need to add something within the single chunks, if someone wants to pass in megabytes here, this is not our problem:)
break; | ||
case ',': | ||
if (state.openBracketsAndBraces_ == 0) { | ||
materializeEnd = idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this materializeEnd
is not read, but only written, it should be a return value of this function
(together with a comment, what this function returns).
src/util/LazyJsonParser.cpp
Outdated
if (idx - state.arrayEndIdx_ > 1'000'000) { | ||
throw std::runtime_error( | ||
"Ill-formed JSON: Element size exceeds 1 Megabyte."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
src/util/LazyJsonParser.cpp
Outdated
state.remainingBraces_ -= 1; | ||
if (state.remainingBraces_ == 0) { | ||
// End reached. | ||
materializeEnd = idx + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, should be a return value.
src/util/LazyJsonParser.cpp
Outdated
absl::StrAppend(&res, input_.substr(0, materializeEnd)); | ||
input_ = input_.substr(std::min(materializeEnd + 1, input_.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implicitly skips one character.
It is probably correct, but it looks really fishy.
This either needs documentation (why is this correct in all cases), or has to be fixed.
Is it correct , that this only happens either at a ,
inside the ArrayPath, or completely at the end after the trailer when the input is done.
So you can assert something like (inArrayPath && skippedCharIsComm) || ( afterArrayPath + materializeEnd is really at the end).
src/util/LazyJsonParser.cpp
Outdated
if (litLength_ > 0) { | ||
curPath_.emplace_back(input.substr(litStart_, litLength_)); | ||
// The marked literal got consumed/added as key -> reset litLength_. | ||
litLength_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually: Are the ""
part of the literal or not?
Otherwise your case litLength_ == 0
is ambigous (no literal vs. empty literal).
I think the much cleaner interface is to store the
litStart_ and litLenght_
in a struct and wrap this in an std::optional<>
s.t. you really have a clear "there just was no literal" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor suggestions,
This is really looking good now.
src/util/LazyJsonParser.h
Outdated
cppcoro::generator<std::string> partJson, | ||
std::vector<std::string> arrayPath) { | ||
LazyJsonParser p(arrayPath); | ||
for (const auto& chunk : partJson) { | ||
co_yield p.parseChunk(chunk); | ||
if (auto res = p.parseChunk(chunk); res.has_value()) { | ||
co_yield res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, thsi box contained nonsense because I didn't understand the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is not a template and not super performance critical,
You can move the definition to the.cpp
file.
src/util/LazyJsonParser.cpp
Outdated
@@ -67,16 +63,13 @@ void LazyJsonParser::parseLiteral(size_t& idx) { | |||
if (input_[idx] == '"' && !inLiteral_) { | |||
++idx; | |||
if (std::holds_alternative<BeforeArrayPath>(state_)) { | |||
std::get<BeforeArrayPath>(state_).litStart_ = idx; | |||
std::get<BeforeArrayPath>(state_).optLiteral_ = | |||
BeforeArrayPath::LiteralView{.start_ = idx}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explicit .length_ = 0
would also be nice, then one doesn't have to look that up and it doesn't hurt.
src/util/LazyJsonParser.cpp
Outdated
std::get<BeforeArrayPath>(state_).optLiteral_->length_ = | ||
idx - std::get<BeforeArrayPath>(state_).optLiteral_->start_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use optLiteral.value().lent_
, some for start
.
These unsafe accesses give me goosebumps.
src/util/LazyJsonParser.cpp
Outdated
if (yieldCount_ > 0) { | ||
res = prefixInArray_; | ||
resStr = prefixInArray_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use a ternary instead of three lines: std::string resStr = yeldCount_ >0 ? prefixInArray_ : "";
src/util/LazyJsonParser.h
Outdated
const std::vector<std::string>& arrayPath) { | ||
LazyJsonParser p(arrayPath); | ||
return p.parseChunk(s); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mechanism now looks good to me.
EXPECT_NO_THROW(expectYields("[1,2,3]", {}, arrayPath)); | ||
EXPECT_NO_THROW(expectYields("{}", {}, arrayPath)); | ||
|
||
EXPECT_ANY_THROW(expectYields(std::string(1'000'000, '0'), {}, arrayPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The yellow branches inside the Ad_CORRECTNESS_CHECKs
seem to indicate that you haven't tested the case, that
e.g. there is a literal that is between a chunk.
But then again this looks odd to me, because you yield single characters.
Maybe we can both have a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
Only two tiny suggestions (best practices)
and then this is ready to go!
src/util/LazyJsonParser.cpp
Outdated
cppcoro::generator<nlohmann::json> LazyJsonParser::parse( | ||
cppcoro::generator<std::string> partJson, | ||
std::vector<std::string> arrayPath) { | ||
LazyJsonParser p(arrayPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LazyJsonParser p(arrayPath); | |
LazyJsonParser p{std::move(arrayPath}); |
for (const auto& chunk : partJson) { | ||
if (auto res = p.parseChunk(chunk); res.has_value()) { | ||
co_yield res; | ||
if (p.endReached_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a (albeit static
) member function of LazyJsonParser
, endReached_
can (as all the other member variabls) be private, which then also gets rid of the Sonarcloud message.
src/util/LazyJsonParser.h
Outdated
std::vector<std::string> arrayPath); | ||
|
||
// Indicates whether the end of the object has been reached. | ||
bool endReached_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, I am pretty sure that this can be private
.
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! This is a milestone towards more RAM (and possibly time-efficient) SERVICE handling.
application/sparql-results+json
format
Integrate the `LazyJsonParser` introduced in #1412 into the `SERVICE` Operation, which will help to reduce RAM usage for the import of large results. In particular, the (possibly large) JSON result of a SERVICE will not be fully materialized, but converted to a (possibly much smaller) `IdTable` on the fly. This is a preparation for making the SERVICE operation completely lazy.
This parser can be used to lazily parse JSON input where most of the data resides in a single JSON array, the path to which is known in advance with the additional assumption that the single entries of that array are small, but there might be many of them. This assumption holds for the JSON format of SPARQL query results, which consist of some (small) metadata like the contained variables and a large array that contains the result rows, each of which are typically small (at most one entry per variable).
In the future this parser will be used to implement a lazy
SERVICE
operation.