Parsing a string column containing JSON values into a typed array #6522
Labels
enhancement
Any new improvement worthy of a entry in the changelog
good first issue
Good for newcomers
help wanted
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have a nullable
StringArray
column that contains JSON object literals.I need to JSON parse the column into a
StructArray
of values following some schema, and NULL input values should become NULL output values.This can almost be implemented using arrow_json::reader::ReaderBuilder::build_decoder and then feeding in the bytes of each string. But the decoder has no concept of record separators in the input stream. Thus, invalid inputs such as blank strings (
""
), or truncated records ("{\"a\":1"
), or multiple objects ("{\"a\": 1} {\"a\": 2}"
) will confuse the decoding process. If we're lucky, it will produce the wrong number of records, but an adversarial input could easily seem to produce the correct number of records even tho no single input string represented a valid JSON object. Thus, if I want such safety, I'm forced to parse each string as its ownRecordBatch
(which can then be validated independently), and then concatenate them all. Ugly, error-prone, and inefficient:(example code, has panics instead of full error handling)
Describe the solution you'd like
Ideally, the JSON Decoder could define public methods that say how many buffered rows the decoder has, and whether the decoder is currently at a record boundary or not. This is essentially a side effect-free version the same check that
Tape::finish
already performs whenDecoder::flush
is called:and
That way, the above implementation becomes a bit simpler and a lot more efficient:
It would be even nicer if the
parse_json
method could just become part of either arrow-json or arrow-compute, if parsing strings to JSON is deemed a general operation that deserves its own API call.Describe alternatives you've considered
Tried shoving each string manually into a
Decoder
to produce a singleRecordBatch
, but the above-mentioned safety issues made it very brittle (wrong row counts, incorrect values, etc). Currently using the ugly/slow solution mentioned earlier, that creates and validates oneRecordBatch
per row, before concatenating them all into a singleRecordBatch
.The text was updated successfully, but these errors were encountered: