Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bad data #2

Merged
merged 3 commits into from
Apr 2, 2024
Merged

Allow bad data #2

merged 3 commits into from
Apr 2, 2024

Conversation

mwylde
Copy link
Member

@mwylde mwylde commented Mar 30, 2024

This PR adds support for ignoring (or optionally returning, as strings) JSON records that do not conform to the arrow schema.

The most obvious approach would be to do in a single pass: as we encounter a field that doesn't conform to the schema, we would then skip deserializing and move to the next row. However, this does not appear to be possible. Arrow-json decodes json in Arrays by column instead of row. That means that if we detect a problem in the second field, we have already built up the entire array for the first field.

Instead, we take a two-pass approach. First, we validate each row (via a new validate_row method on each array decoder) and determine which ones will be deserializable. Then, we deserialize only those rows (if the allow_bad_data option is set to true on the Decoder).

There is also a new flush_with_bad_data method on the Decoder which will partition the good and bad rows, deserialize the good rows into the RecordBatch, and return the bad rows as strings so that they can be handled alternately. It also returns a mask that tells us which of the original rowset was valid, which is helpful for excluding rows in companion arrays (like our timestamp array).

@github-actions github-actions bot added the arrow label Mar 30, 2024
@mwylde mwylde merged commit b6d7669 into 50.0.0/json Apr 2, 2024
12 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant