Raw JSON writer (~10x faster) (#5314) #5318

tustvold · 2024-01-20T13:03:28Z

Which issue does this PR close?

Rationale for this change

bench_primitive         time:   [2.9493 ms 2.9504 ms 2.9515 ms]
                        change: [-86.148% -86.054% -85.962%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

bench_mixed             time:   [6.0997 ms 6.1016 ms 6.1038 ms]
                        change: [-86.966% -86.905% -86.856%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  1 (1.00%) high severe

bench_struct            time:   [7.4146 ms 7.4169 ms 7.4193 ms]
                        change: [-89.586% -89.536% -89.485%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

bench_nullable_struct   time:   [2.3035 ms 2.3052 ms 2.3070 ms]
                        change: [-91.153% -91.131% -91.109%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_list              time:   [2.4907 ms 2.4960 ms 2.5011 ms]
                        change: [-91.988% -91.965% -91.942%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_nullable_list     time:   [982.97 µs 983.21 µs 983.46 µs]
                        change: [-85.818% -85.807% -85.795%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking bench_struct_list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, enable flat sampling, or reduce sample count to 40.
bench_struct_list       time:   [1.9682 ms 1.9691 ms 1.9699 ms]
                        change: [-89.756% -89.726% -89.700%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2024-01-20T13:11:24Z

arrow-json/src/writer.rs

@@ -20,28 +20,6 @@
 //! This JSON writer converts Arrow [`RecordBatch`]es into arrays of
 //! JSON objects or JSON formatted byte streams.
 //!
-//! ## Writing JSON Objects


This functionality isn't removed, yet, but it is deprecated as I can't think of any reasonable use-cases for this. If you're wanting to embed arrow data in another JSON document, serde_json's raw value mechanism is an objectively better way to go about doing this.

but it is deprecated as I can't think of any reasonable use-cases for this.

Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.

I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust structs via serde but I can't remember how far we got with that

Given I am not familiar with serde_json's raw value mechanism I suspect others may not be either

Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow

Hey @tustvold, could you clarify on what the serde_json's raw value mechanism is you're thinking of?

https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html

Yeah, this isn't clear to me either as I mentioned in the original review -- I made a PR to add an example showing how to use this: #5364

tustvold · 2024-01-20T13:12:21Z

arrow-json/src/writer.rs

@@ -1564,9 +1575,9 @@ mod tests {
            r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}}
 {"a":{"list":[null]},"b":{"list":[null]}}
 {"a":{"list":[]},"b":{"list":[]}}
-{"a":null,"b":{"list":[3,null]}}
+{"b":{"list":[3,null]}}


The prior behaviour feels like a bug to me, without explicit nulls set I would expect consistent use of implicit nulls. The fact that null objects happen to be treated differently to null primitives seems at best confusing.

@Jefffrey I remember you working on something related in #5133 and wonder if you have any thoughts about this

This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔

I think when I worked on #5133 I just forgot to consider my previous work for writing explicit nulls in #5065.

This fix makes sense; the only case where we should write nulls if explicit_nulls is set to false (i.e. the default) is for list values, and nothing else, I believe. This falls in line with that 👍

tustvold · 2024-01-23T19:31:19Z

arrow-json/test/data/basic.json

@@ -1,5 +1,5 @@
-{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.1}
-{"a":-10, "b":-3.5, "c":true, "d":"4", "e": "1969-12-31", "f": "-0.3", "g": "2016-04-23T18:25:43.511", "h": 3.141}
+{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.2802734375}


The previous writer had some questionable logic to truncate the precision of its output. We no longer do this, and so we need to use a float that can roundtrip be exactly represented in a f16 in order for it to roundtrip precisely.

tustvold · 2024-01-23T19:38:34Z

I'm going to label this as an API change, as whilst it technically isn't a breaking change, there is a high risk of there being subtle behaviour changes, especially around the encoding of nulls

tustvold · 2024-01-23T19:44:30Z

I will re-run the benchmarks tomorrow

Jefffrey · 2024-01-23T20:16:28Z

arrow-json/src/writer.rs

@@ -703,7 +682,7 @@ where
    format: F,

    /// Whether keys with null values should be written or skipped


Suggested change

/// Whether keys with null values should be written or skipped

/// Controls how JSON should be encoded, e.g. whether to write explicit nulls or skip them

Jefffrey · 2024-01-23T20:35:23Z

arrow-json/src/writer/encoder.rs

+    fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+        out.push(b'"');
+        // Should be infallible
+        // Note: We are making an assumption that the formatter does not produce characters that require escaping


Could you expand on this a little? I'm not sure I follow 🤔

Updated, basically if users can provide format specifications containing " we need to escape them when serializing to JSON

I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔

Yes, it isn't currently possible to hit this, I am just documenting it here for future readers who may not realise this detail

Jefffrey · 2024-01-23T20:40:12Z

arrow-json/src/writer.rs

@@ -1564,9 +1575,9 @@ mod tests {
            r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}}
 {"a":{"list":[null]},"b":{"list":[null]}}
 {"a":{"list":[]},"b":{"list":[]}}
-{"a":null,"b":{"list":[3,null]}}
+{"b":{"list":[3,null]}}


This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔

tustvold · 2024-01-24T12:51:09Z

Most recent numbers

bench_integer           time:   [6.0469 ms 6.0590 ms 6.0711 ms]
                        change: [-87.862% -87.823% -87.783%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_float             time:   [6.6686 ms 6.6789 ms 6.6894 ms]
                        change: [-84.425% -84.385% -84.346%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bench_dict_array        time:   [5.9732 ms 5.9888 ms 6.0038 ms]
                        change: [-90.356% -90.288% -90.219%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

bench_mixed             time:   [12.924 ms 12.948 ms 12.972 ms]
                        change: [-88.190% -88.149% -88.104%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

bench_string    time:   [8.0122 ms 8.0304 ms 8.0484 ms]
                        change: [-88.919% -88.868% -88.817%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_struct            time:   [17.143 ms 17.171 ms 17.199 ms]
                        change: [-88.296% -88.222% -88.149%] (p = 0.00 < 0.05)
                        Performance has improved.

bench_nullable_struct   time:   [5.1811 ms 5.1919 ms 5.2030 ms]
                        change: [-91.645% -91.608% -91.574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_list              time:   [6.1378 ms 6.1479 ms 6.1583 ms]
                        change: [-89.880% -89.856% -89.831%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking bench_nullable_list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.9s, enable flat sampling, or reduce sample count to 50.
bench_nullable_list     time:   [1.7440 ms 1.7451 ms 1.7464 ms]
                        change: [-87.630% -87.580% -87.532%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bench_struct_list       time:   [4.5331 ms 4.5946 ms 4.6583 ms]
                        change: [-88.224% -88.055% -87.899%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

alamb

Looks really nice to me -- thank you @tustvold

I had some comment quibbles but nothing that is required from my perspective.

Basically I would summarize this PR as "converting to JSON and then writing Values to bytes is very slow"

alamb · 2024-01-24T17:12:33Z

arrow-json/src/writer.rs

@@ -20,28 +20,6 @@
 //! This JSON writer converts Arrow [`RecordBatch`]es into arrays of
 //! JSON objects or JSON formatted byte streams.
 //!
-//! ## Writing JSON Objects


but it is deprecated as I can't think of any reasonable use-cases for this.

Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.

I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust structs via serde but I can't remember how far we got with that

Given I am not familiar with serde_json's raw value mechanism I suspect others may not be either

Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow

alamb · 2024-01-24T17:22:35Z

arrow-json/src/writer.rs

@@ -481,6 +463,7 @@ fn set_column_for_json_rows(

 /// Converts an arrow [`RecordBatch`] into a `Vec` of Serde JSON
 /// [`JsonMap`]s (objects)
+#[deprecated(note = "Use Writer")]


I can't figure out if the deprecation is needed for the new json writer, or did you just include it in the same PR for convenience?

I lumped the deprecation into this PR as moving the writer over to mainly not use this functionality means a reduction in our test coverage of it

alamb · 2024-01-24T17:43:04Z

arrow-json/src/writer/encoder.rs

+float_encode!(f32, f64);
+
+impl PrimitiveEncode for f16 {
+    type Buffer = <f64 as PrimitiveEncode>::Buffer;


why can't we just use the PrimitiveEncode directly for f16? I doubt the performance of f16 encoding is particular critical but I am curious

Because the formulation of PrimitiveEncoder expects fixed size buffers... Having peeked at f16's display impl, it converts to f32 in order to print and to parse, so will update this to likewise

alamb · 2024-01-24T17:45:01Z

arrow-json/src/writer/encoder.rs

+    // Workaround https://github.com/rust-lang/rust/issues/61415
+    fn init_buffer() -> Self::Buffer;
+
+    fn encode(self, buf: &mut Self::Buffer) -> &[u8];


I think it would help to document what encode does here

Suggested change

fn encode(self, buf: &mut Self::Buffer) -> &[u8];

/// Encode the primitive value as bytes, returning a reference to that slice.

/// `buf` is temporary space that may be used

fn encode(self, buf: &mut Self::Buffer) -> &[u8];

alamb · 2024-01-24T17:47:33Z

arrow-json/src/writer/encoder.rs

+    options: &EncoderOptions,
+) -> Result<Box<dyn Encoder + 'a>, ArrowError> {
+    let (encoder, nulls) = make_encoder_impl(array, options)?;
+    assert!(nulls.is_none(), "root cannot be nullable");


I don't understand this -- isn't it possible to try to encode a BooleanArray as the root with null values?

The root is called with a StructArray derived from a RecordBatch, and therefore cannot be nullable

alamb · 2024-01-24T17:49:36Z

arrow-json/src/writer/encoder.rs

+    pub explicit_nulls: bool,
+}
+
+pub trait Encoder {


Could you please document the expectations on nullability here? Specifically, it seems like this code assumes that this is invoked with idx for non-null entries, which was not clear to me on my first read of this code

alamb · 2024-01-24T17:50:12Z

arrow-json/src/writer/encoder.rs

+
+impl Encoder for BooleanEncoder {
+    fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+        match self.0.value(idx) {


I was pretty confused at first trying to figure out why this doesn't check for null, but then I saw the null check is handled in the outer loop

alamb · 2024-01-24T17:53:01Z

arrow-json/src/writer/encoder.rs

+    fn encode(&mut self, idx: usize, out: &mut Vec<u8>) {
+        out.push(b'"');
+        // Should be infallible
+        // Note: We are making an assumption that the formatter does not produce characters that require escaping


I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔

alamb · 2024-01-24T22:20:00Z

arrow-json/src/writer/encoder.rs

 pub trait Encoder {
+    /// Encode the non-null value at index `idx` to `out`
+    ///
+    /// The behaviour is unspecified if `idx` corresponds to a null index


github-actions bot added the arrow Changes to the arrow crate label Jan 20, 2024

tustvold mentioned this pull request Jan 20, 2024

arrow_json: support decimal 128 and 256 types in json writer #5197

Closed

tustvold commented Jan 20, 2024

View reviewed changes

tustvold force-pushed the raw-json-writer branch from 8898d2e to 552828e Compare January 23, 2024 19:27

tustvold commented Jan 23, 2024

View reviewed changes

tustvold force-pushed the raw-json-writer branch 2 times, most recently from 259163b to 49a0357 Compare January 23, 2024 19:36

Raw JSON writer (apache#5314)

e8f914c

tustvold force-pushed the raw-json-writer branch from 49a0357 to e8f914c Compare January 23, 2024 19:37

tustvold added the api-change Changes to the arrow API label Jan 23, 2024

tustvold marked this pull request as ready for review January 23, 2024 19:44

alamb mentioned this pull request Jan 23, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 apache/datafusion#8933

Closed

9 tasks

Jefffrey reviewed Jan 23, 2024

View reviewed changes

Fix bench name

9a8e46c

Review feedback

afaf192

alamb approved these changes Jan 24, 2024

View reviewed changes

Review feedback

008b086

tustvold merged commit 5146419 into apache:master Jan 24, 2024
22 checks passed

alamb reviewed Jan 24, 2024

View reviewed changes

alamb mentioned this pull request Feb 4, 2024

Add example of converting RecordBatches to JSON objects #5364

Merged

This was referenced Mar 14, 2024

Deprecate array_to_json_array #5515

Merged

Update Arrow/Parquet to 51.0.0, tonic to 0.11 apache/datafusion#9613

Merged

This was referenced Apr 15, 2024

feat: JSON encoding of FixedSizeList #5646

Merged

Remove deprecated JSON writer #5651

Merged

kylebarron mentioned this pull request Jun 1, 2024

Add stac-arrow stac-utils/stac-rs#256

Closed

8 tasks

samuelcolvin mentioned this pull request Oct 2, 2024

fix arrow-json encoding with dictionary including nulls #6503

Merged

This was referenced Oct 7, 2024

Any plan to support JSON or JSONB? apache/datafusion#7845

Open

JSON parser fails on maps with bool keys #6525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw JSON writer (~10x faster) (#5314) #5318

Raw JSON writer (~10x faster) (#5314) #5318

tustvold commented Jan 20, 2024 •

edited

Loading

tustvold Jan 20, 2024

alamb Jan 24, 2024

JichaoS Feb 3, 2024

tustvold Feb 3, 2024

alamb Feb 4, 2024

tustvold Jan 20, 2024

tustvold Jan 23, 2024

Jefffrey Jan 23, 2024

Jefffrey Jan 23, 2024

tustvold Jan 23, 2024 •

edited

Loading

tustvold commented Jan 23, 2024

tustvold commented Jan 23, 2024

Jefffrey Jan 23, 2024

Jefffrey Jan 23, 2024

tustvold Jan 24, 2024

alamb Jan 24, 2024

tustvold Jan 24, 2024

Jefffrey Jan 23, 2024

tustvold commented Jan 24, 2024

alamb left a comment

alamb Jan 24, 2024

alamb Jan 24, 2024

tustvold Jan 24, 2024

alamb Jan 24, 2024

tustvold Jan 24, 2024

alamb Jan 24, 2024

alamb Jan 24, 2024

tustvold Jan 24, 2024

alamb Jan 24, 2024

alamb Jan 24, 2024

alamb Jan 24, 2024

alamb Jan 24, 2024

		@@ -703,7 +682,7 @@ where
		format: F,

		/// Whether keys with null values should be written or skipped

	/// Whether keys with null values should be written or skipped
	/// Controls how JSON should be encoded, e.g. whether to write explicit nulls or skip them

Raw JSON writer (~10x faster) (#5314) #5318

Raw JSON writer (~10x faster) (#5314) #5318

Conversation

tustvold commented Jan 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold commented Jan 23, 2024

tustvold commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 24, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 20, 2024 •

edited

Loading

tustvold Jan 23, 2024 •

edited

Loading