Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

adriangb · 2024-07-03T18:23:46Z

A step towards #5988, #6002

alamb

Thanks @adriangb -- this PR looks good to me and I think we could proceed with this design.

I did file #6002 to track a potentially more flexible API that I think is worth considering. However, adding this API to mirror decode_metadata I think would also be fine (and we could make a more complex API later)

alamb · 2024-07-04T10:00:04Z

parquet/src/file/footer.rs

+
+        let encoded = encode_metadata(&metadata).unwrap();
+        let decoded = decode_metadata(&encoded).unwrap();
+        assert_eq!(


Can you simply just assert that encoded == decoded?

alamb · 2024-07-04T10:01:11Z

parquet/src/file/footer.rs

+        {
+            assert_eq!(a, b);
+        }
+        // TODO: add encoding and decoding of column and offset indexes (aka page indexes)


I agree that encoding/decoding of these structures doesn't have to be present in the initial PR, however given they are stored out of line / slightly differently than the other structures I think it would be good to ensure we could encode them using this same API

alamb · 2024-07-04T10:03:55Z

parquet/src/file/footer.rs

+/// specified by the [Parquet Spec].
+///
+/// [Parquet Spec]: https://github.com/apache/parquet-format#metadata
+pub fn encode_metadata(metadata: &ParquetMetaData) -> Result<Vec<u8>> {


Is it possible to switch the existing writers to use this API as well? Not only would that avoid code duplication, it would ensure the API is general enough

For example, I wonder if it would make sense for this function signature to be more like

/// write the metadata to the target `std::io:Write`, returning the number of bytes written pub fn encode_metadata<W: Write>(metadata: &ParquetMetaData) -> Result<usize> { ... }

That would allow writing into a Vec but also allow writing into various other targets and perhaps avoid buffering

adriangb · 2024-07-04T11:23:20Z

@alamb I pushed a fluentish API version of this.

I got bogged down implementing the page index writing because there doesn't seem to be a clean path to go from a ParquetMetadata's PageLocation and Index to the thrift OffsetIndex and ColumnIndex. I think the thing is that the current writers never materialize a ParquetMetadata and thus forcing them to do so might introduce unnecessary overhead. Maybe the path to go from a ParquetMetadata to bytes shouldn't be merged with writers? But also maybe I just couldn't come up with a good implementation and with more trial or with your help we can get there.

I do think the readers could be merged.

For this encoder to make sense I think it should have an option to handle page indexes and have it enabled and working by default (like the writers do).

adriangb · 2024-07-06T22:43:55Z

One thing I can do to avoid blocking on my lack of knowledge of encoding the page index stuff is to design the API first and implement it later. E.g. we can add .with_page_index(bool) and error if you set it to true or don't set it at all so that you're forced to acknowledge that the future default will be true.

alamb · 2024-07-08T10:04:35Z

Thanks @adriangb -- I will try and review this PR today

alamb · 2024-07-11T00:09:11Z

Working through the list of PRs in arrow-rs is on my list of things to do tomorrow

alamb

Thanks @adriangb -- this is looking like a good start

I think we should try and structure the code so the existing writer uses this new MetadataEncoder which would keep metadata writing consistent as well as enable usecases like encoding bloom filters, etc.

Let me know what you think.

cc @sunchao @tustvold @Jefffrey @liukun4515 @nevi-me for any thoughts you might have on this API / approach

alamb · 2024-07-11T10:03:19Z

parquet/src/file/metadata/mod.rs

@@ -86,7 +86,7 @@ pub type ParquetOffsetIndex = Vec<Vec<Vec<PageLocation>>>;
 ///
 /// [`parquet.thrift`]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
 /// [`parse_metadata`]: crate::file::footer::parse_metadata
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]


alamb · 2024-07-11T10:16:40Z

parquet/src/file/footer.rs

+        let column_orders = encode_column_orders(metadata.file_metadata().column_orders());
+        let schema = types::to_thrift(&metadata.file_metadata().schema().clone())?;
+
+        let t_file_metadata = TFileMetaData {


I noticed that this is not quite the same code as used n the actual writer (specifically the way column order is not the same) so I worry it would be inconsistent or drift over time from the actual writer

arrow-rs/parquet/src/file/writer.rs

Lines 352 to 375 in 22e0b44

// We only include ColumnOrder for leaf nodes.

// Currently only supported ColumnOrder is TypeDefinedOrder so we set this

// for all leaf nodes.

// Even if the column has an undefined sort order, such as INTERVAL, this

// is still technically the defined TYPEORDER so it should still be set.

let column_orders = (0..self.schema_descr().num_columns())

.map(|_| parquet::ColumnOrder::TYPEORDER(parquet::TypeDefinedOrder {}))

.collect();

// This field is optional, perhaps in cases where no min/max fields are set

// in any Statistics or ColumnIndex object in the whole file.

// But for simplicity we always set this field.

let column_orders = Some(column_orders);

let file_metadata = parquet::FileMetaData {

num_rows,

row_groups,

key_value_metadata,

version: self.props.writer_version().as_num(),

schema: types::to_thrift(self.schema.as_ref())?,

created_by: Some(self.props.created_by().to_owned()),

column_orders,

encryption_algorithm: None,

footer_signing_key_metadata: None,

};

Thus what I suggest we do here is change writer.rs to use the ParquetMetadataEncoder and refactor the code from there into this function. That would be a bit more involved but I think would set us up nicely so that metadata encoding remains consistent.

adriangb · 2024-07-11T15:41:55Z

I think we should try and structure the code so the existing writer uses this new MetadataEncoder which would keep metadata writing consistent as well as enable usecases like encoding bloom filters, etc.

I completely agree. That's just a much bigger chunk to bite off, I can give it a shot but I may need support to get there.

adriangb · 2024-07-12T04:07:59Z

I've made some progress. I made a (very rough) metadata writer that is used internally by SerializedFileWriter and can encode from a ParquetMetadata. My plan of attack from here:

Implement reading of metadata without needing to have the entire file available. There's already MetadataLoader as pointed out in API for encoding/decoding ParquetMetadata with more control #6002 (comment) but it wants to read metadata from an entire file and I think needs to be refactored to be able to load metadata when that's all you have.
Get feedback here on the APIs (they really aren't pretty).
Add roundtrip tests.

alamb

I think this is looking quite nice @adriangb and I think we should try and proceed with this approach.

I think it would be easier to make progress if we can work on the approach incrementally as multiple smaller PRs rather than one large one (it will be easier for me to give you timely feedback)

Also, it is probably good to know of #5486 from @etseidl which could conflict as we change the metadata.

Also #5933 from @progval

Given we are now being careful about breaking changes (see https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes) I am worried that these PRs will interact / cause conflicts with each other

What do you think of this idea: #6050 ?

alamb · 2024-07-13T11:03:35Z

parquet/src/file/writer.rs

+            Some(self.props.created_by().to_string()),
+            self.props.writer_version().as_num(),
+        );
+        encoder.finish()


alamb · 2024-07-13T11:04:37Z

parquet/src/file/writer.rs


        let mut row_groups = self
            .row_groups
-            .as_slice()
            .iter()
            .map(|v| v.to_thrift())
            .collect::<Vec<_>>();

        self.write_bloom_filters(&mut row_groups)?;


FWIW #5933 also contains changes for bloom filter writing

alamb · 2024-07-13T11:23:25Z

parquet/src/file/writer.rs

@@ -791,23 +710,274 @@ impl<'a, W: Write + Send> PageWriter for SerializedPageWriter<'a, W> {
    }
 }

+struct ThriftMetadataWriter<'a, W: Write> {


I always get confused when reading the parquet code between what are the generated Thrift structures from the structures in https://docs.rs/parquet/latest/parquet/file/metadata/index.html

I like how you have split out writing of the thrift structures here from the writing of the parquet::file structures

alamb · 2024-07-13T11:26:45Z

parquet/src/file/writer.rs

+        Ok(())
+    }
+
+    fn convert_column_indexes(&self) -> Vec<Vec<Option<ColumnIndex>>> {


I was looking around for another copy of this code and I now see that this is the first time we are going from Index --> ColumnIndex

Makes sense to me. I think this type of structure could really help clean up some of the tests too (but I am getting ahead of myself)

alamb · 2024-07-13T11:30:56Z

parquet/src/file/writer.rs

-
-        let file_metadata = parquet::FileMetaData {
-            num_rows,
+        let encoder = ThriftMetadataWriter::new(


This might read nicer like this:

let encoder = ThriftMetadataWriter::new() .with_schema(&self.schema) .with_descr(&self.descr) .with_row_groups(row_groups) ... ); // encode the data to buf encoder.encode(&mut buf)

Though I realize many of these fields are required

Maybe something like

let encoder = ThriftMetadataWriter::new( &self.schema, &self.descr, ... ) .with_column_indexes(&self.column_indexes) .with_offset_indexes(&self.offset_indexes); encoder.encode(&mut buf)

etseidl · 2024-07-14T00:14:54Z

parquet/src/file/writer.rs

+        if let Some(row_group_offset_indexes) = self.metadata.offset_index() {
+            (0..self.metadata.row_groups().len())
+                .map(|rg_idx| {
+                    let column_indexes = &row_group_offset_indexes[rg_idx];


Minor nit: could this be named offset_indexes?

etseidl · 2024-07-15T17:31:25Z

parquet/src/file/page_index/index.rs

+        let null_counts = self
+            .indexes
+            .iter()
+            .map(|x| x.null_count())
+            .collect::<Option<Vec<_>>>()
+            .unwrap_or_else(|| vec![0; self.indexes.len()]);


While merging with #5486, I noticed this. IIUC, if on read the optional thrift ColumnIndex::null_counts is not present, then the PageIndex::null_count will be None. When converting back to a thrift ColumnIndex, it appears that this will convert the missing null_counts into a vector of num_pages zeros. I don't know if this is the correct behavior, mostly because the spec is (AFAICT) silent on the interpretation of a non-present null_counts. Is it not present as an optimization when there are no nulls, or is it not present due to a lack of information (say a V1 encoder doesn't keep null counts since the V1 page header doesn't require them). Due to that ambiguity I think null_counts here should be None if any or all of the PageIndex::null_count fields is None. Perhaps stop after the collect() and pass null_counts directly below.

alamb · 2024-07-16T15:46:54Z

Update here is I plan to make a 53 dev branch today so we can start getting this code merged and iterate on the API

alamb · 2024-07-16T22:57:31Z

Hi @adriangb -- I changed this PR to point at the 53.0.0-dev branch. I plan to give it a careful review tomorrow and then I am thinking we can merge it and iterate over the course of a few PRs

Again, I am really sorry for the delay in reviewing. I think this is a really important feature but I have been overwhelmed with reviews for the last week or two

adriangb · 2024-07-17T02:25:02Z

Thank you @alamb! No need to apologize; you have such a diverse and impactful contribution to open source, your time management is really quite inspiring. If anything I need to apologize for lagging on applying feedback. I will go over this PR and incorporate feedback (hopefully before your review tomorrow).

alamb

Here is how I suggest we proceed with this PR:

Let's create an example with the usecase described in API for encoding/decoding ParquetMetadata with more control #6002 (comment) (I will try to do this later today). I think this will motivate how the API looks like
In parallel we could pull out some of the simple usability changes (like adding PartialEq and pub use thrift stuff into their own PR so we can merge that.

alamb · 2024-07-17T21:53:39Z

I started on a basic example here: #6081 -- tomorrow I'll try and find time to try and rebase it on this PR and see if I can do what is needed

Prep for apache#6000

etseidl · 2024-07-24T16:37:57Z

I'm not sure why the test is failing (it was before, I don't think it's from a merge). Need to investigate.

I think you'll need to merge 53.0.0-dev again to pick up the latest changes to the offset index, and then reformat (some new names are longer and changed how the linter wants lines wrapped).

adriangb · 2024-07-24T19:35:21Z

I've updated the branch and cleaned up, test is still failing. It seems the reading part is trying to access byte 0 of the file, which doesn't make sense and makes me think there's a bug somewhere (could be in the test since there's a lot of shim in there): https://github.com/apache/arrow-rs/actions/runs/10082832978/job/27878006690?pr=6000#step:6:761

etseidl · 2024-07-24T22:35:57Z

parquet/src/file/writer.rs

+
+        let data = buf.into_inner().freeze();
+
+        let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;


Suggested change

let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;

let decoded_metadata = load_metadata_from_bytes(data.len(), data).await;

This will load the page indexes, but then the assert below fails because the offset_index_offset and column_index_offset fields of the column chunk are different. Might have to write an equals that accounts for that.

yep thank you.

I could write a custom eq but... that is going to be a pain.

I also need to think about the implications of these things not matching up. If I understand correctly those are the offsets to the page index data from the start of the file (or from where?) and because we loaded the metadata from only a portion of the file they got re-calculated differently?

I could write a custom eq but... that is going to be a pain.

True. Perhaps just compare parts of the metadata rather than the whole thing. Let me see if I can whip something up quick and submit a PR to your branch.

I also need to think about the implications of these things not matching up. If I understand correctly those are the offsets to the page index data from the start of the file (or from where?) and because we loaded the metadata from only a portion of the file they got re-calculated differently?

Well, you've written the page indexes to a new file (well, buffer), so those offsets point to their location in the new file as opposed to the old. Looking at the test output, the column index is at offset 0 with length 23, and the offset index is at 23 with length 10. Assuming there's no 'PAR1' header on the new file, those offsets seem correct.

alamb · 2024-07-26T10:15:57Z

I merged the 53 dev branch ~~and that seems to have closed this PR~~ -- any chance you can retarget main?

Update: I restored the branch

alamb · 2024-07-26T13:02:19Z

I wrote up some thoughts that were floating in my head in #6129

I am hoping to spend some more time today looking at this PR in deatil

Thank you again for your patience

alamb · 2024-07-29T20:22:07Z

I know I keep leading you along with comments on this PR. I really believe this is an important PR / API to work on, but at the moment i don't have enough bandwidth to help drive this forward / break it down into smaller pieces / make sure they all fit together. Maybe @etseidl can figure out how to make it happen.

In any event I hope to have more time to devote here by the end of August

Co-authored-by: Ed Seidl <[email protected]>

etseidl · 2024-07-31T16:31:03Z

@adriangb I submitted a PR to fix the failing parquet and clippy tests.

The "check compilation" test failure is beyond me. The "--no-default-features" flag causes some dependencies to not be loaded. Getting that one fixed requires some cargo mojo that I lack :(

Add test for metadata equivalence

etseidl · 2024-08-01T00:55:04Z

@adriangb just wanted to warn you not to try merging with master just yet. A PR that merged to 53.0.0-dev after 53.0.0-dev was merged to master (but before 53.0.0-dev was merged to this branch) is present here. If you merge with master now that PR will be lost. Hopefully this will be corrected soon 😅.

Ok, master was fixed, but then 53.0.0-dev was closed again, which closed this PR. I've submitted a new PR to your branch that merges with master and fixes the conflicts. If you approve, then this PR can be reopened with master as its base branch. Sorry for all the confusion!

adriangb · 2024-08-05T15:54:40Z

@etseidl looks like this was closed because the target branch was deleted. My first thought is that I should retarget this at master/main and rebase it on that but I guess that's what you're warning me not to do above? Not sure what the next steps are here then.

etseidl · 2024-08-05T16:20:49Z

My first thought is that I should retarget this at master/main and rebase it on that but I guess that's what you're warning me not to do above? Not sure what the next steps are here then.

If you retarget master, then there will be merge conflicts around the new histogram stuff I added. Those are pretty easy to clear (you can look at what I did in adriangb#3). Then I think this will be ready for a final review! Thanks!

* Preallocate for `FixedSizeList` in `concat` (#5862) * Add specific fixed size list concat test * Add fixed size list concat benchmark * Improve `FixedSizeList` concat performance for large list * `cargo fmt` * Increase size of `FixedSizeList` benchmark data * Get capacity recursively for `FixedSizeList` * Reuse `Capacities::List` to avoid breaking change * Use correct default capacities * Avoid a `Box::new()` when not needed * format --------- Co-authored-by: Will Jones <[email protected]> * Add eq benchmark for StringArray/StringViewArray (#5924) * add neq/eq benchmark for String/ViewArray * move bench to comparsion kernel * clean unnecessary dep * make clippy happy * Add the ability for Maps to cast to another case where the field names are different (#5703) * Add the ability for Maps to cast to another case where the field names are different. Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data. This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values. This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements. * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Feedback from code review - simplify map casting logic to reuse the entries - Added unit tests for negative cases - Use MapBuilder to make the intended type clearer. * fix formatting * Lint and format * correctly set the null fields --------- Co-authored-by: Andrew Lamb <[email protected]> * fix(ipc): set correct row count when reading struct arrays with zero fields (#5918) * Update zstd-sys requirement from >=2.0.0, <2.0.10 to >=2.0.0, <2.0.12 (#5913) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/compare/zstd-sys-2.0.7...zstd-sys-2.0.11) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add `MultipartUpload` blanket implementation for `Box<W>` (#5919) * add impl for box * update * another update * small fix * Fix typo in benchmarks (#5935) * row format benches for bool & nullable int (#5943) * Implement arrow-row encoding/decoding for view types (#5922) * implement arrow-row encoding/decoding for view types * add doc comments, better error msg, more test coverage * ensure no performance regression * update perf * fix bug * make fmt happy * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * update * update comments * move cmp around * move things around and remove inline hint * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Andrew Lamb <[email protected]> * Update arrow-ord/src/cmp.rs Co-authored-by: Andrew Lamb <[email protected]> * return error instead of panic * remove unnecessary func --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]> * Better document support for nested comparison (#5942) * Update quick-xml requirement from 0.32.0 to 0.33.0 in /object_store (#5946) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.32.0...v0.33.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Implement like/ilike etc for StringViewArray (#5931) * like for string view array * fix bug * update doc * update tests * test: Add unit test for extending slice of list array (#5948) * test: Add unit test for extending slice of list array * For review * Update quick-xml requirement from 0.33.0 to 0.34.0 in /object_store (#5954) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.33.0...v0.34.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minor: fixup contribution guide (#5952) * chore(5797): change default data_page_row_limit to 20k (#5957) * Improve error message for unsupported nested comparison (#5961) * Improve error message for unsupported nested comparison * Update arrow-ord/src/cmp.rs Co-authored-by: Jay Zhan <[email protected]> --------- Co-authored-by: Jay Zhan <[email protected]> * feat: add max_bytes and min_bytes on PageIndex (#5950) * Faster primitive arrays encoding into row format (#5858) * skip iterator removed from primitive encoding * special cases for not-null primitives encoding * faster iterators for nullable columns * Document process for PRs with breaking changes (#5953) * Document process for PRs with breaking changes * ticket reference * Update CONTRIBUTING.md Co-authored-by: Xuanwo <[email protected]> --------- Co-authored-by: Xuanwo <[email protected]> * `like` benchmark for StringView (#5936) * Expose `IntervalMonthDayNano` and `IntervalDayTime` and update docs (#5928) * Expose IntervalMonthDayNano and IntervalDayMonth and update docs * fix doc test * implement sort for view types (#5963) * Fix FFI array offset handling (#5964) * Add benchmark for reading binary/binary view from parquet (#5968) * implement sort for view types * add bench for binary/binary view * Add view buffer for parquet reader (#5970) * implement sort for view types * add bench for binary/binary view * add view buffer, prepare for byte_view_array reader * make clippy happy * reuse make_view_unchecked * Update parquet/src/arrow/buffer/view_buffer.rs Co-authored-by: Andrew Lamb <[email protected]> * update * rename and inline --------- Co-authored-by: Andrew Lamb <[email protected]> * Handle flight dictionary ID assignment automatically (#5971) * failing test * Handle dict ID assignment during flight encoding/decoding * remove println * One more println * Make auto-assign optional * Update docs * Remove breaking change * Update arrow-ipc/src/writer.rs Co-authored-by: Andrew Lamb <[email protected]> * Remove breaking change to DictionaryTracker ctor --------- Co-authored-by: Andrew Lamb <[email protected]> * Make ObjectStoreScheme public (#5912) * Make ObjectStoreScheme public * Fix clippy, add docs and examples --------- Co-authored-by: Andrew Lamb <[email protected]> * Add operation in ArrowNativeTypeOp::neg_check error message (#5944) (#5980) * feat: support reading OPTIONAL column in parquet_derive (#5717) * support def_level=1 but non-null column in reader * update comment, adapt ut to the uuid change --------- Co-authored-by: Ye Yuan <[email protected]> * Update quick-xml requirement from 0.34.0 to 0.35.0 in /object_store (#5983) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.34.0...v0.35.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Reduce repo size by removing accumulative commits in CI job (#5982) * Use force_orphan in the CI job Use force_orphan in the CI job * Update .github/workflows/docs.yml --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: fix clippy complaint in parquet_derive (#5984) * Add user defined metadata (#5915) * Add metadata attribute * Add user-defined metadata for AWS/GCP/ABS `with_attributes` * Reads and writes both implemented * Add tests for GetClient * Fix an indentation * Placate clippy * Use `strip_prefix` and mutable attributes * Use static Cow for attribute metadata * Add error for value decode failure * Remove unnecessary into * Provide Arrow Schema Hint to Parquet Reader - Alternative 2 (#5939) * Adds option for providing a schema to the Arrow Parquet Reader. * Adds more complete tests. Adds a more detailed error message for incompatible columns. Adds nested fields to test_with_schema. Adds test for incompatible nested field. Updates documentation. * Add an example using showing how to use the with_schema option. --------- Co-authored-by: Eric Fredine <[email protected]> * WriteMultipart Abort on MultipartUpload::complete Error (#5974) * update * another one * more update * another update * debug * debug * some updates * debug * debug * cleanup * cleanup * simplify * address some comments * cleanup on failure * restore abort method * docs * Implement directly build byte view array on top of parquet buffer (#5972) * implement sort for view types * add bench for binary/binary view * add view buffer, prepare for byte_view_array reader * make clippy happy * add byte view array reader * fix doc link * reuse make_view_unchecked * Update parquet/src/arrow/buffer/view_buffer.rs Co-authored-by: Andrew Lamb <[email protected]> * update * rename and inline * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <[email protected]> * use unused * Revert "use unused" This reverts commit 5e6887095251066cfa9998cb16a9eea788f9e175. --------- Co-authored-by: Andrew Lamb <[email protected]> * fix: error in case of invalid interval expression (#5987) This PR addresses an error that occurs when interval expressions contains invalid amount of components. The error type was previously unclear and confusing: `NotYetImplemented`. That doesn't seem correct, because such values are not going to be supported. Let's take a look at such example: ```sql INTERVAL '1 MONTH DAY' ``` This is an obvious typo/mistake which leads to such error, but in fact it's just invalid value (missing number before `DAY`) * Add ParquetMetadata::memory_size size estimation (#5965) * Add ParquetMetadata::memory_size size estimation * Require HeapSize for ParquetValueType * feat(5851): ArrowWriter memory usage (#5967) * refactor(5851): delineate the different memory estimates APIs for the ArrowWriter and column writers * feat(5851): add memory size estimates to the ColumnValueEncoder implementations and the DictEncoder * test(5851): add memory_size() to in-progress test * chore(5851): update docs to make it more explicit what is the difference btwn memory_size vs get_estimated_total_byte * feat(5851): clarify the ColumnValueEncoder::estimated_memory_size interface, and update impls to account for bloom filter size * feat(5851): account for stats array size in the ByteArrayEncoder * Refine documentation * More accurate memory estimation * Improve tests * Update accounting for non dict encoded data * Include more memory size calculations * clean up async writer * clippy * tweak --------- Co-authored-by: Andrew Lamb <[email protected]> * Prepare arrow `52.1.0` (#5992) * Update version to 52.1.0 * Prepare arrow `52.1.0` * Update CHANGELOG * Implement dictionary support for reading ByteView from parquet (#5973) * implement dictionary encoding support * update comments * implement `DataType::try_form(&str)` (#5994) * implement "DataType::try_form(&str)" * add missing file * add FromStr as well as TryFrom<&str> * fmt * Add additional documentation and examples to DataType (#5997) * Automatically cleanup empty dirs in LocalFileSystem (#5978) * automatically cleanup empty dirs * automatic cleanup toggle * configurable cleanup * test for automatic dir deletion * clippy * more comments * Add FlightSqlServiceClient::new_from_inner (#6003) * fix doc ci in latest rust nightly version (#6012) * allow rustdoc::unportable_markdown in arrow-flight. * fix doc in sql_info.rs. * reduce scope of lint disable --------- Co-authored-by: Andrew Lamb <[email protected]> * Deduplicate strings/binarys when building view types (#6005) * implement string view deduplication in builder * make clippy happy * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * better coding style --------- Co-authored-by: Andrew Lamb <[email protected]> * Fast utf8 validation when loading string view from parquet (#6009) * fast utf8 validation * better documentation * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * Rename `Schema::all_fields` to `flattened_fields` (#6001) * Rename Schema::all_fields to flattened_fields * Add doc example for Schema::flattened_fields * fmt doc example * Update arrow-schema/src/schema.rs --------- Co-authored-by: Andrew Lamb <[email protected]> * Complete `StringViewArray` and `BinaryViewArray` parquet decoder: implement delta byte array and delta length byte array encoding (#6004) * implement all encodings * address comments * fix bug * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <[email protected]> * fix test * update comments * update test * Only copy strings one --------- Co-authored-by: Andrew Lamb <[email protected]> * Update zstd-sys requirement from >=2.0.0, <2.0.12 to >=2.0.0, <2.0.13 (#6019) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/compare/zstd-sys-2.0.7...zstd-sys-2.0.12) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update clap test (#6028) * Unsafe improvements: core `parquet` crate. (#6024) * Unsafe improvements: core `parquet` crate. * Make FromBytes an unsafe trait. * Improve performance reading `ByteViewArray` from parquet by removing an implicit copy (#6031) * update byte view array to not implicit copy * Add small comments * Update quick-xml requirement from 0.35.0 to 0.36.0 in /object_store (#6032) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.35.0...v0.36.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix `hashbrown` version in `arrow-array`, remove from `arrow-row` (#6035) * Additional tests for parquet reader utf8 validation (#6023) * Clean up unused code for view types in offset buffer (#6040) * clean up unused view types in offset buffer * make tests happy * Move avoid using copy-based buffer creation (#6039) * Fix 5592: Colon (:) in in object_store::path::{Path} is not handled on Windows (#5830) * Fix issue #5800: Handle missing files in list_with_delimiter * draft * cargo fmt * Handle leading colon * Add windows CI * Fix CI job * Only run local tests and set target family for failing tests * Run all tests without my changes and removed target os * Restore changes again * Add back newline (removed by mistake) * Fix test after merge with master * Minor API adjustments for StringViewBuilder (#6047) * minor update * add memory accounting * Update arrow-buffer/src/builder/null.rs Co-authored-by: Andrew Lamb <[email protected]> * Update arrow-array/src/builder/generic_bytes_view_builder.rs Co-authored-by: Andrew Lamb <[email protected]> * update comments --------- Co-authored-by: Andrew Lamb <[email protected]> * Fix typo in GenericByteViewArray documentation (#6054) * Directly decode String/BinaryView types from arrow-row format (#6044) * add string view bench * check in new impl * add utf8 * quick utf8 validation * Update arrow-row/src/variable.rs Co-authored-by: Andrew Lamb <[email protected]> * address comments * update * Revert "address comments" This reverts commit e2656c94dd5ff4fb2f486278feb346d44a7f5436. * addr comments --------- Co-authored-by: Andrew Lamb <[email protected]> * Add begin/end_transaction methods in FlightSqlServiceClient (#6026) * Add begin/end_transaction methods in FlightSqlServiceClient * Add test * Remove unused imports * Implement min max support for string/binary view types (#6053) * add * implement min max support for string/binary view * update tests * Add parquet `StatisticsConverter` for arrow reader (#6046) * Adds arrow statistics converter for parquet stastistics. * Adds integration tests for arrow statsistics converter. * Fix linting, remove todo, re-use arrow code. * Remove commented out debug::log statements. * Move parquet_column to lib.rs * doc tweaks * Add benchmark * Add parquet_column_index and arrow_field accessors + test * Copy edit docs obsessively * clippy --------- Co-authored-by: Eric Fredine <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * StringView support in arrow-csv (#6062) * StringView support in arrow-csv * review and micro-benches * Minor: clarify the relationship between `file::metadata` and `format` (#6049) * Do not write `ColumnIndex` for null columns when not writing page statistics (#6011) * disable column_index_builder if no page stats are collected * add test * no need to clone descr --------- Co-authored-by: Andrew Lamb <[email protected]> * Reorganize arrow-flight test code (#6065) * Reorganize test code * asf header * reuse TestFixture * .await * Create flight_sql_client.rs * remove code * remove unused import * Fix clippy lints * Sanitize error message for sensitive requests (#6074) * Sanitize error message for sensitive requests * Clippy * use GCE metadata server env var overrides (#6015) * use GCE metadata env var overrides * update docs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * Correct timeout in comment from 5s to 30s (#6073) * Prepare for object_store `0.10.2` release (#6079) * Prepare for `object_store 10.2.0` release * Add CHANGELOG * Historical changelog * Minor: Improve parquet PageIndex documentation (#6042) * Minor: Improve parquet PageIndex documentation * More improvements * Add reasons for data page being without null * Apply suggestions from code review Co-authored-by: Val Lorentz <[email protected]> * Update parquet/src/file/page_index/index.rs --------- Co-authored-by: Val Lorentz <[email protected]> * Enable casting from Utf8View (#6077) * Enable casting from Utf8View -> string or temporal types * save * implement casting utf8view -> timestamp/interval types, with tests * fix clippy * fmt --------- Co-authored-by: Andrew Lamb <[email protected]> * Add PartialEq to ParquetMetaData and FileMetadata (#6082) Prep for #6000 * fix panic in `ParquetMetadata::memory_size`: check has_min_max_set before invoking min()/max() (#6092) * fix: check has_min_max_set before invoking min()/max() * chore: add unit test for statistics heap size * Fixup test --------- Co-authored-by: Andrew Lamb <[email protected]> * Optimize `max_boolean` by operating on u64 chunks (#6098) * Optimize `max_boolean` Operate on bit chunks instead of individual booleans, which can lead to massive speedups while not regressing the short-circuiting behavior of the existing implementation. `cargo bench --bench aggregate_kernels -- "bool/max"` shows throughput improvements between 50% to 23390% on my machine. * add tests exercising u64 chunk code * add benchmark to track performance (#6101) * Make bool_or an alias for max_boolean (#6100) Improves `cargo bench --bench aggregate_kernels -- "bool/or"` throughput by 68%-22366% on my machine * Faster `GenericByteView` construction (#6102) * add benchmark to track performance * fast byte view construction * make doc happy * fix clippy * update comments * Implement specialized min/max for `GenericBinaryView` (`StringView` and `BinaryView`) (#6089) * implement better min/max for string view * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * address review comments --------- Co-authored-by: Andrew Lamb <[email protected]> * Prepare `52.2.0` release (#6110) * Update version to 52.2.0 * Update CHANGELOG for 52.2.0 * touchups * manual tweaks * manual tweaks * added a flush method to IPC writers (#6108) While the writers expose `get_ref` and `get_mut` to access the underlying `io::Write` writer, there is an internal layer of a `BufWriter` that is not accessible. Because of that, there is no way to ensure that all messages written thus far to the `StreamWriter` or `FileWriter` have actually been passed to the underlying writer. Here we expose a `flush` method that flushes the internal buffer and the underlying writer. See #6099 for the discussion. * Fix Clippy for the Rust 1.80 release (#6116) * Fix clippy lints in arrow-data * Fix clippy errors in arrow-array * fix clippy in concat * Clippy in arrow-string * remove unecessary feature in arrow-array * fix clippy in arrow-cast * Fix clippy in parquet crate * Fix clippy in arrow-flight * Fix clippy in object_store crate (#6120) * Fix clippy in object_store crate * clippy ignore * Merge `53.0.0-dev` dev branch to main (#6126) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <[email protected]> * fix example tests Signed-off-by: Bugen Zhao <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <[email protected]> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bugen Zhao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Jesse <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Marco Neumann <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add support for level histograms added in PARQUET-2261 to `ParquetMetaData` (#6105) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <[email protected]> * fix example tests Signed-off-by: Bugen Zhao <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <[email protected]> * deprecate read_page_locations * add level histograms to metadata * add to_thrift() to OffsetIndexMetaData * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * move valid test into ColumnIndexBuilder::append_histograms * move update_histogram() inside ColumnMetrics * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <[email protected]> * Implement LevelHistograms as a struct * formatting * fix error in docs --------- Signed-off-by: Bugen Zhao <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bugen Zhao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Jesse <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Marco Neumann <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add ArrowError::ArithmeticError (#6130) * Implement data_part for intervals (#6071) Signed-off-by: Nick Cameron <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Remove `SchemaBuilder` dependency from `StructArray` constructors (#6139) * Remove automatic buffering in `ipc::reader::FileReader` for for consistent buffering (#6132) * change ipc::reader and writer APIs for consistent buffering Current writer API automatically wraps the supplied std::io::Writer impl into a BufWriter. It is cleaner and more idiomatic to have the default be using the supplied impl directly, as the user might already have a BufWriter or an impl that doesn't actually benefit from buffering at all. StreamReader does a similar thing, but it also exposes a `try_new_unbuffered` that bypasses the internal wrap. Here we propose a consistent and non-buffered by default API: - `try_new` does not wrap the passed reader/writer, - `try_new_buffered` is a convenience function that does wrap the reader/writer into a BufReader/BufWriter, - all four publicly exposed IPC reader/writers follow the above consistently, i.e. `StreamReader`, `FileReader`, `StreamWriter`, `FileWriter`. Those are breaking changes. An additional tweak: removed the generic type bounds from struct definitions on the four types, as that is the idiomatic Rust approach (see e.g. stdlib's HashMap that has no bounds on the struct definition, only the impl requires Hash + Eq). See #6099 for the discussion. * improvements to docs in `arrow::ipc::reader` and `writer` Applied a few suggestions, made `Error` sections more consistent. * Use `LevelHistogram` in `PageIndex` (#6135) * use LevelHistogram in PageIndex and ColumnIndexBuilder * revert changes to OffsetIndexBuilder * Fix comparison kernel benchmarks (#6147) * fix comparison kernel benchmarks * add comment as suggested by @alamb * Implement exponential block size growing strategy for `StringViewBuilder` (#6136) * new block size growing strategy * Update arrow-array/src/builder/generic_bytes_view_builder.rs Co-authored-by: Andrew Lamb <[email protected]> * update function name, deprecate old function * update comments --------- Co-authored-by: Andrew Lamb <[email protected]> * improve LIKE regex (#6145) * Improve `LIKE` performance for "contains" style queries (#6128) * improve "contains" performance * add tests * cargo fmt :disappointed: --------- Co-authored-by: Andrew Lamb <[email protected]> * improvements to `(i)starts_with` and `(i)ends_with` performance (#6118) * improvements to "starts_with" and "ends_with" * add tests and refactor slightly * add comments * Add `BooleanArray::new_from_packed` and `BooleanArray::new_from_u8` (#6127) * Support construct BooleanArray from &[u8] * fix doc * add new_from_packed and new_from_u8; delete impl From<&[u8]> for BooleanArray and BooleanBuffer * Update object store MSRV to `1.64` (#6123) * Update MSRV to 1.64 * Revert "clippy ignore" This reverts commit 7a4b760bfb2a63c7778b20a4710c2828224f9565. * Upgrade protobuf definitions to flightsql 17.0 (#6133) (#6169) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. Co-authored-by: Douglas Anderson <[email protected]> * Add additional documentation and examples to ArrayAccessor (#6141) * Minor: Update release schedule in README (#6125) * Minor: Update release schedule in README * prettier * fixp * Optimize `take` kernel for `BinaryViewArray` and `StringViewArray` (#6168) * improve speed of view take kernel * ArrayData -> new_unchecked * Update arrow-select/src/take.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: improve comments in temporal.rs tests (#6140) * Support `StringView` and `BinaryView` in CDataInterface (#6171) * fix round-trip for view schema in CFFI * add * Make object_store errors non-exhaustive (#6165) * Update snafu (#5930) (#6070) Co-authored-by: Jesse <[email protected]> * Update sysinfo requirement from 0.30.12 to 0.31.2 (#6182) * Update sysinfo requirement from 0.30.12 to 0.31.2 Updates the requirements on [sysinfo](https://github.com/GuillaumeGomez/sysinfo) to permit the latest version. - [Changelog](https://github.com/GuillaumeGomez/sysinfo/blob/master/CHANGELOG.md) - [Commits](https://github.com/GuillaumeGomez/sysinfo/compare/v0.30.13...v0.31.2) --- updated-dependencies: - dependency-name: sysinfo dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update example for new sysinfo API --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * No longer write Parquet column metadata after column chunks *and* in the footer (#6117) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <[email protected]> * fix example tests Signed-off-by: Bugen Zhao <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <[email protected]> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * no longer write inline column metadata * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <[email protected]> * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add some more documentation * remove write_metadata from PageWriter --------- Signed-off-by: Bugen Zhao <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bugen Zhao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Jesse <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Marco Neumann <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add filter benchmark for fsb (#6186) * Add support for `StringView` and `BinaryView` statistics in `StatisticsConverter` (#6181) * Add StringView and BinaryView support for the macro `get_statistics` * Add StringView and BinaryView support for the macro `get_data_page_statistics` * add tests to cover the support for StringView and BinaryView in the macro get_data_page_statistics * found potential bugs and ignore the tests * fake alarm! no bugs, fix the code by initiating all batches to have 5 rows * make the get_stat StringView and BinaryView tests cover bytes greater than 12 * Benchmarks for `bool_and` (#6189) * Fix typo in documentation of Float64Array (#6188) * feat(parquet): Implement AsyncFileWriter for `object_store::buffered::BufWriter` (#6013) * feat(parquet): Implement AsyncFileWriter for obejct_store::BufWriter Signed-off-by: Xuanwo <[email protected]> * Fix build Signed-off-by: Xuanwo <[email protected]> * Bump object_store Signed-off-by: Xuanwo <[email protected]> * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Address comments Signed-off-by: Xuanwo <[email protected]> * Add comments Signed-off-by: Xuanwo <[email protected]> * Make it better to read Signed-off-by: Xuanwo <[email protected]> * Fix docs Signed-off-by: Xuanwo <[email protected]> --------- Signed-off-by: Xuanwo <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Support Parquet `BYTE_STREAM_SPLIT` for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types (#6159) * add todos to help trace flow * add support for byte_stream_split encoding for INT32 and INT64 data * byte_stream_split encoding for fixed_len_byte_array * revert changes to Decoder and add VariableWidthByteStreamSplitDecoder * remove set_type_width as it is now unused * begin implementing roundtrip test * move test * clean up some documentation * add test of byte_stream_split with flba * add check for and test of mismatched sizes * remove type_length from Encoder and add VaribleWidthByteStreamSplitEncoder * fix clippy error * change type of argument to new() * formatting * add another test * add variable to split/join streams for FLBA * more informative error message * avoid buffer copies in decoder per suggestion from review * add roundtrip test * optimized version...but clippy complains * clippy was right...replace loop with copy_from_slice * fix test * optimize split_streams_variable for long type widths * Reduce bounds check in `RowIter`, add `unsafe Rows::row_unchecked` (#6142) * update * update comment * update row-iter bench * make clippy happy * Update zstd-sys requirement from >=2.0.0, <2.0.13 to >=2.0.0, <2.0.14 (#6196) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/commits) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add `ThriftMetadataWriter` for writing Parquet metadata (#6197) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <[email protected]> * fix example tests Signed-off-by: Bugen Zhao <[email protected]> --------- Signed-off-by: Bugen Zhao <[email protected]> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <[email protected]> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <[email protected]> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <[email protected]> * Upgrade protobuf definitions to flightsql 17.0 (#6133) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. * Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` * fix loading in test by etseidl Co-authored-by: Ed Seidl <[email protected]> * add rough equivalence test * one more check * make clippy happy * separate tests that require arrow into a separate module * add histograms to to_thrift() --------- Signed-off-by: Bugen Zhao <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Bugen Zhao <[email protected]> Co-authored-by: Xiangpeng Hao <[email protected]> Co-authored-by: kamille <[email protected]> Co-authored-by: Jesse <[email protected]> Co-authored-by: Ed Seidl <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Marco Neumann <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Douglas Anderson <[email protected]> Co-authored-by: Ed Seidl <[email protected]> * Add (more) Parquet Metadata Documentation (#6184) * Minor: Add (more) Parquet Metadata Documenation * fix clippy * fix parquet type is_optional comment (#6192) Co-authored-by: jp0317 <[email protected]> * Remove duplicated statistics tests in parquet (#6190) * move all tests to parquet/tests/arrow_reader/statistics.rs, and leave a comment in original file * remove duplicated tests and adjust the empty tests * data file tests brought folders changes * fix lint * add comments Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * fix: interleave docs suggests itself, not take (#6210) * fix: Correctly handle take on dense union of a single selected type (#6209) * fix: use filter instead of filter_primitive * fix: remove pub(crate) from filter_primitive * fix: run cargo fmt * fix: clippy * Make it clear that StatisticsConverter can not panic (#6187) * Optimize `min_boolean` and `bool_and` (#6144) * Optimize `min_boolean` and `bool_and` Closes #https://github.com/apache/arrow-rs/issues/6103 * use any * Add benchmarks for `BYTE_STREAM_SPLIT` encoded Parquet `FIXED_LEN_BYTE_ARRAY` data (#6204) * save type_width for fixed_len_byte_array * add decimal128 and float16 byte_stream_split benches * add f16 * add decimal128 flba(16) bench * fix(arrow): restrict the range of temporal values produced via `data_gen` (#6205) * fix: random timestamp array * fix: restrict range of randomly generated temporal values * fix: exclusive range used * Support casting between BinaryView <--> Utf8 and LargeUtf8 (#6180) * support cast between binaryview and string * update impl. and add bench mark * Add ut for views * Apply coments * feat(object_store): add `PermissionDenied` variant to top-level error (#6194) * feat(object_store): add `PermissionDenied` variant to top-level error * Update object_store/src/lib.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * refactor: add additional error variant for unauthenticated ops * fix: include path in unauthenticated error --------- Co-authored-by: Raphael Taylor-Davies <[email protected]> * update BYTE_STREAM_SPLIT documentation (#6212) * Add time dictionary coercions (#6208) * Add time dictionary coercions * format * Pass through primitive values * use spaces not tabs everywhere (#6217) * Implement specialized filter kernel for `FixedSizeByteArray` (#6178) * refactor filter for FixedSizeByteArray * fix expect * remove benchmark code * fix * remove from_trusted_len_iter_slice_u8 * fmt --------- Co-authored-by: Andrew Lamb <[email protected]> * fix: lexsort_to_indices should not fallback to non-lexical sort if the datatype is not supported (#6225) * fix: lexsort_to_indices should not fallback to non-lexical sort if the datatype is not supported * fix clippy * Check error message * Prepare for object_store `0.11.0` release (#6227) * Update version to 0.11.0 * Changelog for 0.11.0 * Remove irrelevant content from changelog * Improve interval parsing (#6211) * improve interval parsing * rename * cleanup * fix formatting * make IntervalParseConfig public * add debug to IntervalParseConfig * fmt * Add LICENSE and NOTICE files to object_store (#6234) * Add LICENSE and NOTICE files to object_store * Update object_store/NOTICE.txt Co-authored-by: Xuanwo <[email protected]> * Update object_store/LICENSE.txt --------- Co-authored-by: Xuanwo <[email protected]> * Update changelog for object_store 0.11.0 release (#6238) * Minor: Remove non standard footer from LICENSE.txt (#6237) * Minor: Improve Type documentation (#6224) * Minor: Improve XXXType documentation * Update arrow-array/src/types.rs Co-authored-by: Marco Neumann <[email protected]> --------- Co-authored-by: Marco Neumann <[email protected]> * Add "take" workflow for self-assigning tickets, add "how to find issues" to contributor guide (#6059) * Add "take" workflow for contributors to assign themselves to tickets * Copy datafusion Finding and Creating Issues to work on * Move `ParquetMetadataWriter` to its own module, update documentation (#6202) * Move `ThriftMetadataWriter` and `ParquetMetadataWriter` to a new module * Improve documentation, make pub(crate) * Apply suggestions from code review Co-authored-by: Ed Seidl <[email protected]> * Add comment side effect of writing column and offset indexes * Document how to write bloom filters * Update parquet/src/file/metadata/writer.rs Co-authored-by: Ed Seidl <[email protected]> --------- Co-authored-by: Ed Seidl <[email protected]> * Modest improvement to FixedLenByteArray BYTE_STREAM_SPLIT arrow decoder (#6222) * replace reserve/push with resize/direct access * remove import * make a bit faster * Improve performance of `FixedLengthBinary` decoding (#6220) * add set_from_bytes to ParquetValueType * change naming of FLBA types so critcmp will work * minor enhance doc for ParquetField (#6239) * Remove unnecessary null buffer construction when converting arrays to a different type (#6244) * create primitive array from iter and nulls * clippy * speed up some more decimals * add optimizations for byte_stream_split * decimal256 * Revert "add optimizations for byte_stream_split" This reverts commit 5d4ae0dc09f95ee9079b46b117fb554f63157564. * add comments * Add examples to `StringViewBuilder` and `BinaryViewBuilder` (#6240) * Add examples to `StringViewBuilder` and `BinaryViewBuilder` * add doc link * Implement PartialEq for GenericBinaryArray (#6241) * parquet Statistics - deprecate `has_*` APIs and add `_opt` functions that return `Option<T>` (#6216) * update public api Statistics::min to return an option. I first re-named the existing method to `min_unchecked` and made it internal to the crate. I then added a `pub min(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: https://github.com/apache/arrow-rs/issues/6093 * update public api Statistics::max to return an option. I first re-named the existing method to `max_unchecked` and made it internal to the crate. I then added a `pub max(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: https://github.com/apache/arrow-rs/issues/6093 * cargo fmt * remove Statistics::has_min_max_set from the public api Ref: https://github.com/apache/arrow-rs/issues/6093 * update impl HeapSize for ValueStatistics to use new min and max api * migrate all tests to new Statistics min and max api * make Statistics::null_count return Option<u64> This removes ambiguity around whether the between all values are non-null or just that the null count stat is missing Ref: https://github.com/apache/arrow-rs/issues/6215 * update expected metadata memory size tests Changing null_count from u64 to Option<u64> increases the memory size and layout of the metadata. I included these tests as a separate commit to call extra attention to it. * add TODO question on is_min_max_backwards_compatible * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * update ValueStatistics::max docs * rename new optional ValueStatistics::max to max_opt Per PR review, we will deprecate the old API instead of introducing a brekaing change. Ref: https://github.com/apache/arrow-rs/pull/6216#pullrequestreview-2236537291 * rename new optional ValueStatistics::min to min_opt * add Statistics:{min,max}_bytes_opt This adds the API and migrates all of the test usage. The old APIs will be deprecated next. * update make_stats_iterator macro to use *_opt methods * deprecate non *_opt Statistics and ValueStatistics methods * remove stale TODO comments * remove has_min_max_set check from make_decimal_stats_iterator The check is unnecessary now that the stats funcs return Option<T> when unset. * deprecate has_min_max_set An internal version was also created because it is used so extensively in testing. * switch to null_count_opt and reintroduce deprecated null_count and has_nulls * remove redundant test assertions of stats._internal_has_min_max_set This removes the assertion from any test that subsequently unwraps both min_opt and max_opt. * replace negated test assertions of stats._internal_has_mix_max_set with assertions on min_opt and max_opt This removes all use of Statistics::_internal_has_min_max_set from the code base, and so it is also removed. * Revert changes to parquet writing, update comments --------- Co-authored-by: Andrew Lamb <[email protected]> * Minor: Update DateType::Date64 docs (#6223) * feat(object_store): add support for server-side encryption with customer-provided keys (SSE-C) (#6230) * Add support for server-side encryption with customer-provided keys (SSE-C). * Add SSE-C test using MinIO. * Visibility change * add nocapture to verify the test indeed runs * cargo fmt * Update object_store/src/aws/mod.rs use environment variables Co-authored-by: Will Jones <[email protected]> * Update object_store/CONTRIBUTING.md use environment variables Co-authored-by: Will Jones <[email protected]> * Fix api --------- Co-authored-by: Will Jones <[email protected]> * Expose bulk ingest in flight sql client and server (#6201) * Expose CommandStatementIngest as pub in sql module * Add do_put_statement_ingest to FlightSqlService Dispatch this handler for the new CommandStatementIngest command. * Sort list * Implement stub do_put_statement_ingest in example * Refactor helper functions into tests/common/utils * Implement execute_ingest for flight sql client I referenced the C++ implementation here: https://github.com/apache/arrow/commit/0d1ea5db1f9312412fe2cc28363e8c9deb2521ba * Add integration test for sql client execute_ingest * Fix lint clippy::new_without_default * Allow streaming ingest for FlightClient::execute_ingest * Properly return client errors --------- Co-authored-by: Andrew Lamb <[email protected]> * docs: Add parquet_opendal in related projects (#6236) * docs: Add parquet_opendal in related projects * Fix spaces * Avoid infinite loop in bad parquet by checking the number of rep levels (#6232) * check the number of rep levels read from page * minor fix on typo Co-authored-by: Andrew Lamb <[email protected]> * add check on record_read as well --------- Co-authored-by: jp0317 <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Make the bearer token visible in FlightSqlServiceClient (#6254) * Make the bearer token visible in FlightSqlServiceClient * Update client.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> * Add tests for bad parquet files (#6262) * Add tests for bad parquet files * Reenable test * Add test for very subltley different file * Update parquet object_store dependency to 0.11.0 (#6264) * Implement date_part for durations (#6246) Signed-off-by: Nick Cameron <[email protected]> * feat: further TLS options on ClientOptions: #5034 (#6148) * feat: further TLS options on ClientOptions: #5034 * Rename to Certificate and with_root_certificate, add docs --------- Co-authored-by: Andrew Lamb <[email protected]> * Improve documentation for MutableArrayData (#6272) * Do not print compression level in schema printer (#6271) The compression level is only used during compression, not decompression, and isn't actually stored in the metadata. Printing it is misleading. * Add `Statistics::distinct_count_opt` and deprecate `Statistics::distinct_count` (#6259) * Fix accessing name from ffi schema (#6273) * Fix accessing name from ffi schema * Add test * ci: use octokit to add assignee (#6267) * Only add encryption headers for for SSE-C in get. (#6260) * Minor: move `FallibleRequestStream` and `FallibleTonicResponseStream` to a module (#6258) * Minor: move FallibleRequestStream and FallibleTonicResponseStream to their own modules * Improve documentation and add links * Minor: `pub use ByteView` in arrow and improve documentation (#6275) * Minor: `pub use ByteView` in arrow and improve documentation * clarify docs more * ci: simplify octokit add assignee (#6280) * Update tower requirement from 0.4.13 to 0.5.0 (#6250) * Update tower requirement from 0.4.13 to 0.5.0 Updates the requirements on [tower](https://github.com/tower-rs/tower) to permit the latest version. - [Release notes](https://github.com/tower-rs/tower/releases) - [Commits](https://github.com/tower-rs/tower/compare/tower-0.4.13...tower-0.5.0) --- updated-dependencies: - dependency-name: tower dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Add tower version --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]> * Fix panic in comparison_kernel benchmarks (#6284) * Fix panic in comparison_kernel benchmarks * Add other special case equality kernels * Add other benchmarks * fix reference in doctest to size_of which is not imported by default (#6286) This corrects an issue with this doctest noticed on FreeBSD/amd64 with rustc 1.77.0 * Use `unary()` for array conversion in Parquet array readers, speed up `Decimal128`, `Decimal256` and `Float16` (#6252) * add unary to FixedSizeBinaryArray; use unary for…

github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2024

adriangb changed the title ~~Add function to mirror~~ Add encode_metadata function to mirror decode_metadata and allow ad-hoc encoding of ParquetMetadata Jul 3, 2024

alamb mentioned this pull request Jul 4, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

alamb reviewed Jul 4, 2024

View reviewed changes

alamb mentioned this pull request Jul 7, 2024

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 apache/datafusion#11190

Closed

10 tasks

alamb mentioned this pull request Jul 8, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 apache/datafusion#11334

Closed

9 tasks

alamb reviewed Jul 11, 2024

View reviewed changes

adriangb force-pushed the add-encode_metadata branch from afa975d to d7a4156 Compare July 12, 2024 04:01

This was referenced Jul 13, 2024

Minor: clarify the relationship between file::metadata and format in docs #6049

Merged

Proposal: parquet 53.0.0 feature branch #6050

Closed

alamb reviewed Jul 13, 2024

View reviewed changes

adriangb mentioned this pull request Jul 13, 2024

Reintroduce: Write Bloom filters between row groups instead of the end #5933

Merged

etseidl reviewed Jul 14, 2024

View reviewed changes

alamb mentioned this pull request Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 apache/datafusion#11474

Closed

7 tasks

etseidl reviewed Jul 15, 2024

View reviewed changes

alamb changed the base branch from master to 53.0.0-dev July 16, 2024 22:56

alamb reviewed Jul 17, 2024

View reviewed changes

adriangb added a commit to adriangb/arrow-rs that referenced this pull request Jul 18, 2024

Add PartialEq to ParquetMetaData and FileMetadata

c43107a

Prep for apache#6000

adriangb mentioned this pull request Jul 18, 2024

Add PartialEq to ParquetMetaData and FileMetadata #6082

Merged

adriangb changed the title ~~Add encode_metadata function to mirror decode_metadata and allow ad-hoc encoding of ParquetMetadata~~ Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata Jul 18, 2024

adriangb force-pushed the add-encode_metadata branch 2 times, most recently from b38ccf7 to b41173f Compare July 24, 2024 19:28

etseidl reviewed Jul 24, 2024

View reviewed changes

alamb deleted the branch apache:53.0.0-dev July 26, 2024 10:11

alamb closed this Jul 26, 2024

alamb reopened this Jul 26, 2024

alamb mentioned this pull request Jul 26, 2024

[DISCUSSION] Parquet Metadata Improvements #6129

Open

etseidl added a commit to etseidl/arrow-rs that referenced this pull request Jul 26, 2024

add to_thrift to NativeIndex in prep for apache#6000

e8a0b7f

alamb mentioned this pull request Jul 29, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 apache/datafusion#11710

Closed

8 tasks

adriangb and others added 2 commits July 31, 2024 09:29

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata

b07d057

fix loading in test by etseidl

e2be8d3

Co-authored-by: Ed Seidl <[email protected]>

adriangb force-pushed the add-encode_metadata branch from b2651b4 to e2be8d3 Compare July 31, 2024 14:33

etseidl added 3 commits July 31, 2024 09:09

add rough equivalence test

0175d53

one more check

f188bf8

make clippy happy

57b85d7

Merge pull request #1 from etseidl/pr_6000_ets

1f3eb0b

Add test for metadata equivalence

etseidl mentioned this pull request Jul 31, 2024

Upgrade protobuf definitions to flightsql 17.0 #6133

Merged

alamb deleted the branch apache:53.0.0-dev August 1, 2024 10:57

alamb closed this Aug 1, 2024

adriangb mentioned this pull request Aug 5, 2024

Add ThriftMetadataWriter for writing Parquet metadata #6197

Merged

	// We only include ColumnOrder for leaf nodes.
	// Currently only supported ColumnOrder is TypeDefinedOrder so we set this
	// for all leaf nodes.
	// Even if the column has an undefined sort order, such as INTERVAL, this
	// is still technically the defined TYPEORDER so it should still be set.
	let column_orders = (0..self.schema_descr().num_columns())
	.map(\|_\| parquet::ColumnOrder::TYPEORDER(parquet::TypeDefinedOrder {}))
	.collect();
	// This field is optional, perhaps in cases where no min/max fields are set
	// in any Statistics or ColumnIndex object in the whole file.
	// But for simplicity we always set this field.
	let column_orders = Some(column_orders);

	let file_metadata = parquet::FileMetaData {
	num_rows,
	row_groups,
	key_value_metadata,
	version: self.props.writer_version().as_num(),
	schema: types::to_thrift(self.schema.as_ref())?,
	created_by: Some(self.props.created_by().to_owned()),
	column_orders,
	encryption_algorithm: None,
	footer_signing_key_metadata: None,
	};


		let data = buf.into_inner().freeze();

		let decoded_metadata = load_metadata_from_bytes(metadata.file_size, data).await;

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Conversation

adriangb commented Jul 3, 2024 • edited by alamb Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jul 4, 2024 • edited Loading

adriangb commented Jul 6, 2024

alamb commented Jul 8, 2024

alamb commented Jul 11, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jul 11, 2024

adriangb commented Jul 12, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

adriangb commented Jul 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 17, 2024

etseidl commented Jul 24, 2024

adriangb commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 26, 2024 • edited Loading

alamb commented Jul 26, 2024

alamb commented Jul 29, 2024

etseidl commented Jul 31, 2024

etseidl commented Aug 1, 2024 • edited Loading

adriangb commented Aug 5, 2024

etseidl commented Aug 5, 2024

Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` #6000

adriangb commented Jul 3, 2024 •

edited by alamb

Loading

adriangb commented Jul 4, 2024 •

edited

Loading

alamb commented Jul 26, 2024 •

edited

Loading

etseidl commented Aug 1, 2024 •

edited

Loading