PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

rok · 2024-05-15T17:47:28Z

As proposed in apache/arrow#34510 and on ML, PARQUET-2474.

Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.

However currently FixedSizeList is stored as List in Parquet which adds significant conversion overhead when reading and writing as discussed here. It would therefore be beneficial to introduce a FIXED_SIZE_LIST logical type to Parquet.

rok · 2024-05-15T17:48:58Z

cc @wgtmac @tustvold @alippai @mapleFU @AlenkaF

LogicalTypes.md

etseidl

Interesting way to get lists without repetition.

etseidl · 2024-05-15T22:02:02Z

LogicalTypes.md

+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.


"binary" means either fixed or variable length, right? I always get confused 😅.

Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].

Perhaps use byte_array in this PR (see #251).

Will do, thanks!

Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].

I would represent the fixed sized list as a non-nested FIXED_LEN_BYTE_ARRAY + type + num_values. Multidimensional lists/arrays bring much more complexity that I'm not sure makes sense to store as a logical type (see FixedShapeTensor in Arrow). Also see #241 (comment).

Perhaps use byte_array in this PR (see #251).

Done.

LogicalTypes.md

tustvold · 2024-05-15T23:49:38Z

One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a m * n list or do we want to support this as a first-class concept?

I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.

wgtmac · 2024-05-16T02:37:41Z

cc @JFinis

LogicalTypes.md

JFinis · 2024-05-16T06:15:17Z

src/main/thrift/parquet.thrift

+struct EnumType {}          // allowed for BINARY, must be encoded with UTF-8
+struct DateType {}          // allowed for INT32
+struct Float16Type {}       // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct FixedSizeListType {} // see LogicalTypes.md


Something is missing here. Shouldn't this type contain the element type? And the length of the list? The length of the list could be deduced from the size of the underlying fixed_len_byte_array, but at least the element type would be necessary then.

Changed to:

struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type], 1: required Type type; // see LogicalTypes.md 2: required i32 num_values; } struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md 1: required Type type; }

JFinis · 2024-05-16T06:37:16Z

LogicalTypes.md

@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.

 The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.

+### FIXED_SIZE_LIST


Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:

PROs:

Guarantees zero-copy, as the layout is defined to be just bytes. In contrast, would this annotate a group, a writer could decide to use a fancy per-value encoding (e.g., dictionary) and thus create a list that first has to be "decoded" before it can be used.

Guarantees that a list is always contained on one page instead of being split over multiple pages. Again, this helps in keeping decoders easy and guaranteeing zero copy.

This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

CONs:

Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.

Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.

Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.

cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null 2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.

This is the main reason I'd like to propose this type, see apache/arrow#34510.

Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.

Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.

Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.

In tensor computation this is usually addressed with bitmasks, which can be stored as a fixed_size_list(binary, num_values).

Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.

Perhaps we should call this type FixedSizeArray to disambiguate?

I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size] is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector having data below?

1. [1, 1, 1], [null, 1, 1] <-- data with null 2. null, [1, 1, 1] <-- null vector

And would vector contains a "nested" vector?

I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.

rok · 2024-06-05T02:39:13Z

Apologies for taking a while to reply.

I've split this into two cases: FixedSizeListType (length is constant) and VariableSizeListType (length differs per row) for the sake of discussion. I would move VariableSizeListType into a separate PR if we even decide it is needed next to ListType.

One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a m * n list or do we want to support this as a first-class concept?

We could start with a more general multidimensional array definition and have list be a 1 dimensional array. Additional metadata required would not be that bad. I'm just a bit scared of validation and striding logic bleeding into parquet implementations. Do we have any other inputs / opinions?

I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement.

That's interesting. What would you expect performance wise with this approach?

Co-authored-by: Ed Seidl <[email protected]>

etseidl

Looking good to me. Just a few questions/comments. Thanks!

etseidl · 2024-06-20T19:56:54Z

LogicalTypes.md

+The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
+elements of the same primitive data type.


Should the encoding be defined as well, for instance the elements of the array are encoded in the same manner as PLAIN encoding?

Yes, that seems like a thing to specify. Changed to:

The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of elements of the same primitive data type encoded with plain encoding.

etseidl · 2024-06-20T20:03:11Z

LogicalTypes.md

+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.


As written, the elements can themselves be arrays. Is this intended? Or should it be "non-array primitive data type"?

I didn't really consider the possibility of elements being arrays and I think non-array limitation makes sense. Changed to:

The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.

LogicalTypes.md

rok · 2024-06-24T17:20:50Z

Thanks for the review @etseidl ! I've updated this with your suggestions.

alippai · 2024-06-24T19:24:37Z

@ritchie46 would this be useful for your new polars Array type?

alippai · 2024-10-18T22:06:39Z

@rok is there anything I can help with?

@mapleFU I saw your questions above. Are you satisfied with the answers?

@coastalwhite I see you are familiar with Parquet and Array in Polars. Do you think this proposal is useful for your project?

coastalwhite · 2024-10-21T07:53:57Z

I like the general idea of moving FixedSizeList partially away from List and towards FixedSizeBinary, but I doubt it would lead to serious speedups or simplification possibilities.

The List based deserializer most of the time already batches decoding similarly to what this would allow, although it would allow skipping many checks that happen before the actual deserialization takes place. We would also still need to support the old path for a long time, since a lot of people write parquet files using old versions of the parquet specification and generally use old parquet files.

The one potentially large upside I can imagine of this is getting dictionary encoding for array's, but I am not sure how common that will be in real-world scenarios.

In general, I would say I am in favor. Although, I am not 100% convinced yet that the added complexity will result in significant performance, file size or other benefits.

alippai · 2024-10-21T12:49:03Z

@coastalwhite there is a 10x penalty in Polars 1.9.0 parquet reading as well using this snippet: apache/arrow#34510 (comment)

rok · 2024-10-21T13:09:00Z

@rok is there anything I can help with?

@alippai thanks for pinging. I was advised on the parquet sync call to re-open a ML discussion on this, but I need a couple of weeks to get to it. If you'd like you can start it now, here's the existing thread: https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3
I suppose it'd be useful to report on the pros and cons discussed here and propose we move forward.

coastalwhite · 2024-10-21T13:21:23Z

@coastalwhite there is a 10x penalty in Polars 1.9.0 parquet reading as well using this snippet: apache/arrow#34510 (comment)

Thank you for putting that to my attention. Still, I feel like that is more of a bug than an inherent performance problem in the Parquet file format. However, it is probably easier to optimize for what is proposed in this PR.

alippai · 2024-10-21T17:30:58Z

@rok based on the ML discussion we should add the fast path in the cases of polars, arrow and arrow-rs where we know the fixed size already (from schema stored in the metadata or if it's provided by the consumer). This is more fragile and less universal, but maybe a good first step forward

rok · 2024-10-22T14:00:10Z

@rok based on the ML discussion we should add the fast path in the cases of polars, arrow and arrow-rs where we know the fixed size already (from schema stored in the metadata or if it's provided by the consumer). This is more fragile and less universal, but maybe a good first step forward

@alippai are you sure we have a strong enough consensus yet to start implementing fast paths? I would really like to have some more discussion before committing.

alippai · 2024-10-22T16:04:38Z

@rok Sorry, wrong phrasing. I meant that was the recommendation to explore on the ML and by @coastalwhite.

I didn’t see objections adding this feature to the parquet format or commitments for adding the fast path to any of the libraries (arrow cpp actually noted it’s a non-trivial part of the codebase)

rok · 2024-10-22T16:15:36Z

Sorry for my abundance of caution @alippai. I'll try to summarize this thread to the ML and ask for some more input ASAP. It would be nice to actually start some work on this.

tustvold · 2024-10-22T16:27:50Z

Some points in no particular order:

The parquet schema is authoritative, with any other schema information merely a hint, this makes the notion of using the arrow schema, or something else to drive decode a little dubious
The record shredding logic for lists is the single most complex, confusing and subtle aspect of any parquet reader, which:
- Limits the pool of people who can implement / review such changes
- Sets a very high bar for including such changes
Even some optimal record shredding setup will never perform better than an implementation that can simply skip it entirely
Both arrow-rs and polars exploit that the hybrid RLE is effectively a bitmask if the max definition level is only 1, this allows for very efficient decode. This isn't possible when there are repetition levels
Performant record skipping, e.g. for predicate/index pushdown or late materialization, is not really possible against data with repetition levels ^1.
Many readers have quirky support for repetition levels and lists in general, especially w.r.t areas where the specification has been ambiguous in the past (and some where it still is), finding ways for people to avoid these pain points seems valuable

That's all to say providing a way to encode fixed size lists seems like a very useful capability. That being said, it does seem to be a bit of a hack to make this a logical type, and will potentially limit the options for encodings, statistics, sort orders, etc... In particular the lack of dictionary encoding I could see being a non-trivial sacrifice.

1. In fact I think arrow-rs may be one of the few readers that actually implements it

tustvold reviewed May 15, 2024

View reviewed changes

LogicalTypes.md Outdated Show resolved Hide resolved

etseidl reviewed May 15, 2024

View reviewed changes

JFinis reviewed May 16, 2024

View reviewed changes

wgtmac mentioned this pull request May 24, 2024

Thoughts about a first-class GEOMETRY data type in Parquet? opengeospatial/geoparquet#222

Open

rok requested review from tustvold, mapleFU and wgtmac June 12, 2024 15:43

rok and others added 4 commits June 20, 2024 01:14

Add FIXED_SIZE_LIST

41fca3f

Review feedback

4f12dd3

Update LogicalTypes.md

cb93b27

Co-authored-by: Ed Seidl <[email protected]>

Review feedback, split into FixedSizeListType and VariableSizeListType

83481f6

rok force-pushed the PARQUET-2474 branch from 2865642 to 83481f6 Compare June 19, 2024 23:15

rok marked this pull request as ready for review June 19, 2024 23:15

rok requested review from etseidl and JFinis June 19, 2024 23:16

etseidl reviewed Jun 20, 2024

View reviewed changes

asfimport mentioned this pull request Jun 5, 2024

[Format] Specify FIXED_SIZE_LIST Logical type #430

Open

rok mentioned this pull request Jun 24, 2024

GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type #438

Draft

3 tasks

rok added 2 commits June 24, 2024 11:07

Removing VariableSizeListType

471efc3

Review feedback

77651fd

rok requested a review from etseidl June 24, 2024 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

rok commented May 15, 2024

rok commented May 15, 2024

etseidl left a comment

etseidl May 15, 2024

wgtmac May 16, 2024

etseidl Jun 4, 2024

rok Jun 4, 2024

rok Jun 5, 2024 •

edited

Loading

tustvold commented May 15, 2024 •

edited

Loading

wgtmac commented May 16, 2024

JFinis May 16, 2024

rok Jun 5, 2024

JFinis May 16, 2024 •

edited

Loading

JFinis May 16, 2024

mapleFU May 24, 2024

rok Jun 5, 2024

rok commented Jun 5, 2024

etseidl left a comment

etseidl Jun 20, 2024

rok Jun 24, 2024

etseidl Jun 20, 2024

rok Jun 24, 2024

rok commented Jun 24, 2024

alippai commented Jun 24, 2024

alippai commented Oct 18, 2024

coastalwhite commented Oct 21, 2024 •

edited

Loading

alippai commented Oct 21, 2024

rok commented Oct 21, 2024

coastalwhite commented Oct 21, 2024

alippai commented Oct 21, 2024

rok commented Oct 22, 2024

alippai commented Oct 22, 2024

rok commented Oct 22, 2024

tustvold commented Oct 22, 2024 •

edited

Loading

		@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.

		The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`.

		### FIXED_SIZE_LIST

		The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
		elements of the same primitive data type.

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

Are you sure you want to change the base?

PARQUET-2474: Add FIXED_SIZE_LIST logical type #241

Conversation

rok commented May 15, 2024

rok commented May 15, 2024

etseidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold commented May 15, 2024 • edited Loading

wgtmac commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JFinis May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Jun 5, 2024

etseidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Jun 24, 2024

alippai commented Jun 24, 2024

alippai commented Oct 18, 2024

coastalwhite commented Oct 21, 2024 • edited Loading

alippai commented Oct 21, 2024

rok commented Oct 21, 2024

coastalwhite commented Oct 21, 2024

alippai commented Oct 21, 2024

rok commented Oct 22, 2024

alippai commented Oct 22, 2024

rok commented Oct 22, 2024

tustvold commented Oct 22, 2024 • edited Loading

rok Jun 5, 2024 •

edited

Loading

tustvold commented May 15, 2024 •

edited

Loading

JFinis May 16, 2024 •

edited

Loading

coastalwhite commented Oct 21, 2024 •

edited

Loading

tustvold commented Oct 22, 2024 •

edited

Loading