Support comparison operators on nested data types (Struct, List, ..) #10856

Blizzara · 2024-06-10T15:06:20Z

Is your feature request related to a problem or challenge?

We're working on running some used-to-be-Spark pipelines through DataFusion. One case we've noticed where DataFusion doesn't support something is comparing lists. (Spark allows)[https://github.com/apache/spark/blame/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L1025] comparing (==, !=, <, >, <=, >=, ..) columns of structs and lists, while in DataFusion those seem to throw:

For structs, from our internal testing:

ArrowError(InvalidArgumentError("Invalid comparison operation: Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]) <= Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }])"), None)

For lists, this is shown in DataFusion's tests:

datafusion/datafusion/sqllogictest/test_files/array_query.slt

Line 44 in 3773fb7

    
           query error DataFusion error: Arrow error: Invalid argument error: Invalid comparison operation: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) == List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)

Maybe this would need to be improved on Arrow directly, seeing that the error is coming from https://github.com/apache/arrow-rs/blob/087f34b70e97ee85e1a54b3c45c5ed814f500b0a/arrow-ord/src/cmp.rs#L219?

Describe the solution you'd like

Binary predicates to be allowed for structs and lists, preferably following same semantics as in Spark (mostly I think it's a DFS over all the fields https://github.com/apache/spark/blob/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala#L285)

Describe alternatives you've considered

No response

Additional context

Related to #2326

The text was updated successfully, but these errors were encountered:

Blizzara · 2024-06-10T15:50:42Z

Ah, #9254 is related, as is apache/arrow-rs#5411 and apache/arrow-rs#5407

jayzhan211 · 2024-06-24T01:27:03Z

Note that the comparison for nested type is not supported in arrow-rs apache/arrow-rs#5942, so we should implement them in datafusion.
First attempt #11091

Ideally we should make it configurable so we can support both Spark and Postgres like behaviour for nulls

Blizzara added the enhancement New feature or request label Jun 10, 2024

y-f-u mentioned this issue Jun 20, 2024

rewrite refresh append update dedup logic using arrow comparators spiceai/spiceai#1743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support comparison operators on nested data types (Struct, List, ..) #10856

Support comparison operators on nested data types (Struct, List, ..) #10856

Blizzara commented Jun 10, 2024

Blizzara commented Jun 10, 2024

jayzhan211 commented Jun 24, 2024 •

edited

Loading

Support comparison operators on nested data types (Struct, List, ..) #10856

Support comparison operators on nested data types (Struct, List, ..) #10856

Comments

Blizzara commented Jun 10, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Blizzara commented Jun 10, 2024

jayzhan211 commented Jun 24, 2024 • edited Loading

jayzhan211 commented Jun 24, 2024 •

edited

Loading