Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support comparison operators on nested data types (Struct, List, ..) #10856

Open
Blizzara opened this issue Jun 10, 2024 · 2 comments
Open

Support comparison operators on nested data types (Struct, List, ..) #10856

Blizzara opened this issue Jun 10, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@Blizzara
Copy link
Contributor

Is your feature request related to a problem or challenge?

We're working on running some used-to-be-Spark pipelines through DataFusion. One case we've noticed where DataFusion doesn't support something is comparing lists. (Spark allows)[https://github.com/apache/spark/blame/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L1025] comparing (==, !=, <, >, <=, >=, ..) columns of structs and lists, while in DataFusion those seem to throw:

For structs, from our internal testing:

ArrowError(InvalidArgumentError("Invalid comparison operation: Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]) <= Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }])"), None)

For lists, this is shown in DataFusion's tests:

query error DataFusion error: Arrow error: Invalid argument error: Invalid comparison operation: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) == List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)

Maybe this would need to be improved on Arrow directly, seeing that the error is coming from https://github.com/apache/arrow-rs/blob/087f34b70e97ee85e1a54b3c45c5ed814f500b0a/arrow-ord/src/cmp.rs#L219?

Describe the solution you'd like

Binary predicates to be allowed for structs and lists, preferably following same semantics as in Spark (mostly I think it's a DFS over all the fields https://github.com/apache/spark/blob/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala#L285)

Describe alternatives you've considered

No response

Additional context

Related to #2326

@Blizzara Blizzara added the enhancement New feature or request label Jun 10, 2024
@Blizzara
Copy link
Contributor Author

Ah, #9254 is related, as is apache/arrow-rs#5411 and apache/arrow-rs#5407

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 24, 2024

Note that the comparison for nested type is not supported in arrow-rs apache/arrow-rs#5942, so we should implement them in datafusion.
First attempt #11091

Ideally we should make it configurable so we can support both Spark and Postgres like behaviour for nulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants