-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve comparison of struct fields #1839
Improve comparison of struct fields #1839
Conversation
arrow/src/datatypes/datatype.rs
Outdated
&& a.iter().zip(b).all(|(a, b)| { | ||
a.is_nullable() == b.is_nullable() | ||
&& a.data_type().equals_datatype(b.data_type()) | ||
a.iter().all(|a| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not correct. equals_datatype
compares exact data type equality (except for metadata and nested field names). Technically speaking, two structs are equal only if the nested field data types are equal.
In apache/datafusion#2326, seems you want to overcome some errors at column projection in DataFusion. For column projection, it requires two structs to be "compatible", so nested field order can be ignored (because it will be "mapped" later). It is a more loose data type equality than the exact equality of equals_datatype
.
So I don't think we should change equals_datatype
. For compatible data type change, maybe we should have another method for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that having another method would be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've added a method compatible_datatype, and reverted my changes to equal_datatype.
The new method will look if two datatypes are compatible which each other, ignoring the field order and length and fail if missing and not nullable. Array data type comparison will use the equals_datatype instead of comparing the enums itself.
48f025c
to
24872b2
Compare
arrow/src/compute/kernels/concat.rs
Outdated
@@ -62,7 +62,7 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> { | |||
|
|||
if arrays | |||
.iter() | |||
.any(|array| array.data_type() != arrays[0].data_type()) | |||
.any(|array| !array.data_type().equals_datatype(arrays[0].data_type())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be reverted too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be changed to compatible_datatype (at least for my own usecase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays can be concatted as long as they are compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our concat kernel doesn't map out-of-order nested fields currently. It is why it checks data type with exact equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to focus on compatible datatype method in this PR. You can definitely propose do compatible concat kernel later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reverted the concat change.
Codecov Report
@@ Coverage Diff @@
## master #1839 +/- ##
==========================================
- Coverage 83.48% 83.45% -0.04%
==========================================
Files 201 201
Lines 56838 56931 +93
==========================================
+ Hits 47452 47510 +58
- Misses 9386 9421 +35
Continue to review full report at Codecov.
|
I'm not sure about this, reordering fields in a StructArray is not compatible? To me this feels like a limitation of the SchemaAdapter logic in DataFusion which needs to:
Columns and fields are identified by index and not column name, and so they cannot be arbitrarily reordered? It also possibly relates to apache/datafusion#2581 |
This PR has been inactive for a while so closing to clear the backlog, please feel free to reopen if you come back to this |
Currently if the order of fields in struct is different, it will fail.
This fix will lookup fields, ignoring the field order and length and
fail if not nullable.
Array data type comparison will use the equals_datatype instead of
comparing enums.
This fix work towards closing of apache/datafusion#2326.
@alamb