-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: Don't copy the list/map when not needed #252
Conversation
Wrapping the list seems to introduce an odd behavior where `null` values are converted to an empty list `[]`. Resolves apache#251
.cast(..)
instead of wrapping a map/list
.cast(..)
instead of wrapping a map/listThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for the great work @Fokko.
I noticed one edge case that could still lead to the issue. I've found one workaround but not sure if it works for other edge cases
pyiceberg/io/pyarrow.py
Outdated
if isinstance(list_array, pa.ListArray) and value_array is not None: | ||
arrow_field = pa.list_(self._construct_field(list_type.element_field, value_array.type)) | ||
if isinstance(value_array, pa.StructArray): | ||
# Arrow does not allow reordering of fields, therefore we have to copy the array :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding, another reason that we have to copy the array: Arrow also does not allow field-mismatch, which happens when we have an optional schema field and no values for that field in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this limitation will make the issue remaining in this edge case (col_list array<struct<test:int>>
):
spark.sql(
f"""
CREATE TABLE {catalog_name}.default.test_table_empty_list_and_map (
col_list array<struct<test:int>>,
col_map map<int, int>
)
USING iceberg
TBLPROPERTIES (
'format-version'='1'
);
"""
)
spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_empty_list_and_map
VALUES (null, null)
"""
)
def test_null_list_and_map(catalog: Catalog) -> None:
table_test_empty_list_and_map = catalog.load_table("default.test_table_empty_list_and_map")
arrow_table = table_test_empty_list_and_map.scan().to_arrow()
> assert arrow_table["col_list"].to_pylist() == [None]
E assert [[]] == [None]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems adding the following before line 1167
if list_array.is_null():
return None
can let me pass the above test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test-case, let me add that one 👍 I don't think the last fix is going to work, since not all values might be null 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a fix, but I don't think we can solve this yet without a fix in Arrow upstream 😢 apache/arrow#38809
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this issue! @Fokko. I think we can merge this first and leave that issue open until the upstream fix release.
@HonahX I agree, thanks for merging. This PR is already an improvement over the previous situation, so I think it is good to have it in. |
Wrapping the list seems to introduce an odd behavior where
null
values are converted to an empty list[]
.Because the arrays in Arrow have a counter for the null values. When we create a new list by copying, the counter is not taken into account, and just empty lists/maps are injected instead.
Resolves #251