Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Don't copy the list/map when not needed #252

Merged
merged 10 commits into from
Jan 26, 2024
Merged

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jan 5, 2024

Wrapping the list seems to introduce an odd behavior where null values are converted to an empty list [].

Because the arrays in Arrow have a counter for the null values. When we create a new list by copying, the counter is not taken into account, and just empty lists/maps are injected instead.

Resolves #251

Wrapping the list seems to introduce an odd behavior where
`null` values are converted to an empty list `[]`.

Resolves apache#251
@Fokko Fokko changed the title Arrow: Use case instead of wrapping a map/list Arrow: Use .cast(..) instead of wrapping a map/list Jan 9, 2024
@Fokko Fokko mentioned this pull request Jan 12, 2024
3 tasks
@Fokko Fokko changed the title Arrow: Use .cast(..) instead of wrapping a map/list Arrow: Don't copy the list when not needed Jan 23, 2024
@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Jan 23, 2024
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks for the great work @Fokko.

I noticed one edge case that could still lead to the issue. I've found one workaround but not sure if it works for other edge cases

if isinstance(list_array, pa.ListArray) and value_array is not None:
arrow_field = pa.list_(self._construct_field(list_type.element_field, value_array.type))
if isinstance(value_array, pa.StructArray):
# Arrow does not allow reordering of fields, therefore we have to copy the array :(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding, another reason that we have to copy the array: Arrow also does not allow field-mismatch, which happens when we have an optional schema field and no values for that field in the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this limitation will make the issue remaining in this edge case (col_list array<struct<test:int>>):

spark.sql(
        f"""
    CREATE TABLE {catalog_name}.default.test_table_empty_list_and_map (
        col_list array<struct<test:int>>,
        col_map map<int, int>
    )
    USING iceberg
    TBLPROPERTIES (
        'format-version'='1'
    );
    """
    )

    spark.sql(
        f"""
    INSERT INTO {catalog_name}.default.test_table_empty_list_and_map
    VALUES (null, null)
    """
    )

 def test_null_list_and_map(catalog: Catalog) -> None:
        table_test_empty_list_and_map = catalog.load_table("default.test_table_empty_list_and_map")
        arrow_table = table_test_empty_list_and_map.scan().to_arrow()
>       assert arrow_table["col_list"].to_pylist() == [None]
E       assert [[]] == [None]

Copy link
Contributor

@HonahX HonahX Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems adding the following before line 1167

            if list_array.is_null():
                return None

can let me pass the above test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test-case, let me add that one 👍 I don't think the last fix is going to work, since not all values might be null 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a fix, but I don't think we can solve this yet without a fix in Arrow upstream 😢 apache/arrow#38809

@Fokko Fokko requested a review from HonahX January 25, 2024 23:18
@Fokko Fokko changed the title Arrow: Don't copy the list when not needed Arrow: Don't copy the list/map when not needed Jan 25, 2024
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue! @Fokko. I think we can merge this first and leave that issue open until the upstream fix release.

@HonahX HonahX merged commit 2a27f2b into apache:main Jan 26, 2024
6 checks passed
@Fokko Fokko deleted the fd-bugsss branch January 26, 2024 13:46
@Fokko
Copy link
Contributor Author

Fokko commented Jan 26, 2024

@HonahX I agree, thanks for merging. This PR is already an improvement over the previous situation, so I think it is good to have it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet column array<struct<>> with null value is read in as empty list
2 participants