Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] [C++] Offset Overflow when calling combine_chunks on Large Struct Arrays #44164

Open
Aryik opened this issue Sep 18, 2024 · 3 comments

Comments

@Aryik
Copy link

Aryik commented Sep 18, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I was getting the following error when trying to build a large polars DataFrame from a pyarrow table:

  File "/app/decision_engine/loader.py", line 897, in dataframe_from_dicts
    pl.from_arrow(arrow_chunks, schema=schema),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/convert/general.py", line 462, in from_arrow
    arrow_to_pydf(
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 1195, in arrow_to_pydf
    ps = plc.arrow_to_pyseries(name, column, rechunk=rechunk)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/series.py", line 421, in arrow_to_pyseries
    pys = PySeries.from_arrow(name, array.combine_chunks())
                                    ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 754, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 4579, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

After some digging, I discovered that this error only occurs when you call combine_chunks on an array that is a struct of strings, when the total size of the array is above some number of bytes. In my testing, I saw the error occur somewhere around 3,668,663,880 bytes.

Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size. Here is a minimal reproduction:

def get_arrow_table_and_chunks(
    dicts: List[Dict[str, Any]],
    schema_keys: List[str],
) -> pl.DataFrame:
    normalized_data = [
        {key: row.get(key, None) for key in schema_keys} for row in dicts
    ]

    # Convert to Arrow Table
    arrow_table = pa.Table.from_pydict(
        {key: [row[key] for row in normalized_data] for key in schema_keys}
    )

    print(f"Arrow Table schema: {arrow_table.schema}")

    # Provide chunks of the Arrow Table to polars. If we have too much data in a single chunk,
    # we get these strange errors:
    # pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
    arrow_chunks = arrow_table.to_batches(max_chunksize=10000)

    return arrow_table, arrow_chunks

n_bytes_per_row = 14
desired_bytes = 3_668_663_888
n_rows = desired_bytes // n_bytes_per_row
placeholder_string = "1"*10
# Careful - this uses ~150GB of memory and takes a long time
data = [ { "i": placeholder_string, "nested": { "nested_string": placeholder_string } } for _ in range(n_rows) ]
table, chunks = get_arrow_table_and_chunks(data, ["i", "nested"])
# Output:
# Arrow Table schema: i: string
# nested: struct<nested_string: string>
#   child 0, nested_string: string
table["nested"].nbytes
# Output: 3,668,663,880
table["nested"].combine_chunks()
# ---------------------------------------------------------------------------
# ArrowInvalid                              Traceback (most recent call last)
# Cell In[12], line 1
# ----> 1 table["nested"].combine_chunks()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/table.pxi:754, in pyarrow.lib.ChunkedArray.combine_chunks()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/array.pxi:4579, in pyarrow.lib.concat_arrays()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
# ArrowInvalid: offset overflow while concatenating arrays

When I called combine_chunks on the same size of data without the nesting of the struct, I did not get the error. I was able to reproduce this on MacOS Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on Debian 12.

Python version: 3.12.2
PyArrow version: 17.0.0
Polars version: 1.7.1

The error appears to be thrown from PutOffsets via Concatenate in concatenate.cc https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/array/concatenate.cc#L166

Component(s)

C++, Python

@pitrou
Copy link
Member

pitrou commented Sep 19, 2024

Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size.

This is expected anyway. The binary and string types in Arrow store the offsets inside the data, so a string array with a total size greater than 2 GiB is not possible.

You should either keep the chunks separate (i.e. don't call combine_chunks) or first convert them your string column to large_string (which uses 64-bit offsets).

@Aryik
Copy link
Author

Aryik commented Sep 19, 2024

The call to combine_chunks works when they are top-level string arrays and only fails when they're nested inside a struct. Is that expected?

@pitrou
Copy link
Member

pitrou commented Sep 19, 2024

Can you show the result of combining the top-level string arrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants