You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
I was getting the following error when trying to build a large polars DataFrame from a pyarrow table:
File "/app/decision_engine/loader.py", line 897, in dataframe_from_dicts
pl.from_arrow(arrow_chunks, schema=schema),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/convert/general.py", line 462, in from_arrow
arrow_to_pydf(
File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 1195, in arrow_to_pydf
ps = plc.arrow_to_pyseries(name, column, rechunk=rechunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/pypoetry/virtualenvs/decision-engine-9TtSrW0h-py3.12/lib/python3.12/site-packages/polars/_utils/construction/series.py", line 421, in arrow_to_pyseries
pys = PySeries.from_arrow(name, array.combine_chunks())
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 754, in pyarrow.lib.ChunkedArray.combine_chunks
File "pyarrow/array.pxi", line 4579, in pyarrow.lib.concat_arrays
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
After some digging, I discovered that this error only occurs when you call combine_chunks on an array that is a struct of strings, when the total size of the array is above some number of bytes. In my testing, I saw the error occur somewhere around 3,668,663,880 bytes.
Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size. Here is a minimal reproduction:
defget_arrow_table_and_chunks(
dicts: List[Dict[str, Any]],
schema_keys: List[str],
) ->pl.DataFrame:
normalized_data= [
{key: row.get(key, None) forkeyinschema_keys} forrowindicts
]
# Convert to Arrow Tablearrow_table=pa.Table.from_pydict(
{key: [row[key] forrowinnormalized_data] forkeyinschema_keys}
)
print(f"Arrow Table schema: {arrow_table.schema}")
# Provide chunks of the Arrow Table to polars. If we have too much data in a single chunk,# we get these strange errors:# pyarrow.lib.ArrowInvalid: offset overflow while concatenating arraysarrow_chunks=arrow_table.to_batches(max_chunksize=10000)
returnarrow_table, arrow_chunksn_bytes_per_row=14desired_bytes=3_668_663_888n_rows=desired_bytes//n_bytes_per_rowplaceholder_string="1"*10# Careful - this uses ~150GB of memory and takes a long timedata= [ { "i": placeholder_string, "nested": { "nested_string": placeholder_string } } for_inrange(n_rows) ]
table, chunks=get_arrow_table_and_chunks(data, ["i", "nested"])
# Output:# Arrow Table schema: i: string# nested: struct<nested_string: string># child 0, nested_string: stringtable["nested"].nbytes# Output: 3,668,663,880table["nested"].combine_chunks()
# ---------------------------------------------------------------------------# ArrowInvalid Traceback (most recent call last)# Cell In[12], line 1# ----> 1 table["nested"].combine_chunks()# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/table.pxi:754, in pyarrow.lib.ChunkedArray.combine_chunks()# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/array.pxi:4579, in pyarrow.lib.concat_arrays()# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()# File ~/Library/Caches/pypoetry/virtualenvs/decision-engine-LFRtXiTt-py3.12/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()# ArrowInvalid: offset overflow while concatenating arrays
When I called combine_chunks on the same size of data without the nesting of the struct, I did not get the error. I was able to reproduce this on MacOS Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on Debian 12.
Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size.
This is expected anyway. The binary and string types in Arrow store the offsets inside the data, so a string array with a total size greater than 2 GiB is not possible.
You should either keep the chunks separate (i.e. don't call combine_chunks) or first convert them your string column to large_string (which uses 64-bit offsets).
Describe the bug, including details regarding any error messages, version, and platform.
I was getting the following error when trying to build a large polars DataFrame from a pyarrow table:
After some digging, I discovered that this error only occurs when you call
combine_chunks
on an array that is a struct of strings, when the total size of the array is above some number of bytes. In my testing, I saw the error occur somewhere around 3,668,663,880 bytes.Previous bug reports with the offset overflow are mostly around very large strings. In this case, we don't have any individual string that is larger than 2GB. Instead, we get the error when we are above a certain total size. Here is a minimal reproduction:
When I called
combine_chunks
on the same size of data without the nesting of the struct, I did not get the error. I was able to reproduce this on MacOS Sonoma 14.6.1 as well as the python:3.12.3-slim docker image which is based on Debian 12.Python version: 3.12.2
PyArrow version: 17.0.0
Polars version: 1.7.1
The error appears to be thrown from
PutOffsets
viaConcatenate
inconcatenate.cc
https://github.com/apache/arrow/blob/apache-arrow-17.0.0/cpp/src/arrow/array/concatenate.cc#L166Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: