Skip to content

Commit

Permalink
Add string as fallback index type when writing data (#904)
Browse files Browse the repository at this point in the history
`pa.from_numpy_dtype` fails when passing an object dtype, which Dask
uses when the dtype is not explicitly known. I propose to fall back on
`string` if this is the case. This will probably be correct for the
index column in most cases when the `dtype` is unknown, and non-breaking
in most other cases.

---------

Co-authored-by: Matthias Richter <[email protected]>
  • Loading branch information
RobbeSneyders and mrchtr authored Mar 12, 2024
1 parent a336917 commit acb6d0a
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion src/fondant/component/data_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,19 @@ def _write_dataframe(self, dataframe: dd.DataFrame) -> None:

# The id needs to be added explicitly since we will convert this to a PyArrow schema
# later and use it in the `pandas.to_parquet` method.
try:
index_type = pa.from_numpy_dtype(dataframe.index.dtype)
except pa.lib.ArrowNotImplementedError:
# The dtype of the index is `np._object`. Fall back on string instead.
logging.warning(
"Failed to infer dtype of index column, falling back to `string`. "
"Specify the dtype explicitly to prevent this.",
)
index_type = pa.string()

schema.update(
{
"id": pa.from_numpy_dtype(dataframe.index.dtype),
"id": index_type,
},
)

Expand Down

0 comments on commit acb6d0a

Please sign in to comment.