-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for writing more ExtensionArray/PyArrow types #257
base: main
Are you sure you want to change the base?
Conversation
d682e3d
to
7769bfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I think it might be worth checking the performance of passing down the ArrowExtensionArray as is to the cython code, versus passing down the equivalent numpy array (at least for simple dtypes like int and float). Given that we repeatedly index into that array while iterating over the records, it might be that indexing into a numpy array is faster and it could be worth it to actually do the conversion.
One difficulty is actually getting that numpy array. Pandas has a pyarrow_array_to_numpy_and_mask
utility function. Maybe that could be used through __from_arrow__
, so one option could be something like:
In [33]: s = pd.Series([1, 2, None], dtype="int64[pyarrow]")
In [34]: pa_arr = s.array._pa_array
In [35]: dtype_mapping = {
...: pa.int8(): pd.Int8Dtype(),
...: pa.int16(): pd.Int16Dtype(),
...: pa.int32(): pd.Int32Dtype(),
...: pa.int64(): pd.Int64Dtype(),
...: pa.uint8(): pd.UInt8Dtype(),
...: pa.uint16(): pd.UInt16Dtype(),
...: pa.uint32(): pd.UInt32Dtype(),
...: pa.uint64(): pd.UInt64Dtype(),
...: pa.bool_(): pd.BooleanDtype(),
...: pa.string(): pd.StringDtype(),
...: pa.float32(): pd.Float32Dtype(),
...: pa.float64(): pd.Float64Dtype(),
...: }
In [36]: masked_arr = dtype_mapping[pa_arr.type].__from_arrow__(pa_arr)
In [37]: masked_arr
Out[37]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
In [38]: masked_arr._data
Out[38]: array([1, 2, 0])
In [39]: masked_arr._mask
Out[39]: array([False, False, True])
(I am not sure there is an official public method to get the pyarrow Array from the ArrowExtensionArray)
Also for the StringArray (which is currently already working), I expect a single conversion to numpy object array (i.e. converting to Python strings in one go) will be more efficient than iteratively getting a Python string one at a time.
Probably just |
Thank you for telling me about Now that I though more about how to implement this efficiently, I wonder if just making the Cython code directly support a So I think the following top-level logic would be needed:
The Cython code can do one of 3 things to access the values:
If this sounds reasonable I can try to implement this alternative instead. |
This PR improves the ExtensionArray writing support from #232 to handle other array types (including PyArrow ones).
Before it, such fields were often silently and unexpectedly casted to
object
dtype and then toOFTString
.Instead of casting the data array to a basic numpy dtype and then using the data array dtype to infer the OGR type in the Cython code, the high-level code now infers the equivalent numpy dtype and passes it into Cython, leaving the ExtensionArray with data as-is. As a side benefit, field data arrays are now never copied during writing.
Addition of
DTYPE_OGR_FIELD_TYPES["string"]
allows theinfer_field_types
function to handle the non-ambiguous string data case, where no additional type inference is needed.