You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a feature-request per-se, rather it's tracking the future possibility of ingesting / exporting NumPy 2 varlen strings.
I took a brief glance at this again today (it's amazing how quickly this stuff fades once you're not doing it every day), and it's clear that right now we have some work ahead of us if we want to ingest these strings into Awkward.
NumPy's choice to have each string be its own arena-allocated object means that there's no trivial way to ask for a single flat buffer of UTF8 code-units. I only spent a few minutes to look at this, and so far it seems we probably can use the NumPy C API to avoid needing to convert the string into UTF-32 in order to produce a flat buffer. This conversion would need to iterate over every string object and fill a buffer.
In the return direction, I don't think we can lean in to the simple slice-based view that we have internally. The C API for NumPy varlen strings is opaque w.r.t the allocators, so we would need to exactly reverse the ingest method (i.e. write each substring using the C API).
The text was updated successfully, but these errors were encountered:
NumPy's decision to allocate strings within an arena makes sense: it allows strings to be replaced in-place, as NumPy users are accustomed to modifying arrays. It's different enough from the Arrow/Awkward way of representing strings that there will be cases in which it's favorable to use one, rather than the other.
Since the arenas are managed internally by NumPy and it can change the data at any time, converting between NumPy strings and Awkward strings would almost certainly be a copy.1 You're right that we'd want to avoid an additional conversion from UTF-8 to UTF-32 back to UTF-8, and we also want to prevent NumPy from padding the intermediate strings.
This might be possible through the C API, but we'd want to avoid compiling against NumPy or depending on a particular version of NumPy. Socially, it might make more sense to get either NumPy or Arrow to build in conversion routines between NumPy and Arrow, and then we can view the Arrow data as Awkward. Both the NumPy and Arrow projects are primarily extension modules (a larger part of the codebase is C or C++ than Python), whereas awkward-cpp is a small part of Awkward.
Footnotes
If the NumPy string data are contiguous UTF-8, it's in principle possible to wrap it as a ListArray content (with new offsets), but it would probably be dangerous to do so, unless NumPy makes strong guarantees about not changing its own arenas. ↩
Description of new feature
This is not a feature-request per-se, rather it's tracking the future possibility of ingesting / exporting NumPy 2 varlen strings.
I took a brief glance at this again today (it's amazing how quickly this stuff fades once you're not doing it every day), and it's clear that right now we have some work ahead of us if we want to ingest these strings into Awkward.
NumPy's choice to have each string be its own arena-allocated object means that there's no trivial way to ask for a single flat buffer of UTF8 code-units. I only spent a few minutes to look at this, and so far it seems we probably can use the NumPy C API to avoid needing to convert the string into UTF-32 in order to produce a flat buffer. This conversion would need to iterate over every string object and fill a buffer.
In the return direction, I don't think we can lean in to the simple slice-based view that we have internally. The C API for NumPy varlen strings is opaque w.r.t the allocators, so we would need to exactly reverse the ingest method (i.e. write each substring using the C API).
The text was updated successfully, but these errors were encountered: