Support Numpy 2 varlen strings #3170

agoose77 · 2024-06-27T11:00:15Z

Description of new feature

This is not a feature-request per-se, rather it's tracking the future possibility of ingesting / exporting NumPy 2 varlen strings.

I took a brief glance at this again today (it's amazing how quickly this stuff fades once you're not doing it every day), and it's clear that right now we have some work ahead of us if we want to ingest these strings into Awkward.

NumPy's choice to have each string be its own arena-allocated object means that there's no trivial way to ask for a single flat buffer of UTF8 code-units. I only spent a few minutes to look at this, and so far it seems we probably can use the NumPy C API to avoid needing to convert the string into UTF-32 in order to produce a flat buffer. This conversion would need to iterate over every string object and fill a buffer.

In the return direction, I don't think we can lean in to the simple slice-based view that we have internally. The C API for NumPy varlen strings is opaque w.r.t the allocators, so we would need to exactly reverse the ingest method (i.e. write each substring using the C API).

jpivarski · 2024-06-27T14:19:27Z

NumPy's decision to allocate strings within an arena makes sense: it allows strings to be replaced in-place, as NumPy users are accustomed to modifying arrays. It's different enough from the Arrow/Awkward way of representing strings that there will be cases in which it's favorable to use one, rather than the other.

Since the arenas are managed internally by NumPy and it can change the data at any time, converting between NumPy strings and Awkward strings would almost certainly be a copy.¹ You're right that we'd want to avoid an additional conversion from UTF-8 to UTF-32 back to UTF-8, and we also want to prevent NumPy from padding the intermediate strings.

This might be possible through the C API, but we'd want to avoid compiling against NumPy or depending on a particular version of NumPy. Socially, it might make more sense to get either NumPy or Arrow to build in conversion routines between NumPy and Arrow, and then we can view the Arrow data as Awkward. Both the NumPy and Arrow projects are primarily extension modules (a larger part of the codebase is C or C++ than Python), whereas awkward-cpp is a small part of Awkward.

If the NumPy string data are contiguous UTF-8, it's in principle possible to wrap it as a ListArray content (with new offsets), but it would probably be dangerous to do so, unless NumPy makes strong guarantees about not changing its own arenas. ↩

agoose77 · 2024-06-27T16:23:09Z

Socially, it might make more sense to get either NumPy or Arrow to build in conversion routines between NumPy and Arrow

Yes! I had the same thought.

agoose77 added the feature New feature or request label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Numpy 2 varlen strings #3170

Support Numpy 2 varlen strings #3170

agoose77 commented Jun 27, 2024

jpivarski commented Jun 27, 2024

agoose77 commented Jun 27, 2024

Support Numpy 2 varlen strings #3170

Support Numpy 2 varlen strings #3170

Comments

agoose77 commented Jun 27, 2024

Description of new feature

jpivarski commented Jun 27, 2024

Footnotes

agoose77 commented Jun 27, 2024