Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Numpy 2 varlen strings #3170

Open
agoose77 opened this issue Jun 27, 2024 · 2 comments
Open

Support Numpy 2 varlen strings #3170

agoose77 opened this issue Jun 27, 2024 · 2 comments
Labels
feature New feature or request

Comments

@agoose77
Copy link
Collaborator

Description of new feature

This is not a feature-request per-se, rather it's tracking the future possibility of ingesting / exporting NumPy 2 varlen strings.

I took a brief glance at this again today (it's amazing how quickly this stuff fades once you're not doing it every day), and it's clear that right now we have some work ahead of us if we want to ingest these strings into Awkward.

NumPy's choice to have each string be its own arena-allocated object means that there's no trivial way to ask for a single flat buffer of UTF8 code-units. I only spent a few minutes to look at this, and so far it seems we probably can use the NumPy C API to avoid needing to convert the string into UTF-32 in order to produce a flat buffer. This conversion would need to iterate over every string object and fill a buffer.

In the return direction, I don't think we can lean in to the simple slice-based view that we have internally. The C API for NumPy varlen strings is opaque w.r.t the allocators, so we would need to exactly reverse the ingest method (i.e. write each substring using the C API).

@agoose77 agoose77 added the feature New feature or request label Jun 27, 2024
@jpivarski
Copy link
Member

NumPy's decision to allocate strings within an arena makes sense: it allows strings to be replaced in-place, as NumPy users are accustomed to modifying arrays. It's different enough from the Arrow/Awkward way of representing strings that there will be cases in which it's favorable to use one, rather than the other.

Since the arenas are managed internally by NumPy and it can change the data at any time, converting between NumPy strings and Awkward strings would almost certainly be a copy.1 You're right that we'd want to avoid an additional conversion from UTF-8 to UTF-32 back to UTF-8, and we also want to prevent NumPy from padding the intermediate strings.

This might be possible through the C API, but we'd want to avoid compiling against NumPy or depending on a particular version of NumPy. Socially, it might make more sense to get either NumPy or Arrow to build in conversion routines between NumPy and Arrow, and then we can view the Arrow data as Awkward. Both the NumPy and Arrow projects are primarily extension modules (a larger part of the codebase is C or C++ than Python), whereas awkward-cpp is a small part of Awkward.

Footnotes

  1. If the NumPy string data are contiguous UTF-8, it's in principle possible to wrap it as a ListArray content (with new offsets), but it would probably be dangerous to do so, unless NumPy makes strong guarantees about not changing its own arenas.

@agoose77
Copy link
Collaborator Author

Socially, it might make more sense to get either NumPy or Arrow to build in conversion routines between NumPy and Arrow

Yes! I had the same thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants