Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement iterator for StructArray #593

Open
WillAyd opened this issue Aug 20, 2024 · 4 comments
Open

Implement iterator for StructArray #593

WillAyd opened this issue Aug 20, 2024 · 4 comments

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Aug 20, 2024

Right now if you were to use the C++ library with nanoarrow and read in a stream of two dimensional objects, you would:

  1. Use the ViewArrayStream class to iterate the stream
  2. Manually loop over each schema / array in the StructArray
  3. Use the ViewArrayAs class to iterate the individual array views

Not sure if this falls in the scope of nanoarrow or if its something for sparrow, but I think it would make sense to add an iterator for step 2

@paleolimbot
Copy link
Member

I am curious exactly what type of struct you're iterating over and what you're trying to produce? There are a number of strategies for doing this depending on what you need (and if you know at compile time what your types are).

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 21, 2024

Here's some code that I specifically crafted in my last video on nanoarrow:

https://github.com/WillAyd/bearly/blob/23f4a84095811047d8a319101c9b437fa11ba349/src/bearly/bearly_ext.cc#L32

Essentially I am going through each stream of a pa.Table, go column-by-column, and then iterate the values within each column. The stream / array value iteration already have C++ iterators, but the column-by-column iteration is a classic loop

  for (const auto &chunk : array_stream) {
    for (decltype(schema->n_children) i = 0; i < schema->n_children; ++i) {
      nanoarrow::UniqueArrayView array_view;
      ArrowArrayViewInitFromSchema(array_view.get(), schema->children[i], &error);
      NANOARROW_THROW_NOT_OK(
          ArrowArrayViewSetArray(array_view.get(), chunk.children[i], &error));

      for (const auto value :
           nanoarrow::ViewArrayAs<int64_t>(array_view.get())) {
        // do something with the values of each array here
      }
    }
  }

@paleolimbot
Copy link
Member

I am probably the wrong person to ask here since I don't mind classic loops and the iteration that I usually have to do is to convert between row-oriented systems and Arrow (e.g., database drivers). There is definitely appetite to interact with Arrow from C++ and I'm not sure I have the answers about the scope of that or if nanoarrow is the right place!

Tiny nit: you can re-use the same nanoarrow::UniqueArrayView for every array in the stream (e.g., initialize it before the array stream loop and just setarray for each one). Probably this is only meaningful for large numbers of very small arrays (or if there are a lot of columns).

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 21, 2024

Yea if you do row-oriented iteration I think there is less value. Maybe there should be a way to differentiate how you want to iterate?

For column iteration, I think something of the form:

  for (const auto &chunk : array_stream) {
    for (const auto& [schema_view, array_view] : chunk.Columns()) {
      // maybe do something with the schema here, like init an ArrowDecimal from precision / scale
      for (const auto value :
           nanoarrow::ViewArrayAs<int64_t>(array_view.get())) {
        // do something with the values of each array here
      }
    }
  }

would make for an idiomatic C++ solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants