Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: String/Binary View Support #596

Merged
merged 32 commits into from
Sep 24, 2024
Merged

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Aug 23, 2024

closes #583

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 23, 2024

Hmm looks like the Python process whereby we generate the pxd file does not play nice with the union containing nested struct declarations

@paleolimbot
Copy link
Member

I'm on vacation at the moment but I believe you just have to give the anonymous structures names (i.e. declare them as top level types). This is how you would have to do it in cython anyway i believe!

@WillAyd
Copy link
Contributor Author

WillAyd commented Aug 25, 2024

Sounds good. Enjoy the time away!

@WillAyd WillAyd force-pushed the string-view branch 4 times, most recently from 0b11479 to 00fbf17 Compare September 4, 2024 23:28
@WillAyd WillAyd marked this pull request as ready for review September 4, 2024 23:36
@WillAyd WillAyd marked this pull request as draft September 5, 2024 01:50
@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 5, 2024

@paleolimbot started to look at this in more earnest to get a full round trip, instead of just a read. The current read implemenation also only assumes one variadic buffer.

I see the Arrow specification says that binary view types have an extra member which is a buffer containing the lengths of the variadic buffers:

https://arrow.apache.org/docs/format/CDataInterface.html#binary-view-arrays

Do you think its worth adding that to the ArrowArray struct, or should we create a new struct, potentially with a layout like:

struct ArrowBinaryViewArray {
  struct ArrowArray array_data;
  // Variadic buffer lengths for Binary View arrays
  int64_t* variadic_buffer_lengths;
};

So that it can be "safely" cast to/from ArrowArray pointers?

@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 5, 2024

With that approach the one thing I still wonder is where we should store the length of variadic_buffer_lengths. I didn't see anything for that in the specification, so maybe using a NULL sentinel to denote the end of the array is the requirement?

@paleolimbot
Copy link
Member

started to look at this in more earnest to get a full round trip, instead of just a read

Awesome!

Do you think its worth adding that to the ArrowArray struct, or should we create a new struct, potentially with a layout like:

I think the ArrowArrayView is the right place for this! From Joris' PR:

int64_t n_varidic_buffers;
int64_t* variadic_buffer_sizes;
const void** variadic_buffers;

I see the Arrow specification says that binary view types have an extra member

I think this refers to an extra buffer (not an extra struct member):

https://github.com/apache/arrow-nanoarrow/pull/367/files#r1459338465

@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 5, 2024

Ah OK interesting. Do you know if this is confirmed to be working upstream in Arrow? The reason I ask is that when running the test suite for what's already in this PR, when I hit the C++ Export method:

https://github.com/apache/arrow/blob/c2123b8b90ab952f854912459bb33ebaf0d99611/cpp/src/arrow/c/bridge.cc#L661

The n_buffers being assigned is just 3. I was (likely mistakenly) expecting that to be 4 to account for the validity buffer, inline/prefix buffer, variadic buffer(s), and a buffer that holds the sizes to all variadic buffers

The Arrow docs also read a little vague to me:

int64_t ArrowArray.n_buffers

Mandatory. The number of physical buffers backing this array. The number of buffers is a function of the data type, as described in the Columnar format specification, except for the the binary or utf-8 view type, which has one additional buffer compared to the Columnar format specification (see Binary view arrays).

Buffers of children arrays are not included.

I'm not sure if that is saying that the n_buffers argument should be 4 for the view types, or that it should be 3 and the consumer is expected to know that when working with view types, you actually have n_buffers + 1 buffers attached to the ArrowArray

@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 5, 2024

Bah ignore what I just said - bad debugging. Digging deeper!

src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
src/nanoarrow/common/inline_array.h Show resolved Hide resolved
src/nanoarrow/common/inline_types.h Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 88.04348% with 11 lines in your changes missing coverage. Please review.

Project coverage is 86.06%. Comparing base (2c42268) to head (8c1e825).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
src/nanoarrow/testing/testing.cc 14.28% 6 Missing ⚠️
src/nanoarrow/common/inline_types.h 0.00% 4 Missing ⚠️
src/nanoarrow/common/inline_array.h 98.64% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #596      +/-   ##
==========================================
+ Coverage   85.02%   86.06%   +1.04%     
==========================================
  Files          92       94       +2     
  Lines       11750    13697    +1947     
==========================================
+ Hits         9990    11788    +1798     
- Misses       1760     1909     +149     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more notes! I am gearing up for a release and it would be very cool to include this if we can...what do you see as blockers for getting this PR across the finish line and can I help with any of them?

Comment on lines 110 to 111
# TODO: run_end_encoded
assert na.binary_view().type == na.Type.BINARY_VIEW
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: run_end_encoded
assert na.binary_view().type == na.Type.BINARY_VIEW
assert na.binary_view().type == na.Type.BINARY_VIEW

(we have an issue for this one, feel free to punt! #618)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a hard time deciphering the code suggestion - do you just want me to remove these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just that you can remove the TODO

src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
@WillAyd WillAyd marked this pull request as ready for review September 19, 2024 02:38
@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 19, 2024

A few more notes! I am gearing up for a release and it would be very cool to include this if we can...what do you see as blockers for getting this PR across the finish line and can I help with any of them?

Great! I should have time to see this through.

There are no major blockers from my end. Let me know any feedback you have and I will take a look

@paleolimbot
Copy link
Member

Let me know any feedback you have and I will take a look

Great! I'll give this a thorough checkout + run through + review tomorrow 🙂

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few changes are required to get this PR merged! This also needs a look from @bkietz to ensure that the string view details are correct!

src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
src/nanoarrow/common/array.c Show resolved Hide resolved
src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
Comment on lines 968 to 969
TEST(ArrayTest, ArrayTestAppendToBinaryViewArray) {
struct ArrowArray array;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, these two tests can share a common void TestAppendToDataViewArray() to avoid duplicating this entire test? (I know we haven't done a great job of this in the test suite so far 😳)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking that function should be added in the test suite or nanoarrow itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a test pattern (like you added below for consolidating the builder tests):

void TestAppendToDataViewArray(enum ArrowType view_type) {
  // ...
}

TEST(ArrayTest, ArrayTestAppendToBinaryViewArray) {
  ASSERT_NO_FATAL_FAILURE(TestAppendToDataViewArray(NANOARROW_TYPE_BINARY_VIEW));
}

TEST(ArrayTest, ArrayTestAppendToStringViewArray) {
  ASSERT_NO_FATAL_FAILURE(TestAppendToDataViewArray(NANOARROW_TYPE_STRING_VIEW));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that the string test uses ArrowArrayAppendString whereas the binary one uses ArrowArrayAppendBytes for the append func? Should I just use ArrowArrayAppendBytes in the shared test implementation? Or parametrize it to accept a std::function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this comment! Probably we should test both functions to make sure they both work (I believe an if statement would work to get that coverage but we've also used std::function (for example, in the integration test json tests) and I think either works!

src/nanoarrow/common/inline_types.h Outdated Show resolved Hide resolved
src/nanoarrow/testing/testing.cc Show resolved Hide resolved
src/nanoarrow/common/schema.c Show resolved Hide resolved
src/nanoarrow/common/utils.c Show resolved Hide resolved
src/nanoarrow/common/inline_array.h Outdated Show resolved Hide resolved
Comment on lines 509 to 513
} else {
int32_t n_vbufs = private_data->n_variadic_buffers;
if (n_vbufs == 0 ||
private_data->variadic_buffers[n_vbufs - 1].size_bytes + value.size_bytes >
NANOARROW_BINARY_VIEW_BLOCK_SIZE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the management of the array of buffers should be abstracted at least one level away to make it easier to spot potential problems and that here it would be more readable:

int64_t buffer_index;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, 1, &buffer_index));

(Exposing the ability to reserve variadic buffers also unlocks the ability to build by buffer and is probably a good idea anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the buffer_index in this case supposed to point to the index of the first buffer that has been added? If so, any preference for the name of that parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just to tell you which one just got added (so that you can put a reference to it in the item you're about to append to the data buffer). Perhaps there's a better parameter name (or a way to retrieve the number of currently used variadic buffers using a function?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea...struggling with what to name this, especially if we allow this function to add multiple buffers at once. Maybe the second function to get the number of buffers allocated is a better pattern?

So

int64_t nbuf_capacity = ArrowArrayVariadicBufferSize(struct ArrowArray *array);
int64_t additional_buffers = 3;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, additional_buffers));
// nbuf_capacity + additional_buffers == ArrowArrayVariadicBufferSize(struct ArrowArray *array);

versus

int64_t nbuf_capacity = private_data->n_variadic_buffers;
int64_t next_buffer_avail;
int64_t additional_buffers = 3;
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, additional_buffers, &next_buffer_avail));
// next_buffer_avail + additional_buffers == private_data->n_variadic_buffers;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the second function to get the number of buffers allocated is a better pattern?

Yes! Maybe ArrowArrayVariadicBufferCount() though? (Invoking Size suggests to me that there's only one of them and that the function will tell you how many bytes there are).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I've got this right. FWIW, maybe we would also add a ArrowArrayGetBuffer(struct ArrowArray* array, int32_t i) function to abstract many of the calls that have to reach into private_data as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have this one!

static inline struct ArrowBuffer* ArrowArrayBuffer(struct ArrowArray* array, int64_t i) {
struct ArrowArrayPrivateData* private_data =
(struct ArrowArrayPrivateData*)array->private_data;
switch (i) {
case 0:
return &private_data->bitmap.buffer;
default:
return private_data->buffers + i - 1;
}
}

We do use it a number of internals (but could possibly use it in others!)

@@ -467,52 +468,136 @@ static inline ArrowErrorCode ArrowArrayAppendDouble(struct ArrowArray* array,
return NANOARROW_OK;
}

#define NANOARROW_BINARY_VIEW_FIXED_BUFFERS 2
#define NANOARROW_BINARY_VIEW_INLINE_SIZE 12
#define NANOARROW_BINARY_VIEW_PREVIEW_SIZE 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: other implementations call this the prefix rather than the preview

Suggested change
#define NANOARROW_BINARY_VIEW_PREVIEW_SIZE 4
#define NANOARROW_BINARY_VIEW_PREFIX_SIZE 4

Comment on lines 479 to 491
struct ArrowBinaryViewTypeInlinedData {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE];
};

struct ArrowBinaryViewTypeRefData {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];
int32_t buffer_index;
int32_t offset;
};

union ArrowBinaryViewType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types should probably avoid including Type in their name since they are values from an array of that type and don't encode type-level information. See also ArrowDecimal, ArrowStringView, ArrowInterval

Suggested change
struct ArrowBinaryViewTypeInlinedData {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE];
};
struct ArrowBinaryViewTypeRefData {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];
int32_t buffer_index;
int32_t offset;
};
union ArrowBinaryViewType {
struct ArrowBinaryViewInlined {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_INLINE_SIZE];
};
struct ArrowBinaryViewRef {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];
int32_t buffer_index;
int32_t offset;
};
union ArrowBinaryView {

src/nanoarrow/common/inline_array.h Show resolved Hide resolved
if (n_vbufs == 0 ||
private_data->variadic_buffers[n_vbufs - 1].size_bytes + value.size_bytes >
NANOARROW_BINARY_VIEW_BLOCK_SIZE) {
NANOARROW_RETURN_NOT_OK(ArrowArrayAddVariadicBuffers(array, 1, &buf_index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't handle the case where value.size_bytes > NANOARROW_BINARY_VIEW_BLOCK_SIZE, in that case you'll need an extra large variadic buffer to accommodate it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK interesting. So right now whenever a variadic buffer's capacity may be exceeded I am starting afresh in a new buffer - should I be packing the existing buffer as full as possible before-hand and potentially splitting the value array across two variadic buffers? Or is that only a consideration when value.size_bytes > NANOARROW_BINARY_VIEW_BLOCK_SIZE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I wrote this before observing that you aren't presizing data buffers #596 (comment)

Since you aren't presizing this isn't a bug


struct ArrowBinaryViewTypeRefData {
int32_t size;
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint8_t data[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];
uint8_t prefix[NANOARROW_BINARY_VIEW_PREVIEW_SIZE];

}

for (int32_t i = 0; i < nbuffers; i++) {
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]);
ArrowBufferInit(&private_data->variadic_buffers[n_current_bufs + i]);
private_data->variadic_buffer_sizes[i] = 0;

Comment on lines 552 to +553
NANOARROW_RETURN_NOT_OK(
ArrowBufferAppend(data_buffer, value.data.data, value.size_bytes));
break;
ArrowBufferAppend(variadic_buf, value.data.as_char, value.size_bytes));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but the more usual pattern for building view arrays is to preallocate the data buffers (which reduces the frequency of reallocation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we want to leave that to a follow up PR at this point, but is the idea that we would just pre-allocate the entire NANOARROW_BINARY_VIEW_BLOCK_SIZE bytes? Or is there a more refined algorithm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misunderstanding this, but would:

if (value.size_bytes < NANOARROW_BINARY_VIEW_BLOCK_SIZE) {
  NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(variadic_buf, NANOARROW_BINARY_VIEW_BLOCK_SIZE));
}

...just after adding a new variadic buffer solve this? (Feel free to ignore if this is off base...I agree that there's no bug here!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wanted to reserve the size of the buffer we should do it after adding it with the call to ArrowArrayAddVariadicBuffers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking logic is rather specific to our appending, and we should make sure that others can implement their own logic if they want, which might have nothing to do with our block size. (Also, value.size_bytes is greater than the block size, that would result the initial reserve being immediately discarded).

Comment on lines 1106 to 1121
const union ArrowBufferViewData value_view = array_view->buffer_views[1].data;
union ArrowBinaryViewType bvt;
const size_t idx = sizeof(union ArrowBinaryViewType) * i;
memcpy(&bvt, value_view.as_uint8 + idx, sizeof(union ArrowBinaryViewType));
const int32_t inline_size = bvt.inlined.size;
view.size_bytes = inline_size;
if (inline_size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) {
view.data.as_uint8 = value_view.as_uint8 + idx +
sizeof(((union ArrowBinaryViewType*)0)->inlined.size);
} else {
const int32_t buf_index = bvt.ref.buffer_index;
const int32_t nfixed_buf = NANOARROW_BINARY_VIEW_FIXED_BUFFERS;
const struct ArrowBufferView variadic_vw =
ArrowArrayViewBufferView(array_view, buf_index + nfixed_buf);
view.data.as_uint8 = variadic_vw.data.as_uint8 + bvt.ref.offset;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd probably be best to add const struct ArrowBinaryView* as_binary_view; to union ArrowBufferViewData, and extract a function

static struct ArrowBufferView ArrowArrayViewGetBytesFromViewArrayUnsafe(
    const struct ArrowArrayView* array_view, int64_t i) {
  const struct ArrowBinaryView* bv = &array_view->buffer_views[1].data.as_binary_view[i];
  struct ArrowBufferView out{.size_bytes = bv->inlined.size};
  if (bv->inlined.size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) {
    out.data.as_uint8 = bv->inlined.data;
    return out;
  }

  const int32_t buf_index = bv->ref.buf_index + NANOARROW_BINARY_VIEW_FIXED_BUFFERS;
  out.data.data = array_view->array->buffers[buf_index];
  out.data.as_uint8 += bv->ref.offset;
  return out;
}

(replacing ArrowArrayViewBufferView).

Then we can rewrite this as:

Suggested change
const union ArrowBufferViewData value_view = array_view->buffer_views[1].data;
union ArrowBinaryViewType bvt;
const size_t idx = sizeof(union ArrowBinaryViewType) * i;
memcpy(&bvt, value_view.as_uint8 + idx, sizeof(union ArrowBinaryViewType));
const int32_t inline_size = bvt.inlined.size;
view.size_bytes = inline_size;
if (inline_size <= NANOARROW_BINARY_VIEW_INLINE_SIZE) {
view.data.as_uint8 = value_view.as_uint8 + idx +
sizeof(((union ArrowBinaryViewType*)0)->inlined.size);
} else {
const int32_t buf_index = bvt.ref.buffer_index;
const int32_t nfixed_buf = NANOARROW_BINARY_VIEW_FIXED_BUFFERS;
const struct ArrowBufferView variadic_vw =
ArrowArrayViewBufferView(array_view, buf_index + nfixed_buf);
view.data.as_uint8 = variadic_vw.data.as_uint8 + bvt.ref.offset;
}
view = ArrowArrayViewGetBytesFromViewArrayUnsafe(string_array, i);

and ArrowArrayViewGetStringUnsafe's case as

    case NANOARROW_TYPE_STRING_VIEW:
    case NANOARROW_TYPE_BINARY_VIEW: {
      struct ArrowBufferView buf_view = ArrowArrayViewGetBytesFromViewArrayUnsafe(string_array, i);
      view.data = buf_view.data.as_char;
      view.size_bytes = buf_view.size_bytes;
      break;
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome idea - thanks!

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more nits, looking good

src/nanoarrow/common/array.c Outdated Show resolved Hide resolved
src/nanoarrow/common/array_test.cc Outdated Show resolved Hide resolved
@WillAyd WillAyd changed the title feat: Implement read support for String/Binary View types feat: String/Binary View Support Sep 23, 2024
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add string view support
4 participants