-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
#6427
base: master
Are you sure you want to change the base?
Conversation
…rray` for faster `slice`
Arc<[Buffer]>
instead of raw Vec<Buffer>
in `GenericByteViewA…Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
Arc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster sliceArc<[Buffer]>
instead of raw Vec<Buffer>
in GenericByteViewArray
for faster slice
Unfortunately the use of impl is still a breaking change as it could impact type inference, e.g if collecting an interator into the argument |
@@ -234,7 +234,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { | |||
} | |||
|
|||
/// Deconstruct this array into its constituent parts | |||
pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) { | |||
pub fn into_parts(self) -> (ScalarBuffer<u128>, Arc<[Buffer]>, Option<NullBuffer>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a breaking change
FWIW the reason "breaking change" is important is that it restricts when we can merge this PR: https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes |
@@ -114,7 +114,7 @@ use super::ByteArrayType; | |||
pub struct GenericByteViewArray<T: ByteViewType + ?Sized> { | |||
data_type: DataType, | |||
views: ScalarBuffer<u128>, | |||
buffers: Vec<Buffer>, | |||
buffers: Arc<[Buffer]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for Arc<[Buffer]> vs Vec<Arc>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloning an Arc
is relatively cheap (no allocation), cloning a Vec
isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i get it. However, if i understand correctly, Arc<[Buffer]>
means the buffers can be passed around and shared only when they are within single slice, which can be limiting. For example, Can i merge two arrays, combining their Arc<Buffer>
s without moving or cloning the buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can i merge two arrays, combining their Arc s without moving or cloning the buffers?
No -- you would have to create a new Vec<Buffer
> (or some other way to get Arc<[Buffer]>
)
So while there are some cases where new allocations are required, slicing / cloning is faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would have to create a new
Vec<Buffer
>
but that would prevent buffer sharing between two arrays, right?
slicing / cloning is faster
cloning yes
slicing -- i didn't see it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the observation is that during StringViewArray::slice
, the slice
actually happens on the view
s -- the buffers (that the views can point at) must be copied
Here is the clone of buffers
: https://docs.rs/arrow-array/53.1.0/src/arrow_array/array/byte_view_array.rs.html#385
Which issue does this PR close?
Close #6408
Rationale for this change
In the
GenericByteViewArray
, thebuffers
field is a raw vector, leading to heap allocation when some methods are called, e.g.clone
,slice
. UsingArc<[Buffer]>
instead of the rawVec<Buffer>
can avoid such heap allocation.And the newly-add benchmark cases about
slice
shows the improvement:What changes are included in this PR?
Use
Arc<[Buffer]>
instead of the rawVec<Buffer>
as the type ofbuffers
field ofGenericByteViewArray
.Are there any user-facing changes?
The signature of the method
GenericByteViewArray::new_unchecked
is changed from:to
However, any usage of this method before this PR should still work without any modification.