Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplicate variadic buffers in MutableArrayData::extend for ByteView arrays #6808

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

onursatici
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

MutableArrayData adds all variadic buffers from input arrays together, potentially duplicating the same buffers in the output array.

What changes are included in this PR?

extend now checks if the same buffer is added from some other input array and changes the views to be appended to point to the new deduplicated buffer indices

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 27, 2024
@onursatici
Copy link
Contributor Author

@tustvold happy to add tests if this approach is more inline with what you had in mind in #6779

_ => vec![],
let (variadic_data_buffers, buffer_to_idx) = match &data_type {
DataType::BinaryView | DataType::Utf8View => {
let mut buffer_to_idx = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if building a hashmap / vec would be overly expensive (though we would need to run benchmarks to be sure)

cc @XiangpengHao

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to run benchmarks, any particular in mind or should I create one with criterion specific to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ones in cast are probably a good place to start

@onursatici
Copy link
Contributor Author

@alamb @tustvold I did add a string view case for the interleave benchmark and ran on main, this PR (interleave-deduplicated), and #6779 (interleave-specific-impl)

❯ critcmp interleave-main interleave-deduplicated interleave-specific-impl
group                                                                                interleave-deduplicated                interleave-main                        interleave-specific-impl
-----                                                                                -----------------------                ---------------                        ------------------------
interleave string_view(0.5, 50, true) 100 [0..100, 100..230, 450..1000]              2.33      2.3±0.02µs        ? ?/sec    1.82  1767.6±25.30ns        ? ?/sec    1.00    968.7±4.97ns        ? ?/sec
interleave string_view(0.5, 50, true) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.80     13.9±0.17µs        ? ?/sec    1.35     10.4±0.11µs        ? ?/sec    1.00      7.7±0.11µs        ? ?/sec
interleave string_view(0.5, 50, true) 1024 [0..100, 100..230, 450..1000]             1.80     13.3±0.13µs        ? ?/sec    1.39     10.3±0.10µs        ? ?/sec    1.00      7.4±0.09µs        ? ?/sec
interleave string_view(0.5, 50, true) 400 [0..100, 100..230, 450..1000]              1.93      5.8±0.05µs        ? ?/sec    1.49      4.5±0.05µs        ? ?/sec    1.00      3.0±0.02µs        ? ?/sec

I believe the penalty introduced by this PR would be mitigated for interleave's case if we also merge #6779, for other cases it feels like the read / transfer over the wire improvements might outweigh the cost. Happy to hear your thoughts

@alamb
Copy link
Contributor

alamb commented Dec 6, 2024

Thank you @onursatici -- I hope to find time to review this PR this weekend or early next week

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I (again) apologize for the delay in reviewing this PR. We are stretched quite thin as always

In general, I think this PR needs some tests to show it is working as well as ensure we don't break this functionality with some future PR.

Thank you for running the benchmarks. They seem promising and I will give them a more careful look if we proceed with this PR

@onursatici
Copy link
Contributor Author

@alamb no worries and thank you for having a look. I added some tests now checking the deduplication and remapping behaviour, let me know whenever you have time if this looks good, happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants