-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge SortMergeJoin filtered batches into larger batches #14160
Conversation
Can we use the |
Thanks @ozankabak I'll check it out |
Tbh, I was not able to find Appreciate if you can point me how it is being used in the joins to have join code base consistent. |
On the other hand, |
Thanks @berkaysynnada for your feedback, if I got you right, you prefer to call the I checked some tests in
perhaps I'm missing something? |
CoalesceBatches' in your example exist because of hash repartitions (CoalesceBatches rule adds a CoalesceBatchesExec on top of FilterExec, HashJoinExec, and hash-repartition). I've thought about this, and I believe the most optimal solution is to make all join operators capable of performing both coalescing and splitting in a built-in manner. This is because the output of a join can either be smaller or larger than the target batch size. Ideally, there should be no need (or only minimal need) for CoalesceBatchesExec. To achieve this built-in coalescing and splitting, we can leverage existing tools like BatchSplitter and BatchCoalescer (although there are no current examples of BatchCoalescer being used in joins). My suggestion is to generalize these tools so they can be utilized by any operator and applied wherever this mechanism is needed. As this pattern becomes more common, it will be easier to expand its usage and simplify its application. |
Thanks @berkaysynnada. Builtin options probably can be implemented with the sending a WDYT if we merge this PR to fix the bug for now and I start a discussion to unify coalesce/split approaches for the joins? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @berkaysynnada. Builtin options probably can be implemented with the > sending a BatchCoalescer into the join instead of writing the custom code like in this > implementation.
WDYT if we merge this PR to fix the bug for now and I start a discussion to unify > coalesce/split approaches for the joins?
Opening an issue for this sounds good. I've taken a look to the changes and LGTM. Are you planning to add some tests to avoid someone breaking this coalescing behavior?
Filed #14238 |
Filed #14239 for tests, thanks @berkaysynnada for the review |
Which issue does this PR close?
Closes #14050.
Rationale for this change
Filtered SortMergeJoin outputs the data after left row shift which is not performant, merging batches into bigger chunks close to
batch_size
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?