-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of high cardinality grouping by reusing hash values #11680
Comments
The trick with this ticket will be to structure the code in a way that is general and works across plans. It might first be worth a POC / hack to see how much performance there is to be had here (I suspect it is like 5-10% at most) |
The experiment I did in #11708 shows that
@alamb If the benchmark code looks good to you, I think we could reuse hash I had not cleanup the code yet for production ready, so the impact for other queries are unknown. Alternative idea for improvement is, if we can combine partial group + repartition + final group in one operation. We could probably avoid converting to row once again in final group. |
Thank you @jayzhan211 -- that is some interesting results. I think it makes sense that reusing the hash values is helpful mostly for high cardinality aggregates as in that case the number of rows that need to be repartitioned /rehashed is high.
I think this is the approach taken by systems like DuckDB as I understand it and I think it is quite intregruing to consider The challenge of the approach would be the software engineering required to manage the complexity of the combined multi-stage operator. I am not sure the functioanlity would be easy to combine without some more refactoring 🤔 |
@Dandandan has a good point on #11708 (comment) that is some cases (like a network shuffle) passing the hash values might be more expensive than just recomputing them |
I got performance boost for clickbench Q17 just by enforcing single mode for multi column group by, interesting Comparing main and single-multi-groupby
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ single-multi-groupby ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2009.51ms │ 1435.19ms │ +1.40x faster │
│ QQuery 1 │ 7270.86ms │ 4033.52ms │ +1.80x faster │
└──────────────┴───────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main) │ 9280.37ms │
│ Total Time (single-multi-groupby) │ 5468.70ms │
│ Average Time (main) │ 4640.19ms │
│ Average Time (single-multi-groupby) │ 2734.35ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 0 │
└─────────────────────────────────────┴───────────┘ I think what I need to do is find a query that is currently slower in single mode, and find a way to optimize it like partial/final way in single execution node? 🤔 Does the result shows that what we really need is storing values in one large hash table? Does anyone know what kind of query that is what partial/final group by good at? Upd: The specialized all distinct benchmark has a more crazy number (reuse_hash.rs) // single-multi-groupby
Gnuplot not found, using plotters backend
benchmark time: [82.970 ms 98.723 ms 111.32 ms]
change: [-99.328% -98.932% -97.777%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
// main (1ce546168)
Gnuplot not found, using plotters backend
Benchmarking benchmark: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 23.3s.
benchmark time: [660.82 ms 1.3354 s 2.1247 s]
change: [+675.04% +1377.8% +2344.1%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild |
What exactly does
I think they are good at being able to use multiple to do the work in parallel They are especially good at low cardinality aggregates (some of the TPCH ones for example where there are 4 distinct groups) as the hash tables are small and the final shuffle is very small. |
Yes, partial/final often do the hashing / arrow::Row conversion twice. In single group by node, there is only once. |
Difference between #11762 and main #11762 runs Repartition -> SingleMode group by For high cardinality (2M rows unique values) #11762 has less partition, and thus speed up a lot For low cardinality (2M rows with only 4 values) I can see #11762 is slightly slower than Next, I want to move repartition within single mode group by. I guess we can see comparable result for low cardinality case Upd: #11777 removes pre-repartition overall, it beats low cardinality case slightly. 30x faster from #11777 optimize the plan from
to
while plan in main branch is
|
I think this data is very interesting and we should look more deeply into why is the single group mode faster than doing a repartition / aggregate. It seems like the only differences are:
I would expect doing the final aggregate in parallel on distinct subsets to be about as fast So one reasonable conclusion conclusion that the overhead of This is the idea behind exploring #11647 -- I think we could avoid a copy at the output of CoalesceBatchesExec which would help to reduce the overhead |
It seems the cpu cost about One possibility is that it may not be the problem about CPU cost, and it is the problem about schedule? |
I am not sure -- is there any chance you can attach the |
Ok, svg here: |
That is quite cool -- thank you @Rachelint I see evidence of why you proposed apache/arrow-rs#6146 / apache/arrow-rs#6155 🏃 |
@alamb Yes... the eager computation of null count is actually not as cheap as we expect... |
I have some thoughts about why They all skipped the partial agg making no benefit in the |
What @jayzhan211 experiments and shows the effects of single aggregate performance benefits in #11762 and #11777 is on Clickbench Q17/Q18 instead of Q32. As of today, I see that Q32 performance is comparable to that in DuckDB on an M3 Mac.
But for Q17, we are still behind:
We would probably need to consolidate Aggregate(Partial and Final) and Repartition into a single place in order to be able to adaptively choose aggregate mode/algorithm based on runtime statistics. |
I see the improvement about q32 in later pr #11792, and I guess the reason why performance improved may be simlar as the partial skipping? Maybe q17/q18 are improved due to different reason with q32? I agree maybe we should perform the similar mechanism about select the merging mode dynamicly like |
Yes, it is why I experiment with single mode, force to avoid partial and repartition stage for all query, sadly, this doesn't work well for low cardinality case
I agree, similar to my idea before.
However, the refactor is quite challenging |
I try an alternative way other than merging partial + final. Continues on the single mode but find a way to optimize on low cardinality case. I found the reason that slow down tcph Q1 is that we repartition the batch so there are much more batches with a smaller size to compute than neccessary. I try to remove the repartition + coalesce batch and found out the performance improves!
#12340 is a hack on specific query only, require to extend to general query The next thing is to find out when and how should I avoid Print out of the columns size
Does anyone know what is the rationale of having repartition + coalesce, what kind of query benefits from it? From my experiment, I can see both high cardinality case + low cardinality case improves a lot without partial + repartition + coalesce. Does that mean we could remove them at all 🤔? My understanding of repartition + coalesce is to rebalance the batches, split (repartition) and merge (coalesce) batches. Does it benefit for the weird incoming batch size cases? Or maybe it is nice for query other than group by cases? |
For aggr, It may be used to perform the parallel merging in final aggr from partial aggr. |
The primary reason is scalability. Efficient aggregation requires multi-core CPUs to process data in parallel. To facilitate this and prevent contention from multiple threads altering a single hash table simultaneously (often managed with locks), a repartition phase is introduced. This repartition allows each thread to perform aggregation independently. Furthermore, pre-aggregation is employed to conduct preliminary calculations to streamline repartitioning, significantly reducing the volume of data that needs repartitioning(in cases either the cadinatlity is low or there are several hot keys).
This when-and-how problem is difficult for query engines because it requires foreknowledge of the data characteristics on grouping keys. It's even harder for DataFusion since we have very limited table metadata that could help us with this decision. In an adaptive approach, the aggregate operator could internally start with one method and move to another without interacting with other components(the physical optimizer and other physical operators), making it more feasible. |
Maybe reducing the cost of repartition is an alternative? I think the reason why |
That is an interesting idea (perhaps have the partial group by operator produce partitioned output somehow (as it already knows the hash values of each group 🤔 ) |
It is exciting that this idea seems promising!
I will check it more carefully again, and if it actually works, I will submit a formal PR. |
I am sure that the performance improved now. But I think we should push this forward after:
|
I agree -- thank you for all this POC work @Rachelint -- very cool. I personally plan to
Then I will have bandwidth to consider the intermediate blocked management
Yes, I think this is the biggest challenge at the moment. I actually view the fuzz tesing as a critical piece of this so as we rearrange the code we have confidence we aren't introducing regressions Thank you again for all your help so far |
They are actually cool optimizations, I am learning and trying to help too.
Yes, I agree testing is extremely important before introducing further big changes to aggregation codes. I am also looking the similar tests in other projects and thinking how to refine the our exists fuzz tests. |
@Rachelint I suggest we work on fusing Partial + Repartition first, I'm quite confident on the direction of this improvement. To be honest I don't quite follow #11943 and it seems that we need better test coverage before #11943 be merged. I think we could work on other optimization first (fusing + simplify code #12335) + fuzz test improvement and then review #11943 again after that. |
Thanks for suggestion! But after considering again, I guess it possible that fuzz testing and code refactoring may should be finished first before continuing to introduce more non-trivial changes? The aggregation codes are actually too complex now, and to be honest I also have no enough confidence to make big changes currently (actually I found possible bugs and regressions in main, and I am checking them again more carefully). I plan to spend more on them first when I have more bandwidth after finishing the string related optimizations. |
I think you should also not discount what you find personally exciting :) I realize that testing is not always to everyones liking One thing to contemplate that I have found works well is to
I have found people seem very excited to help out, contribute to the project, and learn Rust and they do so when there is a clear description of what is needed and an example to follow. |
I agree that fusing Partial + Repartition sounds like a nice win with lower code changes needed (and is covered well by existing tests) I just merged #12269 and I will file some small follow ons related to that tomorrow. So exciting. It is great to be on the team with you @Rachelint and @jayzhan211 ❤️ |
Is your feature request related to a problem or challenge?
As described on #11679, we can do better for high cardinality aggregates
One thing that consumes significant time in such queries is hashing, and I think we can reduce that significantly.
Specifically, for the multi-phase repartition plan, the number of hashed rows is something like
For low cardinality aggregates (e.g when the intermediate group cardinality is 1000) the second term is small (a few thousand extra hashes isn't a big deal)
However, for high cardinality aggregates (eg. when the intermediate cardinality is like 1,000,000 and there are 16 partitions) the second term is substantial
In pictures, this looks like
This effect can be seen in profiling for ClickBench Q17:
Here is the profiling from Instruments:
Describe the solution you'd like
The basic idea is to avoid recompute the hash values in
RepartitionExec
andAggregateMode::Final
by reuse the values fromAggregateMode::Partial
(which has already computed a hash value for each input group)Something like this
Describe alternatives you've considered
We maybe could pass the data as an explicit new column somehow, or maybe as a field in a struct array 🤔
Additional context
No response
The text was updated successfully, but these errors were encountered: