-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Explore how to best deal with large numbers of aggregations in the short term #8141
Comments
For an initial test of this, I created an SQL query of the
For example, a query with
The I then ran this locally on an RTX4000 at scale 100 and on an 8-node A100 cluster at scale 3000. To compare the sort fallback merge vs the default hashed merge, I ran it with the original (current code), and with the call to Here are the results for runs on the 8-node a100 cluster:
@revans2, what do you recommend for next steps?
Should I start working on a sub-partitioning algorithm, or are there additional tests or other optimizations worth trying first? |
A few things that pop out to me. The first one is the fact that you had to set the batch size in bytes so small to even make this work. I am guessing that this has to do with the project that happens before we do the aggregation. We need a small input data because the output data is going to be some number of times larger than the input data. The second thing is the key being used for the aggregations. The cardinality of I have a couple ideas in my head about how we might be able to process the data to reduce memory usage while doing these really big aggregates. I would love it if we could come up with some experiments that might point us in the right direction before having to write 10 different changes and see which one is best, but I am not 100% sure how to do that. The current code does the following.
I don't think I have described the steps perfectly, but hopefully it is good enough. First off the pre-process step is likely to make the output batch size much larger than the input batch size. This is because things like average require a SUM and a COUNT, or decimal SUMs need multiple output columns to detect overflow, or even the fact that I might be doing lots of aggregates on a single column. i.e. SUM(a), AVG(a), COUNT(a), MIN(a), MAX(a)... in addition to a regular project that might increase the number of columns. So to deal with this we need to either be more efficient with memory or partition the data. Being more efficient with the memory is good, but it cannot solve the problem 100%. So we have to look at splitting the data in some way. Our data is two dimensional (columns and rows) so we can split it two ways. We can keep all of the columns and split the number of rows in a batch. This is a good generic solution for project everywhere, so I think it is something we should look at first. Really it comes down to estimating the output size of the project based off of the output schema. If all of the data is fixed width types then we are done. If some are variable width we can estimate things there. The problem with doing a row wise split is at some point we can have so many columns that we are doing the processing a row or a few at a time. That is going to be horrible for performance. We could also split the data by aggregations, i.e. do half of the aggregations in one pass, and then half of the aggregations in another pass. The problem with this is two fold. First we might have to keep all of the input columns around while we process half of the aggregations, and second we have to have a way line up the rows after the aggregations. This is hard with a hash-aggregate because it can decide to reorder rows. So to make this work we would have to sort the data ourselves and then tall CUDF to use a sort based aggregation instead. The second part of the memory issue is the aggregations themselves (pre and merge). We know that sort is slower than hash partitioning so falling back to a hash re-partition instead of a sort would be interesting. But potentially a lot of work, and if we ever what to do a column wise split it is not going to help because we have to sort the data. So the main question(s) I have are really about how many aggregations do we have to run before we start to run out of memory and we might have to start thinking about splitting the data by aggregations instead of by rows. So with a batch size of 512 MiB at what point does the current code start to run out of memory (16-GiB GPU and/or large GPU). Second for the 300 aggregations case how large are the output batches when we tune the input batches to not crash. (Number of rows, columns and MiB). That should hopefully tell us if we have to support splitting the aggregations for all likely cases or not. |
Thanks @revans! This helps a lot.
Yes. I was initially only running with >300 functions, so I tuned it down to avoid the ooms, but I was concerned that these batch sizes might be too small for a meaningful test.
I was actually thinking of this as two separate tests, one with
I will work on answering these questions. Thanks! |
I did some testing on the 8-node a100 cluster to determine when we start seeing OOMs with a fixed configuration.
I was varying the NUM_FUNCS, which defines how many aggregation functions are included (see the script). I started hitting OOM errors with NUM_FUNCS=103. Output from the previous run shows:
On my desktop, with an RTX4000 with 16GB, I started hitting OOMs are 52 funcs. The next step is to add some debug logging to capture more information about the input/output batch sizes. Here is the full script I was using for these tests:
|
103 aggregations at 40 GiB feels really low. I decided to just get an idea of how large the output is compared to the input batch, but I messed up when setting the batch size and instead set With a partition size of 512MiB I was able to get 160 aggregations to succeed on a 48 GiB a6000 with a concurrency of 4 and a batch size of 1 GiB, but the output of the preprocess step was up to 17.2 GiB. Technically that is larger than what our estimates indicate would work on a 48 GiB GPU. We would need about 70 GiB for the aggregation step to pass with our 4x input size estimate. But it also let me get a good estimate on what the maxPartitionBytes would need to be for us to get 1 GiB of output after doing the preprocess stage. When I set it to 30 MiB I was able to successfully test it at 512 aggregations, and I estimate that I could have done 1600 aggregations before we ran out of memory again. That would have equated to over 512 aggregations on a 16 GiB GPU. That ended up splitting the input data into 1,000,000 row batches, because that is what the GPU uses as the default row group size when writing out parquet. I am sure we could have gone even smaller to get even more aggregations. I think that this shows that we have a good first step with just splitting by rows, and implementing a hash based fallback for aggregations instead of a sort based fallback for aggregations. That said we should also file a follow on issue to really dig into what it would be the fastest way to do these aggregations. Is it ever faster to sort the data and do the aggregations a few at a time? If so how many should we do at a time. How many aggregations makes it so we should take the sorting hit, etc... On the happy side I did test the performance on my desktop CPU and even with spilling/etc we are 17.2x faster at 512 aggregations vs a 12 core CPU with 80 GiB of heap. At 10 aggregations we were about 17-18x faster too, so it is good to see that we at least scale similarly to the CPU. |
@revans2 to file follow-on experiment issues. |
I filed #8382 as one follow on issue to this. There will be more, but it might take a while to get to filing them. |
The final issue I filed is #8398 to explore sort and splitting the data by aggregations instead of by row. |
Is your feature request related to a problem? Please describe.
The current aggregation code can run into a lot of performance issues when.
Currently the algorithm is to read in each input batch and aggregate it to an intermediate result. After that we try to merge those batches together into a single output batch. If we are unable to merge those batches into one that is smaller than the target batch size we fall back to doing a sort based aggregation (which is really a sort based merge aggregation). We sort all of the intermediate batches by the grouping keys. Then we split the batches on key boundaries and finally do a final merge pass of each of those batches and output them.
Technically this might mess with some aggregations like first and last where we need to maintain the order because the sort is not stable.
Within each aggregation step there is a preprocess step, the actual aggregation, and a post process step. This pre-process step can take a lot of memory, because in some cases, like with decimal128, we have to expand the output into something that is larger than the input to have enough space to detect overflow conditions.
When we have hundreds of aggregations to do what is the fastest way to do those aggregations? A hash aggregation is likely very fast, but to build a hash table with hundreds of intermediate results in it, is going to have memory pressure problems and also potentially performance problems.
We should explore if a sort based aggregation ever wins when there are large numbers of aggregations being done. If not, then we should look at removing the sort and instead doing a sub-partitioning algorithm similar to what we have done in join recently. If it does, then we should look at finding a good way to do a stable sort for first/last, and if we can detect that the input is already sorted so we can tell CUDF this and also avoid needing to cache the data before processing it.
In the long term we might need to work with CUDF to find better ways of optimizing for this use case.
The text was updated successfully, but these errors were encountered: