Grouped Aggregate in row format #2375

yjshen · 2022-04-29T11:10:04Z

Which issue does this PR close?

Closes #2452.
Partly fix #2455.

Rationale for this change

Using row format in grouped aggregate has several benefits over the current Vec<ScalarValue>:

compare group keys directly on Vec<u8>
save memory by storing state without datatype information

What changes are included in this PR?

A new Accumulator trait manipulates data in row format, with five most basic accumulators: Max, Min, Sum, Count, Avg.
RowAccessor for fast, in-place Vec<u8> row fields update.
Branching AggregateExec to use row-based group aggregate when applicable.
Make the datafusion-row crate default for datafusion-core and datafusion-physical-expr.

Are there any user-facing changes?

No.

…gregates

andygrove · 2022-04-29T19:02:56Z

The current PR seems scary in size, maybe I should move the physical_plan folder re-org as a separate PR first.

I think that would help.

Are we replacing HashAggregate completely with a new row based aggregate or do we want to support both? Does hash aggregate still have advantages for some use cases? Maybe we can have a config setting for which one to use?

yjshen · 2022-04-30T00:42:08Z

Sorry to mix two things into one PR. I would divide this as separate PRs. One for each of these ideas:

Promote physical-plan/hash_aggregates.rs to a directory, and rename it to aggregates. We already have a hash-based implementation, GroupedHashAggregateStream for aggregate with grouping keys, and a non-hash implementation for aggregate without grouping keys (It's a single record state but named HashAggregateStream although it's not related to Hash at all).

We could further enrich the aggregation method from hash-based to sort-based at runtime when we are run out of memory, as described in Memory Limited GroupBy (Externalized / Spill) #1570

Use row format to store grouping keys and accumulator states when all accumulator states are fixed-sized. Use Vec<ScalarValue> for all other cases (when we have at least one var length accumulator state, or any of the AggregateExprs doesn't support row-based accumulator yet).

Maybe we can have a config setting for which one to use

I think the choice between row-based accumulator states vs Vec<ScalarValue> based accumulator states will depend on row-based accumulator capability during query execution, we are only using row-based aggregate states when we have all its accumulators support. (If and only if we are sure that the row-based version will always outperform Vec<ScalarValue> version whenever applicable, based on benchmark results of course.)

alamb · 2022-05-05T15:54:02Z

I am starting to check this out -- I'll try to finish today but I may run out of time.

alamb · 2022-05-05T15:56:07Z

@yjshen do you have any benchmark numbers you can share?

alamb

I have some superficial comments on the API design -- I hope to dig more into the implementation later today.

This is looking very cool @yjshen

datafusion/core/Cargo.toml

datafusion/physical-expr/src/aggregate/accumulator_v2.rs

datafusion/core/src/physical_plan/hash_utils.rs

datafusion/physical-expr/src/aggregate/mod.rs

alamb

I reviewed this code, and all in all I think it is great. Nice work @yjshen 🏆

I would like to see the following two things prior to approving this PR:

See some sort of performance benchmarks showing this is faster than master
Try this change against the IOx test suite

I will plan to try against IOx tomorrow.

Concern about Code Duplication

I am somewhat concerned that this PR ends up with a parallel implementation of GroupByHash as well as some of the aggregates.

This approach is fairly nice because it is backwards compatible and thus this PR allows us to make make incremental progress 👍

However, I worry that we now have two slightly different implementations which will diverge over time and we will be stuck with these these two forms forever if we don't have the discipline to complete the transition. This would make the codebase more complicated and harder to work with over time.

Perhaps I can make this less concerning by enumerating what work remains to entirely switch to RowAggregate (and remove AggregateStream entirely).

datafusion/core/src/physical_plan/aggregates/mod.rs

alamb · 2022-05-05T21:08:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                None => {
+                    this.finished = true;
+                    let timer = this.baseline_metrics.elapsed_compute().timer();
+                    let result = create_batch_from_map(


This code effectively makes one massive output record batch -- I think it is also what GroupedHashAggregateStream does but it would be better in my opinion to stream the output (aka respect the batch_size configuration. Maybe we can file a ticket to do so

Yes, I plan to do this in #1570 as my next step.

datafusion/core/src/physical_plan/aggregates/row_hash.rs

alamb · 2022-05-05T21:47:40Z

datafusion/physical-expr/src/aggregate/sum.rs

@@ -338,6 +455,42 @@ impl Accumulator for SumAccumulator {
    }
 }

+#[derive(Debug)]
+struct SumAccumulatorV2 {
+    index: usize,


I think it would help to document what this is an index into -- aka document what the parameters are

Add a new method state_index(&self) -> usize and explained the meaning in RowAccumulator doc.

alamb · 2022-05-06T04:16:41Z

datafusion/physical-expr/src/aggregate/sum.rs

+    s: &ScalarValue,
+) -> Result<()> {
+    match (dt, s) {
+        // float64 coerces everything to f64


🤔 I almost wonder how valuable supporting all these types are -- like I wonder if we can use u64 or i64 accumulators for all integer types and f64 for floats and reduce the code. I don't think this PR is making things any better or worse, but it just seems like these type match statements are so common and repetitive

Agree. We should clean this up by probably by checking type coercions.

alamb · 2022-05-06T04:19:24Z

datafusion/physical-expr/src/aggregate/sum.rs

+        accessor: &mut RowAccessor,
+    ) -> Result<()> {
+        let values = &values[0];
+        add_to_row(&self.datatype, self.index, accessor, &sum_batch(values)?)?;


I wonder if it is needed to go through sum_batch here (which turns the sum into a ScalarValue) -- perhaps we could call the appropriate sum kernel followed by a direct update

I initially use sum_batch here mainly to reduce code duplication with that of SumAccumulator, besides, there's a decimal sum_batch that isn't included in the compute kernel yet.

Also, possibly related #2447

datafusion/core/src/physical_plan/aggregates/row_hash.rs

yjshen · 2022-05-06T04:50:02Z

Thanks @alamb, for the detailed review ❤️. I'll try to answer or fix them today.

Micro benchmark: aggregate_query_sql

Existing aggregate_query_sql with a newly added case:

    c.bench_function("aggregate_query_group_by_u64_multiple_keys", |b| {
        b.iter(|| {
            query(
                ctx.clone(),
                "SELECT u64_wide, utf8, MIN(f64), AVG(f64), COUNT(f64) \     
                 FROM t GROUP BY u64_wide, utf8",
            )
        })
    });

I'm using a compound group by key with many distinct values in the newly added case.

The results are:

The master branch

aggregate_query_group_by                        
                        time:   [2.0366 ms 2.0448 ms 2.0531 ms]

aggregate_query_group_by_with_filter                        
                        time:   [1.4311 ms 1.4338 ms 1.4366 ms]

aggregate_query_group_by_u64 15 12                        
                        time:   [2.0208 ms 2.0283 ms 2.0358 ms]

aggregate_query_group_by_with_filter_u64 15 12                        
                        time:   [1.4242 ms 1.4269 ms 1.4296 ms]

aggregate_query_group_by_with_filter_multiple_keys                        
                        time:   [1.4709 ms 1.4732 ms 1.4756 ms]

aggregate_query_group_by_u64_multiple_keys                        
                        time:   [14.239 ms 14.408 ms 14.589 ms]

This PR

aggregate_query_group_by                        
                        time:   [2.7145 ms 2.7273 ms 2.7400 ms]

aggregate_query_group_by_with_filter                        
                        time:   [1.4331 ms 1.4364 ms 1.4397 ms]

aggregate_query_group_by_u64 15 12                        
                        time:   [2.7358 ms 2.7493 ms 2.7631 ms]

aggregate_query_group_by_with_filter_u64 15 12                        
                        time:   [1.4269 ms 1.4298 ms 1.4328 ms]

aggregate_query_group_by_with_filter_multiple_keys                        
                        time:   [1.4761 ms 1.4788 ms 1.4818 ms]

aggregate_query_group_by_u64_multiple_keys                        
                        time:   [12.580 ms 12.787 ms 12.999 ms]

Improved: the newly introduced case with many distinct groups.

Regressed: group by with fewer groups.
I try to check regression on aggregate_query_group_by and aggregate_query_group_by_u64 15 12 by flamegaph-ing it, but find only 12 samples GroupedHashAggregateStream::next and takes 1.49% of all the samples.

Edit: by moving the physical plan creation part out from bench timing, there's still not much big difference on the flamegraph, AggregateStreamV1/V2::next only shows less than 10 samples and ~1% of all samples.

TPC-H query 1 ( Aggregate with four distinct states)

cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096

The Master branch

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 192.6 ms and returned 4 rows
Query 1 iteration 1 took 189.0 ms and returned 4 rows
Query 1 iteration 2 took 196.0 ms and returned 4 rows
Query 1 avg time: 192.55 ms

This PR

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 189.8 ms and returned 4 rows
Query 1 iteration 1 took 187.9 ms and returned 4 rows
Query 1 iteration 2 took 186.3 ms and returned 4 rows
Query 1 avg time: 188.00 ms

No difference in performance is observed, which is expected since there are few groups and mainly about the states in-cache calculation.

TPC-H q1 modified, with more groups:

select
    l_orderkey,
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_orderkey,
    l_returnflag,
    l_linestatus
order by
    l_orderkey,
    l_returnflag,
    l_linestatus;

The master branch

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 3956.7 ms and returned 2084634 rows
Query 1 iteration 1 took 3885.2 ms and returned 2084634 rows
Query 1 iteration 2 took 3928.9 ms and returned 2084634 rows
Query 1 avg time: 3923.62 ms

This PR:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 3219.5 ms and returned 2084634 rows
Query 1 iteration 1 took 3130.5 ms and returned 2084634 rows
Query 1 iteration 2 took 3107.4 ms and returned 2084634 rows
Query 1 avg time: 3152.48 ms

There are noticeable performance improvements as the number of grouping grows.

Co-authored-by: Andrew Lamb <[email protected]>

yjshen · 2022-05-06T11:35:37Z

datafusion/physical-expr/src/aggregate/sum.rs

@@ -144,7 +173,8 @@ fn sum_decimal_batch(
 }

 // sums the array and returns a ScalarValue of its corresponding type.
-pub(crate) fn sum_batch(values: &ArrayRef) -> Result<ScalarValue> {
+pub(crate) fn sum_batch(values: &ArrayRef, sum_type: &DataType) -> Result<ScalarValue> {
+    let values = &cast(values, sum_type)?;


This is the partial fix for #2455. We should cast the input array to sum result datatype first to alleviate the possibility of overflow. Further, we should have a wrapping sum kernel as well as a try_sum kernel to produce wrapped results or nulls in the case of overflow.

I also wonder if we could internally consider summing smaller integers using u128 and then detecting overflow at the end. 🤔

alamb

Those are very nice benchmark improvements @yjshen 👍

I ran the IOx test suite against this branch (https://github.com/influxdata/influxdb_iox/pull/4531) and it seems to have worked great

yjshen · 2022-05-08T04:46:00Z

Perhaps I can make this less concerning by enumerating what work remains to entirely switch to RowAggregate (and remove AggregateStream entirely).

@alamb @andygrove I revisited our current row implementation and listed all the TODO items I could think of in #1861, and in the process, I think we can eliminate these code duplications and constantly improve performance.

yjshen added 4 commits April 26, 2022 12:07

first move: re-group aggregates functionalities in core/physical_p/ag…

475f166

…gregates

basic accumulators

fbeaf0b

main updating procedure

ddfd601

output as record batch

2dd2d16

github-actions bot added ballista datafusion Changes in the datafusion crate labels Apr 29, 2022

yjshen self-assigned this May 1, 2022

yjshen added 3 commits May 5, 2022 09:33

Merge remote-tracking branch 'apache/master' into row_agg

b27389e

aggregate with row state

430c315

make row non-optional

1cf0ba5

yjshen marked this pull request as ready for review May 5, 2022 11:19

yjshen changed the title ~~WIP: Use row format for aggregate~~ Grouped Aggregate in row format May 5, 2022

yjshen requested review from alamb, andygrove and Dandandan May 5, 2022 11:38

alamb reviewed May 5, 2022

View reviewed changes

alamb reviewed May 6, 2022

View reviewed changes

yjshen and others added 2 commits May 6, 2022 19:12

address comments, add docs, part fix apache#2455

c8b4833

Apply suggestions from code review

7350ceb

Co-authored-by: Andrew Lamb <[email protected]>

yjshen commented May 6, 2022

View reviewed changes

alamb approved these changes May 6, 2022

View reviewed changes

andygrove merged commit 6786203 into apache:master May 7, 2022

yjshen mentioned this pull request May 8, 2022

[Epic]: Complete ROW Format (Missing features) #1861

Closed

37 tasks

alamb mentioned this pull request May 8, 2022

Investigate and reduce runtime type coercion in aggregates like sum #2447

Closed

This was referenced Jun 6, 2022

[EPIC] JIT support for DataFusion #2703

Closed

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped Aggregate in row format #2375

Grouped Aggregate in row format #2375

yjshen commented Apr 29, 2022 •

edited

Loading

andygrove commented Apr 29, 2022

yjshen commented Apr 30, 2022

alamb commented May 5, 2022

alamb commented May 5, 2022

alamb left a comment

alamb left a comment •

edited

Loading

alamb May 5, 2022

yjshen May 6, 2022

alamb May 5, 2022

yjshen May 6, 2022

alamb May 6, 2022

yjshen May 6, 2022

alamb May 6, 2022

yjshen May 6, 2022

alamb May 8, 2022

yjshen commented May 6, 2022 •

edited

Loading

yjshen May 6, 2022

alamb May 6, 2022

alamb left a comment

yjshen commented May 8, 2022

Grouped Aggregate in row format #2375

Grouped Aggregate in row format #2375

Conversation

yjshen commented Apr 29, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove commented Apr 29, 2022

yjshen commented Apr 30, 2022

alamb commented May 5, 2022

alamb commented May 5, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Concern about Code Duplication

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented May 6, 2022 • edited Loading

Micro benchmark: aggregate_query_sql

The master branch

This PR

TPC-H query 1 ( Aggregate with four distinct states)

The Master branch

This PR

TPC-H q1 modified, with more groups:

The master branch

This PR:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

yjshen commented May 8, 2022

yjshen commented Apr 29, 2022 •

edited

Loading

alamb left a comment •

edited

Loading

yjshen commented May 6, 2022 •

edited

Loading