Skip to content

Commit

Permalink
Add new example, fix referneces
Browse files Browse the repository at this point in the history
  • Loading branch information
alamb committed Sep 29, 2024
1 parent d7dd6b7 commit 2d81a87
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 6 deletions.
11 changes: 9 additions & 2 deletions datafusion/physical-plan/src/aggregates/group_values/column.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ use datafusion_physical_expr::binary_map::OutputType;
use hashbrown::raw::RawTable;

/// A [`GroupValues`] that stores multiple columns of group values.
///
///
pub struct GroupValuesColumn {
/// The output schema
schema: SchemaRef,
Expand All @@ -55,8 +57,13 @@ pub struct GroupValuesColumn {
map_size: usize,

/// The actual group by values, stored column-wise. Compare from
/// the left to right, each column is stored as [`ArrayRowEq`].
/// This is shown faster than the row format
/// the left to right, each column is stored as [`GroupColumn`].
///
/// Performance tests showed that this design is faster than using the
/// more general purpose [`GroupValuesRows`]. See the ticket for details:
/// <https://github.com/apache/datafusion/pull/12269>
///
/// [`GroupValuesRows`]: crate::aggregates::group_values::row::GroupValuesRows
group_values: Vec<Box<dyn GroupColumn>>,

/// reused buffer to store hashes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ pub trait GroupColumn: Send + Sync {
fn take_n(&mut self, n: usize) -> ArrayRef;
}

/// An implementation of [`ArrayRowEq`] for primitive types.
/// An implementation of [`GroupColumn`] for primitive types.
pub struct PrimitiveGroupValueBuilder<T: ArrowPrimitiveType> {
group_values: Vec<T::Native>,
nulls: Vec<bool>,
Expand Down Expand Up @@ -157,7 +157,7 @@ impl<T: ArrowPrimitiveType> GroupColumn for PrimitiveGroupValueBuilder<T> {
}
}

/// An implementation of [`ArrayRowEq`] for binary and utf8 types.
/// An implementation of [`GroupColumn`] for binary and utf8 types.
pub struct ByteGroupValueBuilder<O>
where
O: OffsetSizeTrait,
Expand Down
31 changes: 29 additions & 2 deletions datafusion/physical-plan/src/aggregates/group_values/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,36 @@ mod group_column;

/// Stores the group values during hash aggregation.
///
/// # Background
///
/// In a query such as `SELECT a, b, count(*) FROM t GROUP BY a, b`, the group values
/// identify each group, and correspond to all the distinct values of `(a,b)`.
///
/// ```sql
/// -- Input has 4 rows with 3 distinct combinations of (a,b) ("groups")
/// create table t(a int, b varchar)
/// as values (1, 'a'), (2, 'b'), (1, 'a'), (3, 'c');
///
/// select a, b, count(*) from t group by a, b;
/// ----
/// 1 a 2
/// 2 b 1
/// 3 c 1
/// ```
///
/// # Design
///
/// Managing group values is a performance critical operation in hash
/// aggregation. The major operations are:
///
/// 1. Intern: Quickly finding existing and adding new group values
/// 2. Emit: Returning the group values as an array
///
/// There are multiple specialized implementations of this trait optimized for
/// different data types and number of columns, instantiated by
/// [`new_group_values`].
/// different data types and number of columns, optimized for these operations.
/// See [`new_group_values`] for details.
///
/// # Group Ids
///
/// Each distinct group in a hash aggregation is identified by a unique group id
/// (usize) which is assigned by instances of this trait. Group ids are
Expand Down

0 comments on commit 2d81a87

Please sign in to comment.