Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Avg aggregate: implement convert_to_state #11816

Closed
alamb opened this issue Aug 5, 2024 · 1 comment · Fixed by #11734
Closed

Improve performance of Avg aggregate: implement convert_to_state #11816

alamb opened this issue Aug 5, 2024 · 1 comment · Fixed by #11734
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 5, 2024

Is your feature request related to a problem or challenge?

@korowa added "skip partial aggregation mode" in #11627 which helps with high cardinality aggregates by doing minimal work for the first phase of the aggregation. This mode is triggered dynamically based on how effective the first aggregation phase is working.

In order to use this new mode, the corresponding GroupsAccumulator needs to implement the convert_to_state method

/// Converts an input batch directly the intermediate aggregate state.
///
/// This is the equivalent of treating each input row as its own group. It
/// is invoked when the Partial phase of a multi-phase aggregation is not
/// reducing the cardinality enough to warrant spending more effort on
/// pre-aggregation (see `Background` section below), and switches to
/// passing intermediate state directly on to the next aggregation phase.
///
/// Examples:
/// * `COUNT`: an array of 1s for each row in the input batch.
/// * `SUM/MIN/MAX`: the input values themselves.
///
/// # Arguments
/// * `values`: the input arguments to the accumulator
/// * `opt_filter`: if present, any row where `opt_filter[i]` is false should be ignored
///
/// # Background
///
/// In a multi-phase aggregation (see [`Accumulator::state`]), the initial
/// Partial phase reduces the cardinality of the input data as soon as
/// possible in the plan.
///
/// This strategy is very effective for queries with a small number of
/// groups, as most of the data is aggregated immediately and only a small
/// amount of data must be repartitioned (see [`Accumulator::state`] for
/// background)
///
/// However, for queries with a large number of groups, the Partial phase
/// often does not reduce the cardinality enough to warrant the memory and
/// CPU cost of actually performing the aggregation. For such cases, the
/// HashAggregate operator will dynamically switch to passing intermediate
/// state directly to the next aggregation phase with minimal processing
/// using this method.
///
/// [`Accumulator::state`]: crate::Accumulator::state
fn convert_to_state(
&self,
_values: &[ArrayRef],
_opt_filter: Option<&BooleanArray>,
) -> Result<Vec<ArrayRef>> {
not_impl_err!("Input batch conversion to state not implemented")
}
/// Returns `true` if [`Self::convert_to_state`] is implemented to support
/// intermediate aggregate state conversion.
fn supports_convert_to_state(&self) -> bool {
false
}

Describe the solution you'd like

Implement covert_to_state for

https://github.com/apache/datafusion/blob/66a85706f6c5dc5eabcc09b0990d84c6f8879b81/datafusion/functions-aggregate/src/average.rs#L430-L429

Add tests in

# The main goal of these tests is to verify correctness of transforming
# input values to state by accumulators, supporting `convert_to_state`.

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Aug 5, 2024
@alamb alamb changed the title Improve performance of GeometricMean aggregage: implement convert_to_state Improve performance of Avg aggregage: implement convert_to_state Aug 5, 2024
@alamb alamb self-assigned this Aug 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2024

I have a draft PR for this here: #11734

@alamb alamb changed the title Improve performance of Avg aggregage: implement convert_to_state Improve performance of Avg aggregate: implement convert_to_state Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant