Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Aggregate functions to take builder parameters #859

Merged
merged 35 commits into from
Sep 9, 2024

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Sep 6, 2024

Which issue does this PR close?

Closes #780

Rationale for this change

This PR follows the same pattern as the recently closed #808 but does the same for aggregate functions. This is a usability enhancement.

What changes are included in this PR?

This PR updates the signatures for the aggregate functions to take optional parameters for the following:

  • order_by
  • distinct
  • null_treatment
  • filter

We add these parameters to the aggregate functions based on which ones are used internally. The other options are not in the function call. Users can always add these using the builder function approach if any of the internals get updated at a later time.

Are there any user-facing changes?

@Michael-J-Ward
Copy link
Contributor

@timsaucer - If there's a way for me to help with this lift without stepping on your toes, please let me know.

@timsaucer
Copy link
Contributor Author

If you wanted to divide and conquer we can, but actually I think another thing that would be very helpful would be to have a more ergonomic way to use aggregates as window functions. I could see two ways

  1. Update functions.window to take an aggregate function instead of a registered function name.
  2. Add something like Expr.over() so you could do something like first_value(col("a")).over(window) where window is a window frame definition.

What do you think? That could be done entirely independent of this PR

@timsaucer timsaucer marked this pull request as ready for review September 8, 2024 15:00
src/functions.rs Outdated
.map(|x| x.into_iter().map(|x| x.expr).collect::<Vec<_>>())
.unwrap_or_default();
let mut builder = agg_fn.order_by(order_by);
// let order_by = order_by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this commented code be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Removed

@@ -1059,7 +870,7 @@ pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_wrapped(wrap_pyfunction!(floor))?;
m.add_wrapped(wrap_pyfunction!(from_unixtime))?;
m.add_wrapped(wrap_pyfunction!(gcd))?;
m.add_wrapped(wrap_pyfunction!(grouping))?;
// m.add_wrapped(wrap_pyfunction!(grouping))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this comment applies here?

// Code is commented out since grouping is not yet implemented
// https://github.com/apache/datafusion-python/issues/861
// aggregate_function!(grouping);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added that intentionally as a place holder until grouping gets implemented.

@Michael-J-Ward
Copy link
Contributor

Michael-J-Ward commented Sep 9, 2024

Really excellent.

I really appreciate how cleaner functions.rs is with those macros.


If you wanted to divide and conquer we can, but actually I think another thing that would be very helpful would be to have a more ergonomic way to use aggregates as window functions.

Upstream datafusion is about to release DF42, so we should probably ship df-python 41 after this.

I'll create an issue and add it to our next release cycle.

@timsaucer
Copy link
Contributor Author

Since this is a larger PR, I'm hoping to get at least one more approval before merging it in

@timsaucer timsaucer merged commit fd8224e into apache:main Sep 9, 2024
15 checks passed
@timsaucer timsaucer deleted the feature/aggregate-builder-fns branch September 9, 2024 18:13
emgeee pushed a commit to probably-nothing-labs/datafusion-python that referenced this pull request Sep 9, 2024
* Add NullTreatment enum wrapper and add filter option to approx_distinct

* Small usability on aggregate

* Adding documentation and additional unit test for approx_median

* Update approx_percentil_cont with builder parameters it uses, which is filter but not distinct

* Update approx_percentil_cont_with_weight with builder parameters it uses, which is filter but not distinct

* Update array_agg to use aggregate options

* Update builder options for avg aggregate function

* move bit_and bit_or to use macro to generaty python fn

* Update builder arguments for bitwise operators

* Use macro for bool_and and bool_or

* Update python wrapper for arguments appropriate to bool operators

* Set corr to use macro for pyfunction

* Update unit test to make it easier to debug

* Update corr python wrapper to expose only builder parameters used

* Update count and count_star to use macro for exposing

* Update count and count_star with approprate aggregation options

* Move covar_pop and covar_samp to use macro for aggregates

* Updateing covar_pop and covar_samp with builder option

* Use macro for last_value and move first_value to be near it

* Update first_value and last_value with the builder parameters that are relevant

* Remove grouping since it is not actually implemented upstream

* Move median to use macro

* Expose builder options for median

* Expose nth value

* Updating linear regression functions to use filter and macro

* Update stddev and stddev_pop to use filter and macro

* Expose string_agg

* Add string_agg to python wrappers and add unit test

* Switch sum to use macro in rust side and expose correct options in python wrapper

* Use macro for exposing var_pop and var_samp

* Add unit tests for filtering on var_pop and var_samp

* Move approximation functions to use macro when possible

* Update user documentation to explain in detail the options for aggregate functions

* Update unit test to handle Python 3.10

* Clean up commented code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorporate unified Aggregate / Window function builder
3 participants