feat(api): support `rollup`, `cube` and `grouping_sets` APIs #9945

cpcloud · 2024-08-27T16:55:54Z

Description of changes

WIP PR implementing support for rollup, cube and grouping sets.

For background on what these are used for and how they work, see
https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.

In discussing the implementation, I'm going to refer to groupings sets only,
but everything applies to rollups and cubes as well (rollups and cubes are
shorthand for a longer form GROUPING SET).

These concepts represent a new kind of "thing" for Ibis. They are neither
column nor scalar expressions that can be evaluated, nor are they tables
really.

The approach I took here was to add a non-Value operation for each construct.
These are desugared into tuples of tuples of Value expressions (because
multiple grouping sets specifications are allowed) wherever grouping sets are
supported.

One tricky thing is that I had to partition the key specifically requested in
group by, as distinct from those unique value expressions in grouping sets, to
support constructs like this:

SELECT a, b
FROM t
GROUP BY ROLLUP (a, b)

which is not equivalent to

SELECT a, b
FROM t
GROUP BY a, b, ROLLUP (a, b)

Right now, I'm looking for feedback on the approach before fleshing out the
test suite.

Issues closed

gforsyth · 2024-08-27T21:36:27Z

This seems like a reasonable approach to me.
Small feedback that can always be addressed later:

I think it would be nice to allow passing in bare strings, values, or deferreds to grouping_sets, along with tuples of them, e.g.

ibis.grouping_sets(("a", "b"), "a", "b")

this would better mimic the SQL syntax (god, i can't believe I'm suggesting that...) and also avoids the awkward spelling of

ibis.grouping_sets(("a", "b"), ("a",), ("b",))

And on the naming front, I thing we should take a pointer from DuckDB and rename grouping to grouping_id -- I think grouping is going to cause confusion.

but overall seems solid!

cpcloud · 2024-08-28T01:17:55Z

All good points, definitely changing grouping -> grouping_id. I'm half tempted to call it group_id, those extra three characters are killing me.

cpcloud · 2024-08-28T15:24:45Z

For a full featured and correct implementation, grouping sets will depend on the earliest version of sqlglot contains tobymao/sqlglot#3985.

cpcloud · 2024-08-28T20:00:26Z

Ok, now that I think about this some more, I think having separate ibis.rollup/grouping_sets/cube objects adds unnecessary complication.

What do people think about making these keyword arguments to group_by? E.g.,

t.group_by(
    ...,
    rollup=(_.a, "b"),
    grouping_sets=("a", "b", ("a", "b")),
    cube=(_.c, _.d)
)

The separate objects would make sense if there were some way to use this functionality outside of a group by, but AFAICT they are strictly only valid in that case.

This would also eliminate having to do the partitioning, since the API requires the user spell it out.

cpcloud · 2024-08-28T20:00:54Z

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

jcrist · 2024-08-28T20:21:46Z

What do people think about making these keyword arguments to group_by?

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

I slightly dislike that. Not because it might break existing users (seems unlikely), but because it's harder to visually distinguish keyword-arguments serving as aliases in group_by from those specific kwargs.

t.group_by(one=t.a, cube=(t.b,), three=t.c)  # intentionally terrible formatting

Another potential downside is it doesn't let users as easily use multiple of these constructs in the same query. Following this line from the docs:

If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.

With separate objects, I'd to be able to pass in multiple of the same kind of object, while kwargs would force the user to handle the composition of them themselves (confession - I'm new to this whole concept, so apologies if this is a dumb example):

t.group_by(ibis.cube("a", "b"), ibis.cube("c", "d"))

Individual objects also may be easier for programmatic usage - a system would only need a single list of things to group by, rather than splitting out the various grouping set kind of things.

None of these are blockers of course, and all can be worked around. Just noting a few potential downsides.

cpcloud · 2024-08-28T20:59:03Z

Hm, all good points. I think I'll stick with the objects for now.

cpcloud · 2024-09-05T11:47:04Z

ibis/backends/tests/tpc/queries/duckdb/ds/05.sql

-SELECT channel ,
-       id ,
+SELECT nullif(channel, '') AS channel ,
+       nullif(id, '') AS id ,


This change, and the same change in 80.sql are because of ClickHouse's obsession with never using null values. Apparently this includes the grouping keys for rollup and friends.

cpcloud · 2024-09-05T11:47:48Z

ibis/expr/types/groupby.py

    orderings: VarTuple[ops.SortKey] = ()
    havings: VarTuple[ops.Value[dt.Boolean]] = ()

+    grouping_sets: VarTuple[VarTuple[VarTuple[ir.Value]]] = ()


This makes me sad, and I'll add a comment.

ibis/expr/types/groupby.py

cpcloud · 2024-09-05T11:49:23Z

ibis/expr/types/relations.py

-        return GroupedTable(self, groups)
+        groups, grouping_sets, rollups, cubes = partition_groups(*by, **key_exprs)
+        if not (groups or grouping_sets or rollups or cubes):
+            raise com.IbisInputError("No grouping keys provided")


Previously we were using an Annotated type hint to enforce that groups was non-empty, but we can't do that anymore because it can be empty if the user provides at least one grouping set, rollup, or cube and we don't have a way to spell that kind of relationship in the type system (nor do I think we need to build that right now).

cpcloud · 2024-09-05T11:53:26Z

ibis/expr/types/relations.py

@@ -1040,6 +1069,9 @@ def aggregate(
        metrics: Sequence[ir.Scalar] | None = (),
        by: Sequence[ir.Value] | None = (),
        having: Sequence[ir.BooleanValue] | None = (),
+        grouping_sets=(),
+        rollups=(),
+        cubes=(),


I don't like exposing these arguments in the aggregate. Here's why:

I have to bind again, similar to what I am doing in the group_by method.

Putting the grouping set arguments alongside by means there's potential for confusion and mixing: e.g., someone can passing ibis.cube into by, and/or they can pass it into cube. If we ultimately go this route then we'll have to handle that.

I suppose neither of these things is the actual worst, but I'm open to suggestions on how to avoid exposing these additional arguments.

It doesn't get away from having to bind again (and actually adds a bit more work), but what if we just have all grouping expressions passed in to by?

agg(_.some_col.sum(), by=_.a, _.b, ibis.cube(_.a, _.b))

?

I'll poke at it a bit. It might be easier if there were bound and unbound versions of the rollup/cube/grouping sets objects. Bound would be used internally only, while the user facing APIs are all unbound. The current ones are effectively Unbound.

cpcloud · 2024-09-05T11:54:17Z

ibis/expr/types/relations.py

@@ -1128,8 +1166,29 @@ def aggregate(
                else:
                    metrics[metric.name] = metric

-        # construct the aggregate node
-        agg = ops.Aggregate(node, groups, metrics).to_expr()
+        keys = frozendict(


This bit is far too optimistic and very proof-of-concept. More unit tests are required to throw some chaos at this code.

cpcloud · 2024-09-05T11:54:49Z

ibis/tests/expr/test_aggregation.py

+    gs = ibis.cube("a", "b")
+    expr = t.group_by(gs).agg(n=_.count())
+    result = ibis.to_sql(expr, dialect="duckdb")
+    assert len(result)


These tests are woefully naive and optimistic, need to make them grumpier.

cpcloud · 2024-09-05T12:53:00Z

Keeping this as draft until I can at least add some less sophomoric tests.

…ckhouse behavior

gforsyth · 2024-09-20T18:44:03Z

I wish I hadn't learned this while working through reviewing this, but I guess it's valid to pass a ROLLUP or a CUBE as an element of a GROUPING SET? I don't think that works right now, and maybe we leave that for a follow-up, but I thought we should note it down for ourselves.

cpcloud · 2024-09-20T18:49:07Z

Which of course means you can nest grouping sets arbitrarily.

D create table t (a int, b int);
D select 1 from t group by grouping sets (rollup (a, b), a, grouping sets (cube (a, b)));
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
│     1 │
└───────┘

cpcloud · 2024-09-20T18:49:21Z

This is a special level of hell.

cpcloud added feature Features or general enhancements ux User experience related issues labels Aug 27, 2024

cpcloud marked this pull request as draft August 27, 2024 16:56

cpcloud force-pushed the rollup-cube-grouping-sets branch 4 times, most recently from 1719532 to 176adc8 Compare September 5, 2024 11:34

cpcloud commented Sep 5, 2024

View reviewed changes

ibis/expr/types/groupby.py Outdated Show resolved Hide resolved

cpcloud commented Sep 5, 2024

View reviewed changes

cpcloud requested review from jcrist and gforsyth September 5, 2024 12:52

cpcloud force-pushed the rollup-cube-grouping-sets branch from 86d21fa to 441ec7a Compare September 5, 2024 15:03

cpcloud linked an issue Sep 11, 2024 that may be closed by this pull request

feat: grouping sets functionality #550

Open

cpcloud added 4 commits September 19, 2024 05:21

feat(api): grouping sets/rollup/cube

cd26da4

tests(tpcds): query 05; adjust to deal with obsessively anti-null cli…

aded68b

…ckhouse behavior

test(tpcds): query 14

c25c9d3

test(tpcds): query 18 and hold clickhouses hand

5333e5e

cpcloud added 11 commits September 19, 2024 05:21

test(tpcds): query 80

6b1e6e2

chore: remove setting

e6fad67

chore: fix decompiler

7b4e87e

chore: fix expression tests

1c4dea2

chore: split up aggregate compilation to avoid copypasta

b6a8b79

chore(datafusion): regen sql

4f80e36

chore: move group_id to groupby.py

22768e7

chore: support deferreds in grouping sets

e45b119

chore: add docstrings to cube, rollup and grouping_sets

f5a3a78

chore: add docstring to group_id function

633bb68

chore: add callout about cube cost

1c0553c

cpcloud force-pushed the rollup-cube-grouping-sets branch from 441ec7a to 1c0553c Compare September 19, 2024 09:23

chore: remove use of deprecated .c selector

f2aaaf6

cpcloud added this to the 10.0 milestone Sep 19, 2024

cpcloud marked this pull request as ready for review September 19, 2024 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): support `rollup`, `cube` and `grouping_sets` APIs #9945

feat(api): support `rollup`, `cube` and `grouping_sets` APIs #9945

cpcloud commented Aug 27, 2024

gforsyth commented Aug 27, 2024

cpcloud commented Aug 28, 2024

cpcloud commented Aug 28, 2024 •

edited

Loading

cpcloud commented Aug 28, 2024

cpcloud commented Aug 28, 2024

jcrist commented Aug 28, 2024

cpcloud commented Aug 28, 2024 •

edited

Loading

cpcloud Sep 5, 2024

cpcloud Sep 5, 2024

cpcloud Sep 5, 2024

cpcloud Sep 5, 2024

gforsyth Sep 5, 2024 •

edited

Loading

cpcloud Sep 5, 2024

cpcloud Sep 5, 2024

cpcloud Sep 5, 2024

cpcloud commented Sep 5, 2024 •

edited

Loading

gforsyth commented Sep 20, 2024

cpcloud commented Sep 20, 2024

cpcloud commented Sep 20, 2024

feat(api): support rollup, cube and grouping_sets APIs #9945

Are you sure you want to change the base?

feat(api): support rollup, cube and grouping_sets APIs #9945

Conversation

cpcloud commented Aug 27, 2024

Description of changes

Issues closed

gforsyth commented Aug 27, 2024

cpcloud commented Aug 28, 2024

cpcloud commented Aug 28, 2024 • edited Loading

cpcloud commented Aug 28, 2024

cpcloud commented Aug 28, 2024

jcrist commented Aug 28, 2024

cpcloud commented Aug 28, 2024 • edited Loading

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

gforsyth Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud Sep 5, 2024

Choose a reason for hiding this comment

cpcloud commented Sep 5, 2024 • edited Loading

gforsyth commented Sep 20, 2024

cpcloud commented Sep 20, 2024

cpcloud commented Sep 20, 2024

feat(api): support `rollup`, `cube` and `grouping_sets` APIs #9945

feat(api): support `rollup`, `cube` and `grouping_sets` APIs #9945

cpcloud commented Aug 28, 2024 •

edited

Loading

cpcloud commented Aug 28, 2024 •

edited

Loading

gforsyth Sep 5, 2024 •

edited

Loading

cpcloud commented Sep 5, 2024 •

edited

Loading