From d42158e6286c5b7ee13201f8d8663dd82302b550 Mon Sep 17 00:00:00 2001 From: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> Date: Sat, 13 Jan 2024 15:41:37 +0100 Subject: [PATCH 1/4] Some improvements to the "Reference" homepage (#695) --- altdoc/reference_home.Rmd | 66 +++++++++++++++++++++------------------ 1 file changed, 36 insertions(+), 30 deletions(-) diff --git a/altdoc/reference_home.Rmd b/altdoc/reference_home.Rmd index cd0c7134e..e0d58279d 100644 --- a/altdoc/reference_home.Rmd +++ b/altdoc/reference_home.Rmd @@ -26,10 +26,12 @@ to choose between eager and lazy evaluation, that require respectively a for grouped data). We can apply functions directly on a `DataFrame` or `LazyFrame`, such as `rename()` -or `drop()`. Most (but not all!) functions that can be applied to `DataFrame`s -can also be used on `LazyFrame`s. Calling `$lazy()` yields -a `LazyFrame`. While calling `$collect()` starts a computation and -yields a `DataFrame` as result. +or `drop()`. Most functions that can be applied to `DataFrame`s can also be used +on `LazyFrame`s, but some are specific to one or the other. For example: + +* `$equals()` exists for `DataFrame` but not for `LazyFrame`; +* `$collect()` executes a lazy query, which means it can only be applied on + a `LazyFrame`. Another common data structure is the `Series`, which can be considered as the equivalent of R vectors in `polars`' world. Therefore, a `DataFrame` is a list of @@ -48,10 +50,10 @@ of contexts: * filter rows with `filter()`; * group and aggregate rows with `group_by()` and `agg()` -Inside each context, you can use various **expressions** (aka. `Expr`). Some expressions cannot -be used in some contexts. For example, in `with_columns()`, you can only apply -expressions that return either the same number of values or a single value that -will be duplicated on all rows: +Inside each context, you can use various **expressions** (aka. `Expr`). Some +expressions cannot be used in some contexts. For example, in `with_columns()`, +you can only apply expressions that return either the same number of values or a +single value that will be duplicated on all rows: ```{r} test = pl$DataFrame(mtcars) @@ -71,7 +73,7 @@ test$with_columns( ) ``` -By contrast, in an `agg` context, any number of return values are possible, as +By contrast, in an `agg()` context, any number of return values are possible, as they are returned in a list, and only the new columns or the grouping columns are returned. @@ -86,33 +88,38 @@ test$group_by(pl$col("cyl"))$agg( ## Expressions -`polars` is quite verbose and requires you to be very explicit on the operations -you want to perform. This can be seen in the way expressions work. All polars -public functions (excluding methods) are accessed via the namespace handle `pl`. +Expressions are the building blocks that give all the flexibility we need to +modify or create new columns. Two important expressions starters are `pl$col()` (names a column in the context) and `pl$lit()` (wraps a literal value or vector/series in an Expr). Most other expression starters are syntactic sugar derived from thereof, e.g. `pl$sum(_)` is actually `pl$col(_)$sum()`. -Expressions can be chained with about 170 expression methods such as `$sum()` +Expressions can be chained with more than 170 expression methods such as `$sum()` which aggregates e.g. the column with summing. ```{r} # two examples of starting, chaining and combining expressions pl$DataFrame(a = 1:4)$with_columns( - # take col mpg, slice it, sum it, then cast it - pl$col("a")$slice(0, 2)$sum()$cast(pl$Float32)$alias("a_slice_sum_cast"), - # take 1:3, name it, then sum, then multiply with two - pl$lit(1:3)$alias("lit_sum_add_two")$sum() * 2L, - # similar to above, but with `mul()`-method instead of `*`. - pl$lit(1:3)$sum()$mul(pl$col("a"))$alias("lit_sum_add_mpg") + # compute the cosine of column "a" + a_cos = pl$col("a")$cos()$sin(), + # standardize the values of column "a" + a_stand = (pl$col("a") - pl$col("a")$mean()) / pl$col("a")$std(), + # take 1:3, name it, then sum, then multiply by two + lit_sum_add_two = pl$lit(1:3)$sum() * 2L ) ``` -Moreover there are subnamespaces with special methods only applicable for a -specific type `dt`(datetime), `arr`(list), `str`(strings), `struct`(structs), -`cat`(categoricals) and `bin`(binary). As a sidenote, there is also an exotic +Some methods share a common name but their behavior might be very different +depending on the input type. For example, `$decode()` doesn't do the same thing +when it is applied on binary data or on string data. + +To be able to distinguish those usages and to check the validity of a query, +`polars` stores methods in subnamespaces. For each datatype other than numeric +(floats and integers), there is a subnamespace containing the available methods: +`dt` (datetime), `list` (list), `str` (strings), `struct` (structs), `cat` +(categoricals) and `bin` (binary). As a sidenote, there is also an exotic subnamespace called `meta` which is rarely used to manipulate the expressions themselves. Each subsection in the "Expressions" section lists all operations available for a specific subnamespace. @@ -133,21 +140,20 @@ df = pl$DataFrame( df ``` -The function `year()` only makes sense for date-time data, so the type of input -that can receive this function is `dt` (for **d**ate-**t**ime): +The function `year()` only makes sense for date-time data, so we look for functions +in the `dt` subnamespace (for **d**ate-**t**ime): ```{r} df$with_columns( - pl$col("date")$dt$year()$alias("year") + year = pl$col("date")$dt$year() ) ``` -Similarly, if we have text data that we want to convert text to uppercase, we -use the `str` prefix before using `to_uppercase()`: +Similarly, to convert a string column to uppercase, we use the `str` prefix +before using `to_uppercase()`: ```{r} # Create the DataFrame -df = pl$DataFrame(foo = c("jake", "mary", "john peter")) - -df$select(pl$col("foo")$str$to_uppercase()) +pl$DataFrame(foo = c("jake", "mary", "john peter"))$ + with_columns(upper = pl$col("foo")$str$to_uppercase()) ``` From 3545ee2a8caba75fb01c0c6bb48c5ddb2f6dda13 Mon Sep 17 00:00:00 2001 From: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> Date: Sat, 13 Jan 2024 16:35:34 +0100 Subject: [PATCH 2/4] Create a separate class `RollingGroupBy` (#694) Co-authored-by: eitsupi <50911393+eitsupi@users.noreply.github.com> --- DESCRIPTION | 1 + NAMESPACE | 5 ++ NEWS.md | 3 +- R/as_polars.R | 3 + R/dataframe__frame.R | 8 +- R/group_by.R | 67 ++++++--------- R/group_by_rolling.R | 133 +++++++++++++++++++++++++++++ R/zzz.R | 3 + altdoc/altdoc_preprocessing.R | 2 +- man/RollingGroupBy_agg.Rd | 33 +++++++ man/RollingGroupBy_class.Rd | 11 +++ man/RollingGroupBy_ungroup.Rd | 25 ++++++ man/as_polars_df.Rd | 3 + tests/testthat/_snaps/dataframe.md | 24 ++++++ tests/testthat/_snaps/groupby.md | 44 +++++----- tests/testthat/test-as_polars.R | 2 + tests/testthat/test-dataframe.R | 9 ++ 17 files changed, 309 insertions(+), 67 deletions(-) create mode 100644 R/group_by_rolling.R create mode 100644 man/RollingGroupBy_agg.Rd create mode 100644 man/RollingGroupBy_class.Rd create mode 100644 man/RollingGroupBy_ungroup.Rd diff --git a/DESCRIPTION b/DESCRIPTION index 30d385c82..d22cba48b 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -91,6 +91,7 @@ Collate: 'functions__lazy.R' 'functions__whenthen.R' 'group_by.R' + 'group_by_rolling.R' 'info.R' 'ipc.R' 'is_polars.R' diff --git a/NAMESPACE b/NAMESPACE index 600ef8800..20145cd38 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -26,6 +26,7 @@ S3method("$",RPolarsProtoExprArray) S3method("$",RPolarsRField) S3method("$",RPolarsRNullValues) S3method("$",RPolarsRThreadHandle) +S3method("$",RPolarsRollingGroupBy) S3method("$",RPolarsSQLContext) S3method("$",RPolarsSeries) S3method("$",RPolarsStringCacheHolder) @@ -77,6 +78,7 @@ S3method("[[",RPolarsProtoExprArray) S3method("[[",RPolarsRField) S3method("[[",RPolarsRNullValues) S3method("[[",RPolarsRThreadHandle) +S3method("[[",RPolarsRollingGroupBy) S3method("[[",RPolarsSQLContext) S3method("[[",RPolarsSeries) S3method("[[",RPolarsStringCacheHolder) @@ -94,6 +96,7 @@ S3method(.DollarNames,RPolarsGroupBy) S3method(.DollarNames,RPolarsLazyFrame) S3method(.DollarNames,RPolarsRField) S3method(.DollarNames,RPolarsRThreadHandle) +S3method(.DollarNames,RPolarsRollingGroupBy) S3method(.DollarNames,RPolarsSQLContext) S3method(.DollarNames,RPolarsSeries) S3method(.DollarNames,RPolarsThen) @@ -115,6 +118,7 @@ S3method(as_polars_df,RPolarsDataFrame) S3method(as_polars_df,RPolarsGroupBy) S3method(as_polars_df,RPolarsLazyFrame) S3method(as_polars_df,RPolarsLazyGroupBy) +S3method(as_polars_df,RPolarsRollingGroupBy) S3method(as_polars_df,RPolarsSeries) S3method(as_polars_df,data.frame) S3method(as_polars_df,default) @@ -165,6 +169,7 @@ S3method(print,RPolarsLazyFrame) S3method(print,RPolarsLazyGroupBy) S3method(print,RPolarsRField) S3method(print,RPolarsRThreadHandle) +S3method(print,RPolarsRollingGroupBy) S3method(print,RPolarsSQLContext) S3method(print,RPolarsSeries) S3method(print,RPolarsThen) diff --git a/NEWS.md b/NEWS.md index 3430ce1a5..af637b1ac 100644 --- a/NEWS.md +++ b/NEWS.md @@ -4,7 +4,8 @@ ### What's changed -- New method `$rolling()` for `DataFrame` and `LazyFrame` (#682). +- New method `$rolling()` for `DataFrame` and `LazyFrame`. When this is + applied, it creates an object of class `RPolarsRollingGroupBy` (#682, #694). - New method `$sink_ndjson()` for LazyFrame (#681). - New function `pl$duration()` to create a duration by components (week, day, hour, etc.), and use them with date(time) variables (#692). diff --git a/R/as_polars.R b/R/as_polars.R index dc871ff1a..7d127279e 100644 --- a/R/as_polars.R +++ b/R/as_polars.R @@ -102,6 +102,9 @@ as_polars_df.RPolarsGroupBy = function(x, ...) { x$ungroup() } +#' @rdname as_polars_df +#' @export +as_polars_df.RPolarsRollingGroupBy = as_polars_df.RPolarsGroupBy #' @rdname as_polars_df #' @export diff --git a/R/dataframe__frame.R b/R/dataframe__frame.R index 76e5fbba4..5a1269975 100644 --- a/R/dataframe__frame.R +++ b/R/dataframe__frame.R @@ -1828,8 +1828,8 @@ DataFrame_write_ndjson = function(file) { #' pl$max("a")$alias("max_a") #' ) DataFrame_rolling = function(index_column, period, offset = NULL, closed = "right", by = NULL, check_sorted = TRUE) { - out = self$lazy()$rolling(index_column, period, offset, closed, by, check_sorted) - attr(out, "is_rolling_group_by") = TRUE - class(out) = "RPolarsGroupBy" - out + if (is.null(offset)) { + offset = paste0("-", period) + } + construct_rolling_group_by(self, index_column, period, offset, closed, by, check_sorted) } diff --git a/R/group_by.R b/R/group_by.R index 33e4803df..8d3ded52f 100644 --- a/R/group_by.R +++ b/R/group_by.R @@ -12,7 +12,6 @@ NULL - RPolarsGroupBy = new.env(parent = emptyenv()) #' @export @@ -25,28 +24,30 @@ RPolarsGroupBy = new.env(parent = emptyenv()) #' @export `[[.RPolarsGroupBy` = `$.RPolarsGroupBy` -#' @title auto complete $-access into a polars object -#' @description called by the interactive R session internally -#' @param x GroupBy -#' @param pattern code-stump as string to auto-complete -#' @return char vec #' @export -#' @inherit .DollarNames.RPolarsDataFrame return #' @noRd .DollarNames.RPolarsGroupBy = function(x, pattern = "") { paste0(ls(RPolarsGroupBy, pattern = pattern), "()") } - #' The internal GroupBy constructor #' @return The input as grouped DataFrame #' @noRd construct_group_by = function(df, groupby_input, maintain_order) { - if (!inherits(df, "RPolarsDataFrame")) stop("internal error: construct_group called not on DataFrame") - df = df$clone() - attr(df, "private") = list(groupby_input = unlist(groupby_input), maintain_order = maintain_order) - class(df) = "RPolarsGroupBy" - df + if (!inherits(df, "RPolarsDataFrame")) { + stop("internal error: construct_group called not on DataFrame") + } + # Make an empty object. Store everything (including data) in attributes, so + # that we can keep the RPolarsDataFrame class on the data but still return + # a RPolarsGroupBy object here. + out = c(" ") + attr(out, "private") = list( + dat = df$clone(), + groupby_input = unlist(groupby_input), + maintain_order = maintain_order + ) + class(out) = "RPolarsGroupBy" + out } @@ -58,13 +59,13 @@ construct_group_by = function(df, groupby_input, maintain_order) { #' @return self #' @export #' -#' @examples pl$DataFrame(iris)$group_by("Species") +#' @examples +#' pl$DataFrame(iris)$group_by("Species") print.RPolarsGroupBy = function(x, ...) { - .pr$DataFrame$print(x) - cat("groups: ") prv = attr(x, "private") - cat(toString(prv$groupby_input)) - cat("\nmaintain order: ", prv$maintain_order) + .pr$DataFrame$print(prv$dat) + cat("groups:", toString(prv$groupby_input)) + cat("\nmaintain order:", prv$maintain_order) invisible(x) } @@ -86,18 +87,13 @@ print.RPolarsGroupBy = function(x, ...) { #' pl$col("bar")$mean()$alias("bar_tail_sum") #' ) GroupBy_agg = function(...) { - if (isTRUE(attributes(self)[["is_rolling_group_by"]])) { - class(self) = "RPolarsLazyGroupBy" - self$agg(unpack_list(..., .context = "in $agg():"))$collect(no_optimization = TRUE) - } else { - class(self) = "RPolarsDataFrame" - self$lazy()$group_by( - attr(self, "private")$groupby_input, - maintain_order = attr(self, "private")$maintain_order - )$ - agg(...)$ - collect(no_optimization = TRUE) - } + prv = attr(self, "private") + prv$dat$lazy()$group_by( + prv$groupby_input, + maintain_order = prv$maintain_order + )$ + agg(...)$ + collect(no_optimization = TRUE) } @@ -300,13 +296,6 @@ GroupBy_null_count = function() { #' #' gb$ungroup() GroupBy_ungroup = function() { - if (isTRUE(attributes(self)[["is_rolling_group_by"]])) { - class(self) = "RPolarsLazyGroupBy" - self = self$ungroup()$collect(no_optimization = TRUE) - } else { - self = .pr$DataFrame$clone_in_rust(self) - class(self) = "RPolarsDataFrame" - attr(self, "private") = NULL - } - self + prv = attr(self, "private") + prv$dat } diff --git a/R/group_by_rolling.R b/R/group_by_rolling.R new file mode 100644 index 000000000..29d2d1738 --- /dev/null +++ b/R/group_by_rolling.R @@ -0,0 +1,133 @@ +#' Operations on Polars DataFrame grouped by rolling windows +#' +#' @return not applicable +#' @name RollingGroupBy_class +NULL + +RPolarsRollingGroupBy = new.env(parent = emptyenv()) + +#' @export +`$.RPolarsRollingGroupBy` = function(self, name) { + func = RPolarsRollingGroupBy[[name]] + environment(func) = environment() + func +} + +#' @export +`[[.RPolarsRollingGroupBy` = `$.RPolarsRollingGroupBy` + +#' @export +#' @noRd +.DollarNames.RPolarsRollingGroupBy = function(x, pattern = "") { + paste0(ls(RPolarsRollingGroupBy, pattern = pattern), "()") +} + +#' The internal RollingGroupBy constructor +#' @return The input as grouped DataFrame +#' @noRd +construct_rolling_group_by = function(df, index_column, period, offset, closed, by, check_sorted) { + if (!inherits(df, "RPolarsDataFrame")) { + stop("internal error: construct_group called not on DataFrame") + } + # Make an empty object. Store everything (including data) in attributes, so + # that we can keep the RPolarsDataFrame class on the data but still return + # a RPolarsRollingGroupBy object here. + out = c(" ") + attr(out, "private") = list( + dat = df$clone(), + index_column = index_column, + period = period, + offset = offset, + closed = closed, + by = by, + check_sorted = check_sorted + ) + class(out) = "RPolarsRollingGroupBy" + out +} + +#' print RollingGroupBy +#' +#' @param x DataFrame +#' @param ... not used +#' @noRd +#' @return self +#' @export +#' +#' @examples +#' df = pl$DataFrame( +#' dt = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-08"), +#' a = c(3, 7, 5, 9, 2, 1) +#' )$with_columns( +#' pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() +#' ) +#' +#' df$rolling(index_column = "dt", period = "2d") +print.RPolarsRollingGroupBy = function(x, ...) { + prv = attr(x, "private") + .pr$DataFrame$print(prv$dat) + cat(paste("index column:", prv$index)) + cat(paste("\nother groups:", toString(prv$by))) + cat(paste("\nperiod:", prv$period)) + cat(paste("\noffset:", prv$offset)) + cat(paste("\nclosed:", prv$closed)) +} + + +#' Aggregate over a RollingGroupBy +#' +#' Aggregate a DataFrame over a rolling window created with `$rolling()`. +#' +#' @param ... Exprs to aggregate over. Those can also be passed wrapped in a +#' list, e.g `$agg(list(e1,e2,e3))`. +#' +#' @return An aggregated [DataFrame][DataFrame_class] +#' @examples +#' df = pl$DataFrame( +#' dt = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-08"), +#' a = c(3, 7, 5, 9, 2, 1) +#' )$with_columns( +#' pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() +#' ) +#' +#' df$rolling(index_column = "dt", period = "2d")$agg( +#' pl$col("a"), +#' pl$sum("a")$alias("sum_a"), +#' pl$min("a")$alias("min_a"), +#' pl$max("a")$alias("max_a") +#' ) +RollingGroupBy_agg = function(...) { + prv = attr(self, "private") + prv$dat$ + lazy()$ + rolling( + index_column = prv$index, + period = prv$period, + offset = prv$offset, + closed = prv$closed, + by = prv$by, + check_sorted = prv$check_sorted + )$ + agg(unpack_list(..., .context = "in $agg():"))$ + collect(no_optimization = TRUE) +} + +#' Ungroup a RollingGroupBy object +#' +#' Revert the `$rolling()` operation. Doing `$rolling(...)$ungroup()` +#' returns the original `DataFrame`. +#' +#' @return [DataFrame][DataFrame_class] +#' @examples +#' df = pl$DataFrame( +#' dt = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-08"), +#' a = c(3, 7, 5, 9, 2, 1) +#' )$with_columns( +#' pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() +#' ) +#' +#' df$rolling(index_column = "dt", period = "2d")$ungroup() +RollingGroupBy_ungroup = function() { + prv = attr(self, "private") + prv$dat +} diff --git a/R/zzz.R b/R/zzz.R index e81a8ad04..4413b3279 100644 --- a/R/zzz.R +++ b/R/zzz.R @@ -22,6 +22,9 @@ replace_private_with_pub_methods(RPolarsLazyFrame, "^LazyFrame_") # LazyGroupBy replace_private_with_pub_methods(RPolarsLazyGroupBy, "^LazyGroupBy_") +# LazyGroupBy +replace_private_with_pub_methods(RPolarsRollingGroupBy, "^RollingGroupBy_") + # Expr replace_private_with_pub_methods(RPolarsExpr, "^Expr_") diff --git a/altdoc/altdoc_preprocessing.R b/altdoc/altdoc_preprocessing.R index 7f99d3caf..55a73f3b3 100644 --- a/altdoc/altdoc_preprocessing.R +++ b/altdoc/altdoc_preprocessing.R @@ -36,7 +36,7 @@ out = list() # order determines order in sidebar classes = c( "pl", "Series", "DataFrame", "LazyFrame", "GroupBy", - "LazyGroupBy", "ExprList", "ExprBin", "ExprCat", "ExprDT", + "LazyGroupBy", "RollingGroupBy", "ExprList", "ExprBin", "ExprCat", "ExprDT", "ExprMeta", "ExprName", "ExprStr", "ExprStruct", "Expr", "IO", "RField", "RThreadHandle", "SQLContext", "S3" ) diff --git a/man/RollingGroupBy_agg.Rd b/man/RollingGroupBy_agg.Rd new file mode 100644 index 000000000..41273eb64 --- /dev/null +++ b/man/RollingGroupBy_agg.Rd @@ -0,0 +1,33 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_rolling.R +\name{RollingGroupBy_agg} +\alias{RollingGroupBy_agg} +\title{Aggregate over a RollingGroupBy} +\usage{ +RollingGroupBy_agg(...) +} +\arguments{ +\item{...}{Exprs to aggregate over. Those can also be passed wrapped in a +list, e.g \verb{$agg(list(e1,e2,e3))}.} +} +\value{ +An aggregated \link[=DataFrame_class]{DataFrame} +} +\description{ +Aggregate a DataFrame over a rolling window created with \verb{$rolling()}. +} +\examples{ +df = pl$DataFrame( + dt = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-08"), + a = c(3, 7, 5, 9, 2, 1) +)$with_columns( + pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() +) + +df$rolling(index_column = "dt", period = "2d")$agg( + pl$col("a"), + pl$sum("a")$alias("sum_a"), + pl$min("a")$alias("min_a"), + pl$max("a")$alias("max_a") +) +} diff --git a/man/RollingGroupBy_class.Rd b/man/RollingGroupBy_class.Rd new file mode 100644 index 000000000..4ec6fd354 --- /dev/null +++ b/man/RollingGroupBy_class.Rd @@ -0,0 +1,11 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_rolling.R +\name{RollingGroupBy_class} +\alias{RollingGroupBy_class} +\title{Operations on Polars DataFrame grouped by rolling windows} +\value{ +not applicable +} +\description{ +Operations on Polars DataFrame grouped by rolling windows +} diff --git a/man/RollingGroupBy_ungroup.Rd b/man/RollingGroupBy_ungroup.Rd new file mode 100644 index 000000000..188d462d9 --- /dev/null +++ b/man/RollingGroupBy_ungroup.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_rolling.R +\name{RollingGroupBy_ungroup} +\alias{RollingGroupBy_ungroup} +\title{Ungroup a RollingGroupBy object} +\usage{ +RollingGroupBy_ungroup() +} +\value{ +\link[=DataFrame_class]{DataFrame} +} +\description{ +Revert the \verb{$rolling()} operation. Doing \verb{$rolling(...)$ungroup()} +returns the original \code{DataFrame}. +} +\examples{ +df = pl$DataFrame( + dt = c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-02", "2020-01-03", "2020-01-08"), + a = c(3, 7, 5, 9, 2, 1) +)$with_columns( + pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() +) + +df$rolling(index_column = "dt", period = "2d")$ungroup() +} diff --git a/man/as_polars_df.Rd b/man/as_polars_df.Rd index 4e092d150..3333e91b4 100644 --- a/man/as_polars_df.Rd +++ b/man/as_polars_df.Rd @@ -6,6 +6,7 @@ \alias{as_polars_df.data.frame} \alias{as_polars_df.RPolarsDataFrame} \alias{as_polars_df.RPolarsGroupBy} +\alias{as_polars_df.RPolarsRollingGroupBy} \alias{as_polars_df.RPolarsSeries} \alias{as_polars_df.RPolarsLazyFrame} \alias{as_polars_df.RPolarsLazyGroupBy} @@ -22,6 +23,8 @@ as_polars_df(x, ...) \method{as_polars_df}{RPolarsGroupBy}(x, ...) +\method{as_polars_df}{RPolarsRollingGroupBy}(x, ...) + \method{as_polars_df}{RPolarsSeries}(x, ...) \method{as_polars_df}{RPolarsLazyFrame}( diff --git a/tests/testthat/_snaps/dataframe.md b/tests/testthat/_snaps/dataframe.md index c67b2d0d5..74740ff95 100644 --- a/tests/testthat/_snaps/dataframe.md +++ b/tests/testthat/_snaps/dataframe.md @@ -38,3 +38,27 @@ & carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4 & literal 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 +# rolling for DataFrame: prints all info + + Code + df$rolling(index_column = "dt", period = "2i") + Output + shape: (6, 2) + ┌───────┬─────┐ + │ index ┆ a │ + │ --- ┆ --- │ + │ f64 ┆ f64 │ + ╞═══════╪═════╡ + │ 1.0 ┆ 3.0 │ + │ 2.0 ┆ 7.0 │ + │ 3.0 ┆ 5.0 │ + │ 4.0 ┆ 9.0 │ + │ 5.0 ┆ 2.0 │ + │ 6.0 ┆ 1.0 │ + └───────┴─────┘ + index column: dt + other groups: + period: 2i + offset: -2i + closed: right + diff --git a/tests/testthat/_snaps/groupby.md b/tests/testthat/_snaps/groupby.md index 6e0a08eac..bcdc0e6df 100644 --- a/tests/testthat/_snaps/groupby.md +++ b/tests/testthat/_snaps/groupby.md @@ -16,7 +16,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_CELL_ALIGNMENT, .value=RIGHT @@ -36,7 +36,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_DATAFRAME_SHAPE_BELOW, .value=1 @@ -56,7 +56,7 @@ └─────┴─────┘ shape: (5, 2) groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_FULL @@ -80,7 +80,7 @@ | two | 1.0 | +-----+-----+ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_FULL_CONDENSED @@ -100,7 +100,7 @@ | two | 1.0 | +-----+-----+ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_NO_BORDERS @@ -122,7 +122,7 @@ -----+----- two | 1.0 groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_BORDERS_ONLY @@ -146,7 +146,7 @@ | two 1.0 | +-----------+ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_BORDERS_ONLY_CONDENSED @@ -166,7 +166,7 @@ | two 1.0 | +-----------+ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_HORIZONTAL_ONLY @@ -190,7 +190,7 @@ two 1.0 ----------- groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=ASCII_MARKDOWN @@ -208,7 +208,7 @@ | one | 4.0 | | two | 1.0 | groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=UTF8_FULL @@ -232,7 +232,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=UTF8_FULL_CONDENSED @@ -252,7 +252,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=UTF8_NO_BORDERS @@ -274,7 +274,7 @@ ╌╌╌╌╌┼╌╌╌╌╌ two ┆ 1.0 groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=UTF8_BORDERS_ONLY @@ -294,7 +294,7 @@ │ two 1.0 │ └───────────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=UTF8_HORIZONTAL_ONLY @@ -318,7 +318,7 @@ two 1.0 ─────────── groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_FORMATTING, .value=NOTHING @@ -335,7 +335,7 @@ one 4.0 two 1.0 groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_HIDE_COLUMN_DATA_TYPES, .value=1 @@ -353,7 +353,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_HIDE_COLUMN_NAMES, .value=1 @@ -371,7 +371,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_HIDE_COLUMN_SEPARATOR, .value=1 @@ -390,7 +390,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_TABLE_HIDE_DATAFRAME_SHAPE_INFORMATION, .value=1 @@ -409,7 +409,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print .name=POLARS_FMT_MAX_ROWS, .value=2 @@ -427,7 +427,7 @@ │ two ┆ 1.0 │ └─────┴─────┘ groups: foo - maintain order: TRUE + maintain order: TRUE # groupby print when several groups @@ -445,5 +445,5 @@ │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 │ └──────┴─────┴───────┴───────┘ groups: mpg, cyl, disp - maintain order: TRUE + maintain order: TRUE diff --git a/tests/testthat/test-as_polars.R b/tests/testthat/test-as_polars.R index 8ee8d0129..a1eb5e67d 100644 --- a/tests/testthat/test-as_polars.R +++ b/tests/testthat/test-as_polars.R @@ -12,6 +12,8 @@ make_as_polars_df_cases = function() { "polars_lf", pl$LazyFrame(test_df), "polars_group_by", pl$DataFrame(test_df)$group_by("col_int"), "polars_lazy_group_by", pl$LazyFrame(test_df)$group_by("col_int"), + "polars_rolling_group_by", pl$DataFrame(test_df)$rolling("col_int", period = "1i"), + "polars_lazy_rolling_group_by", pl$LazyFrame(test_df)$rolling("col_int", period = "1i"), "arrow Table", arrow::as_arrow_table(test_df) ) } diff --git a/tests/testthat/test-dataframe.R b/tests/testthat/test-dataframe.R index aa063dd57..dd060a2b2 100644 --- a/tests/testthat/test-dataframe.R +++ b/tests/testthat/test-dataframe.R @@ -1267,6 +1267,15 @@ test_that("rolling for DataFrame: basic example", { ) }) +test_that("rolling for DataFrame: prints all info", { + df = pl$DataFrame( + index = c(1:5, 6.0), + a = c(3, 7, 5, 9, 2, 1) + )$with_columns(pl$col("index")$set_sorted()) + + expect_snapshot(df$rolling(index_column = "dt", period = "2i")) +}) + test_that("rolling for DataFrame: can be ungrouped", { df = pl$DataFrame( index = c(1:5, 6.0), From b0a70e959a139053c546aca4b60612f4143a2dd2 Mon Sep 17 00:00:00 2001 From: Etienne Bacher <52219252+etiennebacher@users.noreply.github.com> Date: Sun, 14 Jan 2024 08:53:11 +0100 Subject: [PATCH 3/4] Implement `group_by_dynamic()` for `DataFrame` and `LazyFrame` (#691) Co-authored-by: eitsupi --- DESCRIPTION | 1 + NAMESPACE | 5 + NEWS.md | 2 + R/as_polars.R | 4 + R/dataframe__frame.R | 88 ++++++ R/expr__expr.R | 8 +- R/extendr-wrappers.R | 2 + R/group_by_dynamic.R | 133 ++++++++++ R/group_by_rolling.R | 8 +- R/lazyframe__lazy.R | 115 +++++++- R/zzz.R | 5 +- man/DataFrame_group_by_dynamic.Rd | 180 +++++++++++++ man/DynamicGroupBy_agg.Rd | 82 ++++++ man/DynamicGroupBy_class.Rd | 11 + man/DynamicGroupBy_ungroup.Rd | 29 ++ man/Expr_rolling.Rd | 3 + man/LazyFrame_group_by_dynamic.Rd | 184 +++++++++++++ man/LazyFrame_rolling.Rd | 3 + man/as_polars_df.Rd | 3 + src/rust/src/lazy/dataframe.rs | 38 +++ src/rust/src/rdatatype.rs | 30 +++ src/rust/src/utils/mod.rs | 6 + tests/testthat/_snaps/after-wrappers.md | 110 ++++---- tests/testthat/_snaps/dataframe.md | 24 -- tests/testthat/test-as_polars.R | 2 + tests/testthat/test-dataframe.R | 9 - tests/testthat/test-groupby.R | 339 +++++++++++++++++++++++- 27 files changed, 1313 insertions(+), 111 deletions(-) create mode 100644 R/group_by_dynamic.R create mode 100644 man/DataFrame_group_by_dynamic.Rd create mode 100644 man/DynamicGroupBy_agg.Rd create mode 100644 man/DynamicGroupBy_class.Rd create mode 100644 man/DynamicGroupBy_ungroup.Rd create mode 100644 man/LazyFrame_group_by_dynamic.Rd diff --git a/DESCRIPTION b/DESCRIPTION index d22cba48b..40f5face0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -91,6 +91,7 @@ Collate: 'functions__lazy.R' 'functions__whenthen.R' 'group_by.R' + 'group_by_dynamic.R' 'group_by_rolling.R' 'info.R' 'ipc.R' diff --git a/NAMESPACE b/NAMESPACE index 20145cd38..11b611a3a 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -9,6 +9,7 @@ S3method("$",RPolarsChainedWhen) S3method("$",RPolarsDataFrame) S3method("$",RPolarsDataType) S3method("$",RPolarsDataTypeVector) +S3method("$",RPolarsDynamicGroupBy) S3method("$",RPolarsErr) S3method("$",RPolarsExpr) S3method("$",RPolarsExprBinNameSpace) @@ -69,6 +70,7 @@ S3method("[[",RPolarsChainedWhen) S3method("[[",RPolarsDataFrame) S3method("[[",RPolarsDataType) S3method("[[",RPolarsDataTypeVector) +S3method("[[",RPolarsDynamicGroupBy) S3method("[[",RPolarsErr) S3method("[[",RPolarsExpr) S3method("[[",RPolarsGroupBy) @@ -90,6 +92,7 @@ S3method("|",RPolarsExpr) S3method(.DollarNames,RPolarsChainedThen) S3method(.DollarNames,RPolarsChainedWhen) S3method(.DollarNames,RPolarsDataFrame) +S3method(.DollarNames,RPolarsDynamicGroupBy) S3method(.DollarNames,RPolarsErr) S3method(.DollarNames,RPolarsExpr) S3method(.DollarNames,RPolarsGroupBy) @@ -115,6 +118,7 @@ S3method(as.matrix,RPolarsLazyFrame) S3method(as.vector,RPolarsSeries) S3method(as_polars_df,ArrowTabular) S3method(as_polars_df,RPolarsDataFrame) +S3method(as_polars_df,RPolarsDynamicGroupBy) S3method(as_polars_df,RPolarsGroupBy) S3method(as_polars_df,RPolarsLazyFrame) S3method(as_polars_df,RPolarsLazyGroupBy) @@ -162,6 +166,7 @@ S3method(print,RPolarsChainedThen) S3method(print,RPolarsChainedWhen) S3method(print,RPolarsDataFrame) S3method(print,RPolarsDataType) +S3method(print,RPolarsDynamicGroupBy) S3method(print,RPolarsErr) S3method(print,RPolarsExpr) S3method(print,RPolarsGroupBy) diff --git a/NEWS.md b/NEWS.md index af637b1ac..7917d8c82 100644 --- a/NEWS.md +++ b/NEWS.md @@ -6,6 +6,8 @@ - New method `$rolling()` for `DataFrame` and `LazyFrame`. When this is applied, it creates an object of class `RPolarsRollingGroupBy` (#682, #694). +- New method `$group_by_dynamic()` for `DataFrame` and `LazyFrame`. When this + is applied, it creates an object of class `RPolarsDynamicGroupBy` (#691). - New method `$sink_ndjson()` for LazyFrame (#681). - New function `pl$duration()` to create a duration by components (week, day, hour, etc.), and use them with date(time) variables (#692). diff --git a/R/as_polars.R b/R/as_polars.R index 7d127279e..9fbcedecc 100644 --- a/R/as_polars.R +++ b/R/as_polars.R @@ -106,6 +106,10 @@ as_polars_df.RPolarsGroupBy = function(x, ...) { #' @export as_polars_df.RPolarsRollingGroupBy = as_polars_df.RPolarsGroupBy +#' @rdname as_polars_df +#' @export +as_polars_df.RPolarsDynamicGroupBy = as_polars_df.RPolarsGroupBy + #' @rdname as_polars_df #' @export as_polars_df.RPolarsSeries = function(x, ...) { diff --git a/R/dataframe__frame.R b/R/dataframe__frame.R index 5a1269975..7d99c0a65 100644 --- a/R/dataframe__frame.R +++ b/R/dataframe__frame.R @@ -1833,3 +1833,91 @@ DataFrame_rolling = function(index_column, period, offset = NULL, closed = "righ } construct_rolling_group_by(self, index_column, period, offset, closed, by, check_sorted) } + +#' @inherit LazyFrame_group_by_dynamic title description details params +#' @return A [GroupBy][GroupBy_class] object +#' +#' @examples +#' df = pl$DataFrame( +#' time = pl$date_range( +#' start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' interval = "30m", +#' eager = TRUE, +#' ), +#' n = 0:6 +#' ) +#' +#' # get the sum in the following hour relative to the "time" column +#' df$group_by_dynamic("time", every = "1h")$agg( +#' vals = pl$col("n"), +#' sum = pl$col("n")$sum() +#' ) +#' +#' # using "include_boundaries = TRUE" is helpful to see the period considered +#' df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( +#' vals = pl$col("n") +#' ) +#' +#' # in the example above, the values didn't include the one *exactly* 1h after +#' # the start because "closed = 'left'" by default. +#' # Changing it to "right" includes values that are exactly 1h after. Note that +#' # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], +#' # even if this interval wasn't there originally +#' df$group_by_dynamic("time", every = "1h", closed = "right")$agg( +#' vals = pl$col("n") +#' ) +#' # To keep both boundaries, we use "closed = 'both'". Some values now belong to +#' # several groups: +#' df$group_by_dynamic("time", every = "1h", closed = "both")$agg( +#' vals = pl$col("n") +#' ) +#' +#' # Dynamic group bys can also be combined with grouping on normal keys +#' df = df$with_columns(groups = pl$Series(c("a", "a", "a", "b", "b", "a", "a"))) +#' df +#' +#' df$group_by_dynamic( +#' "time", +#' every = "1h", +#' closed = "both", +#' by = "groups", +#' include_boundaries = TRUE +#' )$agg(pl$col("n")) +#' +#' # We can also create a dynamic group by based on an index column +#' df = pl$LazyFrame( +#' idx = 0:5, +#' A = c("A", "A", "B", "B", "B", "C") +#' )$with_columns(pl$col("idx")$set_sorted()) +#' df +#' +#' df$group_by_dynamic( +#' "idx", +#' every = "2i", +#' period = "3i", +#' include_boundaries = TRUE, +#' closed = "right" +#' )$agg(A_agg_list = pl$col("A")) +DataFrame_group_by_dynamic = function( + index_column, + every, + period = NULL, + offset = NULL, + include_boundaries = FALSE, + closed = "left", + label = "left", + by = NULL, + start_by = "window", + check_sorted = TRUE) { + if (is.null(offset)) { + offset = paste0("-", every) + } + if (is.null(period)) { + period = every + } + construct_group_by_dynamic( + self, index_column, every, period, offset, include_boundaries, closed, label, + by, start_by, check_sorted + ) +} diff --git a/R/expr__expr.R b/R/expr__expr.R index 36f7f07d3..ed2918a3b 100644 --- a/R/expr__expr.R +++ b/R/expr__expr.R @@ -3507,6 +3507,7 @@ Expr_peak_max = function() { #' column represents an index, it has to be either Int32 or Int64. Note that #' Int32 gets temporarily cast to Int64, so if performance matters use an Int64 #' column. +#' @param ... Ignored. #' @param period Length of the window, must be non-negative. #' @param offset Offset of the window. Default is `-period`. #' @param closed Define which sides of the temporal interval are closed @@ -3569,8 +3570,11 @@ Expr_peak_max = function() { #' df$with_columns( #' sum_a_offset1 = pl$sum("a")$rolling(index_column = "dt", period = "2d", offset = "1d") #' ) -Expr_rolling = function(index_column, period, offset = NULL, - closed = "right", check_sorted = TRUE) { +Expr_rolling = function( + index_column, + ..., + period, offset = NULL, + closed = "right", check_sorted = TRUE) { if (is.null(offset)) { offset = paste0("-", period) } diff --git a/R/extendr-wrappers.R b/R/extendr-wrappers.R index c36dfd39c..7da6a937e 100644 --- a/R/extendr-wrappers.R +++ b/R/extendr-wrappers.R @@ -1121,6 +1121,8 @@ RPolarsLazyFrame$with_context <- function(contexts) .Call(wrap__RPolarsLazyFrame RPolarsLazyFrame$rolling <- function(index_column, period, offset, closed, by, check_sorted) .Call(wrap__RPolarsLazyFrame__rolling, self, index_column, period, offset, closed, by, check_sorted) +RPolarsLazyFrame$group_by_dynamic <- function(index_column, every, period, offset, label, include_boundaries, closed, by, start_by, check_sorted) .Call(wrap__RPolarsLazyFrame__group_by_dynamic, self, index_column, every, period, offset, label, include_boundaries, closed, by, start_by, check_sorted) + #' @export `$.RPolarsLazyFrame` <- function (self, name) { func <- RPolarsLazyFrame[[name]]; environment(func) <- environment(); func } diff --git a/R/group_by_dynamic.R b/R/group_by_dynamic.R new file mode 100644 index 000000000..104b1ff63 --- /dev/null +++ b/R/group_by_dynamic.R @@ -0,0 +1,133 @@ +#' Operations on Polars DataFrame grouped on time or integer values +#' +#' @return not applicable +#' @name DynamicGroupBy_class +NULL + +RPolarsDynamicGroupBy = new.env(parent = emptyenv()) + +#' @export +`$.RPolarsDynamicGroupBy` = function(self, name) { + func = RPolarsDynamicGroupBy[[name]] + environment(func) = environment() + func +} + +#' @export +`[[.RPolarsDynamicGroupBy` = `$.RPolarsDynamicGroupBy` + +#' @export +#' @noRd +.DollarNames.RPolarsDynamicGroupBy = function(x, pattern = "") { + paste0(ls(RPolarsDynamicGroupBy, pattern = pattern), "()") +} + +#' The internal DynamicGroupBy constructor +#' @return The input as grouped DataFrame +#' @noRd +construct_group_by_dynamic = function( + df, index_column, every, period, offset, include_boundaries, closed, label, + by, start_by, check_sorted) { + if (!inherits(df, "RPolarsDataFrame")) { + stop("internal error: construct_group called not on DataFrame") + } + # Make an empty object. Store everything (including data) in attributes, so + # that we can keep the RPolarsDataFrame class on the data but still return + # a RPolarsDynamicGroupBy object here. + out = c(" ") + attr(out, "private") = list( + dat = df$clone(), + index_column = index_column, + every = every, + period = period, + offset = offset, + include_boundaries = include_boundaries, + closed = closed, + label = label, + by = by, + start_by = start_by, + check_sorted = check_sorted + ) + class(out) = "RPolarsDynamicGroupBy" + out +} + +#' print DynamicGroupBy +#' +#' @param x DataFrame +#' @param ... not used +#' @noRd +#' @return self +#' @export +#' +#' @examples +#' df = pl$DataFrame( +#' time = pl$date_range( +#' start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' interval = "30m", +#' eager = TRUE, +#' ), +#' n = 0:6 +#' ) +#' +#' # get the sum in the following hour relative to the "time" column +#' df$group_by_dynamic("time", every = "1h") +print.RPolarsDynamicGroupBy = function(x, ...) { + .pr$DataFrame$print(attr(x, "private")$dat) +} + + +#' Aggregate over a DynamicGroupBy +#' +#' Aggregate a DataFrame over a time or integer window created with +#' `$group_by_dynamic()`. +#' +#' @param ... Exprs to aggregate over. Those can also be passed wrapped in a +#' list, e.g `$agg(list(e1,e2,e3))`. +#' +#' @return An aggregated [DataFrame][DataFrame_class] +#' @inherit DataFrame_group_by_dynamic examples +DynamicGroupBy_agg = function(...) { + prv = attr(self, "private") + prv$dat$ + lazy()$ + group_by_dynamic( + index_column = prv$index_column, + every = prv$every, + period = prv$period, + offset = prv$offset, + include_boundaries = prv$include_boundaries, + closed = prv$closed, + label = prv$label, + by = prv$by, + start_by = prv$start_by, + check_sorted = prv$check_sorted + )$ + agg(unpack_list(..., .context = "in $agg():"))$ + collect(no_optimization = TRUE) +} + +#' Ungroup a DynamicGroupBy object +#' +#' Revert the `$group_by_dynamic()` operation. Doing +#' `$group_by_dynamic(...)$ungroup()` returns the original `DataFrame`. +#' +#' @return [DataFrame][DataFrame_class] +#' @examples +#' df = pl$DataFrame( +#' time = pl$date_range( +#' start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' interval = "30m", +#' eager = TRUE, +#' ), +#' n = 0:6 +#' ) +#' df +#' +#' df$group_by_dynamic("time", every = "1h")$ungroup() +DynamicGroupBy_ungroup = function() { + prv = attr(self, "private") + prv$dat +} diff --git a/R/group_by_rolling.R b/R/group_by_rolling.R index 29d2d1738..ef81d2f37 100644 --- a/R/group_by_rolling.R +++ b/R/group_by_rolling.R @@ -64,13 +64,7 @@ construct_rolling_group_by = function(df, index_column, period, offset, closed, #' #' df$rolling(index_column = "dt", period = "2d") print.RPolarsRollingGroupBy = function(x, ...) { - prv = attr(x, "private") - .pr$DataFrame$print(prv$dat) - cat(paste("index column:", prv$index)) - cat(paste("\nother groups:", toString(prv$by))) - cat(paste("\nperiod:", prv$period)) - cat(paste("\noffset:", prv$offset)) - cat(paste("\nclosed:", prv$closed)) + .pr$DataFrame$print(attr(x, "private")$dat) } diff --git a/R/lazyframe__lazy.R b/R/lazyframe__lazy.R index 9a2bd0a09..c7569ceaf 100644 --- a/R/lazyframe__lazy.R +++ b/R/lazyframe__lazy.R @@ -1710,7 +1710,8 @@ LazyFrame_with_context = function(other) { #' pl$min("a")$alias("min_a"), #' pl$max("a")$alias("max_a") #' )$collect() -LazyFrame_rolling = function(index_column, period, offset = NULL, closed = "right", by = NULL, check_sorted = TRUE) { +LazyFrame_rolling = function( + index_column, ..., period, offset = NULL, closed = "right", by = NULL, check_sorted = TRUE) { if (is.null(offset)) { offset = paste0("-", period) } @@ -1720,3 +1721,115 @@ LazyFrame_rolling = function(index_column, period, offset = NULL, closed = "righ ) |> unwrap("in $rolling():") } + + +#' Group based on a date/time or integer column +#' +#' @inherit LazyFrame_rolling description details params +#' +#' @param every Interval of the window. +#' @param include_boundaries Add two columns `"_lower_boundary"` and +#' `"_upper_boundary"` columns that show the boundaries of the window. This will +#' impact performance because it’s harder to parallelize. +#' @param label Define which label to use for the window: +#' * `"left"`: lower boundary of the window +#' * `"right"`: upper boundary of the window +#' * `"datapoint"`: the first value of the index column in the given window. If +#' you don’t need the label to be at one of the boundaries, choose this option +#' for maximum performance. +#' @param start_by The strategy to determine the start of the first window by: +#' * `"window"`: start by taking the earliest timestamp, truncating it with `every`, +#' and then adding `offset`. Note that weekly windows start on Monday. +#' * `"datapoint"`: start from the first encountered data point. +#' * a day of the week (only takes effect if `every` contains `"w"`): `"monday"` +#' starts the window on the Monday before the first data point, etc. +#' +#' @return A [LazyGroupBy][LazyGroupBy_class] object +#' +#' @examples +#' lf = pl$LazyFrame( +#' time = pl$date_range( +#' start = strptime("2021-12-16 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' end = strptime("2021-12-16 03:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC"), +#' interval = "30m", +#' eager = TRUE, +#' ), +#' n = 0:6 +#' ) +#' lf$collect() +#' +#' # get the sum in the following hour relative to the "time" column +#' lf$group_by_dynamic("time", every = "1h")$agg( +#' vals = pl$col("n"), +#' sum = pl$col("n")$sum() +#' )$collect() +#' +#' # using "include_boundaries = TRUE" is helpful to see the period considered +#' lf$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( +#' vals = pl$col("n") +#' )$collect() +#' +#' # in the example above, the values didn't include the one *exactly* 1h after +#' # the start because "closed = 'left'" by default. +#' # Changing it to "right" includes values that are exactly 1h after. Note that +#' # the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], +#' # even if this interval wasn't there originally +#' lf$group_by_dynamic("time", every = "1h", closed = "right")$agg( +#' vals = pl$col("n") +#' )$collect() +#' # To keep both boundaries, we use "closed = 'both'". Some values now belong to +#' # several groups: +#' lf$group_by_dynamic("time", every = "1h", closed = "both")$agg( +#' vals = pl$col("n") +#' )$collect() +#' +#' # Dynamic group bys can also be combined with grouping on normal keys +#' lf = lf$with_columns(groups = pl$Series(c("a", "a", "a", "b", "b", "a", "a"))) +#' lf$collect() +#' +#' lf$group_by_dynamic( +#' "time", +#' every = "1h", +#' closed = "both", +#' by = "groups", +#' include_boundaries = TRUE +#' )$agg(pl$col("n"))$collect() +#' +#' # We can also create a dynamic group by based on an index column +#' lf = pl$LazyFrame( +#' idx = 0:5, +#' A = c("A", "A", "B", "B", "B", "C") +#' )$with_columns(pl$col("idx")$set_sorted()) +#' lf$collect() +#' +#' lf$group_by_dynamic( +#' "idx", +#' every = "2i", +#' period = "3i", +#' include_boundaries = TRUE, +#' closed = "right" +#' )$agg(A_agg_list = pl$col("A"))$collect() +LazyFrame_group_by_dynamic = function( + index_column, + ..., + every, + period = NULL, + offset = NULL, + include_boundaries = FALSE, + closed = "left", + label = "left", + by = NULL, + start_by = "window", + check_sorted = TRUE) { + if (is.null(offset)) { + offset = paste0("-", every) + } + if (is.null(period)) { + period = every + } + .pr$LazyFrame$group_by_dynamic( + self, index_column, every, period, offset, label, include_boundaries, closed, + wrap_elist_result(by, str_to_lit = FALSE), start_by, check_sorted + ) |> + unwrap("in $group_by_dynamic():") +} diff --git a/R/zzz.R b/R/zzz.R index 4413b3279..c07ea05a9 100644 --- a/R/zzz.R +++ b/R/zzz.R @@ -22,9 +22,12 @@ replace_private_with_pub_methods(RPolarsLazyFrame, "^LazyFrame_") # LazyGroupBy replace_private_with_pub_methods(RPolarsLazyGroupBy, "^LazyGroupBy_") -# LazyGroupBy +# RollingGroupBy replace_private_with_pub_methods(RPolarsRollingGroupBy, "^RollingGroupBy_") +# DynamicGroupBy +replace_private_with_pub_methods(RPolarsDynamicGroupBy, "^DynamicGroupBy_") + # Expr replace_private_with_pub_methods(RPolarsExpr, "^Expr_") diff --git a/man/DataFrame_group_by_dynamic.Rd b/man/DataFrame_group_by_dynamic.Rd new file mode 100644 index 000000000..ac240d21b --- /dev/null +++ b/man/DataFrame_group_by_dynamic.Rd @@ -0,0 +1,180 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/dataframe__frame.R +\name{DataFrame_group_by_dynamic} +\alias{DataFrame_group_by_dynamic} +\title{Group based on a date/time or integer column} +\usage{ +DataFrame_group_by_dynamic( + index_column, + every, + period = NULL, + offset = NULL, + include_boundaries = FALSE, + closed = "left", + label = "left", + by = NULL, + start_by = "window", + check_sorted = TRUE +) +} +\arguments{ +\item{index_column}{Column used to group based on the time window. Often of +type Date/Datetime. This column must be sorted in ascending order (or, if \code{by} +is specified, then it must be sorted in ascending order within each group). In +case of a rolling group by on indices, dtype needs to be either Int32 or Int64. +Note that Int32 gets temporarily cast to Int64, so if performance matters use +an Int64 column.} + +\item{every}{Interval of the window.} + +\item{period}{Length of the window, must be non-negative.} + +\item{offset}{Offset of the window. Default is \code{-period}.} + +\item{include_boundaries}{Add two columns \code{"_lower_boundary"} and +\code{"_upper_boundary"} columns that show the boundaries of the window. This will +impact performance because it’s harder to parallelize.} + +\item{closed}{Define which sides of the temporal interval are closed +(inclusive). This can be either \code{"left"}, \code{"right"}, \code{"both"} or \code{"none"}.} + +\item{label}{Define which label to use for the window: +\itemize{ +\item \code{"left"}: lower boundary of the window +\item \code{"right"}: upper boundary of the window +\item \code{"datapoint"}: the first value of the index column in the given window. If +you don’t need the label to be at one of the boundaries, choose this option +for maximum performance. +}} + +\item{by}{Also group by this column/these columns.} + +\item{start_by}{The strategy to determine the start of the first window by: +\itemize{ +\item \code{"window"}: start by taking the earliest timestamp, truncating it with \code{every}, +and then adding \code{offset}. Note that weekly windows start on Monday. +\item \code{"datapoint"}: start from the first encountered data point. +\item a day of the week (only takes effect if \code{every} contains \code{"w"}): \code{"monday"} +starts the window on the Monday before the first data point, etc. +}} + +\item{check_sorted}{Check whether data is actually sorted. Checking it is +expensive so if you are sure the data within the \code{index_column} is sorted, you +can set this to \code{FALSE} but note that if the data actually is unsorted, it +will lead to incorrect output.} +} +\value{ +A \link[=GroupBy_class]{GroupBy} object +} +\description{ +If you have a time series \verb{}, then by default the windows +created will be: +\itemize{ +\item (t_0 - period, t_0] +\item (t_1 - period, t_1] +\item … +\item (t_n - period, t_n] +} + +whereas if you pass a non-default offset, then the windows will be: +\itemize{ +\item (t_0 + offset, t_0 + offset + period] +\item (t_1 + offset, t_1 + offset + period] +\item … +\item (t_n + offset, t_n + offset + period] +} +} +\details{ +The period and offset arguments are created either from a timedelta, or by +using the following string language: +\itemize{ +\item 1ns (1 nanosecond) +\item 1us (1 microsecond) +\item 1ms (1 millisecond) +\item 1s (1 second) +\item 1m (1 minute) +\item 1h (1 hour) +\item 1d (1 calendar day) +\item 1w (1 calendar week) +\item 1mo (1 calendar month) +\item 1q (1 calendar quarter) +\item 1y (1 calendar year) +\item 1i (1 index count) +} + +Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds + +By "calendar day", we mean the corresponding time on the next day (which may +not be 24 hours, due to daylight savings). Similarly for "calendar week", +"calendar month", "calendar quarter", and "calendar year". + +In case of a rolling operation on an integer column, the windows are defined +by: +\itemize{ +\item "1i" # length 1 +\item "10i" # length 10 +} +} +\examples{ +df = pl$DataFrame( + time = pl$date_range( + start = strptime("2021-12-16 00:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + end = strptime("2021-12-16 03:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + interval = "30m", + eager = TRUE, + ), + n = 0:6 +) + +# get the sum in the following hour relative to the "time" column +df$group_by_dynamic("time", every = "1h")$agg( + vals = pl$col("n"), + sum = pl$col("n")$sum() +) + +# using "include_boundaries = TRUE" is helpful to see the period considered +df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( + vals = pl$col("n") +) + +# in the example above, the values didn't include the one *exactly* 1h after +# the start because "closed = 'left'" by default. +# Changing it to "right" includes values that are exactly 1h after. Note that +# the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], +# even if this interval wasn't there originally +df$group_by_dynamic("time", every = "1h", closed = "right")$agg( + vals = pl$col("n") +) +# To keep both boundaries, we use "closed = 'both'". Some values now belong to +# several groups: +df$group_by_dynamic("time", every = "1h", closed = "both")$agg( + vals = pl$col("n") +) + +# Dynamic group bys can also be combined with grouping on normal keys +df = df$with_columns(groups = pl$Series(c("a", "a", "a", "b", "b", "a", "a"))) +df + +df$group_by_dynamic( + "time", + every = "1h", + closed = "both", + by = "groups", + include_boundaries = TRUE +)$agg(pl$col("n")) + +# We can also create a dynamic group by based on an index column +df = pl$LazyFrame( + idx = 0:5, + A = c("A", "A", "B", "B", "B", "C") +)$with_columns(pl$col("idx")$set_sorted()) +df + +df$group_by_dynamic( + "idx", + every = "2i", + period = "3i", + include_boundaries = TRUE, + closed = "right" +)$agg(A_agg_list = pl$col("A")) +} diff --git a/man/DynamicGroupBy_agg.Rd b/man/DynamicGroupBy_agg.Rd new file mode 100644 index 000000000..3d6a9cc47 --- /dev/null +++ b/man/DynamicGroupBy_agg.Rd @@ -0,0 +1,82 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_dynamic.R +\name{DynamicGroupBy_agg} +\alias{DynamicGroupBy_agg} +\title{Aggregate over a DynamicGroupBy} +\usage{ +DynamicGroupBy_agg(...) +} +\arguments{ +\item{...}{Exprs to aggregate over. Those can also be passed wrapped in a +list, e.g \verb{$agg(list(e1,e2,e3))}.} +} +\value{ +An aggregated \link[=DataFrame_class]{DataFrame} +} +\description{ +Aggregate a DataFrame over a time or integer window created with +\verb{$group_by_dynamic()}. +} +\examples{ +df = pl$DataFrame( + time = pl$date_range( + start = strptime("2021-12-16 00:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + end = strptime("2021-12-16 03:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + interval = "30m", + eager = TRUE, + ), + n = 0:6 +) + +# get the sum in the following hour relative to the "time" column +df$group_by_dynamic("time", every = "1h")$agg( + vals = pl$col("n"), + sum = pl$col("n")$sum() +) + +# using "include_boundaries = TRUE" is helpful to see the period considered +df$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( + vals = pl$col("n") +) + +# in the example above, the values didn't include the one *exactly* 1h after +# the start because "closed = 'left'" by default. +# Changing it to "right" includes values that are exactly 1h after. Note that +# the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], +# even if this interval wasn't there originally +df$group_by_dynamic("time", every = "1h", closed = "right")$agg( + vals = pl$col("n") +) +# To keep both boundaries, we use "closed = 'both'". Some values now belong to +# several groups: +df$group_by_dynamic("time", every = "1h", closed = "both")$agg( + vals = pl$col("n") +) + +# Dynamic group bys can also be combined with grouping on normal keys +df = df$with_columns(groups = pl$Series(c("a", "a", "a", "b", "b", "a", "a"))) +df + +df$group_by_dynamic( + "time", + every = "1h", + closed = "both", + by = "groups", + include_boundaries = TRUE +)$agg(pl$col("n")) + +# We can also create a dynamic group by based on an index column +df = pl$LazyFrame( + idx = 0:5, + A = c("A", "A", "B", "B", "B", "C") +)$with_columns(pl$col("idx")$set_sorted()) +df + +df$group_by_dynamic( + "idx", + every = "2i", + period = "3i", + include_boundaries = TRUE, + closed = "right" +)$agg(A_agg_list = pl$col("A")) +} diff --git a/man/DynamicGroupBy_class.Rd b/man/DynamicGroupBy_class.Rd new file mode 100644 index 000000000..52442ea1b --- /dev/null +++ b/man/DynamicGroupBy_class.Rd @@ -0,0 +1,11 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_dynamic.R +\name{DynamicGroupBy_class} +\alias{DynamicGroupBy_class} +\title{Operations on Polars DataFrame grouped on time or integer values} +\value{ +not applicable +} +\description{ +Operations on Polars DataFrame grouped on time or integer values +} diff --git a/man/DynamicGroupBy_ungroup.Rd b/man/DynamicGroupBy_ungroup.Rd new file mode 100644 index 000000000..ccb7e605a --- /dev/null +++ b/man/DynamicGroupBy_ungroup.Rd @@ -0,0 +1,29 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/group_by_dynamic.R +\name{DynamicGroupBy_ungroup} +\alias{DynamicGroupBy_ungroup} +\title{Ungroup a DynamicGroupBy object} +\usage{ +DynamicGroupBy_ungroup() +} +\value{ +\link[=DataFrame_class]{DataFrame} +} +\description{ +Revert the \verb{$group_by_dynamic()} operation. Doing +\verb{$group_by_dynamic(...)$ungroup()} returns the original \code{DataFrame}. +} +\examples{ +df = pl$DataFrame( + time = pl$date_range( + start = strptime("2021-12-16 00:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + end = strptime("2021-12-16 03:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + interval = "30m", + eager = TRUE, + ), + n = 0:6 +) +df + +df$group_by_dynamic("time", every = "1h")$ungroup() +} diff --git a/man/Expr_rolling.Rd b/man/Expr_rolling.Rd index d5307a82a..0f106ab95 100644 --- a/man/Expr_rolling.Rd +++ b/man/Expr_rolling.Rd @@ -6,6 +6,7 @@ \usage{ Expr_rolling( index_column, + ..., period, offset = NULL, closed = "right", @@ -19,6 +20,8 @@ column represents an index, it has to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.} +\item{...}{Ignored.} + \item{period}{Length of the window, must be non-negative.} \item{offset}{Offset of the window. Default is \code{-period}.} diff --git a/man/LazyFrame_group_by_dynamic.Rd b/man/LazyFrame_group_by_dynamic.Rd new file mode 100644 index 000000000..c24614559 --- /dev/null +++ b/man/LazyFrame_group_by_dynamic.Rd @@ -0,0 +1,184 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/lazyframe__lazy.R +\name{LazyFrame_group_by_dynamic} +\alias{LazyFrame_group_by_dynamic} +\title{Group based on a date/time or integer column} +\usage{ +LazyFrame_group_by_dynamic( + index_column, + ..., + every, + period = NULL, + offset = NULL, + include_boundaries = FALSE, + closed = "left", + label = "left", + by = NULL, + start_by = "window", + check_sorted = TRUE +) +} +\arguments{ +\item{index_column}{Column used to group based on the time window. Often of +type Date/Datetime. This column must be sorted in ascending order (or, if \code{by} +is specified, then it must be sorted in ascending order within each group). In +case of a rolling group by on indices, dtype needs to be either Int32 or Int64. +Note that Int32 gets temporarily cast to Int64, so if performance matters use +an Int64 column.} + +\item{...}{Ignored.} + +\item{every}{Interval of the window.} + +\item{period}{Length of the window, must be non-negative.} + +\item{offset}{Offset of the window. Default is \code{-period}.} + +\item{include_boundaries}{Add two columns \code{"_lower_boundary"} and +\code{"_upper_boundary"} columns that show the boundaries of the window. This will +impact performance because it’s harder to parallelize.} + +\item{closed}{Define which sides of the temporal interval are closed +(inclusive). This can be either \code{"left"}, \code{"right"}, \code{"both"} or \code{"none"}.} + +\item{label}{Define which label to use for the window: +\itemize{ +\item \code{"left"}: lower boundary of the window +\item \code{"right"}: upper boundary of the window +\item \code{"datapoint"}: the first value of the index column in the given window. If +you don’t need the label to be at one of the boundaries, choose this option +for maximum performance. +}} + +\item{by}{Also group by this column/these columns.} + +\item{start_by}{The strategy to determine the start of the first window by: +\itemize{ +\item \code{"window"}: start by taking the earliest timestamp, truncating it with \code{every}, +and then adding \code{offset}. Note that weekly windows start on Monday. +\item \code{"datapoint"}: start from the first encountered data point. +\item a day of the week (only takes effect if \code{every} contains \code{"w"}): \code{"monday"} +starts the window on the Monday before the first data point, etc. +}} + +\item{check_sorted}{Check whether data is actually sorted. Checking it is +expensive so if you are sure the data within the \code{index_column} is sorted, you +can set this to \code{FALSE} but note that if the data actually is unsorted, it +will lead to incorrect output.} +} +\value{ +A \link[=LazyGroupBy_class]{LazyGroupBy} object +} +\description{ +If you have a time series \verb{}, then by default the windows +created will be: +\itemize{ +\item (t_0 - period, t_0] +\item (t_1 - period, t_1] +\item … +\item (t_n - period, t_n] +} + +whereas if you pass a non-default offset, then the windows will be: +\itemize{ +\item (t_0 + offset, t_0 + offset + period] +\item (t_1 + offset, t_1 + offset + period] +\item … +\item (t_n + offset, t_n + offset + period] +} +} +\details{ +The period and offset arguments are created either from a timedelta, or by +using the following string language: +\itemize{ +\item 1ns (1 nanosecond) +\item 1us (1 microsecond) +\item 1ms (1 millisecond) +\item 1s (1 second) +\item 1m (1 minute) +\item 1h (1 hour) +\item 1d (1 calendar day) +\item 1w (1 calendar week) +\item 1mo (1 calendar month) +\item 1q (1 calendar quarter) +\item 1y (1 calendar year) +\item 1i (1 index count) +} + +Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds + +By "calendar day", we mean the corresponding time on the next day (which may +not be 24 hours, due to daylight savings). Similarly for "calendar week", +"calendar month", "calendar quarter", and "calendar year". + +In case of a rolling operation on an integer column, the windows are defined +by: +\itemize{ +\item "1i" # length 1 +\item "10i" # length 10 +} +} +\examples{ +lf = pl$LazyFrame( + time = pl$date_range( + start = strptime("2021-12-16 00:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + end = strptime("2021-12-16 03:00:00", format = "\%Y-\%m-\%d \%H:\%M:\%S", tz = "UTC"), + interval = "30m", + eager = TRUE, + ), + n = 0:6 +) +lf$collect() + +# get the sum in the following hour relative to the "time" column +lf$group_by_dynamic("time", every = "1h")$agg( + vals = pl$col("n"), + sum = pl$col("n")$sum() +)$collect() + +# using "include_boundaries = TRUE" is helpful to see the period considered +lf$group_by_dynamic("time", every = "1h", include_boundaries = TRUE)$agg( + vals = pl$col("n") +)$collect() + +# in the example above, the values didn't include the one *exactly* 1h after +# the start because "closed = 'left'" by default. +# Changing it to "right" includes values that are exactly 1h after. Note that +# the value at 00:00:00 now becomes included in the interval [23:00:00 - 00:00:00], +# even if this interval wasn't there originally +lf$group_by_dynamic("time", every = "1h", closed = "right")$agg( + vals = pl$col("n") +)$collect() +# To keep both boundaries, we use "closed = 'both'". Some values now belong to +# several groups: +lf$group_by_dynamic("time", every = "1h", closed = "both")$agg( + vals = pl$col("n") +)$collect() + +# Dynamic group bys can also be combined with grouping on normal keys +lf = lf$with_columns(groups = pl$Series(c("a", "a", "a", "b", "b", "a", "a"))) +lf$collect() + +lf$group_by_dynamic( + "time", + every = "1h", + closed = "both", + by = "groups", + include_boundaries = TRUE +)$agg(pl$col("n"))$collect() + +# We can also create a dynamic group by based on an index column +lf = pl$LazyFrame( + idx = 0:5, + A = c("A", "A", "B", "B", "B", "C") +)$with_columns(pl$col("idx")$set_sorted()) +lf$collect() + +lf$group_by_dynamic( + "idx", + every = "2i", + period = "3i", + include_boundaries = TRUE, + closed = "right" +)$agg(A_agg_list = pl$col("A"))$collect() +} diff --git a/man/LazyFrame_rolling.Rd b/man/LazyFrame_rolling.Rd index 657d0a997..f4ab90dcb 100644 --- a/man/LazyFrame_rolling.Rd +++ b/man/LazyFrame_rolling.Rd @@ -6,6 +6,7 @@ \usage{ LazyFrame_rolling( index_column, + ..., period, offset = NULL, closed = "right", @@ -21,6 +22,8 @@ case of a rolling group by on indices, dtype needs to be either Int32 or Int64. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.} +\item{...}{Ignored.} + \item{period}{Length of the window, must be non-negative.} \item{offset}{Offset of the window. Default is \code{-period}.} diff --git a/man/as_polars_df.Rd b/man/as_polars_df.Rd index 3333e91b4..250e998ce 100644 --- a/man/as_polars_df.Rd +++ b/man/as_polars_df.Rd @@ -7,6 +7,7 @@ \alias{as_polars_df.RPolarsDataFrame} \alias{as_polars_df.RPolarsGroupBy} \alias{as_polars_df.RPolarsRollingGroupBy} +\alias{as_polars_df.RPolarsDynamicGroupBy} \alias{as_polars_df.RPolarsSeries} \alias{as_polars_df.RPolarsLazyFrame} \alias{as_polars_df.RPolarsLazyGroupBy} @@ -25,6 +26,8 @@ as_polars_df(x, ...) \method{as_polars_df}{RPolarsRollingGroupBy}(x, ...) +\method{as_polars_df}{RPolarsDynamicGroupBy}(x, ...) + \method{as_polars_df}{RPolarsSeries}(x, ...) \method{as_polars_df}{RPolarsLazyFrame}( diff --git a/src/rust/src/lazy/dataframe.rs b/src/rust/src/lazy/dataframe.rs index 8a7920b11..4a81c8ac2 100644 --- a/src/rust/src/lazy/dataframe.rs +++ b/src/rust/src/lazy/dataframe.rs @@ -650,6 +650,44 @@ impl RPolarsLazyFrame { opt_state: self.0.get_current_optimizations(), }) } + + pub fn group_by_dynamic( + &self, + index_column: Robj, + every: Robj, + period: Robj, + offset: Robj, + label: Robj, + include_boundaries: Robj, + closed: Robj, + by: Robj, + start_by: Robj, + check_sorted: Robj, + ) -> RResult { + let closed_window = robj_to!(ClosedWindow, closed)?; + let by = robj_to!(VecPLExprCol, by)?; + let ldf = self.0.clone(); + let lazy_gb = ldf.group_by_dynamic( + robj_to!(PLExprCol, index_column)?, + by, + pl::DynamicGroupOptions { + every: robj_to!(pl_duration, every)?, + period: robj_to!(pl_duration, period)?, + offset: robj_to!(pl_duration, offset)?, + label: robj_to!(Label, label)?, + include_boundaries: robj_to!(bool, include_boundaries)?, + closed_window, + start_by: robj_to!(StartBy, start_by)?, + check_sorted: robj_to!(bool, check_sorted)?, + ..Default::default() + }, + ); + + Ok(RPolarsLazyGroupBy { + lgb: lazy_gb, + opt_state: self.0.get_current_optimizations(), + }) + } } #[derive(Clone)] diff --git a/src/rust/src/rdatatype.rs b/src/rust/src/rdatatype.rs index d6784f8f6..86012d008 100644 --- a/src/rust/src/rdatatype.rs +++ b/src/rust/src/rdatatype.rs @@ -499,6 +499,36 @@ pub fn robj_to_closed_window(robj: Robj) -> RResult { } } +pub fn robj_to_label(robj: Robj) -> RResult { + use pl::Label; + match robj_to_rchoice(robj)?.as_str() { + "left" => Ok(Label::Left), + "right" => Ok(Label::Right), + "datapoint" => Ok(Label::DataPoint), + s => rerr().bad_val(format!( + "Label choice ['{s}'] should be one of 'left', 'right', 'datapoint'" + )), + } +} + +pub fn robj_to_start_by(robj: Robj) -> RResult { + use pl::StartBy as SB; + match robj_to_rchoice(robj)?.as_str() { + "window" => Ok(SB::WindowBound), + "datapoint" => Ok(SB::DataPoint), + "monday" => Ok(SB::Monday), + "tuesday" => Ok(SB::Tuesday), + "wednesday" => Ok(SB::Wednesday), + "thursday" => Ok(SB::Thursday), + "friday" => Ok(SB::Friday), + "saturday" => Ok(SB::Saturday), + "sunday" => Ok(SB::Sunday), + s => rerr().bad_val(format!( + "StartBy choice ['{s}'] should be one of 'window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'" + )), + } +} + pub fn robj_to_parallel_strategy(robj: extendr_api::Robj) -> RResult { use pl::ParallelStrategy as PS; match robj_to_rchoice(robj)?.to_lowercase().as_str() { diff --git a/src/rust/src/utils/mod.rs b/src/rust/src/utils/mod.rs index 8be415f40..bc354e73e 100644 --- a/src/rust/src/utils/mod.rs +++ b/src/rust/src/utils/mod.rs @@ -943,6 +943,12 @@ macro_rules! robj_to_inner { (ClosedWindow, $a:ident) => { $crate::rdatatype::robj_to_closed_window($a) }; + (Label, $a:ident) => { + $crate::rdatatype::robj_to_label($a) + }; + (StartBy, $a:ident) => { + $crate::rdatatype::robj_to_start_by($a) + }; (new_quantile_interpolation_option, $a:ident) => { $crate::rdatatype::new_quantile_interpolation_option($a) }; diff --git a/tests/testthat/_snaps/after-wrappers.md b/tests/testthat/_snaps/after-wrappers.md index f570398d7..e6b7c00c1 100644 --- a/tests/testthat/_snaps/after-wrappers.md +++ b/tests/testthat/_snaps/after-wrappers.md @@ -72,23 +72,23 @@ Code ls(.pr$env[[class_name]]) Output - [1] "clone" "columns" "describe" "drop" - [5] "drop_in_place" "drop_nulls" "dtype_strings" "dtypes" - [9] "equals" "estimated_size" "explode" "fill_nan" - [13] "fill_null" "filter" "first" "get_column" - [17] "get_columns" "glimpse" "group_by" "head" - [21] "height" "join" "join_asof" "last" - [25] "lazy" "limit" "max" "mean" - [29] "median" "melt" "min" "n_chunks" - [33] "null_count" "pivot" "print" "quantile" - [37] "rechunk" "rename" "reverse" "rolling" - [41] "sample" "schema" "select" "shape" - [45] "shift" "shift_and_fill" "slice" "sort" - [49] "std" "sum" "tail" "to_data_frame" - [53] "to_list" "to_series" "to_struct" "transpose" - [57] "unique" "unnest" "var" "width" - [61] "with_columns" "with_row_count" "write_csv" "write_json" - [65] "write_ndjson" + [1] "clone" "columns" "describe" "drop" + [5] "drop_in_place" "drop_nulls" "dtype_strings" "dtypes" + [9] "equals" "estimated_size" "explode" "fill_nan" + [13] "fill_null" "filter" "first" "get_column" + [17] "get_columns" "glimpse" "group_by" "group_by_dynamic" + [21] "head" "height" "join" "join_asof" + [25] "last" "lazy" "limit" "max" + [29] "mean" "median" "melt" "min" + [33] "n_chunks" "null_count" "pivot" "print" + [37] "quantile" "rechunk" "rename" "reverse" + [41] "rolling" "sample" "schema" "select" + [45] "shape" "shift" "shift_and_fill" "slice" + [49] "sort" "std" "sum" "tail" + [53] "to_data_frame" "to_list" "to_series" "to_struct" + [57] "transpose" "unique" "unnest" "var" + [61] "width" "with_columns" "with_row_count" "write_csv" + [65] "write_json" "write_ndjson" --- @@ -140,25 +140,26 @@ [11] "fetch" "fill_nan" [13] "fill_null" "filter" [15] "first" "get_optimization_toggle" - [17] "group_by" "head" - [19] "join" "join_asof" - [21] "last" "limit" - [23] "max" "mean" - [25] "median" "melt" - [27] "min" "print" - [29] "profile" "quantile" - [31] "rename" "reverse" - [33] "rolling" "schema" - [35] "select" "set_optimization_toggle" - [37] "shift" "shift_and_fill" - [39] "sink_csv" "sink_ipc" - [41] "sink_ndjson" "sink_parquet" - [43] "slice" "sort" - [45] "std" "sum" - [47] "tail" "unique" - [49] "unnest" "var" - [51] "width" "with_columns" - [53] "with_context" "with_row_count" + [17] "group_by" "group_by_dynamic" + [19] "head" "join" + [21] "join_asof" "last" + [23] "limit" "max" + [25] "mean" "median" + [27] "melt" "min" + [29] "print" "profile" + [31] "quantile" "rename" + [33] "reverse" "rolling" + [35] "schema" "select" + [37] "set_optimization_toggle" "shift" + [39] "shift_and_fill" "sink_csv" + [41] "sink_ipc" "sink_ndjson" + [43] "sink_parquet" "slice" + [45] "sort" "std" + [47] "sum" "tail" + [49] "unique" "unnest" + [51] "var" "width" + [53] "with_columns" "with_context" + [55] "with_row_count" --- @@ -173,24 +174,25 @@ [11] "fill_nan" "fill_null" [13] "filter" "first" [15] "get_optimization_toggle" "group_by" - [17] "join" "join_asof" - [19] "last" "limit" - [21] "max" "mean" - [23] "median" "melt" - [25] "min" "print" - [27] "profile" "quantile" - [29] "rename" "reverse" - [31] "rolling" "schema" - [33] "select" "select_str_as_lit" - [35] "set_optimization_toggle" "shift" - [37] "shift_and_fill" "sink_csv" - [39] "sink_ipc" "sink_json" - [41] "sink_parquet" "slice" - [43] "sort_by_exprs" "std" - [45] "sum" "tail" - [47] "unique" "unnest" - [49] "var" "with_columns" - [51] "with_context" "with_row_count" + [17] "group_by_dynamic" "join" + [19] "join_asof" "last" + [21] "limit" "max" + [23] "mean" "median" + [25] "melt" "min" + [27] "print" "profile" + [29] "quantile" "rename" + [31] "reverse" "rolling" + [33] "schema" "select" + [35] "select_str_as_lit" "set_optimization_toggle" + [37] "shift" "shift_and_fill" + [39] "sink_csv" "sink_ipc" + [41] "sink_json" "sink_parquet" + [43] "slice" "sort_by_exprs" + [45] "std" "sum" + [47] "tail" "unique" + [49] "unnest" "var" + [51] "with_columns" "with_context" + [53] "with_row_count" # public and private methods of each class Expr diff --git a/tests/testthat/_snaps/dataframe.md b/tests/testthat/_snaps/dataframe.md index 74740ff95..c67b2d0d5 100644 --- a/tests/testthat/_snaps/dataframe.md +++ b/tests/testthat/_snaps/dataframe.md @@ -38,27 +38,3 @@ & carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4 & literal 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 -# rolling for DataFrame: prints all info - - Code - df$rolling(index_column = "dt", period = "2i") - Output - shape: (6, 2) - ┌───────┬─────┐ - │ index ┆ a │ - │ --- ┆ --- │ - │ f64 ┆ f64 │ - ╞═══════╪═════╡ - │ 1.0 ┆ 3.0 │ - │ 2.0 ┆ 7.0 │ - │ 3.0 ┆ 5.0 │ - │ 4.0 ┆ 9.0 │ - │ 5.0 ┆ 2.0 │ - │ 6.0 ┆ 1.0 │ - └───────┴─────┘ - index column: dt - other groups: - period: 2i - offset: -2i - closed: right - diff --git a/tests/testthat/test-as_polars.R b/tests/testthat/test-as_polars.R index a1eb5e67d..2d0bc4923 100644 --- a/tests/testthat/test-as_polars.R +++ b/tests/testthat/test-as_polars.R @@ -14,6 +14,8 @@ make_as_polars_df_cases = function() { "polars_lazy_group_by", pl$LazyFrame(test_df)$group_by("col_int"), "polars_rolling_group_by", pl$DataFrame(test_df)$rolling("col_int", period = "1i"), "polars_lazy_rolling_group_by", pl$LazyFrame(test_df)$rolling("col_int", period = "1i"), + "polars_group_by_dynamic", pl$DataFrame(test_df)$group_by_dynamic("col_int", every = "1i"), + "polars_lazy_group_by_dynamic", pl$LazyFrame(test_df)$group_by_dynamic("col_int", every = "1i"), "arrow Table", arrow::as_arrow_table(test_df) ) } diff --git a/tests/testthat/test-dataframe.R b/tests/testthat/test-dataframe.R index dd060a2b2..aa063dd57 100644 --- a/tests/testthat/test-dataframe.R +++ b/tests/testthat/test-dataframe.R @@ -1267,15 +1267,6 @@ test_that("rolling for DataFrame: basic example", { ) }) -test_that("rolling for DataFrame: prints all info", { - df = pl$DataFrame( - index = c(1:5, 6.0), - a = c(3, 7, 5, 9, 2, 1) - )$with_columns(pl$col("index")$set_sorted()) - - expect_snapshot(df$rolling(index_column = "dt", period = "2i")) -}) - test_that("rolling for DataFrame: can be ungrouped", { df = pl$DataFrame( index = c(1:5, 6.0), diff --git a/tests/testthat/test-groupby.R b/tests/testthat/test-groupby.R index e99bf55cc..e992d4485 100644 --- a/tests/testthat/test-groupby.R +++ b/tests/testthat/test-groupby.R @@ -1,3 +1,5 @@ +### group_by ------------------------------------------------ + df = pl$DataFrame( list( foo = c("one", "two", "two", "one", "two"), @@ -7,6 +9,19 @@ df = pl$DataFrame( gb = df$group_by("foo", maintain_order = TRUE) +test_that("groupby", { + df2 = gb$agg( + pl$col("bar")$sum()$alias("bar_sum"), + pl$col("bar")$mean()$alias("bar_tail_sum") + )$to_data_frame() + + expect_equal( + df2, + data.frame(foo = c("one", "two"), bar_sum = c(9, 6), bar_tail_sum = c(4.5, 2)) + ) +}) + + patrick::with_parameters_test_that("groupby print", { .env_var = .value @@ -21,19 +36,6 @@ test_that("groupby print when several groups", { expect_snapshot(df) }) -test_that("groupby", { - df2 = gb$agg( - pl$col("bar")$sum()$alias("bar_sum"), - pl$col("bar")$mean()$alias("bar_tail_sum") - )$to_data_frame() - - expect_equal( - df2, - data.frame(foo = c("one", "two"), bar_sum = c(9, 6), bar_tail_sum = c(4.5, 2)) - ) -}) - - make_cases = function() { tibble::tribble( ~.test_name, ~pola, ~base, @@ -162,3 +164,314 @@ test_that("LazyGroupBy clone", { expect_true(mem_address(lgb) != mem_address(lgb_clone)) expect_true(mem_address(lgb) == mem_address(lgb_copy)) }) + + + + + + +### group_by_dynamic ------------------------------------------------ + +test_that("group_by_dynamic for DataFrame calls the LazyFrame method", { + df = pl$DataFrame( + dt = as.Date(as.Date("2021-12-16"):as.Date("2021-12-22"), origin = "1970-01-01"), + n = 0:6 + )$with_columns( + pl$col("dt")$set_sorted() + ) + + actual = df$group_by_dynamic(index_column = "dt", every = "2d")$agg( + pl$col("n")$mean() + )$to_data_frame() + + expect_equal( + actual[, "n"], + c(0, 1.5, 3.5, 5.5) + ) +}) + +test_that("group_by_dynamic for LazyFrame: date variable", { + df = pl$LazyFrame( + dt = as.Date(as.Date("2021-12-16"):as.Date("2021-12-22"), origin = "1970-01-01"), + n = 0:6 + )$with_columns( + pl$col("dt")$set_sorted() + ) + + actual = df$group_by_dynamic(index_column = "dt", every = "2d")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(0, 1.5, 3.5, 5.5) + ) +}) + +test_that("group_by_dynamic for LazyFrame: datetime variable", { + df = pl$LazyFrame( + dt = c( + "2021-12-16 00:00:00", "2021-12-16 00:30:00", "2021-12-16 01:00:00", + "2021-12-16 01:30:00", "2021-12-16 02:00:00", "2021-12-16 02:30:00", + "2021-12-16 03:00:00" + ), + n = 0:6 + )$with_columns( + pl$col("dt")$str$strptime(pl$Datetime("ms"), format = NULL)$set_sorted() + ) + + actual = df$group_by_dynamic(index_column = "dt", every = "1h")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(0.5, 2.5, 4.5, 6) + ) +}) + +test_that("group_by_dynamic for LazyFrame: integer variable", { + df = pl$LazyFrame( + idx = 0:5, + n = 0:5 + )$with_columns(pl$col("idx")$set_sorted()) + + actual = df$group_by_dynamic( + "idx", + every = "2i" + )$agg(pl$col("n")$mean())$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(0.5, 2.5, 4.5) + ) +}) + +test_that("group_by_dynamic for LazyFrame: error if not explicitly sorted", { + df = pl$LazyFrame( + index = c(1L, 2L, 3L, 4L, 8L, 9L), + a = c(3, 7, 5, 9, 2, 1) + ) + expect_error( + df$group_by_dynamic(index_column = "index", every = "2i")$agg(pl$col("a"))$collect(), + "not explicitly sorted" + ) +}) + +test_that("group_by_dynamic for LazyFrame: arg 'closed' works", { + df = pl$LazyFrame( + dt = c( + "2021-12-16 00:00:00", "2021-12-16 00:30:00", "2021-12-16 01:00:00", + "2021-12-16 01:30:00", "2021-12-16 02:00:00", "2021-12-16 02:30:00", + "2021-12-16 03:00:00" + ), + n = 0:6 + )$with_columns( + pl$col("dt")$str$strptime(pl$Datetime("ms"), format = NULL)$set_sorted() + ) + + actual = df$group_by_dynamic(index_column = "dt", closed = "right", every = "1h")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(0, 1.5, 3.5, 5.5) + ) + + expect_error( + df$group_by_dynamic(index_column = "dt", closed = "foobar", every = "1h")$agg( + pl$col("n")$mean() + )$collect(), + "should be one of" + ) +}) + +test_that("group_by_dynamic for LazyFrame: arg 'label' works", { + df = pl$LazyFrame( + dt = c( + "2021-12-16 00:00:00", "2021-12-16 00:30:00", "2021-12-16 01:00:00", + "2021-12-16 01:30:00", "2021-12-16 02:00:00", "2021-12-16 02:30:00", + "2021-12-16 03:00:00" + ), + n = 0:6 + )$with_columns( + pl$col("dt")$str$strptime(pl$Datetime("ms"), format = NULL)$set_sorted()$dt$replace_time_zone("UTC") + ) + + actual = df$group_by_dynamic(index_column = "dt", label = "right", every = "1h")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "dt"], + as.POSIXct( + c("2021-12-16 01:00:00", "2021-12-16 02:00:00", "2021-12-16 03:00:00", "2021-12-16 04:00:00"), + tz = "UTC" + ) + ) + + expect_error( + df$group_by_dynamic(index_column = "dt", label = "foobar", every = "1h")$agg( + pl$col("n")$mean() + )$collect(), + "should be one of" + ) +}) + +test_that("group_by_dynamic for LazyFrame: arg 'start_by' works", { + df = pl$LazyFrame( + dt = c( + "2021-12-16 00:00:00", "2021-12-16 00:30:00", "2021-12-16 01:00:00", + "2021-12-16 01:30:00", "2021-12-16 02:00:00", "2021-12-16 02:30:00", + "2021-12-16 03:00:00" + ), + n = 0:6 + )$with_columns( + pl$col("dt")$str$strptime(pl$Datetime("ms", tz = "UTC"), format = NULL)$set_sorted() + ) + + # TODO: any weekday should return the same since it is ignored when there's no + # "w" in "every". + # https://github.com/pola-rs/polars/issues/13648 + actual = df$group_by_dynamic(index_column = "dt", start_by = "monday", every = "1h")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "dt"], + as.POSIXct( + c("2021-12-16 00:00:00 UTC", "2021-12-16 01:00:00 UTC", "2021-12-16 02:00:00 UTC", "2021-12-16 03:00:00 UTC"), + tz = "UTC" + ) + ) + + expect_error( + df$group_by_dynamic(index_column = "dt", start_by = "foobar", every = "1h")$agg( + pl$col("n")$mean() + )$collect(), + "should be one of" + ) +}) + +test_that("group_by_dynamic for LazyFrame: argument 'by' works", { + df = pl$LazyFrame( + dt = c( + "2021-12-16 00:00:00", "2021-12-16 00:30:00", "2021-12-16 01:00:00", + "2021-12-16 01:30:00", "2021-12-16 02:00:00", "2021-12-16 02:30:00", + "2021-12-16 03:00:00" + ), + n = 0:6, + grp = c("a", "a", "a", "b", "b", "a", "a") + )$with_columns( + pl$col("dt")$str$strptime(pl$Datetime("ms"), format = NULL)$set_sorted() + ) + + actual = df$group_by_dynamic(index_column = "dt", every = "2h", by = pl$col("grp"))$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(1, 5.5, 3, 4) + ) + + # string is parsed as column name in "by" + expect_equal( + df$group_by_dynamic(index_column = "dt", every = "2h", by = pl$col("grp"))$agg( + pl$col("n")$mean() + )$collect()$to_data_frame(), + df$group_by_dynamic(index_column = "dt", every = "2h", by = "grp")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + ) +}) + +test_that("group_by_dynamic for LazyFrame: argument 'check_sorted' works", { + df = pl$LazyFrame( + index = c(2L, 1L, 3L, 4L, 9L, 8L), # unsorted index + grp = c("a", "a", rep("b", 4)), + a = c(3, 7, 5, 9, 2, 1) + ) + expect_error( + df$group_by_dynamic(index_column = "index", every = "2i", by = "grp")$agg( + pl$sum("a")$alias("sum_a") + )$collect(), + "not sorted" + ) + expect_no_error( + df$group_by_dynamic(index_column = "index", every = "2i", by = "grp", check_sorted = FALSE)$agg( + pl$sum("a")$alias("sum_a") + )$collect() + ) +}) + +test_that("group_by_dynamic for LazyFrame: error if index not int or date/time", { + df = pl$LazyFrame( + index = c(1:5, 6.0), + a = c(3, 7, 5, 9, 2, 1) + )$with_columns(pl$col("index")$set_sorted()) + + expect_error( + df$group_by_dynamic(index_column = "index", every = "2i")$agg( + pl$sum("a")$alias("sum_a") + )$collect() + ) +}) + +test_that("group_by_dynamic for LazyFrame: arg 'offset' works", { + df = pl$LazyFrame( + dt = c( + "2020-01-01", "2020-01-01", "2020-01-01", + "2020-01-02", "2020-01-03", "2020-01-08" + ), + n = c(3, 7, 5, 9, 2, 1) + )$with_columns( + pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() + ) + + # checked with python-polars but unclear on how "offset" works + actual = df$group_by_dynamic(index_column = "dt", every = "2d", offset = "1d")$agg( + pl$col("n")$mean() + )$collect()$to_data_frame() + + expect_equal( + actual[, "n"], + c(5.5, 1) + ) +}) + +test_that("group_by_dynamic for LazyFrame: arg 'include_boundaries' works", { + df = pl$LazyFrame( + dt = c( + "2020-01-01", "2020-01-01", "2020-01-01", + "2020-01-02", "2020-01-03", "2020-01-08" + ), + n = c(3, 7, 5, 9, 2, 1) + )$with_columns( + pl$col("dt")$str$strptime(pl$Date, format = NULL)$set_sorted() + ) + + actual = df$group_by_dynamic( + index_column = "dt", every = "2d", offset = "1d", + include_boundaries = TRUE + )$ + agg( + pl$col("n") + ) + + expect_named(actual, c("_lower_boundary", "_upper_boundary", "dt", "n")) +}) + +test_that("group_by_dynamic for LazyFrame: can be ungrouped", { + df = pl$LazyFrame( + index = c(1:5, 6.0), + a = c(3, 7, 5, 9, 2, 1) + )$with_columns(pl$col("index")$set_sorted()) + + actual = df$group_by_dynamic(index_column = "dt", every = "2i")$ + ungroup()$ + collect()$ + to_data_frame() + expect_equal(actual, df$collect()$to_data_frame()) +}) From 1f2a4d0494a0e90e49ccb72cc92e10bd82a0a42d Mon Sep 17 00:00:00 2001 From: eitsupi <50911393+eitsupi@users.noreply.github.com> Date: Sun, 14 Jan 2024 16:55:36 +0900 Subject: [PATCH 4/4] fix: rename pl$options' Rd file and enforce named arguments (#697) --- R/options.R | 39 +++++++++++------------- man/{polars_options.Rd => pl_options.Rd} | 5 +-- 2 files changed, 21 insertions(+), 23 deletions(-) rename man/{polars_options.Rd => pl_options.Rd} (98%) diff --git a/R/options.R b/R/options.R index fc42d66d1..e9992b190 100644 --- a/R/options.R +++ b/R/options.R @@ -34,25 +34,6 @@ polars_optreq$rpool_cap = list() # rust-side options already check args #' Get and set polars options. See sections "Value" and "Examples" below for #' more details. #' -#' @param strictly_immutable Keep polars strictly immutable. Polars/arrow is in -#' general pro "immutable objects". Immutability is also classic in R. To mimic -#' the Python-polars API, set this to `FALSE.` -#' @param maintain_order Default for all `maintain_order` options (present in -#' `$group_by()` or `$unique()` for example). -#' @param do_not_repeat_call Do not print the call causing the error in error -#' messages. The default (`FALSE`) is to show them. -#' @param debug_polars Print additional information to debug Polars. -#' @param no_messages Hide messages. -#' @param rpool_cap The maximum number of R sessions that can be used to process -#' R code in the background. See Details. -#' -#' @rdname polars_options -#' -#' @docType NULL -#' -#' @details -#' All args must be explicitly and fully named. -#' #' `pl$options$rpool_active` indicates the number of R sessions already #' spawned in pool. `pl$options$rpool_cap` indicates the maximum number of new R #' sessions that can be spawned. Anytime a polars thread worker needs a background @@ -69,6 +50,22 @@ polars_optreq$rpool_cap = list() # rust-side options already check args #' will likely only give a speed-up in a `low io - high cpu` scenario. Native #' polars query syntax runs in threads and have no overhead. #' +#' @param ... Ignored. +#' @param strictly_immutable Keep polars strictly immutable. Polars/arrow is in +#' general pro "immutable objects". Immutability is also classic in R. To mimic +#' the Python-polars API, set this to `FALSE.` +#' @param maintain_order Default for all `maintain_order` options (present in +#' `$group_by()` or `$unique()` for example). +#' @param do_not_repeat_call Do not print the call causing the error in error +#' messages. The default (`FALSE`) is to show them. +#' @param debug_polars Print additional information to debug Polars. +#' @param no_messages Hide messages. +#' @param rpool_cap The maximum number of R sessions that can be used to process +#' R code in the background. See Details. +#' +#' @rdname pl_options +#' @docType NULL +#' #' @return #' `pl$options` returns a named list with the value (`TRUE` or `FALSE`) of #' each option. @@ -90,6 +87,7 @@ polars_optreq$rpool_cap = list() # rust-side options already check args #' # reset options to their default value #' pl$reset_options() pl_set_options = function( + ..., strictly_immutable = TRUE, maintain_order = FALSE, do_not_repeat_call = FALSE, @@ -146,8 +144,7 @@ pl_set_options = function( } } -#' @rdname polars_options - +#' @rdname pl_options pl_reset_options = function() { assign("strictly_immutable", TRUE, envir = polars_optenv) assign("maintain_order", FALSE, envir = polars_optenv) diff --git a/man/polars_options.Rd b/man/pl_options.Rd similarity index 98% rename from man/polars_options.Rd rename to man/pl_options.Rd index 6249fecc2..d16e91ba3 100644 --- a/man/polars_options.Rd +++ b/man/pl_options.Rd @@ -6,6 +6,7 @@ \title{Set polars options} \usage{ pl_set_options( + ..., strictly_immutable = TRUE, maintain_order = FALSE, do_not_repeat_call = FALSE, @@ -17,6 +18,8 @@ pl_set_options( pl_reset_options() } \arguments{ +\item{...}{Ignored.} + \item{strictly_immutable}{Keep polars strictly immutable. Polars/arrow is in general pro "immutable objects". Immutability is also classic in R. To mimic the Python-polars API, set this to \code{FALSE.}} @@ -47,8 +50,6 @@ Get and set polars options. See sections "Value" and "Examples" below for more details. } \details{ -All args must be explicitly and fully named. - \code{pl$options$rpool_active} indicates the number of R sessions already spawned in pool. \code{pl$options$rpool_cap} indicates the maximum number of new R sessions that can be spawned. Anytime a polars thread worker needs a background