-
Notifications
You must be signed in to change notification settings - Fork 12
/
README.Rmd
214 lines (138 loc) · 9.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# slider
<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/r-lib/slider/graph/badge.svg)](https://app.codecov.io/gh/r-lib/slider)
[![R-CMD-check](https://github.com/r-lib/slider/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/r-lib/slider/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
slider provides a family of general purpose "sliding window" functions. The API is purposefully _very_ similar to purrr. The goal of these functions is usually to compute rolling averages, cumulative sums, rolling regressions, or other "window" based computations.
There are 3 core functions in slider:
- `slide()` iterates over your data like [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html), but uses a sliding window to do so. It is type-stable, and always returns a result with the same size as its input.
- `slide_index()` computes a rolling calculation _relative to an index_. If you have ever wanted to compute something like a "3 month rolling average" where the number of days in each month is irregular, you might like this function.
- `slide_period()` is similar to `slide_index()` in that it slides relative to an index, but it first breaks the index up into "time blocks", like 2 month blocks of time, and then it slides over `.x` using indices defined by those blocks.
Each of these core functions have the same variants as `purrr::map()`. For example, `slide()` has `slide_dbl()`, `slide2()`, and `pslide()`, along with the other combinations of these variants that you might expect from having previously used purrr.
To learn more about these three functions, read the [introduction vignette](https://slider.r-lib.org/articles/slider.html).
There are also a set of extremely fast specialized variants of `slide_dbl()` for the most common use cases. These include `slide_sum()` for rolling sums and `slide_mean()` for rolling averages. There are index variants of each of these as well, like `slide_index_sum()`.
## Installation
Install the released version from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("slider")
```
Install the development version from [GitHub](https://github.com/) with:
``` r
pak::pak("r-lib/slider")
```
## Examples
The [help page for `slide()`](https://slider.r-lib.org/reference/slide.html) has many examples, but here are a few:
```{r}
library(slider)
```
The classic example would be to do a moving average. `slide()` handles this with a combination of the `.before` and `.after` arguments, which control the width of the window and the alignment.
```{r}
# Moving average (Aligned right)
# "The current element + 2 elements before"
slide_dbl(1:5, ~mean(.x), .before = 2)
# Align left
# "The current element + 2 elements after"
slide_dbl(1:5, ~mean(.x), .after = 2)
# Center aligned
# "The current element + 1 element before + 1 element after"
slide_dbl(1:5, ~mean(.x), .before = 1, .after = 1)
```
With `Inf`, you can do a "cumulative slide" to compute cumulative expressions. I think of this as saying "give me everything before the current element."
```{r}
slide(1:4, ~.x, .before = Inf)
```
With `.complete`, you can decide whether or not `.f` should be evaluated on incomplete windows. In the following example, the requested window size is 3, but the first two results are computed on windows of size 1 and 2 because partial results are allowed by default. When `.complete` is set to `TRUE`, the first two results are not computed.
```{r}
slide(1:4, ~.x, .before = 2)
slide(1:4, ~.x, .before = 2, .complete = TRUE)
```
## Data frames
Unlike `purrr::map()`, `slide()` iterates over data frames in a row wise fashion. Interestingly this means the default of `slide()` becomes a generic row wise iterator, with nice syntax for accessing data frame columns.
There is a [vignette specifically about this](https://slider.r-lib.org/articles/rowwise.html).
```{r}
mini_cars <- cars[1:4,]
slide(mini_cars, ~.x)
slide_dbl(mini_cars, ~.x$speed + .x$dist)
```
This makes rolling regressions trivial!
```{r, message=FALSE}
library(tibble)
set.seed(123)
df <- tibble(
y = rnorm(100),
x = rnorm(100)
)
# Window size of 20 rows
# The current row + 19 before
# (see slide_index() for how to do this relative to a date vector!)
df$regressions <- slide(df, ~lm(y ~ x, data = .x), .before = 19, .complete = TRUE)
df[15:25,]
```
## Index sliding
In many business settings, the value you want to compute is tied to some _index_, like a date vector. In these cases, you'll probably want to compute sliding windows relative to the index, and not using the fixed window that `slide()` provides. You can use `slide_index()` to pass in both `.x` and an index, `.i`, and the window will be calculated relative to that index.
Here, when computing a "2 day window", you probably don't want `"2019-08-16"` and `"2019-08-18"` to be grouped together. `slide()` has no concept of an index, so when you specify a window size of 2, it will group these two together. `slide_index()`, on the other hand, will do the right thing.
```{r}
x <- 1:3
i <- as.Date(c("2019-08-15", "2019-08-16", "2019-08-18"))
# slide() has no concept of an "index"
slide(x, ~.x, .before = 1)
# "index aware"
slide_index(x, i, ~.x, .before = 1)
```
Essentially what happens is that when we get to `"2019-08-18"`, it "looks backwards" 1 day to set a window boundary at `"2019-08-17"`. Since the date at position 2, `"2019-08-16"`, is before `"2019-08-17"`, it is not included.
Powerfully, you can pass through any object to `.before` that computes a value from `.i - .before`. This means that you could also have used a lubridate period object (which gets even more interesting when you use `weeks()` or `months()`):
```{r}
slide_index(x, i, ~.x, .before = lubridate::days(1))
```
## Period sliding
`slide_period()` is different from `slide_index()` in that it first breaks the index into "time blocks" and then slides over `.x` relative to those blocks. For example, in the monthly period slide below, `i` is broken up into 4 time blocks of "the current block of monthly data, plus one block before this one". The locations of those blocks are the locations that are used to slice `.x` with.
```{r}
i <- as.Date(c(
"2019-01-29",
"2019-01-30",
"2019-02-05",
"2019-04-01",
"2019-05-10"
))
slide_period(i, i, "month", ~.x, .before = 1)
```
One neat thing to notice is that `slide_period()` is aware of the distance between elements of `.i` in the period you specify. The practical implication of this is that in the above example, group 3 with `2019-04-01` did _not_ include `2019-02-05` in it, because it is more than 1 month group away.
## Inspiration
This package is inspired heavily by SQL's window functions. The API is similar, but more general because you can iterate over any kind of R object.
There have been multiple attempts at creating sliding window functions (I personally created `rollify()`, and worked a little bit on `tsibble::slide()` with [Earo Wang](https://github.com/earowang)).
- `zoo::rollapply()`
- `tibbletime::rollify()`
- `tsibble::slide()`
I believe that slider is the next iteration of these. There are a few reasons for this:
- To me, the API is more intuitive, and is more flexible because `.before` and `.after` let you completely control the entry point (as opposed to fixed entry points like `"center"`, `"left"`, etc.
- It is objectively faster because it is written purely in C.
- With `slide_vec()` you can return any kind of object, and are not limited to the suffixed versions: `_dbl`, `_int`, etc.
- It iterates rowwise over data frames, consistent with the vctrs framework.
- I believe it is overall more consistent, backed by a theory that can always justify the sliding window generated by any combination of the parameters.
Earo and I have spoken, and we have mutually agreed that it would be best to deprecate `tsibble::slide()` in favor of `slider::slide()`.
Additionally, [data.table](https://github.com/Rdatatable/data.table)'s non-equi joins have been pretty much the only solution to the problem that `slide_index()` tries to solve. Their solution is robust and quite fast, and has been a nice benchmark for slider. slider is trying to solve a much narrower problem, so the API here is more focused.
## Performance
Like `purrr::map()`, the core functions of slider, such as `slide()` and `slide_index()`, are optimized in C to be as fast as possible, but there is overhead involved in calling `.f` repeatedly. These functions are meant to be as _general purpose_ as possible, at the cost of some performance. This means that slider can be used for more abstract computations, like rolling regressions, or any other custom function that you want to use in a rolling fashion.
slider also provides _specialized_ functions for some of the most common use cases, such as `slide_mean()`, or `slide_index_sum()`. These compute their corresponding metric at the C level, using a specialized algorithm, and are often much faster than their `slide_dbl(x, fn)` equivalent.
## References
I've found the following references very useful to understand more about window functions:
- [Postgres SQL documentation](https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS)
- [dbplyr window function vignette](https://dbplyr.tidyverse.org/articles/translation-function.html#window-functions)
- [SQLite documentation - with a flowchart](https://www.sqlite.org/windowfunctions.html)
- [Vertica Rows vs Range discussion](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/window_frame_clause.htm?origin_team=T02V9CHFH#ROWSversusRANGE)
## Code of Conduct
Please note that the slider project is released with a [Contributor Code of Conduct](https://slider.r-lib.org/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.