From 7b9320f76104adf8720331b648225302cc36beeb Mon Sep 17 00:00:00 2001 From: yjunechoe Date: Sun, 22 Sep 2024 19:51:17 -0400 Subject: [PATCH] kronecker() aside --- .../row-relational-operations.Rmd | 29 + .../row-relational-operations.html | 131 +- .../figure-html5/final-gapminder-plot-1.png | Bin 18806 -> 19953 bytes .../header-attrs-2.27/header-attrs.js | 12 + .../panelset-0.3.0/panelset.css | 291 + .../panelset-0.3.0/panelset.js | 826 ++ docs/blog.html | 7224 ++++++++--------- docs/blog.xml | 1851 ++++- .../index.html | 100 +- .../figure-html5/final-gapminder-plot-1.png | Bin 18806 -> 19953 bytes .../header-attrs-2.21/header-attrs.js | 12 + .../panelset-0.2.6/panelset.css | 227 + .../panelset-0.2.6/panelset.js | 325 + docs/posts/posts.json | 44 +- docs/site_libs/panelset-0.3.0/panelset.css | 291 + docs/site_libs/panelset-0.3.0/panelset.js | 826 ++ docs/sitemap.xml | 8 +- 17 files changed, 8503 insertions(+), 3694 deletions(-) create mode 100644 _posts/2023-06-11-row-relational-operations/row-relational-operations_files/header-attrs-2.27/header-attrs.js create mode 100644 _posts/2023-06-11-row-relational-operations/row-relational-operations_files/panelset-0.3.0/panelset.css create mode 100644 _posts/2023-06-11-row-relational-operations/row-relational-operations_files/panelset-0.3.0/panelset.js create mode 100644 docs/posts/2023-06-11-row-relational-operations/row-relational-operations_files/header-attrs-2.21/header-attrs.js create mode 100644 docs/posts/2023-06-11-row-relational-operations/row-relational-operations_files/panelset-0.2.6/panelset.css create mode 100644 docs/posts/2023-06-11-row-relational-operations/row-relational-operations_files/panelset-0.2.6/panelset.js create mode 100644 docs/site_libs/panelset-0.3.0/panelset.css create mode 100644 docs/site_libs/panelset-0.3.0/panelset.js diff --git a/_posts/2023-06-11-row-relational-operations/row-relational-operations.Rmd b/_posts/2023-06-11-row-relational-operations/row-relational-operations.Rmd index 2d889af5..429d8831 100644 --- a/_posts/2023-06-11-row-relational-operations/row-relational-operations.Rmd +++ b/_posts/2023-06-11-row-relational-operations/row-relational-operations.Rmd @@ -637,6 +637,35 @@ We find that the lessons of working with row indices from `slice()` translated t +### Aside: `kronecker()` as `as.vector(outer())` + +Following from the `slice()` + `outer()` strategy demoed above, imagine if we wanted to filter for `"Luke Skywalker"` and 4 other characters that are neighbors in the `height` and `mass` values. + +```{r} +dplyr::starwars[, 1:3] +``` + +In row-relational terms, "filtering neighboring values" just means "filtering rows after arranging by the values we care about". We can express this using `slice()` and `outer()` as: + +```{r} +starwars %>% + select(name, mass, height) %>% + arrange(mass, height) %>% + slice( as.vector(outer(-2:2, which(name == "Luke Skywalker"), `+`)) ) +``` + +I raised this example on an unrelated thread on the [R4DS/DSLC slack](https://fosstodon.org/@DSLC), where Anthony Durrant pointed me to `kronecker()` as a version of `outer()` that unlist before returning the output. + +So in examples involving `outer()` to generate row indices in `slice()`, we can also use `kronecker()` instead to save a call to a flattening function like `as.vector()`: + +```{r} +starwars %>% + select(name, mass, height) %>% + arrange(mass, height) %>% + slice( kronecker(-2:2, which(name == "Luke Skywalker"), `+`) ) +``` + + ### Windowed min/max/median (etc.) Let's say we have this small time series data, and we want to calculate a **lagged 3-window moving minimum** for the `val` column: diff --git a/_posts/2023-06-11-row-relational-operations/row-relational-operations.html b/_posts/2023-06-11-row-relational-operations/row-relational-operations.html index 7ba40fb3..3981ff9d 100644 --- a/_posts/2023-06-11-row-relational-operations/row-relational-operations.html +++ b/_posts/2023-06-11-row-relational-operations/row-relational-operations.html @@ -21,7 +21,7 @@ - - - - - - - - - - June Choe: Blog Posts - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-

Blog Posts

- - - -
- -
-
-

Read files on the web into R

-
-
tutorial
-
-

For the download-button-averse of us

-
-
- - - -
- -
-
-

Naming patterns for boolean enums

-
-
design
-
-

Some thoughts on the principle of enumerating possible options, even for booleans

-
-
- - - -
- -
-
-

`ave()` for the average {dplyr} user

-
-
dplyr
-
-

tidyverse 🤝 base R

-
-
- - - -
- -
-
-

args(args(args)(args))

-
-
args
-
metaprogramming
-
-

The unexpected sequal to "R is a language optimized for meme-ing"

-
-
- - - -
- -
-
-

HelloWorld("print")

-
-
metaprogramming
-
-

R is a language optimized for meme-ing

-
-
- - - -
- -
-
-

2023 Year in Review

-
-
reflections
-
-

Reflections and updates on what I've been up to in 2023

-
-
- - - -
- -
-
-

The many ways to (un)tidy-select

-
-
data wrangling
-
dplyr
-
tidyselect
-
-

Deconstructing {tidyselect} and building it back up

-
-
- - - -
- -
-
-

Fumbling my way through an XY problem

-
-
reflections
-
-

Some lessons learned from a (personal) case study

-
-
- - - -
- -
-
-

Row relational operations with slice()

-
-
data wrangling
-
dplyr
-
-

A love letter to dplyr::slice() and a gallery of usecases

-
-
- - - -
- -
-
-

First impressions of DataFrames.jl and accessories

-
-
julia
-
data wrangling
-
DataFrames.jl
-
dplyr
-
data.table
-
-

Perspectives from a {dplyr} and {data.table} useR

-
-
- - - -
- -
-
-

Reflections on useR! 2022

-
-
conference
-
ggtrace
-
-

Notes from attending and speaking at my first R conference

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 2

-
-
data visualization
-
ggplot2
-
tutorial
-
-

Exposing the `Stat` ggproto in functional programming terms

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 1

-
-
data visualization
-
ggplot2
-
ggplot internals
-
tutorial
-
-

Exploring the logic of `after_stat()` to peek inside ggplot internals

-
-
- - - -
- -
-
-

Setting up and debugging custom fonts

-
-
data visualization
-
ggplot2
-
typography
-
tutorial
-
-

A practical introduction to all (new) things font in R

-
-
- - - -
- -
-
-

Random Sampling: A table animation

-
-
data visualization
-
data wrangling
-
-

Plus a convenient way of rendering LaTeX expressions as images

-
-
- - - -
- -
-
-

Collapse repetitive piping with reduce()

-
-
data wrangling
-
tutorial
-
-

Featuring accumulate()

-
-
- - - -
- -
-
-

Plot Makeover #2

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Making a dodged-stacked hybrid bar plot in {ggplot2}

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 45

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Waffle chart of IKEA furnitures in stock

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 44

-
-
ggplot2
-
gganimate
-
spatial
-
data visualization
-
tidytuesday
-
-

Patched animation of the location and cumulative capacity of wind turbines in Canada

-
-
- - - -
- -
-
-

Analysis of @everycolorbot's tweets

-
-
data visualization
-
ggplot2
-
rtweet
-
colors
-
-

And why you should avoid neon colors

-
-
- - - -
- -
-
-

Designing guiding aesthetics

-
-
data visualization
-
ggplot2
-
tidytuesday
-
-

The fine line between creativity and noise

-
-
- - - -
- -
-
-

Demystifying stat_ layers in {ggplot2}

-
-
data visualization
-
ggplot2
-
tutorial
-
-

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 39

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Stacked area plot of the heights of Himalayan peaks attempted over the last century

-
-
- - - -
- -
-
-

Plot Makeover #1

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Flattening a faceted grid for strictly horizontal comparisons

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 38

-
-
tables
-
data visualization
-
tidytuesday
-
-

Visualizing two decades of primary and secondary education spending with {gt}

-
-
- - - -
- -
-
-

Embedding videos in {reactable} tables

-
-
tables
-
data visualization
-
-

Pushing the limits of expandable row details

-
-
- - - -
- -
-
-

Fonts for graphs

-
-
data visualization
-
typography
-
-

A small collection of my favorite fonts for data visualization

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 33

-
-
tidytuesday
-
gganimate
-
ggplot2
-
-

An animation of the main characters in Avatar

-
-
- - - -
- -
-
-

Saving a line of piping

-
-
data wrangling
-
dplyr
-
tutorial
-
-

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 32

-
-
tidytuesday
-
data visualization
-
ggplot2
-
-

A dumbbell chart visualization of energy production trends among European countries

-
-
- - - -
- -
-
-

Six years of my Spotify playlists

-
-
ggplot2
-
gganimate
-
spotifyr
-
data wrangling
-
data visualization
-
-

An analysis of acoustic features with {spotifyr}

-
-
- - - -
- -
-
-

Shiny tips - the first set

-
-
shiny
-
-

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

-
-
- - - -
- -
-
-

geom_paired_raincloud()

-
-
data visualization
-
ggplot2
-
-

A {ggplot2} geom for visualizing change in distribution between two conditions.

-
-
- - - -
- -
-
-

Plotting treemaps with {treemap} and {ggplot2}

-
-
data visualization
-
treemap
-
ggplot2
-
tutorial
-
-

Using underlying plot data for maximum customization

-
-
- - - -
- -
-
-

Indexing tip for {spacyr}

-
-
data wrangling
-
NLP
-
spacyr
-
-

Speeding up the analysis of dependency relations.

-
-
- - - -
- -
-
-

The Correlation Parameter in Mixed Effects Models

-
-
statistics
-
mixed-effects models
-
tutorial
-
-

Notes on the Corr term in {lme4} output

-
-
-
-
- -
- -
- - -
-

Blog Posts

- - - - -
- - -
- -
- - -
- -
-
- - - - - -
- - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + June Choe: Blog Posts + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+

Blog Posts

+ + + +
+ +
+
+

Read files on the web into R

+
+
tutorial
+
+

For the download-button-averse of us

+
+
+ + + +
+ +
+
+

Row relational operations with slice()

+
+
data wrangling
+
dplyr
+
+

A love letter to dplyr::slice() and a gallery of usecases

+
+
+ + + +
+ +
+
+

Naming patterns for boolean enums

+
+
design
+
+

Some thoughts on the principle of enumerating possible options, even for booleans

+
+
+ + + +
+ +
+
+

`ave()` for the average {dplyr} user

+
+
dplyr
+
+

tidyverse 🤝 base R

+
+
+ + + +
+ +
+
+

args(args(args)(args))

+
+
args
+
metaprogramming
+
+

The unexpected sequal to "R is a language optimized for meme-ing"

+
+
+ + + +
+ +
+
+

HelloWorld("print")

+
+
metaprogramming
+
+

R is a language optimized for meme-ing

+
+
+ + + +
+ +
+
+

2023 Year in Review

+
+
reflections
+
+

Reflections and updates on what I've been up to in 2023

+
+
+ + + +
+ +
+
+

The many ways to (un)tidy-select

+
+
data wrangling
+
dplyr
+
tidyselect
+
+

Deconstructing {tidyselect} and building it back up

+
+
+ + + +
+ +
+
+

Fumbling my way through an XY problem

+
+
reflections
+
+

Some lessons learned from a (personal) case study

+
+
+ + + +
+ +
+
+

First impressions of DataFrames.jl and accessories

+
+
julia
+
data wrangling
+
DataFrames.jl
+
dplyr
+
data.table
+
+

Perspectives from a {dplyr} and {data.table} useR

+
+
+ + + +
+ +
+
+

Reflections on useR! 2022

+
+
conference
+
ggtrace
+
+

Notes from attending and speaking at my first R conference

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 2

+
+
data visualization
+
ggplot2
+
tutorial
+
+

Exposing the `Stat` ggproto in functional programming terms

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 1

+
+
data visualization
+
ggplot2
+
ggplot internals
+
tutorial
+
+

Exploring the logic of `after_stat()` to peek inside ggplot internals

+
+
+ + + +
+ +
+
+

Setting up and debugging custom fonts

+
+
data visualization
+
ggplot2
+
typography
+
tutorial
+
+

A practical introduction to all (new) things font in R

+
+
+ + + +
+ +
+
+

Random Sampling: A table animation

+
+
data visualization
+
data wrangling
+
+

Plus a convenient way of rendering LaTeX expressions as images

+
+
+ + + +
+ +
+
+

Collapse repetitive piping with reduce()

+
+
data wrangling
+
tutorial
+
+

Featuring accumulate()

+
+
+ + + +
+ +
+
+

Plot Makeover #2

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Making a dodged-stacked hybrid bar plot in {ggplot2}

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 45

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Waffle chart of IKEA furnitures in stock

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 44

+
+
ggplot2
+
gganimate
+
spatial
+
data visualization
+
tidytuesday
+
+

Patched animation of the location and cumulative capacity of wind turbines in Canada

+
+
+ + + +
+ +
+
+

Analysis of @everycolorbot's tweets

+
+
data visualization
+
ggplot2
+
rtweet
+
colors
+
+

And why you should avoid neon colors

+
+
+ + + +
+ +
+
+

Designing guiding aesthetics

+
+
data visualization
+
ggplot2
+
tidytuesday
+
+

The fine line between creativity and noise

+
+
+ + + +
+ +
+
+

Demystifying stat_ layers in {ggplot2}

+
+
data visualization
+
ggplot2
+
tutorial
+
+

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 39

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Stacked area plot of the heights of Himalayan peaks attempted over the last century

+
+
+ + + +
+ +
+
+

Plot Makeover #1

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Flattening a faceted grid for strictly horizontal comparisons

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 38

+
+
tables
+
data visualization
+
tidytuesday
+
+

Visualizing two decades of primary and secondary education spending with {gt}

+
+
+ + + +
+ +
+
+

Embedding videos in {reactable} tables

+
+
tables
+
data visualization
+
+

Pushing the limits of expandable row details

+
+
+ + + +
+ +
+
+

Fonts for graphs

+
+
data visualization
+
typography
+
+

A small collection of my favorite fonts for data visualization

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 33

+
+
tidytuesday
+
gganimate
+
ggplot2
+
+

An animation of the main characters in Avatar

+
+
+ + + +
+ +
+
+

Saving a line of piping

+
+
data wrangling
+
dplyr
+
tutorial
+
+

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 32

+
+
tidytuesday
+
data visualization
+
ggplot2
+
+

A dumbbell chart visualization of energy production trends among European countries

+
+
+ + + +
+ +
+
+

Six years of my Spotify playlists

+
+
ggplot2
+
gganimate
+
spotifyr
+
data wrangling
+
data visualization
+
+

An analysis of acoustic features with {spotifyr}

+
+
+ + + +
+ +
+
+

Shiny tips - the first set

+
+
shiny
+
+

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

+
+
+ + + +
+ +
+
+

geom_paired_raincloud()

+
+
data visualization
+
ggplot2
+
+

A {ggplot2} geom for visualizing change in distribution between two conditions.

+
+
+ + + +
+ +
+
+

Plotting treemaps with {treemap} and {ggplot2}

+
+
data visualization
+
treemap
+
ggplot2
+
tutorial
+
+

Using underlying plot data for maximum customization

+
+
+ + + +
+ +
+
+

Indexing tip for {spacyr}

+
+
data wrangling
+
NLP
+
spacyr
+
+

Speeding up the analysis of dependency relations.

+
+
+ + + +
+ +
+
+

The Correlation Parameter in Mixed Effects Models

+
+
statistics
+
mixed-effects models
+
tutorial
+
+

Notes on the Corr term in {lme4} output

+
+
+
+
+ +
+ +
+ + +
+

Blog Posts

+ + + + +
+ + +
+ +
+ + +
+ +
+
+ + + + + +
+ + + + + + + + + diff --git a/docs/blog.xml b/docs/blog.xml index c5ad6754..d0f3f727 100644 --- a/docs/blog.xml +++ b/docs/blog.xml @@ -23,6 +23,1846 @@ Sun, 22 Sep 2024 00:00:00 +0000 + + Row relational operations with slice() + June Choe + https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations + + + +<h2 id="intro">Intro</h2> +<p>In data wrangling, there are a handful of <strong>classes</strong> of +operations on data frames that we think of as theoretically well-defined +and tackling distinct problems. To name a few, these include subsetting, +joins, split-apply-combine, pairwise operations, nested-column +workflows, and so on.</p> +<p>Against this rich backdrop, there’s one aspect of data wrangling that +doesn’t receive as much attention: <strong>ordering of rows</strong>. +This isn’t necessarily surprising - we often think of row order as an +auxiliary attribute of data frames since they don’t speak to the content +of the data, <em>per se</em>. I think we all share the intuition that +two dataframe that differ only in row order are practically the same for +most analysis purposes.</p> +<p><em>Except when they aren’t.</em></p> +<p>In this blog post I want to talk about a few, somewhat esoteric cases +of what I like to call <strong>row-relational operations</strong>. My +goal is to try to motivate row-relational operations as a full-blown +class of data wrangling operation that includes not only row ordering, +but also sampling, shuffling, repeating, interweaving, and so on (I’ll +go over all of these later).</p> +<p>Without spoiling too much, I believe that <code>dplyr::slice()</code> +offers a powerful context for operations over row indices, even those +that at first seem to lack a “tidy” solution. You may already know +<code>slice()</code> as an indexing function, but my hope is to convince +you that it can do so much more.</p> +<p>Let’s start by first talking about some special properties of +<code>dplyr::slice()</code>, and then see how we can use it for various +row-relational operations.</p> +<h2 id="special-properties-of-dplyrslice">Special properties of +<code>dplyr::slice()</code></h2> +<h3 id="basic-usage">Basic usage</h3> +<p>For the following demonstration, I’ll use a small subset of the +<code>dplyr::starwars</code> dataset:</p> +<pre class="r"><code>starwars_sm &lt;- dplyr::starwars[1:10, 1:3] +starwars_sm</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Leia Organa 150 49 + 6 Owen Lars 178 120 + 7 Beru Whitesun Lars 165 75 + 8 R5-D4 97 32 + 9 Biggs Darklighter 183 84 + 10 Obi-Wan Kenobi 182 77</code></pre> +<h4 id="row-selection">1) Row selection</h4> +<p><code>slice()</code> is a row indexing verb - if you pass it a vector +of integers, it subsets data frame rows:</p> +<pre class="r"><code>starwars_sm |&gt; + slice(1:6) # First six rows</code></pre> +<pre><code> # A tibble: 6 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Leia Organa 150 49 + 6 Owen Lars 178 120</code></pre> +<p>Like other dplyr verbs with mutate-semantics, you can use <a +href="https://dplyr.tidyverse.org/reference/context.html">context-dependent +expressions</a> inside <code>slice()</code>. For example, you can use +<code>n()</code> to grab the last row (or last couple of rows):</p> +<pre class="r"><code>starwars_sm |&gt; + slice( n() ) # Last row</code></pre> +<pre><code> # A tibble: 1 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Obi-Wan Kenobi 182 77</code></pre> +<pre class="r"><code>starwars_sm |&gt; + slice( n() - 2:0 ) # Last three rows</code></pre> +<pre><code> # A tibble: 3 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 R5-D4 97 32 + 2 Biggs Darklighter 183 84 + 3 Obi-Wan Kenobi 182 77</code></pre> +<p>Another context-dependent expression that comes in handy is +<code>row_number()</code>, which returns all row indices. Using it +inside <code>slice()</code> essentially performs an identity +transformation:</p> +<pre class="r"><code>identical( + starwars_sm, + starwars_sm |&gt; slice( row_number() ) +)</code></pre> +<pre><code> [1] TRUE</code></pre> +<p>Lastly, similar to in <code>select()</code>, you can use +<code>-</code> for negative indexing (to remove rows):</p> +<pre class="r"><code>identical( + starwars_sm |&gt; slice(1:3), # First three rows + starwars_sm |&gt; slice(-(4:n())) # All rows except fourth row to last row +)</code></pre> +<pre><code> [1] TRUE</code></pre> +<h4 id="dynamic-dots">2) Dynamic dots</h4> +<p><code>slice()</code> supports <a +href="https://rlang.r-lib.org/reference/dyn-dots.html">dynamic dots</a>. +If you pass row indices into multiple argument positions, +<code>slice()</code> will concatenate them for you:</p> +<pre class="r"><code>identical( + starwars_sm |&gt; slice(1:6), + starwars_sm |&gt; slice(1, 2:4, 5, 6) +)</code></pre> +<pre><code> [1] TRUE</code></pre> +<p>If you have a <code>list()</code> of row indices, you can use the <a +href="https://rlang.r-lib.org/reference/splice-operator.html">splice +operator</a> <code>!!!</code> to spread them out:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( !!!list(1, 2:4, 5, 6) )</code></pre> +<pre><code> # A tibble: 6 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Leia Organa 150 49 + 6 Owen Lars 178 120</code></pre> +<p>The above call to <code>slice()</code> evaluates to the following +after splicing:</p> +<pre class="r"><code>rlang::expr( slice(!!!list(1, 2:4, 5, 6)) )</code></pre> +<pre><code> slice(1, 2:4, 5, 6)</code></pre> +<h4 id="row-ordering">3) Row ordering</h4> +<p><code>slice()</code> respects the order in which you supplied the row +indices:</p> +<pre class="r"><code>starwars_sm |&gt; + slice(3, 1, 2, 5)</code></pre> +<pre><code> # A tibble: 4 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 R2-D2 96 32 + 2 Luke Skywalker 172 77 + 3 C-3PO 167 75 + 4 Leia Organa 150 49</code></pre> +<p>This means you can do stuff like random sampling with +<code>sample()</code>:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( sample(n()) )</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Obi-Wan Kenobi 182 77 + 2 Owen Lars 178 120 + 3 Leia Organa 150 49 + 4 Darth Vader 202 136 + 5 Luke Skywalker 172 77 + 6 R5-D4 97 32 + 7 C-3PO 167 75 + 8 Beru Whitesun Lars 165 75 + 9 Biggs Darklighter 183 84 + 10 R2-D2 96 32</code></pre> +<p>You can also shuffle a subset of rows (ex: just the first five):</p> +<pre class="r"><code>starwars_sm |&gt; + slice( sample(5), 6:n() )</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 C-3PO 167 75 + 2 Leia Organa 150 49 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Luke Skywalker 172 77 + 6 Owen Lars 178 120 + 7 Beru Whitesun Lars 165 75 + 8 R5-D4 97 32 + 9 Biggs Darklighter 183 84 + 10 Obi-Wan Kenobi 182 77</code></pre> +<p>Or reorder all rows by their indices (ex: in reverse):</p> +<pre class="r"><code>starwars_sm |&gt; + slice( rev(row_number()) )</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Obi-Wan Kenobi 182 77 + 2 Biggs Darklighter 183 84 + 3 R5-D4 97 32 + 4 Beru Whitesun Lars 165 75 + 5 Owen Lars 178 120 + 6 Leia Organa 150 49 + 7 Darth Vader 202 136 + 8 R2-D2 96 32 + 9 C-3PO 167 75 + 10 Luke Skywalker 172 77</code></pre> +<h4 id="out-of-bounds-handling">4) Out-of-bounds handling</h4> +<p>If you pass a row index that’s out of bounds, <code>slice()</code> +returns a 0-row data frame:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( n() + 1 ) # Select the row after the last row</code></pre> +<pre><code> # A tibble: 0 × 3 + # ℹ 3 variables: name &lt;chr&gt;, height &lt;int&gt;, mass &lt;dbl&gt;</code></pre> +<p>When mixed with valid row indices, out-of-bounds indices are simply +ignored (much 💜 for this behavior):</p> +<pre class="r"><code>starwars_sm |&gt; + slice( + 0, # 0th row - ignored + 1:3, # first three rows + n() + 1 # 1 after last row - ignored + )</code></pre> +<pre><code> # A tibble: 3 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32</code></pre> +<p>This lets you do funky stuff like select all even numbered rows by +passing <code>slice()</code> all row indices times 2:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( row_number() * 2 ) # Add `- 1` at the end for *odd* rows!</code></pre> +<pre><code> # A tibble: 5 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 C-3PO 167 75 + 2 Darth Vader 202 136 + 3 Owen Lars 178 120 + 4 R5-D4 97 32 + 5 Obi-Wan Kenobi 182 77</code></pre> +<h3 id="re-imagining-slice-with-data-masking">Re-imagining +<code>slice()</code> with data-masking</h3> +<p><code>slice()</code> is already pretty neat as it is, but that’s just +the tip of the iceberg.</p> +<p>The really cool, under-rated feature of <code>slice()</code> is that +it’s <a +href="https://dplyr.tidyverse.org/reference/dplyr_data_masking.html"><strong>data-masked</strong></a>, +meaning that you can reference column vectors as if they’re variables. +Another way of describing this property of <code>slice()</code> is to +say that it has <a +href="https://rlang.r-lib.org/reference/topic-data-mask-programming.html"><strong>mutate-semantics</strong></a>.</p> +<p>At a very basic level, this means that <code>slice()</code> can +straightforwardly replicate the behavior of some dplyr verbs like +<code>arrange()</code> and <code>filter()</code>!</p> +<h4 id="slice-as-arrange"><code>slice()</code> as +<code>arrange()</code></h4> +<p>From our <code>starwars_sm</code> data, if we want to sort by +<code>height</code> we can use <code>arrange()</code>:</p> +<pre class="r"><code>starwars_sm |&gt; + arrange(height)</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 R2-D2 96 32 + 2 R5-D4 97 32 + 3 Leia Organa 150 49 + 4 Beru Whitesun Lars 165 75 + 5 C-3PO 167 75 + 6 Luke Skywalker 172 77 + 7 Owen Lars 178 120 + 8 Obi-Wan Kenobi 182 77 + 9 Biggs Darklighter 183 84 + 10 Darth Vader 202 136</code></pre> +<p>But we can also do this with <code>slice()</code> to the same effect, +using <code>order()</code>:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( order(height) )</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 R2-D2 96 32 + 2 R5-D4 97 32 + 3 Leia Organa 150 49 + 4 Beru Whitesun Lars 165 75 + 5 C-3PO 167 75 + 6 Luke Skywalker 172 77 + 7 Owen Lars 178 120 + 8 Obi-Wan Kenobi 182 77 + 9 Biggs Darklighter 183 84 + 10 Darth Vader 202 136</code></pre> +<p>This is conceptually equivalent to combining the following 2-step +process:</p> +<ol style="list-style-type: decimal"> +<li><pre class="r"><code>ordered_val_ind &lt;- order(starwars_sm$height) +ordered_val_ind</code></pre> +<pre><code> [1] 3 8 5 7 2 1 6 10 9 4</code></pre></li> +<li><pre class="r"><code>starwars_sm |&gt; + slice( ordered_val_ind )</code></pre> +<pre><code> # A tibble: 10 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 R2-D2 96 32 + 2 R5-D4 97 32 + 3 Leia Organa 150 49 + 4 Beru Whitesun Lars 165 75 + 5 C-3PO 167 75 + 6 Luke Skywalker 172 77 + 7 Owen Lars 178 120 + 8 Obi-Wan Kenobi 182 77 + 9 Biggs Darklighter 183 84 + 10 Darth Vader 202 136</code></pre></li> +</ol> +<h4 id="slice-as-filter"><code>slice()</code> as +<code>filter()</code></h4> +<p>We can also use <code>slice()</code> to <code>filter()</code>, using +<code>which()</code>:</p> +<pre class="r"><code>identical( + starwars_sm |&gt; filter( height &gt; 150 ), + starwars_sm |&gt; slice( which(height &gt; 150) ) +)</code></pre> +<pre><code> [1] TRUE</code></pre> +<p>Thus, we can think of <code>filter()</code> and <code>slice()</code> +as two sides of the same coin:</p> +<ul> +<li><p><code>filter()</code> takes a logical vector that’s the same +length as the number of rows in the data frame</p></li> +<li><p><code>slice()</code> takes an integer vector that’s a (sub)set of +a data frame’s row indices.</p></li> +</ul> +<p>To put it more concretely, this logical vector was being passed to +the above <code>filter()</code> call:</p> +<pre class="r"><code>starwars_sm$height &gt; 150</code></pre> +<pre><code> [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE</code></pre> +<p>While this integer vector was being passed to the above +<code>slice()</code> call, where <code>which()</code> returns the +position of <code>TRUE</code> values, given a logical vector:</p> +<pre class="r"><code>which( starwars_sm$height &gt; 150 )</code></pre> +<pre><code> [1] 1 2 4 6 7 9 10</code></pre> +<h3 id="special-properties-of-slice">Special properties of +<code>slice()</code></h3> +<p>This re-imagined <code>slice()</code> that heavily exploits +data-masking gives us two interesting properties:</p> +<ol style="list-style-type: decimal"> +<li><p>We can work with <strong>sets</strong> of row indices that need +not to be the same length as the data frame +(vs. <code>filter()</code>).</p></li> +<li><p>We can work with row indices as <strong>integers</strong>, which +are legible to arithmetic operations (ex: <code>+</code> and +<code>*</code>)</p></li> +</ol> +<p>To grok the significance of working with rows as <strong>integer +sets</strong>, let’s work through some examples where +<code>slice()</code> comes in very handy.</p> +<h2 id="a-gallery-of-row-operations-with-slice">A gallery of row +operations with <code>slice()</code></h2> +<h3 id="repeat-rows-in-place">Repeat rows (in place)</h3> +<p>In <code>{tidyr}</code>, there’s a function called +<code>uncount()</code> which does the opposite of +<code>dplyr::count()</code>:</p> +<pre class="r"><code>library(tidyr) +# Example from `tidyr::uncount()` docs +uncount_df &lt;- tibble(x = c(&quot;a&quot;, &quot;b&quot;), n = c(1, 2)) +uncount_df</code></pre> +<pre><code> # A tibble: 2 × 2 + x n + &lt;chr&gt; &lt;dbl&gt; + 1 a 1 + 2 b 2</code></pre> +<pre class="r"><code>uncount_df |&gt; + uncount(n)</code></pre> +<pre><code> # A tibble: 3 × 1 + x + &lt;chr&gt; + 1 a + 2 b + 3 b</code></pre> +<p>We can mimic this behavior with <code>slice()</code>, using +<code>rep(times = ...)</code>:</p> +<pre class="r"><code>rep(1:nrow(uncount_df), times = uncount_df$n)</code></pre> +<pre><code> [1] 1 2 2</code></pre> +<pre class="r"><code>uncount_df |&gt; + slice( rep(row_number(), times = n) ) |&gt; + select( -n )</code></pre> +<pre><code> # A tibble: 3 × 1 + x + &lt;chr&gt; + 1 a + 2 b + 3 b</code></pre> +<p>What if instead of a whole column storing that information, we only +have information about row position?</p> +<p>Let’s say we want to duplicate the rows of <code>starwars_sm</code> +at the <code>repeat_at</code> positions:</p> +<pre class="r"><code>repeat_at &lt;- sample(5, 2) +repeat_at</code></pre> +<pre><code> [1] 4 5</code></pre> +<p>In <code>slice()</code>, you’d just select all rows plus those +additional rows, then sort the integer row indices:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( sort(c(row_number(), repeat_at)) )</code></pre> +<pre><code> # A tibble: 12 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Darth Vader 202 136 + 6 Leia Organa 150 49 + 7 Leia Organa 150 49 + 8 Owen Lars 178 120 + 9 Beru Whitesun Lars 165 75 + 10 R5-D4 97 32 + 11 Biggs Darklighter 183 84 + 12 Obi-Wan Kenobi 182 77</code></pre> +<p>What if we also separately have information about how much to repeat +those rows by?</p> +<pre class="r"><code>repeat_by &lt;- c(3, 4)</code></pre> +<p>You can apply the same <code>rep()</code> method for just the subset +of rows to repeat:</p> +<pre class="r"><code>starwars_sm |&gt; + slice( sort(c(row_number(), rep(repeat_at, times = repeat_by - 1))) )</code></pre> +<pre><code> # A tibble: 15 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Darth Vader 202 136 + 6 Darth Vader 202 136 + 7 Leia Organa 150 49 + 8 Leia Organa 150 49 + 9 Leia Organa 150 49 + 10 Leia Organa 150 49 + 11 Owen Lars 178 120 + 12 Beru Whitesun Lars 165 75 + 13 R5-D4 97 32 + 14 Biggs Darklighter 183 84 + 15 Obi-Wan Kenobi 182 77</code></pre> +<p>Circling back to <code>uncount()</code>, you could also initialize a +vector of <code>1s</code> and <code>replace()</code> where the rows +should be repeated:</p> +<pre class="r"><code>starwars_sm |&gt; + uncount( replace(rep(1, n()), repeat_at, repeat_by) )</code></pre> +<pre><code> # A tibble: 15 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Darth Vader 202 136 + 6 Darth Vader 202 136 + 7 Leia Organa 150 49 + 8 Leia Organa 150 49 + 9 Leia Organa 150 49 + 10 Leia Organa 150 49 + 11 Owen Lars 178 120 + 12 Beru Whitesun Lars 165 75 + 13 R5-D4 97 32 + 14 Biggs Darklighter 183 84 + 15 Obi-Wan Kenobi 182 77</code></pre> +<h3 id="subset-a-selection-of-rows-the-following-row">Subset a selection +of rows + the following row</h3> +<p>Row order can sometimes encode a meaningful continuous measure, like +time.</p> +<p>Take for example this subset of the <code>flights</code> dataset in +<code>{nycflights13}</code>:</p> +<pre class="r"><code>flights_df &lt;- nycflights13::flights |&gt; + filter(month == 3, day == 3, origin == &quot;JFK&quot;) |&gt; + select(dep_time, flight, carrier) |&gt; + slice(1:100) |&gt; + arrange(dep_time) +flights_df</code></pre> +<pre><code> # A tibble: 100 × 3 + dep_time flight carrier + &lt;int&gt; &lt;int&gt; &lt;chr&gt; + 1 535 1141 AA + 2 551 5716 EV + 3 555 145 B6 + 4 556 208 B6 + 5 556 79 B6 + 6 601 501 B6 + 7 604 725 B6 + 8 606 135 B6 + 9 606 600 UA + 10 607 829 US + # ℹ 90 more rows</code></pre> +<p>Here, the rows are ordered by <code>dep_time</code>, such that given +a row, the next row is a data point for the next flight that departed +from the airport.</p> +<p>And let’s say we’re interested in flights that took off immediately +after American Airlines (<code>"AA"</code>) flights. Given what we just +noted about the ordering of rows in the data frame, we can do this in +<code>slice()</code> by adding <code>1</code> to the row index of AA +flights:</p> +<pre class="r"><code>flights_df |&gt; + slice( which(carrier == &quot;AA&quot;) + 1 )</code></pre> +<pre><code> # A tibble: 14 × 3 + dep_time flight carrier + &lt;int&gt; &lt;int&gt; &lt;chr&gt; + 1 551 5716 EV + 2 627 905 B6 + 3 652 117 B6 + 4 714 825 AA + 5 717 987 B6 + 6 724 11 VX + 7 742 183 DL + 8 802 655 AA + 9 805 2143 DL + 10 847 59 B6 + 11 858 647 AA + 12 859 120 DL + 13 1031 179 AA + 14 1036 641 B6</code></pre> +<p>What if we also want to keep observations for the preceding AA +flights as well? We can just stick <code>which(carrier == "AA")</code> +inside <code>slice()</code> too:</p> +<pre class="r"><code>flights_df |&gt; + slice( + which(carrier == &quot;AA&quot;), + which(carrier == &quot;AA&quot;) + 1 + )</code></pre> +<pre><code> # A tibble: 28 × 3 + dep_time flight carrier + &lt;int&gt; &lt;int&gt; &lt;chr&gt; + 1 535 1141 AA + 2 626 413 AA + 3 652 1815 AA + 4 711 443 AA + 5 714 825 AA + 6 724 33 AA + 7 739 59 AA + 8 802 1838 AA + 9 802 655 AA + 10 843 1357 AA + # ℹ 18 more rows</code></pre> +<p>But now the rows are now ordered such that all the AA flights come +before the other flights! How can we preserve the original order of +increasing <code>dep_time</code>?</p> +<p>We <em>could</em> reconstruct the initial row order by piping the +result into <code>arrange(dep_time)</code> again, but the simplest +solution would be to concatenate the set of row indices and +<code>sort()</code> them, since the output of <code>which()</code> is +already integer!</p> +<pre class="r"><code>flights_df |&gt; + slice( + sort(c( + which(carrier == &quot;AA&quot;), + which(carrier == &quot;AA&quot;) + 1 + )) + )</code></pre> +<pre><code> # A tibble: 28 × 3 + dep_time flight carrier + &lt;int&gt; &lt;int&gt; &lt;chr&gt; + 1 535 1141 AA + 2 551 5716 EV + 3 626 413 AA + 4 627 905 B6 + 5 652 1815 AA + 6 652 117 B6 + 7 711 443 AA + 8 714 825 AA + 9 714 825 AA + 10 717 987 B6 + # ℹ 18 more rows</code></pre> +<p>Notice how the 8th and 9th rows are repeated here - that’s because 2 +AA flights departed in a row (ha!). We can use <code>unique()</code> to +remove duplicate rows in the same call to <code>slice()</code>:</p> +<pre class="r"><code>flights_df |&gt; + slice( + unique(sort(c( + which(carrier == &quot;AA&quot;), + which(carrier == &quot;AA&quot;) + 1 + ))) + )</code></pre> +<pre><code> # A tibble: 24 × 3 + dep_time flight carrier + &lt;int&gt; &lt;int&gt; &lt;chr&gt; + 1 535 1141 AA + 2 551 5716 EV + 3 626 413 AA + 4 627 905 B6 + 5 652 1815 AA + 6 652 117 B6 + 7 711 443 AA + 8 714 825 AA + 9 717 987 B6 + 10 724 33 AA + # ℹ 14 more rows</code></pre> +<p>Importantly, we can do all of this inside <code>slice()</code> +because we’re working with <strong>integer sets</strong>. The +<strong>integer</strong> part allows us to do things like +<code>+ 1</code> and <code>sort()</code>, while the <strong>set</strong> +part allows us to combine with <code>c()</code> and remove duplicates +with <code>unique()</code>.</p> +<h3 id="subset-a-selection-of-rows-multiple-following-rows">Subset a +selection of rows + multiple following rows</h3> +<p>In this example, let’s problematize our approach with the repeated +<code>which()</code> calls in our previous solution.</p> +<p>Imagine another scenario where we want to filter for all AA flights +and <em>three</em> subsequent flights for each.</p> +<p>Do we need to write the solution out like this? That’s a lot of +repetition!</p> +<pre class="r"><code>flights_df |&gt; + slice( + which(carrier == &quot;AA&quot;), + which(carrier == &quot;AA&quot;) + 1, + which(carrier == &quot;AA&quot;) + 2, + which(carrier == &quot;AA&quot;) + 3 + )</code></pre> +<p>You might think we can get away with <code>+ 0:3</code>, but it +doesn’t work as we’d like. The <code>+</code> just forces +<code>0:3</code> to be (partially) recycled to the same length as +<code>carrier</code> for element-wise addition:</p> +<pre class="r"><code>which(flights_df$carrier == &quot;AA&quot;) + 0:3</code></pre> +<pre><code> Warning in which(flights_df$carrier == &quot;AA&quot;) + 0:3: longer object length is not + a multiple of shorter object length</code></pre> +<pre><code> [1] 1 14 20 27 25 28 34 40 38 62 66 68 91 93</code></pre> +<p>If only we can get the <strong>outer</strong> sum of the two arrays, +<code>0:3</code> and <code>which(carrier == "AA")</code> … Oh wait, we +can - that’s what <code>outer()</code> does!</p> +<pre class="r"><code>outer(0:3, which(flights_df$carrier == &quot;AA&quot;), `+`)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] + [1,] 1 13 18 24 25 27 32 37 38 61 64 65 91 92 + [2,] 2 14 19 25 26 28 33 38 39 62 65 66 92 93 + [3,] 3 15 20 26 27 29 34 39 40 63 66 67 93 94 + [4,] 4 16 21 27 28 30 35 40 41 64 67 68 94 95</code></pre> +<p>This is essentially the repeated <code>which()</code> vectors stacked +on top of each other, but as a matrix:</p> +<pre class="r"><code>print( which(flights_df$carrier == &quot;AA&quot;) ) +print( which(flights_df$carrier == &quot;AA&quot;) + 1 ) +print( which(flights_df$carrier == &quot;AA&quot;) + 2 ) +print( which(flights_df$carrier == &quot;AA&quot;) + 3 )</code></pre> +<pre><code> [1] 1 13 18 24 25 27 32 37 38 61 64 65 91 92 + [1] 2 14 19 25 26 28 33 38 39 62 65 66 92 93 + [1] 3 15 20 26 27 29 34 39 40 63 66 67 93 94 + [1] 4 16 21 27 28 30 35 40 41 64 67 68 94 95</code></pre> +<p>The fact that <code>outer()</code> returns all the relevant row +indices inside a single matrix is nice because we can collect the +indices column-by-column, preserving row order. Matrices, like data +frames, are <strong>column-major</strong>, so coercing a matrix to a +vector collapses it column-wise:</p> +<pre class="r"><code>as.integer( outer(0:3, which(flights_df$carrier == &quot;AA&quot;), `+`) )</code></pre> +<pre><code> [1] 1 2 3 4 13 14 15 16 18 19 20 21 24 25 26 27 25 26 27 28 27 28 29 30 32 + [26] 33 34 35 37 38 39 40 38 39 40 41 61 62 63 64 64 65 66 67 65 66 67 68 91 92 + [51] 93 94 92 93 94 95</code></pre> +<details> +<summary> +Other ways to coerce matrix to vector +</summary> +<p>There are two other options for coercing a matrix to vector - +<code>c()</code> and <code>as.vector()</code>. I like to stick with +<code>as.integer()</code> because that enforces integer type (which +makes sense for row indices), and <code>c()</code> can be nice because +it’s less to type (although it’s <a +href="https://youtu.be/izFssYRsLZs?t=1143">off-label usage</a>):</p> +<pre class="r"><code># Not run, but equivalent to `as.integer()` method +as.vector( outer(0:3, which(flights_df$carrier == &quot;AA&quot;), `+`) ) +c( outer(0:3, which(flights_df$carrier == &quot;AA&quot;), `+`) )</code></pre> +<p>Somewhat relatedly - and this only works inside the tidy-eval context +of <code>slice()</code> - you can get a similar effect of “collapsing” a +matrix using the <a +href="https://rlang.r-lib.org/reference/topic-inject.html#splicing-with-">splice +operator</a> <code>!!!</code>:</p> +<pre class="r"><code>seq_matrix &lt;- matrix(1:9, byrow = TRUE, nrow = 3) +as.integer(seq_matrix)</code></pre> +<pre><code> [1] 1 4 7 2 5 8 3 6 9</code></pre> +<pre class="r"><code>identical( + mtcars |&gt; slice( as.vector(seq_matrix) ), + mtcars |&gt; slice( !!!seq_matrix ) +)</code></pre> +<pre><code> [1] TRUE</code></pre> +<p>Here, the <code>!!!seq_matrix</code> was slotting each individual +“cell” as argument to <code>slice()</code>:</p> +<pre class="r"><code>rlang::expr( slice(!!!seq_matrix) )</code></pre> +<pre><code> slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)</code></pre> +<p>A big difference in behavior between <code>as.integer()</code> +vs. <code>!!!</code> is that the latter works for <strong>lists</strong> +of indices too, by slotting each element of the list as an argument to +<code>slice()</code>:</p> +<pre class="r"><code>seq_list &lt;- list(c(1, 4, 7, 2), c(5, 8, 3, 6, 9)) +rlang::expr( slice( !!!seq_list ) )</code></pre> +<pre><code> slice(c(1, 4, 7, 2), c(5, 8, 3, 6, 9))</code></pre> +<p>However, as you may already know, <code>as.integer()</code> cannot +flatten lists:</p> +<pre class="r"><code>as.integer(seq_list)</code></pre> +<pre><code> Error in eval(expr, envir, enclos): &#39;list&#39; object cannot be coerced to type &#39;integer&#39;</code></pre> +<p>Note that <code>as.vector()</code> and <code>c()</code> leaves lists +<em>as is</em>, which is another reason to prefer +<code>as.integer()</code> for type-checking:</p> +<pre class="r"><code>identical(seq_list, as.vector(seq_list)) +identical(seq_list, c(seq_list))</code></pre> +<pre><code> [1] TRUE + [1] TRUE</code></pre> +<p>Finally, back in our <code>!!!seq_matrix</code> example, we could +have applied <code>asplit(MARGIN = 2)</code> to chunk the splicing by +<em>matrix column</em>, although the overall effect would be the +same:</p> +<pre class="r"><code>rlang::expr(slice( !!!seq_matrix ))</code></pre> +<pre><code> slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)</code></pre> +<pre class="r"><code>rlang::expr(slice( !!!asplit(seq_matrix, 2) ))</code></pre> +<pre><code> slice(c(1L, 4L, 7L), c(2L, 5L, 8L), c(3L, 6L, 9L))</code></pre> +</details> +<p>This lets us ask questions like: Which AA flights departed within 3 +flights of another AA flight?</p> +<pre class="r"><code>flights_df |&gt; + slice( as.integer( outer(0:3, which(carrier == &quot;AA&quot;), `+`) ) ) |&gt; + filter( carrier == &quot;AA&quot;, duplicated(flight) ) |&gt; + distinct(flight, carrier)</code></pre> +<pre><code> # A tibble: 6 × 2 + flight carrier + &lt;int&gt; &lt;chr&gt; + 1 825 AA + 2 33 AA + 3 655 AA + 4 1 AA + 5 647 AA + 6 179 AA</code></pre> +<details> +<summary> +Slicing all the way down: Case 1 +</summary> +<p>With the addition of the <code>.by</code> argument to +<code>slice()</code> in <a +href="https://www.tidyverse.org/blog/2023/02/dplyr-1-1-0-per-operation-grouping/">dplyr +v1.10</a>, we can re-write the above code as three calls to +<code>slice()</code> (+ a call to <code>select()</code>):</p> +<pre class="r"><code>flights_df |&gt; + slice( as.integer( outer(0:3, which(carrier == &quot;AA&quot;), `+`) ) ) |&gt; + slice( which(carrier == &quot;AA&quot; &amp; duplicated(flight)) ) |&gt; # filter() + slice( 1, .by = c(flight, carrier) ) |&gt; # distinct() + select(flight, carrier)</code></pre> +<pre><code> # A tibble: 6 × 2 + flight carrier + &lt;int&gt; &lt;chr&gt; + 1 825 AA + 2 33 AA + 3 655 AA + 4 1 AA + 5 647 AA + 6 179 AA</code></pre> +</details> +<p>The next example will demonstrate another, perhaps more practical +usecase for <code>outer()</code> in <code>slice()</code>.</p> +<h3 id="filter-and-encode-neighboring-rows">Filter (and encode) +neighboring rows</h3> +<p>Let’s use a subset of the <code>{gapminder}</code> data set for this +one. Here, we have data for each European country’s GDP-per-capita by +year, between 1992 to 2007:</p> +<pre class="r"><code>gapminder_df &lt;- gapminder::gapminder |&gt; + left_join(gapminder::country_codes, by = &quot;country&quot;) |&gt; # `multiple = &quot;all&quot;` + filter(year &gt;= 1992, continent == &quot;Europe&quot;) |&gt; + select(country, country_code = iso_alpha, year, gdpPercap) +gapminder_df</code></pre> +<pre><code> # A tibble: 120 × 4 + country country_code year gdpPercap + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Albania ALB 1992 2497. + 2 Albania ALB 1997 3193. + 3 Albania ALB 2002 4604. + 4 Albania ALB 2007 5937. + 5 Austria AUT 1992 27042. + 6 Austria AUT 1997 29096. + 7 Austria AUT 2002 32418. + 8 Austria AUT 2007 36126. + 9 Belgium BEL 1992 25576. + 10 Belgium BEL 1997 27561. + # ℹ 110 more rows</code></pre> +<p>This time, let’s see the desired output (plot) first and build our +way up. The goal is to plot the GDP growth of Germany over the years, +<em>and</em> its yearly <strong>GDP neighbors</strong> side-by-side:</p> +<p><img src="file86a47c2865f8_files/figure-html/final-gapminder-plot-1.png" width="672" /></p> +<p>First, let’s think about what a “GDP neighbor” means in +row-relational terms. If you arranged the data by GDP, the GDP neighbors +would be the rows that come immediately before and after the rows for +Germany. You need to recalculate neighbors every year though, so this +<code>arrange()</code> + <code>slice()</code> combo should happen +by-year.</p> +<p>With that in mind, let’s set up a <code>year</code> grouping and +arrange by <code>gdpPercap</code> within <code>year</code>:<a +href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p> +<pre class="r"><code>gapminder_df |&gt; + group_by(year) |&gt; + arrange(gdpPercap, .by_group = TRUE)</code></pre> +<pre><code> # A tibble: 120 × 4 + # Groups: year [4] + country country_code year gdpPercap + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Albania ALB 1992 2497. + 2 Bosnia and Herzegovina BIH 1992 2547. + 3 Turkey TUR 1992 5678. + 4 Bulgaria BGR 1992 6303. + 5 Romania ROU 1992 6598. + 6 Montenegro MNE 1992 7003. + 7 Poland POL 1992 7739. + 8 Croatia HRV 1992 8448. + 9 Serbia SRB 1992 9325. + 10 Slovak Republic SVK 1992 9498. + # ℹ 110 more rows</code></pre> +<p>Now within each year, we want to grab the row for Germany +<em>and</em> its neighboring rows. We can do this by taking the +<code>outer()</code> sum of <code>-1:1</code> and the row indices for +Germany:</p> +<pre class="r"><code>gapminder_df |&gt; + group_by(year) |&gt; + arrange(gdpPercap, .by_group = TRUE) |&gt; + slice( as.integer(outer( -1:1, which(country == &quot;Germany&quot;), `+` )) )</code></pre> +<pre><code> # A tibble: 12 × 4 + # Groups: year [4] + country country_code year gdpPercap + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Denmark DNK 1992 26407. + 2 Germany DEU 1992 26505. + 3 Netherlands NLD 1992 26791. + 4 Belgium BEL 1997 27561. + 5 Germany DEU 1997 27789. + 6 Iceland ISL 1997 28061. + 7 United Kingdom GBR 2002 29479. + 8 Germany DEU 2002 30036. + 9 Belgium BEL 2002 30486. + 10 France FRA 2007 30470. + 11 Germany DEU 2007 32170. + 12 United Kingdom GBR 2007 33203.</code></pre> +<details> +<summary> +Slicing all the way down: Case 2 +</summary> +<p>The new <code>.by</code> argument in <code>slice()</code> comes in +handy again here, allowing us to collapse the <code>group_by()</code> + +<code>arrange()</code> combo into one <code>slice()</code> call:</p> +<pre class="r"><code>gapminder_df |&gt; + slice( order(gdpPercap), .by = year) |&gt; + slice( as.integer(outer( -1:1, which(country == &quot;Germany&quot;), `+` )) )</code></pre> +<pre><code> # A tibble: 12 × 4 + country country_code year gdpPercap + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Denmark DNK 1992 26407. + 2 Germany DEU 1992 26505. + 3 Netherlands NLD 1992 26791. + 4 Belgium BEL 1997 27561. + 5 Germany DEU 1997 27789. + 6 Iceland ISL 1997 28061. + 7 United Kingdom GBR 2002 29479. + 8 Germany DEU 2002 30036. + 9 Belgium BEL 2002 30486. + 10 France FRA 2007 30470. + 11 Germany DEU 2007 32170. + 12 United Kingdom GBR 2007 33203.</code></pre> +For our purposes here we want actually the grouping to +<strong>persist</strong> for the following <code>mutate()</code> call, +but there may be other cases where you’d want to use +<code>slice(.by = )</code> for temporary grouping. +</details> +<p>Now we’re already starting to see the shape of the data that we want! +The last step is to encode the relationship of each row to Germany - +does a row represent Germany itself, or a country that’s one GDP ranking +below or above Germany?</p> +<p>Continuing with our grouped context, we make a new column +<code>grp</code> that assigns a factor value +<code>"lo"</code>-<code>"is"</code>-<code>"hi"</code> (for “lower” than +Germany, “is” Germany and “higher” than Germany) to each country trio by +year. Notice the use of <code>fct_inorder()</code> below - this ensures +that the factor levels are in the order of their occurrence (necessary +for the correct ordering of bars in <code>geom_col()</code> later):</p> +<pre class="r"><code>gapminder_df |&gt; + group_by(year) |&gt; + arrange(gdpPercap) |&gt; + slice( as.integer(outer( -1:1, which(country == &quot;Germany&quot;), `+` )) ) |&gt; + mutate(grp = forcats::fct_inorder(c(&quot;lo&quot;, &quot;is&quot;, &quot;hi&quot;)))</code></pre> +<pre><code> # A tibble: 12 × 5 + # Groups: year [4] + country country_code year gdpPercap grp + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;fct&gt; + 1 Denmark DNK 1992 26407. lo + 2 Germany DEU 1992 26505. is + 3 Netherlands NLD 1992 26791. hi + 4 Belgium BEL 1997 27561. lo + 5 Germany DEU 1997 27789. is + 6 Iceland ISL 1997 28061. hi + 7 United Kingdom GBR 2002 29479. lo + 8 Germany DEU 2002 30036. is + 9 Belgium BEL 2002 30486. hi + 10 France FRA 2007 30470. lo + 11 Germany DEU 2007 32170. is + 12 United Kingdom GBR 2007 33203. hi</code></pre> +<p>We now have everything that’s necessary to make our desired plot, so +we <code>ungroup()</code>, write some <code>{ggplot2}</code> code, and +voila!</p> +<pre class="r"><code>gapminder_df |&gt; + group_by(year) |&gt; + arrange(gdpPercap) |&gt; + slice( as.integer(outer( -1:1, which(country == &quot;Germany&quot;), `+` )) ) |&gt; + mutate(grp = forcats::fct_inorder(c(&quot;lo&quot;, &quot;is&quot;, &quot;hi&quot;))) |&gt; + # Ungroup and make ggplot + ungroup() |&gt; + ggplot(aes(as.factor(year), gdpPercap, group = grp)) + + geom_col(aes(fill = grp == &quot;is&quot;), position = position_dodge()) + + geom_text( + aes(label = country_code), + vjust = 1.3, + position = position_dodge(width = .9) + ) + + scale_fill_manual( + values = c(&quot;grey75&quot;, &quot;steelblue&quot;), + guide = guide_none() + ) + + theme_classic() + + labs(x = &quot;Year&quot;, y = &quot;GDP per capita&quot;)</code></pre> +<p><img src="file86a47c2865f8_files/figure-html/final-gapminder-plot-1.png" width="672" /></p> +<details> +<summary> +Solving the harder version of the problem +</summary> +<p>The solution presented above relies on a fragile assumption that +Germany will always have a higher <em>and</em> lower ranking GDP +neighbor every year. But nothing about the problem description +guarantees this, so how can we re-write our code to be more robust?</p> +<p>First, let’s simulate a data where Germany is the lowest ranking +country in 2002 and the highest ranking in 2007. In other words, Germany +only has one GDP neighbor in those years:</p> +<pre class="r"><code>gapminder_harder_df &lt;- gapminder_df |&gt; + slice( order(gdpPercap), .by = year) |&gt; + slice( as.integer(outer( -1:1, which(country == &quot;Germany&quot;), `+` )) ) |&gt; + slice( -7, -12 ) +gapminder_harder_df</code></pre> +<pre><code> # A tibble: 10 × 4 + country country_code year gdpPercap + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Denmark DNK 1992 26407. + 2 Germany DEU 1992 26505. + 3 Netherlands NLD 1992 26791. + 4 Belgium BEL 1997 27561. + 5 Germany DEU 1997 27789. + 6 Iceland ISL 1997 28061. + 7 Germany DEU 2002 30036. + 8 Belgium BEL 2002 30486. + 9 France FRA 2007 30470. + 10 Germany DEU 2007 32170.</code></pre> +<p>Given this data, we cannot assign the full, length-3 lo-is-hi factor +by group, because the groups for year 2002 and 2007 only have 2 +observations:</p> +<pre class="r"><code>gapminder_harder_df |&gt; + group_by(year) |&gt; + mutate(grp = forcats::fct_inorder(c(&quot;lo&quot;, &quot;is&quot;, &quot;hi&quot;)))</code></pre> +<pre><code> Error in `mutate()`: + ℹ In argument: `grp = forcats::fct_inorder(c(&quot;lo&quot;, &quot;is&quot;, &quot;hi&quot;))`. + ℹ In group 3: `year = 2002`. + Caused by error: + ! `grp` must be size 2 or 1, not 3.</code></pre> +<p>The trick here is to turn each group of rows into an integer sequence +where Germany is “anchored” to 2, and then use that vector to subset the +lo-is-hi factor:</p> +<pre class="r"><code>gapminder_harder_df |&gt; + group_by(year) |&gt; + mutate( + Germany_anchored_to_2 = row_number() - which(country == &quot;Germany&quot;) + 2, + grp = forcats::fct_inorder(c(&quot;lo&quot;, &quot;is&quot;, &quot;hi&quot;))[Germany_anchored_to_2] + )</code></pre> +<pre><code> # A tibble: 10 × 6 + # Groups: year [4] + country country_code year gdpPercap Germany_anchored_to_2 grp + &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;fct&gt; + 1 Denmark DNK 1992 26407. 1 lo + 2 Germany DEU 1992 26505. 2 is + 3 Netherlands NLD 1992 26791. 3 hi + 4 Belgium BEL 1997 27561. 1 lo + 5 Germany DEU 1997 27789. 2 is + 6 Iceland ISL 1997 28061. 3 hi + 7 Germany DEU 2002 30036. 2 is + 8 Belgium BEL 2002 30486. 3 hi + 9 France FRA 2007 30470. 1 lo + 10 Germany DEU 2007 32170. 2 is</code></pre> +<p>We find that the lessons of working with row indices from +<code>slice()</code> translated to solving this complex +<code>mutate()</code> problem - neat!</p> +</details> +<h3 id="aside-kronecker-as-as.vectorouter">Aside: +<code>kronecker()</code> as <code>as.vector(outer())</code></h3> +<p>Following from the <code>slice()</code> + <code>outer()</code> +strategy demoed above, imagine if we wanted to filter for +<code>"Luke Skywalker"</code> and 4 other characters that are neighbors +in the <code>height</code> and <code>mass</code> values.</p> +<pre class="r"><code>dplyr::starwars[, 1:3]</code></pre> +<pre><code> # A tibble: 87 × 3 + name height mass + &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; + 1 Luke Skywalker 172 77 + 2 C-3PO 167 75 + 3 R2-D2 96 32 + 4 Darth Vader 202 136 + 5 Leia Organa 150 49 + 6 Owen Lars 178 120 + 7 Beru Whitesun Lars 165 75 + 8 R5-D4 97 32 + 9 Biggs Darklighter 183 84 + 10 Obi-Wan Kenobi 182 77 + # ℹ 77 more rows</code></pre> +<p>In row-relational terms, “filtering neighboring values” just means +“filtering rows after arranging by the values we care about”. We can +express this using <code>slice()</code> and <code>outer()</code> as:</p> +<pre class="r"><code>starwars %&gt;% + select(name, mass, height) %&gt;% + arrange(mass, height) %&gt;% + slice( as.vector(outer(-2:2, which(name == &quot;Luke Skywalker&quot;), `+`)) )</code></pre> +<pre><code> # A tibble: 5 × 3 + name mass height + &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; + 1 Palpatine 75 170 + 2 Wedge Antilles 77 170 + 3 Luke Skywalker 77 172 + 4 Obi-Wan Kenobi 77 182 + 5 Boba Fett 78.2 183</code></pre> +<p>I raised this example on an unrelated thread on the <a +href="https://fosstodon.org/@DSLC">R4DS/DSLC slack</a>, where Anthony +Durrant pointed me to <code>kronecker()</code> as a version of +<code>outer()</code> that unlist before returning the output.</p> +<p>So in examples involving <code>outer()</code> to generate row indices +in <code>slice()</code>, we can also use <code>kronecker()</code> +instead to save a call to a flattening function like +<code>as.vector()</code>:</p> +<pre class="r"><code>starwars %&gt;% + select(name, mass, height) %&gt;% + arrange(mass, height) %&gt;% + slice( kronecker(-2:2, which(name == &quot;Luke Skywalker&quot;), `+`) )</code></pre> +<pre><code> # A tibble: 5 × 3 + name mass height + &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; + 1 Palpatine 75 170 + 2 Wedge Antilles 77 170 + 3 Luke Skywalker 77 172 + 4 Obi-Wan Kenobi 77 182 + 5 Boba Fett 78.2 183</code></pre> +<h3 id="windowed-minmaxmedian-etc.">Windowed min/max/median (etc.)</h3> +<p>Let’s say we have this small time series data, and we want to +calculate a <strong>lagged 3-window moving minimum</strong> for the +<code>val</code> column:</p> +<pre class="r"><code>ts_df &lt;- tibble( + time = 1:6, + val = sample(1:6 * 10) +) +ts_df</code></pre> +<pre><code> # A tibble: 6 × 2 + time val + &lt;int&gt; &lt;dbl&gt; + 1 1 50 + 2 2 40 + 3 3 60 + 4 4 30 + 5 5 20 + 6 6 10</code></pre> +<p>If you’re new to window functions, think of them as a special kind of +<code>group_by()</code> + <code>summarize()</code> where groups are +chunks of observations along a (typically unique) continuous measure +like time, and observations can be shared between groups.</p> +<p>There are several packages implementing moving/sliding/rolling window +functions. My current favorite is <code>{r2c}</code> (see a <a +href="https://github.com/brodieG/r2c#fast-group-and-rolling-statistics">review +of other implementations therein</a>), but I also like +<code>{slider}</code> for an implementation that follows familiar <a +href="https://design.tidyverse.org/">“tidy” design principles</a>:</p> +<pre class="r"><code>library(slider) +ts_df |&gt; + mutate(moving_min = slide_min(val, before = 2L, complete = TRUE))</code></pre> +<pre><code> # A tibble: 6 × 3 + time val moving_min + &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; + 1 1 50 NA + 2 2 40 NA + 3 3 60 40 + 4 4 30 30 + 5 5 20 20 + 6 6 10 10</code></pre> +<p>Moving window is a general class of operations that encompass any +arbitrary summary statistic - so not just min but other reducing +functions like mean, standard deviation, etc. But what makes moving +<strong>min</strong> (along with max, median, etc.) a particularly +interesting case for our current discussion is that the value comes from +<strong>an existing observation</strong> in the data. And if our time +series is tidy, every observation makes up a row. See where I’m going +with this?</p> +<p>Using <code>outer()</code> again, we can take the outer sum of all +row indices of <code>ts_df</code> and <code>-2:0</code>. This gives us a +matrix where each column represents a lagged size-3 moving window:</p> +<pre class="r"><code>windows_3lag &lt;- outer(-2:0, 1:nrow(ts_df), &quot;+&quot;) +windows_3lag</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] + [1,] -1 0 1 2 3 4 + [2,] 0 1 2 3 4 5 + [3,] 1 2 3 4 5 6</code></pre> +<p>The “lagged size-3” property of this moving window means that the +first two windows are incomplete (consisting of less than 3 +observations). We want to treat those as invalid, so we can drop the +first two columns from our matrix:</p> +<pre class="r"><code>windows_3lag[,-(1:2)]</code></pre> +<pre><code> [,1] [,2] [,3] [,4] + [1,] 1 2 3 4 + [2,] 2 3 4 5 + [3,] 3 4 5 6</code></pre> +<p>For each remaining column, we want to grab the values of +<code>val</code> at the corresponding row indices and find which row has +the minimum <code>val</code>. In terms of code, we use +<code>apply()</code> with <code>MARGIN = 2L</code> to column-wise apply +a function where we use <code>which.min()</code> to find the location of +the minimum <code>val</code> and convert it back to row index via +subsetting:</p> +<pre class="r"><code>windows_3lag[, -(1:2)] |&gt; + apply(MARGIN = 2L, \(i) i[which.min(ts_df$val[i])])</code></pre> +<pre><code> [1] 2 4 5 6</code></pre> +<p>Now let’s stick this inside <code>slice()</code>, exploiting the fact +that it’s <em>data-masked</em> (<code>ts_df$val</code> can just be +<code>val</code>) and exposes <em>context-dependent expressions</em> +(<code>1:nrow(ts_df)</code> can just be <code>row_number()</code>):</p> +<pre class="r"><code>moving_mins &lt;- ts_df |&gt; + slice( + outer(-2:0, row_number(), &quot;+&quot;)[,-(1:2)] |&gt; + apply(MARGIN = 2L, \(i) i[which.min(val[i])]) + ) +moving_mins</code></pre> +<pre><code> # A tibble: 4 × 2 + time val + &lt;int&gt; &lt;dbl&gt; + 1 2 40 + 2 4 30 + 3 5 20 + 4 6 10</code></pre> +<p>From here, we can grab the <code>val</code> column and pad it with +<code>NA</code> to add our desired <code>window_min</code> column to the +original data frame:</p> +<pre class="r"><code>ts_df |&gt; + mutate(moving_min = c(NA, NA, moving_mins$val))</code></pre> +<pre><code> # A tibble: 6 × 3 + time val moving_min + &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; + 1 1 50 NA + 2 2 40 NA + 3 3 60 40 + 4 4 30 30 + 5 5 20 20 + 6 6 10 10</code></pre> +<p>At this point you might think that this is a very round-about way of +solving the same problem. But actually I think that it’s a faster route +to solving a slightly more complicated problem - augmenting each +observation of a data frame with information about <strong>comparison +observations</strong>.</p> +<p>For example, our <code>slice()</code>-based solution sets us up +nicely for also bringing along information about the time at which the +<code>moving_min</code> occurred. After some <code>rename()</code>-ing +and adding the original time information back in, we get back a +relational data structure where <code>time</code> is a +<strong>key</strong> shared with <code>ts_df</code>:</p> +<pre class="r"><code>moving_mins2 &lt;- moving_mins |&gt; + rename(moving_min_val = val, moving_min_time = time) |&gt; + mutate(time = ts_df$time[-(1:2)], .before = 1L) +moving_mins2</code></pre> +<pre><code> # A tibble: 4 × 3 + time moving_min_time moving_min_val + &lt;int&gt; &lt;int&gt; &lt;dbl&gt; + 1 3 2 40 + 2 4 4 30 + 3 5 5 20 + 4 6 6 10</code></pre> +<p>We can then left-join this to the original data to augment it with +information about both the value of the 3-window minimum and the time +that the minimum occurred:</p> +<pre class="r"><code>left_join(ts_df, moving_mins2, by = &quot;time&quot;)</code></pre> +<pre><code> # A tibble: 6 × 4 + time val moving_min_time moving_min_val + &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; + 1 1 50 NA NA + 2 2 40 NA NA + 3 3 60 2 40 + 4 4 30 4 30 + 5 5 20 5 20 + 6 6 10 6 10</code></pre> +<p>This is particularly useful if rows contain other useful information +for comparison and you have memory to spare:</p> +<pre class="r"><code>ts_wide_df &lt;- ts_df |&gt; + mutate( + col1 = rnorm(6), + col2 = rnorm(6) + ) +ts_wide_df</code></pre> +<pre><code> # A tibble: 6 × 4 + time val col1 col2 + &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; + 1 1 50 0.0183 0.00501 + 2 2 40 0.705 -0.0376 + 3 3 60 -0.647 0.724 + 4 4 30 0.868 -0.497 + 5 5 20 0.376 0.0114 + 6 6 10 0.310 0.00986</code></pre> +<p>The below code augments each observation in the original +<code>ts_wide_df</code> data with information about the corresponding +3-window moving min (columns prefixed with <code>"min3val_"</code>)</p> +<pre class="r"><code>moving_mins_wide &lt;- ts_wide_df |&gt; + slice( + outer(-2:0, row_number(), &quot;+&quot;)[,-(1:2)] |&gt; + apply(MARGIN = 2L, \(i) i[which.min(val[i])]) + ) |&gt; + rename_with(~ paste0(&quot;min3val_&quot;, .x)) |&gt; + mutate(time = ts_wide_df$time[-(1:2)]) +left_join(ts_wide_df, moving_mins_wide, by = &quot;time&quot;)</code></pre> +<pre><code> # A tibble: 6 × 8 + time val col1 col2 min3val_time min3val_val min3val_col1 + &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; + 1 1 50 0.0183 0.00501 NA NA NA + 2 2 40 0.705 -0.0376 NA NA NA + 3 3 60 -0.647 0.724 2 40 0.705 + 4 4 30 0.868 -0.497 4 30 0.868 + 5 5 20 0.376 0.0114 5 20 0.376 + 6 6 10 0.310 0.00986 6 10 0.310 + # ℹ 1 more variable: min3val_col2 &lt;dbl&gt;</code></pre> +<!-- ### Dependency relations between rows --> +<!-- In tidy representations of nested, hierarchical data, the parent-child relationship between rows are often encoded in a column that reference other rows in the data. --> +<!-- Let's take the case of dependency parsing in NLP (using [spaCy](https://spacy.io/)) for example. Given a sentence like "June likes cute cats", the dependency relationship between tokens can be represented like so: --> +<!-- <details> --> +<!-- <summary>Code to produce the figure</summary> --> +<!-- ```{r displacy, eval = FALSE} --> +<!-- library(reticulate) --> +<!-- spacy <- import("spacy") --> +<!-- nlp <- spacy$load("en_core_web_sm") --> +<!-- py_parsed <- nlp("June likes cute cats") --> +<!-- displacy_render <- function(x) { --> +<!-- spacy$displacy$render(x) |> --> +<!-- htmltools::HTML() |> --> +<!-- htmltools::html_print() --> +<!-- } --> +<!-- displacy_render(py_parsed) --> +<!-- ``` --> +<!-- </details> --> +<!-- ```{r displacy, message = FALSE} --> +<!-- ``` --> +<!-- There's a lot going on here, but I want to draw attention to the fact that there's an arrow going from `"likes"` to `"cats"` and from `"cats"` to `"cute"`. In NLP terms, we say that in this sentence, the head of "cute" is "cats", and the head of "cats" is "likes". --> +<!-- To demonstrate this using **spacy** in *python*, we can parse the sentence and store the parsed object in `py_parsed`: --> +<!-- ```{r, eval = FALSE} --> +<!-- library(reticulate) --> +<!-- spacy <- import("spacy") --> +<!-- nlp <- spacy$load("en_core_web_sm") --> +<!-- py_parsed <- nlp("June likes cute cats") --> +<!-- ``` --> +<!-- Using python's 0-indexing, we extract the token object corresponding to "cats". From there, we can see the dependency relationship where the head of "cute" is "cats" and the head of "cats" is "likes": --> +<!-- ```{r} --> +<!-- py_parsed[2] --> +<!-- py_parsed[2]$head --> +<!-- py_parsed[2]$head$head --> +<!-- ``` --> +<!-- We see that this dependency relationship between "cute"-"cats"-"likes" is expressed concisely in object-oriented programming (you just follow the `head` property of tokens). This gets a bit trickier to work with in tidy data form. --> +<!-- Using the `{spacyr}` package, we can generate a dataframe equivalent of `py_parsed` which stores the dependency relationship in the `head_token_id` column: --> +<!-- ```{r, message = FALSE} --> +<!-- library(spacyr) --> +<!-- sentence <- "June likes cute cats" --> +<!-- parsed <- spacy_parse(sentence, dependency = TRUE, entity = FALSE)[,-c(1:2)] --> +<!-- parsed --> +<!-- ``` --> +<!-- To get from "cute" to "cats" and "likes" by following the arrows, we can use `slice()`: --> +<!-- ```{r} --> +<!-- # Child token --> +<!-- parsed |> --> +<!-- slice(3) --> +<!-- # Parent head --> +<!-- parsed |> --> +<!-- slice( head_token_id[3] ) --> +<!-- # Grandparent head --> +<!-- parsed |> --> +<!-- slice( head_token_id[token_id == head_token_id[3]] ) --> +<!-- ``` --> +<!-- <details> --> +<!-- <summary>The recursive generalization</summary> --> +<!-- ```{r} --> +<!-- # Special case for init --> +<!-- parsed |> --> +<!-- slice( 3 ) --> +<!-- # Recursive call to `head_token_id[token_id == PREV]` --> +<!-- parsed |> --> +<!-- slice( head_token_id[token_id == 3] ) --> +<!-- parsed |> --> +<!-- slice( head_token_id[token_id == head_token_id[token_id == 3]] ) --> +<!-- ``` --> +<!-- </details> --> +<!-- Now let's say that we have a paragraph of sentences describing what June likes and doesn't likes. --> +<!-- ```{r} --> +<!-- paragraph <- "June likes cute cats. June hates angry cats. June likes small dogs." --> +<!-- parsed_paragraph <- spacy_parse(paragraph, dependency = TRUE, entity = FALSE) --> +<!-- parsed_paragraph <- parsed_paragraph |> --> +<!-- filter(pos != "PUNCT") --> +<!-- parsed_paragraph --> +<!-- ``` --> +<!-- And let's say that our research question is: what kinds of cats does June like? In other words we want to end up with a set of `ADJ`s where the head token is `cats` and the head token of that is `likes`. In `parsed_paragraph`, there's only one token (row) meeting this criteria: --> +<!-- ```{r} --> +<!-- parsed_paragraph[3,] --> +<!-- ``` --> +<!-- In our solution we use a helper function `get_head_tokens()`, which recursively searches for token head using `slice()`. This blog post is already getting too long so I'll just dump the code for now and maybe I'll come back to add more explanations later... --> +<!-- ```{r} --> +<!-- get_head_tokens <- function(df, child, n) { --> +<!-- child <- rlang::eval_tidy(enquo(child), data = df) --> +<!-- purrr::accumulate( --> +<!-- .x = seq_len(n), --> +<!-- .f = ~ slice(df, head_token_id[token_id == .x$token_id]), --> +<!-- .init = slice(df, child) --> +<!-- ) |> --> +<!-- slice(-1) |> --> +<!-- pull(token) --> +<!-- } --> +<!-- parsed_paragraph |> --> +<!-- filter(sentence_id == 1) |> --> +<!-- get_head_tokens(child = which(pos == "ADJ"), n = 2) --> +<!-- ``` --> +<!-- Applying `get_head_tokens()` for each sentence gets us the solution: --> +<!-- ```{r} --> +<!-- parsed_paragraph |> --> +<!-- group_by(sentence_id) |> --> +<!-- filter( --> +<!-- pos == "ADJ", --> +<!-- identical( --> +<!-- c("cats", "likes"), --> +<!-- get_head_tokens(pick(everything()), which(pos == "ADJ"), 2) --> +<!-- ) --> +<!-- ) |> --> +<!-- ungroup() --> +<!-- ``` --> +<h3 id="evenly-distributed-row-shuffling-of-balanced-categories">Evenly +distributed row shuffling of balanced categories</h3> +<p>Sometimes the ordering of rows in a data frame can be meaningful for +an external application.</p> +<p>For example, many experiment-building platforms for psychology +research require researchers to specify the running order of trials in +an experiment via a csv, where each row represents a trial and each +column represents information about the trial.</p> +<p>So an experiment testing the classic <a +href="https://www.psytoolkit.org/lessons/stroop.html">Stroop effect</a> +may have the following template:</p> +<pre class="r"><code>mismatch_trials &lt;- tibble( + item_id = 1:5, + trial = &quot;mismatch&quot;, + word = c(&quot;red&quot;, &quot;green&quot;, &quot;purple&quot;, &quot;brown&quot;, &quot;blue&quot;), + color = c(&quot;brown&quot;, &quot;red&quot;, &quot;green&quot;, &quot;blue&quot;, &quot;purple&quot;) +) +mismatch_trials</code></pre> +<pre><code> # A tibble: 5 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 2 mismatch green red + 3 3 mismatch purple green + 4 4 mismatch brown blue + 5 5 mismatch blue purple</code></pre> +<p>We probably also want to mix in some <em>control</em> trials where +the word and color do match:</p> +<pre class="r"><code>match_trials &lt;- mismatch_trials |&gt; + mutate(trial = &quot;match&quot;, color = word) +match_trials</code></pre> +<pre><code> # A tibble: 5 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 match red red + 2 2 match green green + 3 3 match purple purple + 4 4 match brown brown + 5 5 match blue blue</code></pre> +<p>Now that we have all materials for our experiment, we next want the +running order to interleave the match and mismatch trials.</p> +<p>We first add them together into a longer data frame:</p> +<pre class="r"><code>stroop_trials &lt;- bind_rows(mismatch_trials, match_trials) +stroop_trials</code></pre> +<pre><code> # A tibble: 10 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 2 mismatch green red + 3 3 mismatch purple green + 4 4 mismatch brown blue + 5 5 mismatch blue purple + 6 1 match red red + 7 2 match green green + 8 3 match purple purple + 9 4 match brown brown + 10 5 match blue blue</code></pre> +<p>And from here we can exploit the fact that all mismatch items come +before match items, and that they share the same length of 5:</p> +<pre class="r"><code>stroop_trials |&gt; + slice( as.integer(outer(c(0, 5), 1:5, &quot;+&quot;)) )</code></pre> +<pre><code> # A tibble: 10 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 1 match red red + 3 2 mismatch green red + 4 2 match green green + 5 3 mismatch purple green + 6 3 match purple purple + 7 4 mismatch brown blue + 8 4 match brown brown + 9 5 mismatch blue purple + 10 5 match blue blue</code></pre> +<p>This relies on a strong assumptions about the row order in the +original data, though. So a safer alternative is to represent the row +indices for <code>"match"</code> and <code>"mismatch"</code> trials as +rows of a matrix, and then collapse column-wise.</p> +<p>Let’s try this outside of <code>slice()</code> first. We start with a +call to <code>sapply()</code> to construct a matrix where the columns +contain row indices for each unique category of <code>trial</code>:</p> +<pre class="r"><code>sapply(unique(stroop_trials$trial), \(x) which(stroop_trials$trial == x))</code></pre> +<pre><code> mismatch match + [1,] 1 6 + [2,] 2 7 + [3,] 3 8 + [4,] 4 9 + [5,] 5 10</code></pre> +<p>Then we transpose the matrix with <code>t()</code>, which rotates +it:</p> +<pre class="r"><code>t( sapply(unique(stroop_trials$trial), \(x) which(stroop_trials$trial == x)) )</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] + mismatch 1 2 3 4 5 + match 6 7 8 9 10</code></pre> +<p>Now lets stick that inside slice, remembering to collapse the +transposed matrix into vector:</p> +<pre class="r"><code>interleaved_stroop_trials &lt;- stroop_trials |&gt; + slice( as.integer(t(sapply(unique(trial), \(x) which(trial == x)))) ) +interleaved_stroop_trials</code></pre> +<pre><code> # A tibble: 10 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 1 match red red + 3 2 mismatch green red + 4 2 match green green + 5 3 mismatch purple green + 6 3 match purple purple + 7 4 mismatch brown blue + 8 4 match brown brown + 9 5 mismatch blue purple + 10 5 match blue blue</code></pre> +<p>At the moment, we have both “red” word trails showing up together, +and then the “green”s, the “purple”s, and so on. If we wanted to +introduce some randomness to the presentation order within each type of +trial, we can wrap the row indices in <code>sample()</code> to shuffle +them first:</p> +<pre class="r"><code>shuffled_stroop_trials &lt;- stroop_trials |&gt; + slice( as.integer(t(sapply(unique(trial), \(x) sample(which(trial == x))))) ) +shuffled_stroop_trials</code></pre> +<pre><code> # A tibble: 10 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 2 mismatch green red + 4 4 match brown brown + 5 3 mismatch purple green + 6 1 match red red + 7 4 mismatch brown blue + 8 3 match purple purple + 9 5 mismatch blue purple + 10 2 match green green</code></pre> +<!-- applies to monotonically increasing continuous sequences too --> +<h3 id="inserting-a-new-row-at-specific-intervals">Inserting a new row +at specific intervals</h3> +<p>Continuing with our Stroop experiment template example, let’s say we +want to give participants a break every two trials.</p> +<p>In a matrix representation, this means constructing this 2-row matrix +of row indices:</p> +<pre class="r"><code>matrix(1:nrow(shuffled_stroop_trials), nrow = 2)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] + [1,] 1 3 5 7 9 + [2,] 2 4 6 8 10</code></pre> +<p>And adding a row of that represent a separator/break, before +collapsing column-wise:</p> +<pre class="r"><code>matrix(1:nrow(shuffled_stroop_trials), nrow = 2) |&gt; + rbind(11)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] + [1,] 1 3 5 7 9 + [2,] 2 4 6 8 10 + [3,] 11 11 11 11 11</code></pre> +<p>Using slice, this means adding a row to the data representing a break +trial first, and then adding a row to the row index matrix representing +that row:</p> +<pre class="r"><code>stroop_with_breaks &lt;- shuffled_stroop_trials |&gt; + add_row(trial = &quot;BREAK&quot;) |&gt; + slice( + matrix(row_number()[-n()], nrow = 2) |&gt; + rbind(n()) |&gt; + as.integer() + ) +stroop_with_breaks</code></pre> +<pre><code> # A tibble: 15 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 NA BREAK &lt;NA&gt; &lt;NA&gt; + 4 2 mismatch green red + 5 4 match brown brown + 6 NA BREAK &lt;NA&gt; &lt;NA&gt; + 7 3 mismatch purple green + 8 1 match red red + 9 NA BREAK &lt;NA&gt; &lt;NA&gt; + 10 4 mismatch brown blue + 11 3 match purple purple + 12 NA BREAK &lt;NA&gt; &lt;NA&gt; + 13 5 mismatch blue purple + 14 2 match green green + 15 NA BREAK &lt;NA&gt; &lt;NA&gt;</code></pre> +<p>If we don’t want a break after the last trial, we can use negative +indexing with <code>slice(-n())</code>:</p> +<pre class="r"><code>stroop_with_breaks |&gt; + slice(-n())</code></pre> +<pre><code> # A tibble: 14 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 NA BREAK &lt;NA&gt; &lt;NA&gt; + 4 2 mismatch green red + 5 4 match brown brown + 6 NA BREAK &lt;NA&gt; &lt;NA&gt; + 7 3 mismatch purple green + 8 1 match red red + 9 NA BREAK &lt;NA&gt; &lt;NA&gt; + 10 4 mismatch brown blue + 11 3 match purple purple + 12 NA BREAK &lt;NA&gt; &lt;NA&gt; + 13 5 mismatch blue purple + 14 2 match green green</code></pre> +<p>What about after 3 trials, where the number of trials (10) is not +divisibly by 3? Can we still use a matrix?</p> +<p>Yes, you’d just need to explicitly fill in the “blanks”!</p> +<p>Conceptually, we want a matrix like this, where extra “cells” are +padded with 0s (recall that 0s are ignored in <code>slice()</code>):</p> +<pre class="r"><code>matrix(c(1:10, rep(0, 3 - 10 %% 3)), nrow = 3)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] + [1,] 1 4 7 10 + [2,] 2 5 8 0 + [3,] 3 6 9 0</code></pre> +<p>And this is how that could be implemented inside +<code>slice()</code>, minding the fact that adding the break trial +increases original row count by 1:</p> +<pre class="r"><code>shuffled_stroop_trials |&gt; + add_row(trial = &quot;BREAK&quot;) |&gt; + slice( + c(seq_len(n()-1), rep(0, 3 - (n()-1) %% 3)) |&gt; + matrix(nrow = 3) |&gt; + rbind(n()) |&gt; + as.integer() + ) |&gt; + slice(-n())</code></pre> +<pre><code> # A tibble: 13 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 2 mismatch green red + 4 NA BREAK &lt;NA&gt; &lt;NA&gt; + 5 4 match brown brown + 6 3 mismatch purple green + 7 1 match red red + 8 NA BREAK &lt;NA&gt; &lt;NA&gt; + 9 4 mismatch brown blue + 10 3 match purple purple + 11 5 mismatch blue purple + 12 NA BREAK &lt;NA&gt; &lt;NA&gt; + 13 2 match green green</code></pre> +<p>How about inserting a break trial after every <code>"purple"</code> +word trials?</p> +<p>Conceptually, we want a matrix that binds these two vectors as rows +before collapsing:</p> +<pre class="r"><code>print( 1:nrow(shuffled_stroop_trials) ) +print( + replace(rep(0, nrow(shuffled_stroop_trials)), + which(shuffled_stroop_trials$word == &quot;purple&quot;), 11) +)</code></pre> +<pre><code> [1] 1 2 3 4 5 6 7 8 9 10 + [1] 0 0 0 0 11 0 0 11 0 0</code></pre> +<p>And this is how you could do that inside <code>slice()</code>:</p> +<pre class="r"><code>shuffled_stroop_trials |&gt; + add_row(trial = &quot;BREAK&quot;) |&gt; + slice( + c(seq_len(n()-1), replace(rep(0, n()-1), which(word == &quot;purple&quot;), n())) |&gt; + matrix(nrow = 2, byrow = TRUE) |&gt; + as.integer() + )</code></pre> +<pre><code> # A tibble: 12 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 2 mismatch green red + 4 4 match brown brown + 5 3 mismatch purple green + 6 NA BREAK &lt;NA&gt; &lt;NA&gt; + 7 1 match red red + 8 4 mismatch brown blue + 9 3 match purple purple + 10 NA BREAK &lt;NA&gt; &lt;NA&gt; + 11 5 mismatch blue purple + 12 2 match green green</code></pre> +<p>You might protest that this is a pretty convoluted approach to a +seemingly simple problem of inserting rows, and you’d be right!<a +href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> Not only +is the code difficult to read, you can only insert the same single row +over and over.</p> +<p>It turns out that these cases of row insertion actually fall under +the broader class of interweaving <strong>unequal categories</strong> - +let’s see this next.</p> +<h3 id="evenly-distributed-row-shuffling-of-unequal-categories">Evenly +distributed row shuffling of unequal categories</h3> +<p>Let’s return to our solution for the initial “break every 2 trials” +problem:</p> +<pre class="r"><code>shuffled_stroop_trials |&gt; + add_row(trial = &quot;BREAK&quot;) |&gt; + slice( + matrix(row_number()[-n()], nrow = 2) |&gt; + rbind(n()) |&gt; + as.integer() + ) |&gt; + slice(-n())</code></pre> +<pre><code> # A tibble: 14 × 4 + item_id trial word color + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 mismatch red brown + 2 5 match blue blue + 3 NA BREAK &lt;NA&gt; &lt;NA&gt; + 4 2 mismatch green red + 5 4 match brown brown + 6 NA BREAK &lt;NA&gt; &lt;NA&gt; + 7 3 mismatch purple green + 8 1 match red red + 9 NA BREAK &lt;NA&gt; &lt;NA&gt; + 10 4 mismatch brown blue + 11 3 match purple purple + 12 NA BREAK &lt;NA&gt; &lt;NA&gt; + 13 5 mismatch blue purple + 14 2 match green green</code></pre> +<p>Here, we were working with a matrix that looks like this, where +<code>11</code> represents the new row we added representing a break +trial:</p> +<pre><code> [,1] [,2] [,3] [,4] [,5] + [1,] 1 3 5 7 9 + [2,] 2 4 6 8 10 + [3,] 11 11 11 11 11</code></pre> +<p>And recall that to insert every <em>3</em> rows, we needed to pad +with <code>0</code> first to satisfy the matrix’s rectangle +constraint:</p> +<pre><code> [,1] [,2] [,3] [,4] + [1,] 1 4 7 10 + [2,] 2 5 8 0 + [3,] 3 6 9 0 + [4,] 11 11 11 11</code></pre> +<p>But a better way of thinking about this is to have one matrix row +representing all row indices, and then add a <strong>sparse row</strong> +that represent breaks:</p> +<ul> +<li><p>Break after every 2 trials:</p> +<pre class="r"><code>matrix(c(1:10, rep_len(c(0, 11), 10)), nrow = 2, byrow = TRUE)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] + [1,] 1 2 3 4 5 6 7 8 9 10 + [2,] 0 11 0 11 0 11 0 11 0 11</code></pre></li> +<li><p>Break after every 3 trials:</p> +<pre class="r"><code>matrix(c(1:10, rep_len(c(0, 0, 11), 10)), nrow = 2, byrow = TRUE)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] + [1,] 1 2 3 4 5 6 7 8 9 10 + [2,] 0 0 11 0 0 11 0 0 11 0</code></pre></li> +<li><p>Break after every 4 trials:</p> +<pre class="r"><code>matrix(c(1:10, rep_len(c(0, 0, 0, 11), 10)), nrow = 2, byrow = TRUE)</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] + [1,] 1 2 3 4 5 6 7 8 9 10 + [2,] 0 0 0 11 0 0 0 11 0 0</code></pre></li> +</ul> +<p>And it turns out that this method generalizes to balanced shuffling +across categories that are not equal in size!</p> +<p>Let’s start with a really basic example - here we have three kinds of +fruits with varying counts:</p> +<pre class="r"><code>fruits &lt;- c(&quot;🍎&quot;, &quot;🍋&quot;, &quot;🍇&quot;)[c(2,1,3,3,2,3,1,2,2,1,2,2,3,3,3)] +fruits &lt;- factor(fruits, levels = c(&quot;🍇&quot;, &quot;🍋&quot;, &quot;🍎&quot;)) +table(fruits)</code></pre> +<pre><code> fruits + 🍇 🍋 🍎 + 6 6 3</code></pre> +<p>Their current order looks like this:</p> +<pre class="r"><code>cat(levels(fruits)[fruits])</code></pre> +<pre><code> 🍋 🍎 🍇 🍇 🍋 🍇 🍎 🍋 🍋 🍎 🍋 🍋 🍇 🍇 🍇</code></pre> +<p>But I want them to be ordered such that individuals of the same fruit +kind are maximally apart from one another. This effectively re-orders +the fruits to be distributed “evenly”:</p> +<pre class="r"><code>cat(levels(fruits)[fruits[c(3,1,2,4,5,0,6,8,10,13,9,0,14,11,7,15,12,0)]])</code></pre> +<pre><code> 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋</code></pre> +<p>With our “build row-wise, collapse col-wise” approach, this takes the +following steps:</p> +<ol style="list-style-type: decimal"> +<li><p>Find the most frequent category - that N-max becomes the number +of columns in the matrix of row indices.</p> +<p>In this case it’s grapes and lemons, of which there are 6 each:</p> +<pre class="r"><code>grape_rows &lt;- which(fruits == &quot;🍇&quot;) +setNames(grape_rows, rep(&quot;🍇&quot;, 6))</code></pre> +<pre><code> 🍇 🍇 🍇 🍇 🍇 🍇 + 3 4 6 13 14 15</code></pre> +<pre class="r"><code>lemon_rows &lt;- which(fruits == &quot;🍋&quot;) +setNames(lemon_rows, rep(&quot;🍋&quot;, 6))</code></pre> +<pre><code> 🍋 🍋 🍋 🍋 🍋 🍋 + 1 5 8 9 11 12</code></pre></li> +<li><p>Normalize (“stretch”) all vectors to have the same length as +N.</p> +<p>In this case we need to stretch the apples vector, which is currently +only length-3:</p> +<pre class="r"><code>apple_rows &lt;- which(fruits == &quot;🍎&quot;) +apple_rows</code></pre> +<pre><code> [1] 2 7 10</code></pre> +<p>The desired “sparse” representation is something like this, where +each instance of apple is equidistant, with 0s in between:</p> +<pre class="r"><code>apple_rows_sparse &lt;- c(2, 0, 7, 0, 10, 0) +setNames(apple_rows_sparse, c(&quot;🍎&quot;, &quot;&quot;, &quot;🍎&quot;, &quot;&quot;, &quot;🍎&quot;, &quot;&quot;))</code></pre> +<pre><code> 🍎 🍎 🍎 + 2 0 7 0 10 0</code></pre> +<p>There are many ways to get at this, but one trick involves creating +an evenly spaced float sequence from 1 to N-apple over N-max steps:</p> +<pre class="r"><code>seq(1, 3, length.out = 6)</code></pre> +<pre><code> [1] 1.0 1.4 1.8 2.2 2.6 3.0</code></pre> +<p>From there, we round the numbers:</p> +<pre class="r"><code>round(seq(1, 3, length.out = 6))</code></pre> +<pre><code> [1] 1 1 2 2 3 3</code></pre> +<p>Then mark the first occurance of each number using +<code>!duplicated()</code>:</p> +<pre class="r"><code>!duplicated(round(seq(1, 3, length.out = 6)))</code></pre> +<pre><code> [1] TRUE FALSE TRUE FALSE TRUE FALSE</code></pre> +<p>And lastly, we initialize a vector of 0s and <code>replace()</code> +the <code>TRUE</code>s with apple indices:</p> +<pre class="r"><code>replace( + rep(0, 6), + !duplicated(round(seq(1, 3, length.out = 6))), + which(fruits == &quot;🍎&quot;) +)</code></pre> +<pre><code> [1] 2 0 7 0 10 0</code></pre></li> +<li><p>Stack up the category vectors by row and collapse +column-wise:</p> +<p>Manually, we would build the full matrix row-by-row like this:</p> +<pre class="r"><code>fruits_matrix &lt;- matrix( + c(grape_rows, lemon_rows, apple_rows_sparse), + nrow = 3, byrow = TRUE +) +rownames(fruits_matrix) &lt;- c(&quot;🍇&quot;, &quot;🍋&quot;, &quot;🍎&quot;) +fruits_matrix</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] + 🍇 3 4 6 13 14 15 + 🍋 1 5 8 9 11 12 + 🍎 2 0 7 0 10 0</code></pre> +<p>And dynamically we can use <code>sapply()</code> to fill the matrix +column-by-column, and then <code>t()</code>-ing the output:</p> +<pre class="r"><code>fruits_distributed &lt;- sapply(levels(fruits), \(x) { + n_max &lt;- max(table(fruits)) + ind &lt;- which(fruits == x) + nums &lt;- seq(1, length(ind), length.out = n_max) + replace(rep(0, n_max), !duplicated(round(nums)), ind) +}) |&gt; + t() +fruits_distributed</code></pre> +<pre><code> [,1] [,2] [,3] [,4] [,5] [,6] + 🍇 3 4 6 13 14 15 + 🍋 1 5 8 9 11 12 + 🍎 2 0 7 0 10 0</code></pre> +<p>Finally, we collapse the vector and we see that it indeed distributed +the fruits evenly!</p> +<pre class="r"><code>fruits[as.integer(fruits_distributed)]</code></pre> +<pre><code> [1] 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋 + Levels: 🍇 🍋 🍎</code></pre></li> +</ol> +<p>We can go even further and wrap the dynamic, +<code>sapply()</code>-based solution into a function for use within +<code>slice()</code>. Here, I also added an optional argument for +shuffling within categories:</p> +<pre class="r"><code>rshuffle &lt;- function(x, shuffle_within = FALSE) { + categories &lt;- as.factor(x) + n_max &lt;- max(table(categories)) + sapply(levels(categories), \(lvl) { + ind &lt;- which(categories == lvl) + if (shuffle_within) ind &lt;- sample(ind) + nums &lt;- seq(1, length(ind), length.out = n_max) + replace(rep(0, n_max), !duplicated(round(nums)), ind) + }) |&gt; + t() |&gt; + as.integer() +}</code></pre> +<p>Returning back to our Stroop experiment template example, imagine we +also had two filler trials, where no word is shown and just the color +flashes on the screen:</p> +<pre class="r"><code>stroop_fillers &lt;- tibble( + item_id = 1:2, + trial = &quot;filler&quot;, + word = NA, + color = c(&quot;red&quot;, &quot;blue&quot;) +) +stroop_with_fillers &lt;- bind_rows(stroop_fillers, stroop_trials) |&gt; + mutate(trial = factor(trial, c(&quot;match&quot;, &quot;mismatch&quot;, &quot;filler&quot;))) +stroop_with_fillers</code></pre> +<pre><code> # A tibble: 12 × 4 + item_id trial word color + &lt;int&gt; &lt;fct&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 filler &lt;NA&gt; red + 2 2 filler &lt;NA&gt; blue + 3 1 mismatch red brown + 4 2 mismatch green red + 5 3 mismatch purple green + 6 4 mismatch brown blue + 7 5 mismatch blue purple + 8 1 match red red + 9 2 match green green + 10 3 match purple purple + 11 4 match brown brown + 12 5 match blue blue</code></pre> +<p>We can evenly shuffle between the unequal trial types with our new +<code>rshuffle()</code> function:</p> +<pre class="r"><code>stroop_with_fillers |&gt; + slice( rshuffle(trial, shuffle_within = TRUE) )</code></pre> +<pre><code> # A tibble: 12 × 4 + item_id trial word color + &lt;int&gt; &lt;fct&gt; &lt;chr&gt; &lt;chr&gt; + 1 2 match green green + 2 2 mismatch green red + 3 1 filler &lt;NA&gt; red + 4 1 match red red + 5 4 mismatch brown blue + 6 3 match purple purple + 7 3 mismatch purple green + 8 2 filler &lt;NA&gt; blue + 9 4 match brown brown + 10 5 mismatch blue purple + 11 5 match blue blue + 12 1 mismatch red brown</code></pre> +<h2 id="conclusion">Conclusion</h2> +<p>When I started drafting this blog post, I thought I’d come with a +principled taxonomy of row-relational operations. Ha. This was a lot +trickier to think through than I thought.</p> +<p>But I hope that this gallery of esoteric use-cases for +<code>slice()</code> inspires you to use it more, and to think about +“tidy” solutions to seemingly “untidy” problems.</p> +<pre class="r distill-force-highlighting-css"><code></code></pre> +<div class="footnotes footnotes-end-of-document"> +<hr /> +<ol> +<li id="fn1"><p>The <code>.by_group = TRUE</code> is not strictly +necessary here, but it’s good for visually inspecting the within-group +ordering.<a href="#fnref1" class="footnote-back">↩︎</a></p></li> +<li id="fn2"><p>Although row insertion is a generally tricky problem for +column-major data frame structures, which is partly why dplyr’s <a +href="https://dplyr.tidyverse.org/reference/rows.html">row manipulation +verbs</a> have stayed experimental for quite some time.<a href="#fnref2" +class="footnote-back">↩︎</a></p></li> +</ol> +</div> + 5f086a5db2d8a7a771b81537a421b2fb + data wrangling + dplyr + https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations + Sun, 22 Sep 2024 00:00:00 +0000 + + Naming patterns for boolean enums June Choe @@ -96,17 +1936,6 @@ Mon, 10 Jul 2023 00:00:00 +0000 - - Row relational operations with slice() - June Choe - https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations - A love letter to dplyr::slice() and a gallery of usecases - data wrangling - dplyr - https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations - Sun, 11 Jun 2023 00:00:00 +0000 - - First impressions of DataFrames.jl and accessories June Choe diff --git a/docs/posts/2023-06-11-row-relational-operations/index.html b/docs/posts/2023-06-11-row-relational-operations/index.html index 2d55264a..8b3b89f4 100644 --- a/docs/posts/2023-06-11-row-relational-operations/index.html +++ b/docs/posts/2023-06-11-row-relational-operations/index.html @@ -21,7 +21,7 @@