From 0d6bf65a80aefb3313f572bc1f508d2a2c8753c3 Mon Sep 17 00:00:00 2001 From: June Choe Date: Mon, 4 Dec 2023 10:11:49 -0500 Subject: [PATCH] typo --- .../untidy-select.Rmd | 4 +- .../untidy-select.html | 20 +- docs/blog.html | 7006 ++++++++--------- docs/blog.xml | 540 +- .../posts/2023-12-03-untidy-select/index.html | 20 +- docs/posts/posts.json | 8 +- docs/sitemap.xml | 2 +- 7 files changed, 4067 insertions(+), 3533 deletions(-) diff --git a/_posts/2023-12-03-untidy-select/untidy-select.Rmd b/_posts/2023-12-03-untidy-select/untidy-select.Rmd index 7f4ef97..38e8e64 100644 --- a/_posts/2023-12-03-untidy-select/untidy-select.Rmd +++ b/_posts/2023-12-03-untidy-select/untidy-select.Rmd @@ -428,10 +428,10 @@ cnd_zero_selection$i General evaluation errors are distinguished by having a `$parent`: ```{r} -cnd_zero_selection <- rlang::catch_cnd( +cnd_evaluation_error <- rlang::catch_cnd( eval_select(evaluation_error, df3) ) -cnd_zero_selection$parent +cnd_evaluation_error$parent ``` Again, this is more useful as a developer, if you're building something that integrates `{tidyselect}`.^[If you want some examples of post-processing tidyselect errors, there's some stuff I did for [pointblank](https://github.com/rstudio/pointblank/blob/7c4bdd0eb753db17b5213d03fd74f044df12be48/R/utils.R#L241-L318) that may be helpful as a reference.] But I personally find this interesting to know about anyways! diff --git a/_posts/2023-12-03-untidy-select/untidy-select.html b/_posts/2023-12-03-untidy-select/untidy-select.html index dbe6fd6..cdb0c9e 100644 --- a/_posts/2023-12-03-untidy-select/untidy-select.html +++ b/_posts/2023-12-03-untidy-select/untidy-select.html @@ -94,8 +94,8 @@ - - + + @@ -119,7 +119,7 @@ @@ -1534,7 +1534,7 @@ @@ -1559,7 +1559,7 @@

The many ways to (un)tidy-select

June Choe (University of Pennsylvania Linguistics)https://live-sas-www-ling.pantheon.sas.upenn.edu/ -
2023-12-03 +
2023-12-04
@@ -1674,7 +1674,7 @@

tidy-select!

out <- set_names(out, names(loc)) out } - <bytecode: 0x000002917a3967b8> + <bytecode: 0x0000012f8e6de148> <environment: namespace:dplyr>

tidy?-select

@@ -1831,7 +1831,7 @@

Tidying untidy-select

  $math_expr
   <quosure>
   expr: ^x + 1
-  env:  0x000002917c379bd0
+  env:  0x0000012f8e27cec8
   
   $columns
   [1] "x" "y" "z"
@@ -1841,7 +1841,7 @@ 

Tidying untidy-select

1 1 2 3 $mask - <environment: 0x000002917cbc2600> + <environment: 0x0000012f8e3332f0> $out [1] 2
@@ -2111,10 +2111,10 @@

2) Error handling

General evaluation errors are distinguished by having a $parent:

-
cnd_zero_selection <- rlang::catch_cnd(
+
cnd_evaluation_error <- rlang::catch_cnd(
   eval_select(evaluation_error, df3)
 )
-cnd_zero_selection$parent
+cnd_evaluation_error$parent
  <simpleError in eval_tidy(as_quosure(expr, env), context_mask): I'm a bad expression!>
diff --git a/docs/blog.html b/docs/blog.html index e194d3b..301c975 100644 --- a/docs/blog.html +++ b/docs/blog.html @@ -1,3503 +1,3503 @@ - - - - - - - - - - - - - - - - - - - - - - June Choe: Blog Posts - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-

Blog Posts

- - - -
- -
-
-

The many ways to (un)tidy-select

-
-
data wrangling
-
dplyr
-
tidyselect
-
-

Deconstructing {tidyselect} and building it back up

-
-
- - - -
- -
-
-

Fumbling my way through an XY problem

-
-
reflections
-
-

Some lessons learned from a (personal) case study

-
-
- - - -
- -
-
-

Row relational operations with slice()

-
-
data wrangling
-
dplyr
-
-

A love letter to dplyr::slice() and a gallery of usecases

-
-
- - - -
- -
-
-

First impressions of DataFrames.jl and accessories

-
-
julia
-
data wrangling
-
DataFrames.jl
-
dplyr
-
data.table
-
-

Perspectives from a {dplyr} and {data.table} useR

-
-
- - - -
- -
-
-

Reflections on useR! 2022

-
-
conference
-
ggtrace
-
-

Notes from attending and speaking at my first R conference

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 2

-
-
data visualization
-
ggplot2
-
tutorial
-
-

Exposing the `Stat` ggproto in functional programming terms

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 1

-
-
data visualization
-
ggplot2
-
ggplot internals
-
tutorial
-
-

Exploring the logic of `after_stat()` to peek inside ggplot internals

-
-
- - - -
- -
-
-

Setting up and debugging custom fonts

-
-
data visualization
-
ggplot2
-
typography
-
tutorial
-
-

A practical introduction to all (new) things font in R

-
-
- - - -
- -
-
-

Random Sampling: A table animation

-
-
data visualization
-
data wrangling
-
-

Plus a convenient way of rendering LaTeX expressions as images

-
-
- - - -
- -
-
-

Collapse repetitive piping with reduce()

-
-
data wrangling
-
tutorial
-
-

Featuring accumulate()

-
-
- - - -
- -
-
-

Plot Makeover #2

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Making a dodged-stacked hybrid bar plot in {ggplot2}

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 45

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Waffle chart of IKEA furnitures in stock

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 44

-
-
ggplot2
-
gganimate
-
spatial
-
data visualization
-
tidytuesday
-
-

Patched animation of the location and cumulative capacity of wind turbines in Canada

-
-
- - - -
- -
-
-

Analysis of @everycolorbot's tweets

-
-
data visualization
-
ggplot2
-
rtweet
-
colors
-
-

And why you should avoid neon colors

-
-
- - - -
- -
-
-

Designing guiding aesthetics

-
-
data visualization
-
ggplot2
-
tidytuesday
-
-

The fine line between creativity and noise

-
-
- - - -
- -
-
-

Demystifying stat_ layers in {ggplot2}

-
-
data visualization
-
ggplot2
-
tutorial
-
-

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 39

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Stacked area plot of the heights of Himalayan peaks attempted over the last century

-
-
- - - -
- -
-
-

Plot Makeover #1

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Flattening a faceted grid for strictly horizontal comparisons

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 38

-
-
tables
-
data visualization
-
tidytuesday
-
-

Visualizing two decades of primary and secondary education spending with {gt}

-
-
- - - -
- -
-
-

Embedding videos in {reactable} tables

-
-
tables
-
data visualization
-
-

Pushing the limits of expandable row details

-
-
- - - -
- -
-
-

Fonts for graphs

-
-
data visualization
-
typography
-
-

A small collection of my favorite fonts for data visualization

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 33

-
-
tidytuesday
-
gganimate
-
ggplot2
-
-

An animation of the main characters in Avatar

-
-
- - - -
- -
-
-

Saving a line of piping

-
-
data wrangling
-
dplyr
-
tutorial
-
-

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 32

-
-
tidytuesday
-
data visualization
-
ggplot2
-
-

A dumbbell chart visualization of energy production trends among European countries

-
-
- - - -
- -
-
-

Six years of my Spotify playlists

-
-
ggplot2
-
gganimate
-
spotifyr
-
data wrangling
-
data visualization
-
-

An analysis of acoustic features with {spotifyr}

-
-
- - - -
- -
-
-

Shiny tips - the first set

-
-
shiny
-
-

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

-
-
- - - -
- -
-
-

geom_paired_raincloud()

-
-
data visualization
-
ggplot2
-
-

A {ggplot2} geom for visualizing change in distribution between two conditions.

-
-
- - - -
- -
-
-

Plotting treemaps with {treemap} and {ggplot2}

-
-
data visualization
-
treemap
-
ggplot2
-
tutorial
-
-

Using underlying plot data for maximum customization

-
-
- - - -
- -
-
-

Indexing tip for {spacyr}

-
-
data wrangling
-
NLP
-
spacyr
-
-

Speeding up the analysis of dependency relations.

-
-
- - - -
- -
-
-

The Correlation Parameter in Mixed Effects Models

-
-
statistics
-
mixed-effects models
-
tutorial
-
-

Notes on the Corr term in {lme4} output

-
-
-
-
- -
- -
- - -
-

Blog Posts

- - - - -
- - -
- -
- - -
- -
-
- - - - - -
- - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + June Choe: Blog Posts + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+

Blog Posts

+ + + +
+ +
+
+

The many ways to (un)tidy-select

+
+
data wrangling
+
dplyr
+
tidyselect
+
+

Deconstructing {tidyselect} and building it back up

+
+
+ + + +
+ +
+
+

Fumbling my way through an XY problem

+
+
reflections
+
+

Some lessons learned from a (personal) case study

+
+
+ + + +
+ +
+
+

Row relational operations with slice()

+
+
data wrangling
+
dplyr
+
+

A love letter to dplyr::slice() and a gallery of usecases

+
+
+ + + +
+ +
+
+

First impressions of DataFrames.jl and accessories

+
+
julia
+
data wrangling
+
DataFrames.jl
+
dplyr
+
data.table
+
+

Perspectives from a {dplyr} and {data.table} useR

+
+
+ + + +
+ +
+
+

Reflections on useR! 2022

+
+
conference
+
ggtrace
+
+

Notes from attending and speaking at my first R conference

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 2

+
+
data visualization
+
ggplot2
+
tutorial
+
+

Exposing the `Stat` ggproto in functional programming terms

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 1

+
+
data visualization
+
ggplot2
+
ggplot internals
+
tutorial
+
+

Exploring the logic of `after_stat()` to peek inside ggplot internals

+
+
+ + + +
+ +
+
+

Setting up and debugging custom fonts

+
+
data visualization
+
ggplot2
+
typography
+
tutorial
+
+

A practical introduction to all (new) things font in R

+
+
+ + + +
+ +
+
+

Random Sampling: A table animation

+
+
data visualization
+
data wrangling
+
+

Plus a convenient way of rendering LaTeX expressions as images

+
+
+ + + +
+ +
+
+

Collapse repetitive piping with reduce()

+
+
data wrangling
+
tutorial
+
+

Featuring accumulate()

+
+
+ + + +
+ +
+
+

Plot Makeover #2

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Making a dodged-stacked hybrid bar plot in {ggplot2}

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 45

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Waffle chart of IKEA furnitures in stock

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 44

+
+
ggplot2
+
gganimate
+
spatial
+
data visualization
+
tidytuesday
+
+

Patched animation of the location and cumulative capacity of wind turbines in Canada

+
+
+ + + +
+ +
+
+

Analysis of @everycolorbot's tweets

+
+
data visualization
+
ggplot2
+
rtweet
+
colors
+
+

And why you should avoid neon colors

+
+
+ + + +
+ +
+
+

Designing guiding aesthetics

+
+
data visualization
+
ggplot2
+
tidytuesday
+
+

The fine line between creativity and noise

+
+
+ + + +
+ +
+
+

Demystifying stat_ layers in {ggplot2}

+
+
data visualization
+
ggplot2
+
tutorial
+
+

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 39

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Stacked area plot of the heights of Himalayan peaks attempted over the last century

+
+
+ + + +
+ +
+
+

Plot Makeover #1

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Flattening a faceted grid for strictly horizontal comparisons

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 38

+
+
tables
+
data visualization
+
tidytuesday
+
+

Visualizing two decades of primary and secondary education spending with {gt}

+
+
+ + + +
+ +
+
+

Embedding videos in {reactable} tables

+
+
tables
+
data visualization
+
+

Pushing the limits of expandable row details

+
+
+ + + +
+ +
+
+

Fonts for graphs

+
+
data visualization
+
typography
+
+

A small collection of my favorite fonts for data visualization

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 33

+
+
tidytuesday
+
gganimate
+
ggplot2
+
+

An animation of the main characters in Avatar

+
+
+ + + +
+ +
+
+

Saving a line of piping

+
+
data wrangling
+
dplyr
+
tutorial
+
+

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 32

+
+
tidytuesday
+
data visualization
+
ggplot2
+
+

A dumbbell chart visualization of energy production trends among European countries

+
+
+ + + +
+ +
+
+

Six years of my Spotify playlists

+
+
ggplot2
+
gganimate
+
spotifyr
+
data wrangling
+
data visualization
+
+

An analysis of acoustic features with {spotifyr}

+
+
+ + + +
+ +
+
+

Shiny tips - the first set

+
+
shiny
+
+

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

+
+
+ + + +
+ +
+
+

geom_paired_raincloud()

+
+
data visualization
+
ggplot2
+
+

A {ggplot2} geom for visualizing change in distribution between two conditions.

+
+
+ + + +
+ +
+
+

Plotting treemaps with {treemap} and {ggplot2}

+
+
data visualization
+
treemap
+
ggplot2
+
tutorial
+
+

Using underlying plot data for maximum customization

+
+
+ + + +
+ +
+
+

Indexing tip for {spacyr}

+
+
data wrangling
+
NLP
+
spacyr
+
+

Speeding up the analysis of dependency relations.

+
+
+ + + +
+ +
+
+

The Correlation Parameter in Mixed Effects Models

+
+
statistics
+
mixed-effects models
+
tutorial
+
+

Notes on the Corr term in {lme4} output

+
+
+
+
+ +
+ +
+ + +
+

Blog Posts

+ + + + +
+ + +
+ +
+ + +
+ +
+
+ + + + + +
+ + + + + + + + + diff --git a/docs/blog.xml b/docs/blog.xml index 6b59944..853cce5 100644 --- a/docs/blog.xml +++ b/docs/blog.xml @@ -12,17 +12,551 @@ https://yjunechoe.github.io Distill - Sun, 03 Dec 2023 00:00:00 +0000 + Mon, 04 Dec 2023 00:00:00 +0000 The many ways to (un)tidy-select June Choe https://yjunechoe.github.io/posts/2023-12-03-untidy-select - Deconstructing {tidyselect} and building it back up + + + +<h2 id="intro">Intro</h2> +<p>Recently, I’ve been having <a +href="https://github.com/rstudio/pointblank/pull/493">frequent</a> <a +href="https://github.com/rstudio/pointblank/pull/499">run-ins</a> with +<code>{tidyselect}</code> internals, discovering some weird and +interesting behaviors along the way. This blog post is my attempt at +documenting a couple of these. And as is the case with my usual style of +writing, I’m gonna talk about some of the weirder stuff first and then +touch on some of the “practical” side to this.</p> +<h2 id="some-observations">Some observations</h2> +<p>Let’s start with some facts about how <code>{tidyselect}</code> is +supposed to work. I’ll use this toy data for the demo:</p> +<pre class="r"><code>library(dplyr, warn.conflicts = FALSE) +library(tidyselect) +df &lt;- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2]) +df</code></pre> +<pre><code> # A tibble: 2 × 3 + x y z + &lt;int&gt; &lt;chr&gt; &lt;chr&gt; + 1 1 a A + 2 2 b B</code></pre> +<h3 id="tidy-select">tidy-select!</h3> +<p><code>{tidyselect}</code> is the package that powers +<code>dplyr::select()</code>. If you’ve used <code>{dplyr}</code>, you +already know the behavior of <code>select()</code> pretty well. We can +specify a column as string, symbol, or by its position:</p> +<pre class="r"><code>df %&gt;% + select(&quot;x&quot;)</code></pre> +<pre><code> # A tibble: 2 × 1 + x + &lt;int&gt; + 1 1 + 2 2</code></pre> +<pre class="r"><code>df %&gt;% + select(x)</code></pre> +<pre><code> # A tibble: 2 × 1 + x + &lt;int&gt; + 1 1 + 2 2</code></pre> +<pre class="r"><code>df %&gt;% + select(1)</code></pre> +<pre><code> # A tibble: 2 × 1 + x + &lt;int&gt; + 1 1 + 2 2</code></pre> +<p>It’s not obvious from the outside, but the way this works is that +these user-supplied expressions (like <code>"x"</code>, <code>x</code>, +and <code>1</code>) all get <strong>resolved to integer</strong> before +the selection actually happens.</p> +<p>So to be more specific, the three calls to <code>select()</code> were +the same because these three calls to +<code>tidyselect::eval_select()</code> are the same:<a href="#fn1" +class="footnote-ref" id="fnref1"><sup>1</sup></a></p> +<pre class="r"><code>eval_select(quote(&quot;x&quot;), df)</code></pre> +<pre><code> x + 1</code></pre> +<pre class="r"><code>eval_select(quote(x), df)</code></pre> +<pre><code> x + 1</code></pre> +<pre class="r"><code>eval_select(quote(1), df)</code></pre> +<pre><code> x + 1</code></pre> +<p>You can also see <code>eval_select()</code> in action in the +<code>&lt;data.frame&gt;</code> method for <code>select()</code>:</p> +<pre class="r"><code>dplyr:::select.data.frame</code></pre> +<pre><code> function (.data, ...) + { + error_call &lt;- dplyr_error_call() + loc &lt;- tidyselect::eval_select(expr(c(...)), data = .data, + error_call = error_call) + loc &lt;- ensure_group_vars(loc, .data, notify = TRUE) + out &lt;- dplyr_col_select(.data, loc) + out &lt;- set_names(out, names(loc)) + out + } + &lt;bytecode: 0x0000012f8e6de148&gt; + &lt;environment: namespace:dplyr&gt;</code></pre> +<h3 id="tidy-select-1">tidy?-select</h3> +<p>Because the column <em>subsetting</em> part is ultimately done using +integers, we can theoretically pass <code>select()</code> <em>any</em> +expression, as long as it resolves to an integer vector.</p> +<p>For example, we can use <code>1 + 1</code> to select the second +column:</p> +<pre class="r"><code>df %&gt;% + select(1 + 1)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>And vector recycling is still a thing here too - we can use +<code>c(1, 2) + 1</code> to select the second and third columns:</p> +<pre class="r"><code>df %&gt;% + select(c(1, 2) + 1)</code></pre> +<pre><code> # A tibble: 2 × 2 + y z + &lt;chr&gt; &lt;chr&gt; + 1 a A + 2 b B</code></pre> +<p>Ordinary function calls work as well - we can select a random column +using <code>sample()</code>:</p> +<pre class="r"><code>df %&gt;% + select(sample(ncol(df), 1))</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>We can even use the <code>.env</code> pronoun to scope an integer +variable from the global environment:<a href="#fn2" class="footnote-ref" +id="fnref2"><sup>2</sup></a></p> +<pre class="r"><code>offset &lt;- 1 +df %&gt;% + select(1 + .env$offset)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>So that’s kinda interesting.<a href="#fn3" class="footnote-ref" +id="fnref3"><sup>3</sup></a> But what if we try to mix the different +approaches to tidyselect-ing? Can we do math on columns that we’ve +selected using strings and symbols?</p> +<h3 id="untidy-select">untidy-select?</h3> +<p>Uh not quite. <code>select()</code> doesn’t like doing math on +strings and symbols.</p> +<pre class="r"><code>df %&gt;% + select(x + 1)</code></pre> +<pre><code> Error in `select()`: + ! Problem while evaluating `x + 1`. + Caused by error: + ! object &#39;x&#39; not found</code></pre> +<pre class="r"><code>df %&gt;% + select(&quot;x&quot; + 1)</code></pre> +<pre><code> Error in `select()`: + ! Problem while evaluating `&quot;x&quot; + 1`. + Caused by error in `&quot;x&quot; + 1`: + ! non-numeric argument to binary operator</code></pre> +<p>In fact, it doesn’t even like doing certain kinds of math like +multiplication (<code>*</code>), even with numeric constants:</p> +<pre class="r"><code>df %&gt;% + select(1 * 2)</code></pre> +<pre><code> Error in `select()`: + ! Can&#39;t use arithmetic operator `*` in selection context.</code></pre> +<p>This actually makes sense from a design POV. Adding numbers to +columns probably happens more often as a mistake than something +intentional. These safeguards exist to prevent users from running into +cryptic errors.</p> +<p>Unless…</p> +<h3 id="untidy-select-1">untidy-select!</h3> +<p>It turns out that <code>{tidyselect}</code> +<em><strong>helpers</strong></em> have an interesting behavior of +<em>immediately</em> resolving the column selection to integer. So we +can get addition (<code>+</code>) working if we wrap our columns in +redundant column selection helpers like <code>all_of()</code> and +<code>matches()</code></p> +<pre class="r"><code>df %&gt;% + select(all_of(&quot;x&quot;) + 1)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<pre class="r"><code>df %&gt;% + select(matches(&quot;^x$&quot;) + 1)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>For multiplication, we have to additionally circumvent the <a +href="https://github.com/r-lib/tidyselect/blob/7cc3ea6213838dbb3f9c19e9a8b97cd03f5063a9/R/eval-walk.R#L167">censoring</a> +of the <code>*</code> symbol. Here, we can simply use a different name +for the same operation:<a href="#fn4" class="footnote-ref" +id="fnref4"><sup>4</sup></a></p> +<pre class="r"><code>`%times%` &lt;- `*` +df %&gt;% + select(matches(&quot;^x$&quot;) %times% 2)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>But geez, it’s so tiring to type <code>all_of()</code> and +<code>matches()</code> all the time. There must be a better way to break +the rule!</p> +<h2 id="tidying-untidy-select">Tidying untidy-select</h2> +<p>Let’s make a tidy design for the untidy pattern of selecting columns +by doing math on column locations. The idea is to make our own little +scope inside <code>select()</code> where all the existing safeguards are +suspended. Like a <a +href="https://en.wikipedia.org/wiki/Domain-specific_language">DSL</a> +within a DSL, if you will.</p> +<p>Let’s call this function <code>math()</code>. It should let us +express stuff like “give me the column to the right of column +<code>x</code>” via this intuitive(?) syntax:</p> +<pre class="r"><code>df %&gt;% + select(math(x + 1))</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>This is my take on <code>math()</code>:</p> +<pre class="r"><code>math &lt;- function(expr) { + math_expr &lt;- rlang::enquo(expr) + columns &lt;- tidyselect::peek_vars() + col_locs &lt;- as.data.frame.list(seq_along(columns), col.names = columns) + mask &lt;- rlang::as_data_mask(col_locs) + out &lt;- rlang::eval_tidy(math_expr, mask) + out +}</code></pre> +<p>There’s a lot of weird functions involved here, but it’s easier to +digest by focusing on its parts. Here’s what each local variable in the +function looks like for our <code>math(x + 1)</code> example above:</p> +<pre><code> $math_expr + &lt;quosure&gt; + expr: ^x + 1 + env: 0x0000012f8e27cec8 + + $columns + [1] &quot;x&quot; &quot;y&quot; &quot;z&quot; + + $col_locs + x y z + 1 1 2 3 + + $mask + &lt;environment: 0x0000012f8e3332f0&gt; + + $out + [1] 2</code></pre> +<p>Let’s walk through the pieces:</p> +<ol style="list-style-type: decimal"> +<li><p><code>math_expr</code>: the captured user expression, with the +environment attached</p></li> +<li><p><code>columns</code>: the column names of the current dataframe, +in order</p></li> +<li><p><code>col_locs</code>: a dataframe of column names and location, +created from <code>columns</code></p></li> +<li><p><code>mask</code>: a <a +href="https://rlang.r-lib.org/reference/topic-data-mask.html">data +mask</a> created from <code>col_locs</code></p></li> +<li><p><code>out</code>: location of column(s) to select</p></li> +</ol> +<p>Essentially, <code>math()</code> first captures the expression to +evaluate it in its own special environment, circumventing +<code>select()</code>’s safeguards. Then, it grabs the column names of +the data frame with <code>tidyselect::peek_vars()</code> to define +<code>col_locs</code> and then <code>mask</code>. The data mask +<code>mask</code> is then used inside <code>rlang::eval_tidy()</code> to +resolve symbols like <code>x</code> to integer <code>1</code> when +evaluating the captured expression <code>x + 1</code>. The expression +<code>math(x + 1)</code> thus evaluates to <code>1 + 1</code>. In turn, +<code>select(math(x + 1))</code> is evaluated to <code>select(2)</code>, +returning us the second column of the dataframe.</p> +<h2 id="writing-untidy-select-helpers">Writing untidy-select +helpers</h2> +<p>A small yet powerful detail in the implementation of +<code>math()</code> is the fact that it captures the expression as a <a +href="https://rlang.r-lib.org/reference/topic-quosure.html">quosure</a>. +This allows <code>math()</code> to appropriately scope dynamically +created variables, and not just bare symbols provided directly by the +user.</p> +<p>This makes more sense with some examples. Here, I define helper +functions that call <code>math()</code> under the hood with their own +templatic math expressions (and I have them <code>print()</code> the +expression as passed to <code>math()</code> for clarity). The fact that +<code>math()</code> captures its argument as a quosure is what allows +local variables like <code>n</code> to be correctly scoped in these +examples.</p> +<h3 id="times">1) <code>times()</code></h3> +<pre class="r"><code>times &lt;- function(col, n) { + col &lt;- rlang::ensym(col) + print(rlang::expr(math(!!col * n))) # for debugging + math(!!col * n) +} +df %&gt;% + select(times(x, 2))</code></pre> +<pre><code> math(x * n)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<pre class="r"><code>num2 &lt;- 2 +df %&gt;% + select(times(x, num2))</code></pre> +<pre><code> math(x * n)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<h3 id="offset">2) <code>offset()</code></h3> +<pre class="r"><code>offset &lt;- function(col, n) { + col &lt;- rlang::ensym(col) + print(rlang::expr(math(!!col + n))) # for debugging + math(!!col + n) +} +df %&gt;% + select(offset(x, 1))</code></pre> +<pre><code> math(x + n)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<pre class="r"><code>num1 &lt;- 1 +df %&gt;% + select(offset(x, num1))</code></pre> +<pre><code> math(x + n)</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<h3 id="neighbors">3) <code>neighbors()</code></h3> +<pre class="r"><code>neighbors &lt;- function(col, n) { + col &lt;- rlang::ensym(col) + range &lt;- c(-(n:1), 1:n) + print(rlang::expr(math(!!col + !!range))) # for debugging + math(!!col + !!range) +} +df %&gt;% + select(neighbors(y, 1))</code></pre> +<pre><code> math(y + c(-1L, 1L))</code></pre> +<pre><code> # A tibble: 2 × 2 + x z + &lt;int&gt; &lt;chr&gt; + 1 1 A + 2 2 B</code></pre> +<pre class="r"><code>df %&gt;% + select(neighbors(y, num1))</code></pre> +<pre><code> math(y + c(-1L, 1L))</code></pre> +<pre><code> # A tibble: 2 × 2 + x z + &lt;int&gt; &lt;chr&gt; + 1 1 A + 2 2 B</code></pre> +<h3 id="diy">DIY!</h3> +<p>And of course, we can do arbitrary injections ourselves as well with +<code>!!</code> or <code>.env$</code>:</p> +<pre class="r"><code>df %&gt;% + select(math(x * !!num2))</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<pre class="r"><code>df %&gt;% + select(math(x * .env$num2))</code></pre> +<pre><code> # A tibble: 2 × 1 + y + &lt;chr&gt; + 1 a + 2 b</code></pre> +<p>That was fun but probably not super practical. Let’s set +<code>math()</code> aside to try our hands on something more useful.</p> +<h2 id="lets-get-practical">Let’s get practical</h2> +<h3 id="sorting-columns">1) Sorting columns</h3> +<p>Probably one of the hardest things to do idiomatically in the +tidyverse is sorting (a subset of) columns by their name. For example, +consider this dataframe which is a mix of columns that follow some fixed +pattern (<code>"x|y_\\d"</code>) and those outside that pattern +(<code>"year"</code>, <code>"day"</code>, etc.).</p> +<pre class="r"><code>data_cols &lt;- expand.grid(first = c(&quot;x&quot;, &quot;y&quot;), second = 1:3) %&gt;% + mutate(cols = paste0(first, &quot;_&quot;, second)) %&gt;% + pull(cols) +df2 &lt;- as.data.frame.list(seq_along(data_cols), col.names = data_cols) +df2 &lt;- cbind(df2, storms[1,1:5]) +df2 &lt;- df2[, sample(ncol(df2))] +df2</code></pre> +<pre><code> y_3 x_3 month day hour y_2 y_1 x_2 year name x_1 + 1 6 5 6 27 0 4 2 3 1975 Amy 1</code></pre> +<p>It’s trivial to select columns by pattern - we can use the +<code>matches()</code> helper:</p> +<pre class="r"><code>df2 %&gt;% + select(matches(&quot;(x|y)_(\\d)&quot;))</code></pre> +<pre><code> y_3 x_3 y_2 y_1 x_2 x_1 + 1 6 5 4 2 3 1</code></pre> +<p>But what if I also wanted to further sort these columns, <em>after I +select them</em>? There’s no easy way to do this “on the fly” inside of +select, especially if we want the flexibility to sort the columns by the +letter vs. the number.</p> +<p>But here’s one way of getting at that, exploiting two facts:</p> +<ol style="list-style-type: decimal"> +<li><code>matches()</code>, like other tidyselect helpers, immediately +resolves the selection to integer</li> +<li><code>peek_vars()</code> returns the column names in order, which +lets us recover the column names from location</li> +</ol> +<p>And that’s pretty much all there is to the tidyselect magic that goes +into my solution below. I define <code>locs</code> (integer vector of +column locations) and <code>cols</code> (character vector of column +names at those locations), and the rest is just regex and sorting:</p> +<pre class="r"><code>ordered_matches &lt;- function(matches, order) { + # tidyselect magic + locs &lt;- tidyselect::matches(matches) + cols &lt;- tidyselect::peek_vars()[locs] + # Ordinary evaluation + groups &lt;- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,] + reordered &lt;- do.call(&quot;order&quot;, asplit(groups[order, ], 1)) + locs[reordered] +}</code></pre> +<p>Using <code>ordered_matches()</code>, we can not only select columns +but also sort them using regex capture groups.</p> +<p>This sorts the columns by letter first then number:</p> +<pre class="r"><code>df2 %&gt;% + select(ordered_matches(&quot;(x|y)_(\\d)&quot;, c(1, 2)))</code></pre> +<pre><code> x_1 x_2 x_3 y_1 y_2 y_3 + 1 1 3 5 2 4 6</code></pre> +<p>This sorts the columns by number first then letter:</p> +<pre class="r"><code>df2 %&gt;% + select(ordered_matches(&quot;(x|y)_(\\d)&quot;, c(2, 1)))</code></pre> +<pre><code> x_1 y_1 x_2 y_2 x_3 y_3 + 1 1 2 3 4 5 6</code></pre> +<p>And if we wanted the other columns too, we can use +<code>everything()</code> to grab the “rest”:</p> +<pre class="r"><code>df2 %&gt;% + select(ordered_matches(&quot;(x|y)_(\\d)&quot;, c(2, 1)), everything())</code></pre> +<pre><code> x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name + 1 1 2 3 4 5 6 6 27 0 1975 Amy</code></pre> +<h3 id="error-handling">2) Error handling</h3> +<p>One of the really nice parts about the <code>{tidyselect}</code> +design is the fact that error messages are very informative.</p> +<p>For example, if you select a non-existing column, it errors while +pointing out that mistake:</p> +<pre class="r"><code>df3 &lt;- data.frame(x = 1) +nonexistent_selection &lt;- quote(c(x, y)) +eval_select(nonexistent_selection, df3)</code></pre> +<pre><code> Error in `write_feed_xml_html_content()`: + ! Can&#39;t subset columns that don&#39;t exist. + ✖ Column `y` doesn&#39;t exist.</code></pre> +<p>If you use a tidyselect helper that returns nothing, it won’t +complain by default:</p> +<pre class="r"><code>zero_selection &lt;- quote(starts_with(&quot;z&quot;)) +eval_select(zero_selection, df3)</code></pre> +<pre><code> named integer(0)</code></pre> +<p>But you can make that error with +<code>allow_empty = FALSE</code>:</p> +<pre class="r"><code>eval_select(zero_selection, df3, allow_empty = FALSE)</code></pre> +<pre><code> Error in `write_feed_xml_html_content()`: + ! Must select at least one item.</code></pre> +<p>General evaluation errors are caught and <a +href="https://rlang.r-lib.org/reference/topic-error-chaining.html">chained</a>:</p> +<pre class="r"><code>evaluation_error &lt;- quote(stop(&quot;I&#39;m a bad expression!&quot;)) +eval_select(evaluation_error, df3)</code></pre> +<pre><code> Error in `write_feed_xml_html_content()`: + ! Problem while evaluating `stop(&quot;I&#39;m a bad expression!&quot;)`. + Caused by error: + ! I&#39;m a bad expression!</code></pre> +<p>These error signalling patterns are clearly very useful for users,<a +href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a> but +there’s a little gem in there for developers too. It turns out that the +<strong>error condition object</strong> contains these information too, +which lets you detect different error types programmatically to forward +errors to your own error handling logic.</p> +<p>For example, the attempted non-existent column is stored in +<code>$i</code>:<a href="#fn6" class="footnote-ref" +id="fnref6"><sup>6</sup></a></p> +<pre class="r"><code>cnd_nonexistent &lt;- rlang::catch_cnd( + eval_select(nonexistent_selection, df3) +) +cnd_nonexistent$i</code></pre> +<pre><code> [1] &quot;y&quot;</code></pre> +<p>Zero column selections give you <code>NULL</code> in <code>$i</code> +when you set it to error:</p> +<pre class="r"><code>cnd_zero_selection &lt;- rlang::catch_cnd( + eval_select(zero_selection, df3, allow_empty = FALSE) +) +cnd_zero_selection$i</code></pre> +<pre><code> NULL</code></pre> +<p>General evaluation errors are distinguished by having a +<code>$parent</code>:</p> +<pre class="r"><code>cnd_evaluation_error &lt;- rlang::catch_cnd( + eval_select(evaluation_error, df3) +) +cnd_evaluation_error$parent</code></pre> +<pre><code> &lt;simpleError in eval_tidy(as_quosure(expr, env), context_mask): I&#39;m a bad expression!&gt;</code></pre> +<p>Again, this is more useful as a developer, if you’re building +something that integrates <code>{tidyselect}</code>.<a href="#fn7" +class="footnote-ref" id="fnref7"><sup>7</sup></a> But I personally find +this interesting to know about anyways!</p> +<h2 id="conclusion">Conclusion</h2> +<p>Here I end with the (usual) disclaimer to not actually just copy +paste these for production - they’re written with the very low standard +of scratching my itch, so they do not come with any warranty!</p> +<p>But I hope that this was a fun exercise in thinking through one of +the most mysterious magics in <code>{dplyr}</code>. I’m sure to +reference this many times in the future myself.</p> +<pre class="r distill-force-highlighting-css"><code></code></pre> +<div class="footnotes footnotes-end-of-document"> +<hr /> +<ol> +<li id="fn1"><p>The examples <code>quote("x")</code> and +<code>quote(1)</code> are redundant because <code>"x"</code> and +<code>1</code> are constants. I keep <code>quote()</code> in there just +to make the comparison clearer<a href="#fnref1" +class="footnote-back">↩︎</a></p></li> +<li id="fn2"><p>Not to be confused with <code>all_of()</code>. The +idiomatic pattern for scoping an external <em>character</em> vector is +to do <code>all_of(x)</code> not <code>.env$x</code>. It’s only when +you’re scoping a non-character vector that you’d use +<code>.env$</code>.<a href="#fnref2" +class="footnote-back">↩︎</a></p></li> +<li id="fn3"><p>It’s also strangely reminiscent of my <a +href="https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations/">previous +blog post</a> on <code>dplyr::slice()</code><a href="#fnref3" +class="footnote-back">↩︎</a></p></li> +<li id="fn4"><p>Thanks to <a +href="https://fosstodon.org/@jonocarroll/111343255529231116">Jonathan +Carroll</a> for this suggestion!<a href="#fnref4" +class="footnote-back">↩︎</a></p></li> +<li id="fn5"><p>For those who actually read error messages, at least +(<em>points to self</em>) …<a href="#fnref5" +class="footnote-back">↩︎</a></p></li> +<li id="fn6"><p>Though <code>{tidyselect}</code> errors early, so it’ll +only record the first attempted column causing the error. You could use +a <code>while()</code> loop (catch and remove bad columns from the data +until there’s no more error) if you really wanted to get the full set of +offending columns.<a href="#fnref6" class="footnote-back">↩︎</a></p></li> +<li id="fn7"><p>If you want some examples of post-processing tidyselect +errors, there’s some stuff I did for <a +href="https://github.com/rstudio/pointblank/blob/7c4bdd0eb753db17b5213d03fd74f044df12be48/R/utils.R#L241-L318">pointblank</a> +that may be helpful as a reference.<a href="#fnref7" +class="footnote-back">↩︎</a></p></li> +</ol> +</div> + 32275e0b0a132d327f6605e22aa8b745 data wrangling dplyr tidyselect https://yjunechoe.github.io/posts/2023-12-03-untidy-select - Sun, 03 Dec 2023 00:00:00 +0000 + Mon, 04 Dec 2023 00:00:00 +0000 diff --git a/docs/posts/2023-12-03-untidy-select/index.html b/docs/posts/2023-12-03-untidy-select/index.html index f14523e..e71f497 100644 --- a/docs/posts/2023-12-03-untidy-select/index.html +++ b/docs/posts/2023-12-03-untidy-select/index.html @@ -96,8 +96,8 @@ - - + + @@ -125,7 +125,7 @@ @@ -2626,7 +2626,7 @@

${suggestion.title}

@@ -2693,7 +2693,7 @@

The many ways to (un)tidy-select

@@ -2808,7 +2808,7 @@

tidy-select!

out <- set_names(out, names(loc)) out } - <bytecode: 0x000002917a3967b8> + <bytecode: 0x0000012f8e6de148> <environment: namespace:dplyr>

tidy?-select

@@ -2965,7 +2965,7 @@

Tidying untidy-select

  $math_expr
   <quosure>
   expr: ^x + 1
-  env:  0x000002917c379bd0
+  env:  0x0000012f8e27cec8
   
   $columns
   [1] "x" "y" "z"
@@ -2975,7 +2975,7 @@ 

Tidying untidy-select

1 1 2 3 $mask - <environment: 0x000002917cbc2600> + <environment: 0x0000012f8e3332f0> $out [1] 2
@@ -3245,10 +3245,10 @@

2) Error handling

General evaluation errors are distinguished by having a $parent:

-
cnd_zero_selection <- rlang::catch_cnd(
+
cnd_evaluation_error <- rlang::catch_cnd(
   eval_select(evaluation_error, df3)
 )
-cnd_zero_selection$parent
+cnd_evaluation_error$parent
  <simpleError in eval_tidy(as_quosure(expr, env), context_mask): I'm a bad expression!>
diff --git a/docs/posts/posts.json b/docs/posts/posts.json index 0f61136..eef3ac1 100644 --- a/docs/posts/posts.json +++ b/docs/posts/posts.json @@ -9,16 +9,16 @@ "url": {} } ], - "date": "2023-12-03", + "date": "2023-12-04", "categories": [ "data wrangling", "dplyr", "tidyselect" ], - "contents": "\r\n\r\nContents\r\nIntro\r\nSome observations\r\ntidy-select!\r\ntidy?-select\r\nuntidy-select?\r\nuntidy-select!\r\n\r\nTidying untidy-select\r\nWriting untidy-select helpers\r\n1) times()\r\n2) offset()\r\n3) neighbors()\r\nDIY!\r\n\r\nLet’s get practical\r\n1) Sorting columns\r\n2) Error handling\r\n\r\nConclusion\r\n\r\nIntro\r\nRecently, I’ve been having frequent run-ins with {tidyselect} internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, I’m gonna talk about some of the weirder stuff first and then touch on some of the “practical” side to this.\r\nSome observations\r\nLet’s start with some facts about how {tidyselect} is supposed to work. I’ll use this toy data for the demo:\r\n\r\n\r\nlibrary(dplyr, warn.conflicts = FALSE)\r\nlibrary(tidyselect)\r\ndf <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])\r\ndf\r\n\r\n # A tibble: 2 × 3\r\n x y z \r\n \r\n 1 1 a A \r\n 2 2 b B\r\n\r\ntidy-select!\r\n{tidyselect} is the package that powers dplyr::select(). If you’ve used {dplyr}, you already know the behavior of select() pretty well. We can specify a column as string, symbol, or by its position:\r\n\r\n\r\ndf %>% \r\n select(\"x\")\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(x)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(1)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\nIt’s not obvious from the outside, but the way this works is that these user-supplied expressions (like \"x\", x, and 1) all get resolved to integer before the selection actually happens.\r\nSo to be more specific, the three calls to select() were the same because these three calls to tidyselect::eval_select() are the same:1\r\n\r\n\r\neval_select(quote(\"x\"), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(x), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(1), df)\r\n\r\n x \r\n 1\r\n\r\nYou can also see eval_select() in action in the method for select():\r\n\r\n\r\ndplyr:::select.data.frame\r\n\r\n function (.data, ...) \r\n {\r\n error_call <- dplyr_error_call()\r\n loc <- tidyselect::eval_select(expr(c(...)), data = .data, \r\n error_call = error_call)\r\n loc <- ensure_group_vars(loc, .data, notify = TRUE)\r\n out <- dplyr_col_select(.data, loc)\r\n out <- set_names(out, names(loc))\r\n out\r\n }\r\n \r\n \r\n\r\ntidy?-select\r\nBecause the column subsetting part is ultimately done using integers, we can theoretically pass select() any expression, as long as it resolves to an integer vector.\r\nFor example, we can use 1 + 1 to select the second column:\r\n\r\n\r\ndf %>% \r\n select(1 + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nAnd vector recycling is still a thing here too - we can use c(1, 2) + 1 to select the second and third columns:\r\n\r\n\r\ndf %>% \r\n select(c(1, 2) + 1)\r\n\r\n # A tibble: 2 × 2\r\n y z \r\n \r\n 1 a A \r\n 2 b B\r\n\r\nOrdinary function calls work as well - we can select a random column using sample():\r\n\r\n\r\ndf %>% \r\n select(sample(ncol(df), 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nWe can even use the .env pronoun to scope an integer variable from the global environment:2\r\n\r\n\r\noffset <- 1\r\ndf %>% \r\n select(1 + .env$offset)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nSo that’s kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that we’ve selected using strings and symbols?\r\nuntidy-select?\r\nUh not quite. select() doesn’t like doing math on strings and symbols.\r\n\r\n\r\ndf %>% \r\n select(x + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `x + 1`.\r\n Caused by error:\r\n ! object 'x' not found\r\n\r\ndf %>% \r\n select(\"x\" + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `\"x\" + 1`.\r\n Caused by error in `\"x\" + 1`:\r\n ! non-numeric argument to binary operator\r\n\r\nIn fact, it doesn’t even like doing certain kinds of math like multiplication (*), even with numeric constants:\r\n\r\n\r\ndf %>% \r\n select(1 * 2)\r\n\r\n Error in `select()`:\r\n ! Can't use arithmetic operator `*` in selection context.\r\n\r\nThis actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.\r\nUnless…\r\nuntidy-select!\r\nIt turns out that {tidyselect} helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (+) working if we wrap our columns in redundant column selection helpers like all_of() and matches()\r\n\r\n\r\ndf %>% \r\n select(all_of(\"x\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>% \r\n select(matches(\"^x$\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nFor multiplication, we have to additionally circumvent the censoring of the * symbol. Here, we can simply use a different name for the same operation:4\r\n\r\n\r\n`%times%` <- `*`\r\ndf %>% \r\n select(matches(\"^x$\") %times% 2)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nBut geez, it’s so tiring to type all_of() and matches() all the time. There must be a better way to break the rule!\r\nTidying untidy-select\r\nLet’s make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside select() where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.\r\nLet’s call this function math(). It should let us express stuff like “give me the column to the right of column x” via this intuitive(?) syntax:\r\n\r\n\r\n\r\n\r\n\r\ndf %>% \r\n select(math(x + 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThis is my take on math():\r\n\r\n\r\nmath <- function(expr) {\r\n math_expr <- rlang::enquo(expr)\r\n columns <- tidyselect::peek_vars()\r\n col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)\r\n mask <- rlang::as_data_mask(col_locs)\r\n out <- rlang::eval_tidy(math_expr, mask)\r\n out\r\n}\r\n\r\n\r\nThere’s a lot of weird functions involved here, but it’s easier to digest by focusing on its parts. Here’s what each local variable in the function looks like for our math(x + 1) example above:\r\n\r\n $math_expr\r\n \r\n expr: ^x + 1\r\n env: 0x000002917c379bd0\r\n \r\n $columns\r\n [1] \"x\" \"y\" \"z\"\r\n \r\n $col_locs\r\n x y z\r\n 1 1 2 3\r\n \r\n $mask\r\n \r\n \r\n $out\r\n [1] 2\r\n\r\nLet’s walk through the pieces:\r\nmath_expr: the captured user expression, with the environment attached\r\ncolumns: the column names of the current dataframe, in order\r\ncol_locs: a dataframe of column names and location, created from columns\r\nmask: a data mask created from col_locs\r\nout: location of column(s) to select\r\nEssentially, math() first captures the expression to evaluate it in its own special environment, circumventing select()’s safeguards. Then, it grabs the column names of the data frame with tidyselect::peek_vars() to define col_locs and then mask. The data mask mask is then used inside rlang::eval_tidy() to resolve symbols like x to integer 1 when evaluating the captured expression x + 1. The expression math(x + 1) thus evaluates to 1 + 1. In turn, select(math(x + 1)) is evaluated to select(2), returning us the second column of the dataframe.\r\nWriting untidy-select helpers\r\nA small yet powerful detail in the implementation of math() is the fact that it captures the expression as a quosure. This allows math() to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.\r\nThis makes more sense with some examples. Here, I define helper functions that call math() under the hood with their own templatic math expressions (and I have them print() the expression as passed to math() for clarity). The fact that math() captures its argument as a quosure is what allows local variables like n to be correctly scoped in these examples.\r\n1) times()\r\n\r\n\r\ntimes <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col * n))) # for debugging\r\n math(!!col * n)\r\n}\r\ndf %>%\r\n select(times(x, 2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum2 <- 2\r\ndf %>%\r\n select(times(x, num2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n2) offset()\r\n\r\n\r\noffset <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col + n))) # for debugging\r\n math(!!col + n)\r\n}\r\ndf %>%\r\n select(offset(x, 1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum1 <- 1\r\ndf %>%\r\n select(offset(x, num1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n3) neighbors()\r\n\r\n\r\nneighbors <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n range <- c(-(n:1), 1:n)\r\n print(rlang::expr(math(!!col + !!range))) # for debugging\r\n math(!!col + !!range)\r\n}\r\ndf %>%\r\n select(neighbors(y, 1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\n\r\n\r\ndf %>%\r\n select(neighbors(y, num1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\nDIY!\r\nAnd of course, we can do arbitrary injections ourselves as well with !! or .env$:\r\n\r\n\r\ndf %>%\r\n select(math(x * !!num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>%\r\n select(math(x * .env$num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThat was fun but probably not super practical. Let’s set math() aside to try our hands on something more useful.\r\nLet’s get practical\r\n1) Sorting columns\r\nProbably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern (\"x|y_\\\\d\") and those outside that pattern (\"year\", \"day\", etc.).\r\n\r\n\r\ndata_cols <- expand.grid(first = c(\"x\", \"y\"), second = 1:3) %>%\r\n mutate(cols = paste0(first, \"_\", second)) %>%\r\n pull(cols)\r\ndf2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)\r\ndf2 <- cbind(df2, storms[1,1:5])\r\ndf2 <- df2[, sample(ncol(df2))]\r\ndf2\r\n\r\n y_3 x_3 month day hour y_2 y_1 x_2 year name x_1\r\n 1 6 5 6 27 0 4 2 3 1975 Amy 1\r\n\r\nIt’s trivial to select columns by pattern - we can use the matches() helper:\r\n\r\n\r\ndf2 %>%\r\n select(matches(\"(x|y)_(\\\\d)\"))\r\n\r\n y_3 x_3 y_2 y_1 x_2 x_1\r\n 1 6 5 4 2 3 1\r\n\r\nBut what if I also wanted to further sort these columns, after I select them? There’s no easy way to do this “on the fly” inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.\r\nBut here’s one way of getting at that, exploiting two facts:\r\nmatches(), like other tidyselect helpers, immediately resolves the selection to integer\r\npeek_vars() returns the column names in order, which lets us recover the column names from location\r\nAnd that’s pretty much all there is to the tidyselect magic that goes into my solution below. I define locs (integer vector of column locations) and cols (character vector of column names at those locations), and the rest is just regex and sorting:\r\n\r\n\r\nordered_matches <- function(matches, order) {\r\n # tidyselect magic\r\n locs <- tidyselect::matches(matches)\r\n cols <- tidyselect::peek_vars()[locs]\r\n # Ordinary evaluation\r\n groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]\r\n reordered <- do.call(\"order\", asplit(groups[order, ], 1))\r\n locs[reordered]\r\n}\r\n\r\n\r\nUsing ordered_matches(), we can not only select columns but also sort them using regex capture groups.\r\nThis sorts the columns by letter first then number:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(1, 2)))\r\n\r\n x_1 x_2 x_3 y_1 y_2 y_3\r\n 1 1 3 5 2 4 6\r\n\r\nThis sorts the columns by number first then letter:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)))\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3\r\n 1 1 2 3 4 5 6\r\n\r\nAnd if we wanted the other columns too, we can use everything() to grab the “rest”:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)), everything())\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name\r\n 1 1 2 3 4 5 6 6 27 0 1975 Amy\r\n\r\n2) Error handling\r\nOne of the really nice parts about the {tidyselect} design is the fact that error messages are very informative.\r\nFor example, if you select a non-existing column, it errors while pointing out that mistake:\r\n\r\n\r\ndf3 <- data.frame(x = 1)\r\nnonexistent_selection <- quote(c(x, y))\r\neval_select(nonexistent_selection, df3)\r\n\r\n Error:\r\n ! Can't subset columns that don't exist.\r\n ✖ Column `y` doesn't exist.\r\n\r\nIf you use a tidyselect helper that returns nothing, it won’t complain by default:\r\n\r\n\r\nzero_selection <- quote(starts_with(\"z\"))\r\neval_select(zero_selection, df3)\r\n\r\n named integer(0)\r\n\r\nBut you can make that error with allow_empty = FALSE:\r\n\r\n\r\neval_select(zero_selection, df3, allow_empty = FALSE)\r\n\r\n Error:\r\n ! Must select at least one item.\r\n\r\nGeneral evaluation errors are caught and chained:\r\n\r\n\r\nevaluation_error <- quote(stop(\"I'm a bad expression!\"))\r\neval_select(evaluation_error, df3)\r\n\r\n Error:\r\n ! Problem while evaluating `stop(\"I'm a bad expression!\")`.\r\n Caused by error:\r\n ! I'm a bad expression!\r\n\r\nThese error signalling patterns are clearly very useful for users,5 but there’s a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.\r\nFor example, the attempted non-existent column is stored in $i:6\r\n\r\n\r\ncnd_nonexistent <- rlang::catch_cnd(\r\n eval_select(nonexistent_selection, df3)\r\n)\r\ncnd_nonexistent$i\r\n\r\n [1] \"y\"\r\n\r\nZero column selections give you NULL in $i when you set it to error:\r\n\r\n\r\ncnd_zero_selection <- rlang::catch_cnd(\r\n eval_select(zero_selection, df3, allow_empty = FALSE)\r\n)\r\ncnd_zero_selection$i\r\n\r\n NULL\r\n\r\nGeneral evaluation errors are distinguished by having a $parent:\r\n\r\n\r\ncnd_zero_selection <- rlang::catch_cnd(\r\n eval_select(evaluation_error, df3)\r\n)\r\ncnd_zero_selection$parent\r\n\r\n \r\n\r\nAgain, this is more useful as a developer, if you’re building something that integrates {tidyselect}.7 But I personally find this interesting to know about anyways!\r\nConclusion\r\nHere I end with the (usual) disclaimer to not actually just copy paste these for production - they’re written with the very low standard of scratching my itch, so they do not come with any warranty!\r\nBut I hope that this was a fun exercise in thinking through one of the most mysterious magics in {dplyr}. I’m sure to reference this many times in the future myself.\r\n\r\nThe examples quote(\"x\") and quote(1) are redundant because \"x\" and 1 are constants. I keep quote() in there just to make the comparison clearer↩︎\r\nNot to be confused with all_of(). The idiomatic pattern for scoping an external character vector is to do all_of(x) not .env$x. It’s only when you’re scoping a non-character vector that you’d use .env$.↩︎\r\nIt’s also strangely reminiscent of my previous blog post on dplyr::slice()↩︎\r\nThanks to Jonathan Carroll for this suggestion!↩︎\r\nFor those who actually read error messages, at least (points to self) …↩︎\r\nThough {tidyselect} errors early, so it’ll only record the first attempted column causing the error. You could use a while() loop (catch and remove bad columns from the data until there’s no more error) if you really wanted to get the full set of offending columns.↩︎\r\nIf you want some examples of post-processing tidyselect errors, there’s some stuff I did for pointblank that may be helpful as a reference.↩︎\r\n", + "contents": "\r\n\r\nContents\r\nIntro\r\nSome observations\r\ntidy-select!\r\ntidy?-select\r\nuntidy-select?\r\nuntidy-select!\r\n\r\nTidying untidy-select\r\nWriting untidy-select helpers\r\n1) times()\r\n2) offset()\r\n3) neighbors()\r\nDIY!\r\n\r\nLet’s get practical\r\n1) Sorting columns\r\n2) Error handling\r\n\r\nConclusion\r\n\r\nIntro\r\nRecently, I’ve been having frequent run-ins with {tidyselect} internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, I’m gonna talk about some of the weirder stuff first and then touch on some of the “practical” side to this.\r\nSome observations\r\nLet’s start with some facts about how {tidyselect} is supposed to work. I’ll use this toy data for the demo:\r\n\r\n\r\nlibrary(dplyr, warn.conflicts = FALSE)\r\nlibrary(tidyselect)\r\ndf <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])\r\ndf\r\n\r\n # A tibble: 2 × 3\r\n x y z \r\n \r\n 1 1 a A \r\n 2 2 b B\r\n\r\ntidy-select!\r\n{tidyselect} is the package that powers dplyr::select(). If you’ve used {dplyr}, you already know the behavior of select() pretty well. We can specify a column as string, symbol, or by its position:\r\n\r\n\r\ndf %>% \r\n select(\"x\")\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(x)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(1)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\nIt’s not obvious from the outside, but the way this works is that these user-supplied expressions (like \"x\", x, and 1) all get resolved to integer before the selection actually happens.\r\nSo to be more specific, the three calls to select() were the same because these three calls to tidyselect::eval_select() are the same:1\r\n\r\n\r\neval_select(quote(\"x\"), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(x), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(1), df)\r\n\r\n x \r\n 1\r\n\r\nYou can also see eval_select() in action in the method for select():\r\n\r\n\r\ndplyr:::select.data.frame\r\n\r\n function (.data, ...) \r\n {\r\n error_call <- dplyr_error_call()\r\n loc <- tidyselect::eval_select(expr(c(...)), data = .data, \r\n error_call = error_call)\r\n loc <- ensure_group_vars(loc, .data, notify = TRUE)\r\n out <- dplyr_col_select(.data, loc)\r\n out <- set_names(out, names(loc))\r\n out\r\n }\r\n \r\n \r\n\r\ntidy?-select\r\nBecause the column subsetting part is ultimately done using integers, we can theoretically pass select() any expression, as long as it resolves to an integer vector.\r\nFor example, we can use 1 + 1 to select the second column:\r\n\r\n\r\ndf %>% \r\n select(1 + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nAnd vector recycling is still a thing here too - we can use c(1, 2) + 1 to select the second and third columns:\r\n\r\n\r\ndf %>% \r\n select(c(1, 2) + 1)\r\n\r\n # A tibble: 2 × 2\r\n y z \r\n \r\n 1 a A \r\n 2 b B\r\n\r\nOrdinary function calls work as well - we can select a random column using sample():\r\n\r\n\r\ndf %>% \r\n select(sample(ncol(df), 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nWe can even use the .env pronoun to scope an integer variable from the global environment:2\r\n\r\n\r\noffset <- 1\r\ndf %>% \r\n select(1 + .env$offset)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nSo that’s kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that we’ve selected using strings and symbols?\r\nuntidy-select?\r\nUh not quite. select() doesn’t like doing math on strings and symbols.\r\n\r\n\r\ndf %>% \r\n select(x + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `x + 1`.\r\n Caused by error:\r\n ! object 'x' not found\r\n\r\ndf %>% \r\n select(\"x\" + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `\"x\" + 1`.\r\n Caused by error in `\"x\" + 1`:\r\n ! non-numeric argument to binary operator\r\n\r\nIn fact, it doesn’t even like doing certain kinds of math like multiplication (*), even with numeric constants:\r\n\r\n\r\ndf %>% \r\n select(1 * 2)\r\n\r\n Error in `select()`:\r\n ! Can't use arithmetic operator `*` in selection context.\r\n\r\nThis actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.\r\nUnless…\r\nuntidy-select!\r\nIt turns out that {tidyselect} helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (+) working if we wrap our columns in redundant column selection helpers like all_of() and matches()\r\n\r\n\r\ndf %>% \r\n select(all_of(\"x\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>% \r\n select(matches(\"^x$\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nFor multiplication, we have to additionally circumvent the censoring of the * symbol. Here, we can simply use a different name for the same operation:4\r\n\r\n\r\n`%times%` <- `*`\r\ndf %>% \r\n select(matches(\"^x$\") %times% 2)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nBut geez, it’s so tiring to type all_of() and matches() all the time. There must be a better way to break the rule!\r\nTidying untidy-select\r\nLet’s make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside select() where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.\r\nLet’s call this function math(). It should let us express stuff like “give me the column to the right of column x” via this intuitive(?) syntax:\r\n\r\n\r\n\r\n\r\n\r\ndf %>% \r\n select(math(x + 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThis is my take on math():\r\n\r\n\r\nmath <- function(expr) {\r\n math_expr <- rlang::enquo(expr)\r\n columns <- tidyselect::peek_vars()\r\n col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)\r\n mask <- rlang::as_data_mask(col_locs)\r\n out <- rlang::eval_tidy(math_expr, mask)\r\n out\r\n}\r\n\r\n\r\nThere’s a lot of weird functions involved here, but it’s easier to digest by focusing on its parts. Here’s what each local variable in the function looks like for our math(x + 1) example above:\r\n\r\n $math_expr\r\n \r\n expr: ^x + 1\r\n env: 0x0000012f8e27cec8\r\n \r\n $columns\r\n [1] \"x\" \"y\" \"z\"\r\n \r\n $col_locs\r\n x y z\r\n 1 1 2 3\r\n \r\n $mask\r\n \r\n \r\n $out\r\n [1] 2\r\n\r\nLet’s walk through the pieces:\r\nmath_expr: the captured user expression, with the environment attached\r\ncolumns: the column names of the current dataframe, in order\r\ncol_locs: a dataframe of column names and location, created from columns\r\nmask: a data mask created from col_locs\r\nout: location of column(s) to select\r\nEssentially, math() first captures the expression to evaluate it in its own special environment, circumventing select()’s safeguards. Then, it grabs the column names of the data frame with tidyselect::peek_vars() to define col_locs and then mask. The data mask mask is then used inside rlang::eval_tidy() to resolve symbols like x to integer 1 when evaluating the captured expression x + 1. The expression math(x + 1) thus evaluates to 1 + 1. In turn, select(math(x + 1)) is evaluated to select(2), returning us the second column of the dataframe.\r\nWriting untidy-select helpers\r\nA small yet powerful detail in the implementation of math() is the fact that it captures the expression as a quosure. This allows math() to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.\r\nThis makes more sense with some examples. Here, I define helper functions that call math() under the hood with their own templatic math expressions (and I have them print() the expression as passed to math() for clarity). The fact that math() captures its argument as a quosure is what allows local variables like n to be correctly scoped in these examples.\r\n1) times()\r\n\r\n\r\ntimes <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col * n))) # for debugging\r\n math(!!col * n)\r\n}\r\ndf %>%\r\n select(times(x, 2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum2 <- 2\r\ndf %>%\r\n select(times(x, num2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n2) offset()\r\n\r\n\r\noffset <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col + n))) # for debugging\r\n math(!!col + n)\r\n}\r\ndf %>%\r\n select(offset(x, 1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum1 <- 1\r\ndf %>%\r\n select(offset(x, num1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n3) neighbors()\r\n\r\n\r\nneighbors <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n range <- c(-(n:1), 1:n)\r\n print(rlang::expr(math(!!col + !!range))) # for debugging\r\n math(!!col + !!range)\r\n}\r\ndf %>%\r\n select(neighbors(y, 1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\n\r\n\r\ndf %>%\r\n select(neighbors(y, num1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\nDIY!\r\nAnd of course, we can do arbitrary injections ourselves as well with !! or .env$:\r\n\r\n\r\ndf %>%\r\n select(math(x * !!num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>%\r\n select(math(x * .env$num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThat was fun but probably not super practical. Let’s set math() aside to try our hands on something more useful.\r\nLet’s get practical\r\n1) Sorting columns\r\nProbably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern (\"x|y_\\\\d\") and those outside that pattern (\"year\", \"day\", etc.).\r\n\r\n\r\ndata_cols <- expand.grid(first = c(\"x\", \"y\"), second = 1:3) %>%\r\n mutate(cols = paste0(first, \"_\", second)) %>%\r\n pull(cols)\r\ndf2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)\r\ndf2 <- cbind(df2, storms[1,1:5])\r\ndf2 <- df2[, sample(ncol(df2))]\r\ndf2\r\n\r\n y_3 x_3 month day hour y_2 y_1 x_2 year name x_1\r\n 1 6 5 6 27 0 4 2 3 1975 Amy 1\r\n\r\nIt’s trivial to select columns by pattern - we can use the matches() helper:\r\n\r\n\r\ndf2 %>%\r\n select(matches(\"(x|y)_(\\\\d)\"))\r\n\r\n y_3 x_3 y_2 y_1 x_2 x_1\r\n 1 6 5 4 2 3 1\r\n\r\nBut what if I also wanted to further sort these columns, after I select them? There’s no easy way to do this “on the fly” inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.\r\nBut here’s one way of getting at that, exploiting two facts:\r\nmatches(), like other tidyselect helpers, immediately resolves the selection to integer\r\npeek_vars() returns the column names in order, which lets us recover the column names from location\r\nAnd that’s pretty much all there is to the tidyselect magic that goes into my solution below. I define locs (integer vector of column locations) and cols (character vector of column names at those locations), and the rest is just regex and sorting:\r\n\r\n\r\nordered_matches <- function(matches, order) {\r\n # tidyselect magic\r\n locs <- tidyselect::matches(matches)\r\n cols <- tidyselect::peek_vars()[locs]\r\n # Ordinary evaluation\r\n groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]\r\n reordered <- do.call(\"order\", asplit(groups[order, ], 1))\r\n locs[reordered]\r\n}\r\n\r\n\r\nUsing ordered_matches(), we can not only select columns but also sort them using regex capture groups.\r\nThis sorts the columns by letter first then number:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(1, 2)))\r\n\r\n x_1 x_2 x_3 y_1 y_2 y_3\r\n 1 1 3 5 2 4 6\r\n\r\nThis sorts the columns by number first then letter:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)))\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3\r\n 1 1 2 3 4 5 6\r\n\r\nAnd if we wanted the other columns too, we can use everything() to grab the “rest”:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)), everything())\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name\r\n 1 1 2 3 4 5 6 6 27 0 1975 Amy\r\n\r\n2) Error handling\r\nOne of the really nice parts about the {tidyselect} design is the fact that error messages are very informative.\r\nFor example, if you select a non-existing column, it errors while pointing out that mistake:\r\n\r\n\r\ndf3 <- data.frame(x = 1)\r\nnonexistent_selection <- quote(c(x, y))\r\neval_select(nonexistent_selection, df3)\r\n\r\n Error:\r\n ! Can't subset columns that don't exist.\r\n ✖ Column `y` doesn't exist.\r\n\r\nIf you use a tidyselect helper that returns nothing, it won’t complain by default:\r\n\r\n\r\nzero_selection <- quote(starts_with(\"z\"))\r\neval_select(zero_selection, df3)\r\n\r\n named integer(0)\r\n\r\nBut you can make that error with allow_empty = FALSE:\r\n\r\n\r\neval_select(zero_selection, df3, allow_empty = FALSE)\r\n\r\n Error:\r\n ! Must select at least one item.\r\n\r\nGeneral evaluation errors are caught and chained:\r\n\r\n\r\nevaluation_error <- quote(stop(\"I'm a bad expression!\"))\r\neval_select(evaluation_error, df3)\r\n\r\n Error:\r\n ! Problem while evaluating `stop(\"I'm a bad expression!\")`.\r\n Caused by error:\r\n ! I'm a bad expression!\r\n\r\nThese error signalling patterns are clearly very useful for users,5 but there’s a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.\r\nFor example, the attempted non-existent column is stored in $i:6\r\n\r\n\r\ncnd_nonexistent <- rlang::catch_cnd(\r\n eval_select(nonexistent_selection, df3)\r\n)\r\ncnd_nonexistent$i\r\n\r\n [1] \"y\"\r\n\r\nZero column selections give you NULL in $i when you set it to error:\r\n\r\n\r\ncnd_zero_selection <- rlang::catch_cnd(\r\n eval_select(zero_selection, df3, allow_empty = FALSE)\r\n)\r\ncnd_zero_selection$i\r\n\r\n NULL\r\n\r\nGeneral evaluation errors are distinguished by having a $parent:\r\n\r\n\r\ncnd_evaluation_error <- rlang::catch_cnd(\r\n eval_select(evaluation_error, df3)\r\n)\r\ncnd_evaluation_error$parent\r\n\r\n \r\n\r\nAgain, this is more useful as a developer, if you’re building something that integrates {tidyselect}.7 But I personally find this interesting to know about anyways!\r\nConclusion\r\nHere I end with the (usual) disclaimer to not actually just copy paste these for production - they’re written with the very low standard of scratching my itch, so they do not come with any warranty!\r\nBut I hope that this was a fun exercise in thinking through one of the most mysterious magics in {dplyr}. I’m sure to reference this many times in the future myself.\r\n\r\nThe examples quote(\"x\") and quote(1) are redundant because \"x\" and 1 are constants. I keep quote() in there just to make the comparison clearer↩︎\r\nNot to be confused with all_of(). The idiomatic pattern for scoping an external character vector is to do all_of(x) not .env$x. It’s only when you’re scoping a non-character vector that you’d use .env$.↩︎\r\nIt’s also strangely reminiscent of my previous blog post on dplyr::slice()↩︎\r\nThanks to Jonathan Carroll for this suggestion!↩︎\r\nFor those who actually read error messages, at least (points to self) …↩︎\r\nThough {tidyselect} errors early, so it’ll only record the first attempted column causing the error. You could use a while() loop (catch and remove bad columns from the data until there’s no more error) if you really wanted to get the full set of offending columns.↩︎\r\nIf you want some examples of post-processing tidyselect errors, there’s some stuff I did for pointblank that may be helpful as a reference.↩︎\r\n", "preview": "posts/2023-12-03-untidy-select/preview.png", - "last_modified": "2023-12-03T19:43:14-05:00", - "input_file": {}, + "last_modified": "2023-12-04T10:11:18-05:00", + "input_file": "untidy-select.knit.md", "preview_width": 957, "preview_height": 664 }, diff --git a/docs/sitemap.xml b/docs/sitemap.xml index e25217e..e95e3ae 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -30,7 +30,7 @@ https://yjunechoe.github.io/posts/2023-12-03-untidy-select/ - 2023-12-03T19:43:14-05:00 + 2023-12-04T10:11:18-05:00 https://yjunechoe.github.io/posts/2023-07-09-x-y-problem/