diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd index 90127ce..63bcb91 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd @@ -37,7 +37,7 @@ I think it's safe to say that the average `{dplyr}` user does not know the `ave( ## `ave()` -`ave()` is a split-apply-function in base R (specifically, `{stats}`). It's a pretty short function - maybe you can make out what it does from just reading the code^[And check out the elusive `split<-` function!] +`ave()` is a split-apply-combine function in base R (specifically, `{stats}`). It's a pretty short function - maybe you can make out what it does from just reading the code^[And check out the elusive `split<-` function!] ```{r} ave @@ -130,7 +130,7 @@ input %>% ... ``` -I'll note that there's actually also an idiomatic `{dplyr}`-solution to this using the lesser-known function `add_count()`, but you can't avoid the repetitiveness problem because it doesn't vectorize on the first argument: +I'll note that there's actually also an idiomatic `{dplyr}`-solution to this using the lesser-known function `add_count()`, but you can't avoid the repetitiveness problem because it doesn't vectorize over the first argument: ```{r} input %>% diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html index 36cec4d..a63ad2d 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html @@ -1569,7 +1569,7 @@
I think it’s safe to say that the average {dplyr}
user does not know the ave()
function. For that audience, this is a short appreciation post on ave()
, a case of tidyverse and base R.
ave()
ave()
is a split-apply-function in base R (specifically, {stats}
). It’s a pretty short function - maybe you can make out what it does from just reading the code1
ave()
is a split-apply-combine function in base R (specifically, {stats}
). It’s a pretty short function - maybe you can make out what it does from just reading the code1
ave
@@ -1584,7 +1584,7 @@ ave()
Despite its (rather generic and uninformative) name, I like to think of ave()
as actually belonging to the *apply()
family of functions, having particularly close ties to tapply()
.
{tidyverse}
solutionsI’ll note that there’s actually also an idiomatic {dplyr}
-solution to this using the lesser-known function add_count()
, but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:
I’ll note that there’s actually also an idiomatic {dplyr}
-solution to this using the lesser-known function add_count()
, but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:
input %>%
diff --git a/docs/blog.html b/docs/blog.html
index 9c24c0e..4a82238 100644
--- a/docs/blog.html
+++ b/docs/blog.html
@@ -1,3576 +1,3576 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- June Choe: Blog Posts
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-Blog Posts
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ June Choe: Blog Posts
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Blog Posts
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/blog.xml b/docs/blog.xml
index c1750e8..de5efd8 100644
--- a/docs/blog.xml
+++ b/docs/blog.xml
@@ -12,15 +12,278 @@
https://yjunechoe.github.io
Distill
- Sun, 09 Jun 2024 00:00:00 +0000
+ Sat, 08 Jun 2024 00:00:00 +0000
-
`ave()` for the average {dplyr} user
June Choe
https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average
- tidyverse 🤝 base R
+
+
+
+<p>I think it’s safe to say that the average <code>{dplyr}</code> user
+does not know the <code>ave()</code> function. For that audience, this
+is a short appreciation post on <code>ave()</code>, a case of tidyverse
+<em>and</em> base R.</p>
+<h2 id="ave"><code>ave()</code></h2>
+<p><code>ave()</code> is a split-apply-combine function in base R
+(specifically, <code>{stats}</code>). It’s a pretty short function -
+maybe you can make out what it does from just reading the code<a
+href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
+<pre class="r"><code>ave</code></pre>
+<pre><code> function (x, ..., FUN = mean)
+ {
+ if (missing(...))
+ x[] <- FUN(x)
+ else {
+ g <- interaction(...)
+ split(x, g) <- lapply(split(x, g), FUN)
+ }
+ x
+ }
+ <bytecode: 0x0000020fc12974b8>
+ <environment: namespace:stats></code></pre>
+<p>Despite its (rather generic and uninformative) name, I like to think
+of <code>ave()</code> as actually belonging to the <code>*apply()</code>
+family of functions, having particularly close ties to
+<code>tapply()</code>.</p>
+<p>A unique feature of <code>ave()</code> is the invariant that it
+<strong>returns a vector of the same length as the input</strong>. And
+if you use an aggregating function like <code>sum()</code> or
+<code>mean()</code>, it simply repeats those values over the
+observations on the basis of their grouping.</p>
+<p>For example, whereas <code>tapply()</code> can be used to summarize
+the average <code>mpg</code> by <code>cyl</code>:</p>
+<pre class="r"><code>tapply(mtcars$mpg, mtcars$cyl, FUN = mean)</code></pre>
+<pre><code> 4 6 8
+ 26.66364 19.74286 15.10000</code></pre>
+<p>The same syntax with <code>ave()</code> will repeat those values over
+each element of the input vector:</p>
+<pre class="r"><code>ave(mtcars$mpg, mtcars$cyl, FUN = mean)</code></pre>
+<pre><code> [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364
+ [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000
+ [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000
+ [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364</code></pre>
+<p>You can also get to this output from <code>tapply()</code> with an
+extra step of vectorized indexing:</p>
+<pre class="r"><code>tapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]</code></pre>
+<pre><code> 6 6 4 6 8 6 8 4
+ 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364
+ 4 6 6 8 8 8 8 8
+ 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000
+ 8 4 4 4 4 8 8 8
+ 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000
+ 8 4 4 4 8 6 8 4
+ 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364</code></pre>
+<h2 id="the-problem">The problem</h2>
+<p>Nothing sparks more joy than when a base R function helps you write
+more “tidy” code. I’ve talked about this in length before with
+<code>outer()</code> in a <a
+href="https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations/">prior
+blog post on <code>dplyr::slice()</code></a>, and here I want to show a
+cool <code>ave()</code> + <code>dplyr::mutate()</code> combo.</p>
+<p>This example is adapted from a reprex by <a
+href="https://cedricscherer.netlify.app/">Cédric Scherer</a><a
+href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> on the
+DSLC (previously R4DS) slack.</p>
+<p>Given an input of multiple discrete columns and the frequencies of
+these values:</p>
+<pre class="r"><code>input <- data.frame(
+ a = c("A", "A", "A", "B"),
+ b = c("X", "Y", "Y", "Z"),
+ c = c("M", "N", "O", "O"),
+ freq = c(5, 12, 3, 7)
+)
+input</code></pre>
+<pre><code> a b c freq
+ 1 A X M 5
+ 2 A Y N 12
+ 3 A Y O 3
+ 4 B Z O 7</code></pre>
+<p>The task is to add new columns named <code>freq_*</code> that show
+the total frequency of the values in each column:</p>
+<pre class="r"><code>output <- data.frame(
+ a = c("A", "A", "A", "B"),
+ freq_a = c(20, 20, 20, 7),
+ b = c("X", "Y", "Y", "Z"),
+ freq_b = c(5, 15, 15, 7),
+ c = c("M", "N", "O", "O"),
+ freq_c = c(5, 12, 10, 10),
+ freq = c(5, 12, 3, 7)
+)
+output</code></pre>
+<pre><code> a freq_a b freq_b c freq_c freq
+ 1 A 20 X 5 M 5 5
+ 2 A 20 Y 15 N 12 12
+ 3 A 20 Y 15 O 10 3
+ 4 B 7 Z 7 O 10 7</code></pre>
+<p>So for example, in column <code>a</code> the value <code>"A"</code>
+is associated with values <code>5</code>, <code>12</code>, and
+<code>3</code> in the <code>freq</code> column, so a new
+<code>freq_a</code> column should be created to track their total
+frequencies <code>5 + 12 + 3</code> and associate that value
+(<code>20</code>) for all occurrences of <code>"A"</code> in the
+<code>a</code> column.</p>
+<h2 id="some-tidyverse-solutions">Some <code>{tidyverse}</code>
+solutions</h2>
+<p>The gut feeling is that this seems to lack a straightforwardly “tidy”
+solution. I mean, the input isn’t even <strong>tidy</strong><a
+href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> in the
+first place!</p>
+<p>So maybe we’d be better off starting with a pivoted tidy data for
+constructing a tidy solution:</p>
+<pre class="r"><code>library(tidyverse)
+input %>%
+ pivot_longer(-freq)</code></pre>
+<pre><code> # A tibble: 12 × 3
+ freq name value
+ <dbl> <chr> <chr>
+ 1 5 a A
+ 2 5 b X
+ 3 5 c M
+ 4 12 a A
+ 5 12 b Y
+ 6 12 c N
+ 7 3 a A
+ 8 3 b Y
+ 9 3 c O
+ 10 7 a B
+ 11 7 b Z
+ 12 7 c O</code></pre>
+<p>But recall that the desired output is of a wide form like the input,
+so it looks like our tidy solution will require some indirection,
+involving something like:</p>
+<pre class="r"><code>input %>%
+ pivot_longer(-freq) %>%
+ ... %>%
+ pivot_wider(...)</code></pre>
+<p>Or maybe you’d rather tackle this with some
+<code>left_join()</code>s, like:</p>
+<pre class="r"><code>input %>%
+ left_join(summarize(input, freq_a = sum(freq), .by = a)) %>%
+ ...</code></pre>
+<p>I’ll note that there’s actually also an idiomatic
+<code>{dplyr}</code>-solution to this using the lesser-known function
+<code>add_count()</code>, but you can’t avoid the repetitiveness problem
+because it doesn’t vectorize over the first argument:</p>
+<pre class="r"><code>input %>%
+ add_count(a, wt = freq, name = "freq_a") %>%
+ add_count(b, wt = freq, name = "freq_b") %>%
+ add_count(c, wt = freq, name = "freq_c")</code></pre>
+<pre><code> a b c freq freq_a freq_b freq_c
+ 1 A X M 5 20 5 5
+ 2 A Y N 12 20 15 12
+ 3 A Y O 3 20 15 10
+ 4 B Z O 7 7 7 10</code></pre>
+<p>You could try to scale this <code>add_count()</code> solution with
+<code>reduce()</code> (see my previous blog post on <a
+href="https://yjunechoe.github.io/posts/2020-12-13-collapse-repetitive-piping-with-reduce/">collapsing
+repetitive piping</a>), but now we’re straying very far from the “tidy”
+territory:</p>
+<pre class="r"><code>input %>%
+ purrr::reduce(
+ c("a", "b", "c"),
+ ~ .x %>%
+ add_count(.data[[.y]], wt = freq, name = paste0("freq_", .y)),
+ .init = .
+ )</code></pre>
+<pre><code> a b c freq freq_a freq_b freq_c
+ 1 A X M 5 20 5 5
+ 2 A Y N 12 20 15 12
+ 3 A Y O 3 20 15 10
+ 4 B Z O 7 7 7 10</code></pre>
+<p>IMO this problem is actually a really good thinking exercise for the
+“average {dplyr} user”, so I encourage you to take a stab at this
+yourself before proceeding if you’ve read this far!</p>
+<h2 id="an-ave-dplyr-solution">An <code>ave()</code> +
+<code>{dplyr}</code> solution</h2>
+<p>The crucial piece of the puzzle here is to think a little outside the
+box, beyond “data(frame) wrangling”.</p>
+<p>It helps to simplify the problem once we think about the problem in
+terms of “(column) vector wrangling” first, and that’s where
+<code>ave()</code> comes in!</p>
+<p>I’ll start with the cake first - this is the one-liner
+<code>ave()</code> solution I advocated for:</p>
+<pre class="r"><code>input %>%
+ mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = "freq_{.col}"))</code></pre>
+<pre><code> a b c freq freq_a freq_b freq_c
+ 1 A X M 5 20 5 5
+ 2 A Y N 12 20 15 12
+ 3 A Y O 3 20 15 10
+ 4 B Z O 7 7 7 10</code></pre>
+<p>Taking column <code>freq_a</code> as an example, the
+<code>ave()</code> part of the solution essential creates this vector of
+summed-up <code>freq</code> values by the categories of
+<code>a</code>:</p>
+<pre class="r"><code>ave(input$freq, input$a, FUN = sum)</code></pre>
+<pre><code> [1] 20 20 20 7</code></pre>
+<p>From there, <code>across()</code> handles the iteration over columns
+and, as an added bonus, the naming of the new columns in convenient
+<code>{glue}</code> syntax (<code>"freq_{.col}"</code>).</p>
+<p>It’s the perfect mashup of base R + tidyverse. Base R takes care of
+the problem at the vector level with a split-apply-combine that’s
+concisely expressed with <code>ave()</code>, and tidyverse scales that
+solution up to the dataframe level with <code>mutate()</code> and
+<code>across()</code>.</p>
+<p>tidyverse 🤝 base R</p>
+<h2 id="sessioninfo">sessionInfo()</h2>
+<pre class="r"><code>sessionInfo()</code></pre>
+<pre><code> R version 4.3.3 (2024-02-29 ucrt)
+ Platform: x86_64-w64-mingw32/x64 (64-bit)
+ Running under: Windows 11 x64 (build 22631)
+
+ Matrix products: default
+
+
+ locale:
+ [1] LC_COLLATE=English_United States.utf8
+ [2] LC_CTYPE=English_United States.utf8
+ [3] LC_MONETARY=English_United States.utf8
+ [4] LC_NUMERIC=C
+ [5] LC_TIME=English_United States.utf8
+
+ time zone: Asia/Seoul
+ tzcode source: internal
+
+ attached base packages:
+ [1] stats graphics grDevices utils datasets methods base
+
+ other attached packages:
+ [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
+ [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
+ [9] tidyverse_2.0.0 ggplot2_3.5.1
+
+ loaded via a namespace (and not attached):
+ [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 xml2_1.3.6
+ [5] stringi_1.8.4 distill_1.6 hms_1.1.3 digest_0.6.35
+ [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.3 timechange_0.2.0
+ [13] bookdown_0.38 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8
+ [17] fansi_1.0.6 scales_1.3.0 jquerylib_0.1.4 cli_3.6.2
+ [21] rlang_1.1.4 munsell_0.5.0 withr_3.0.0 cachem_1.0.8
+ [25] yaml_2.3.8 tools_4.3.3 tzdb_0.4.0 memoise_2.0.1
+ [29] colorspace_2.1-0 vctrs_0.6.5 R6_2.5.1 mime_0.12
+ [33] png_0.1-8 lifecycle_1.0.4 fontawesome_0.5.2 pkgconfig_2.0.3
+ [37] pillar_1.9.0 bslib_0.7.0 gtable_0.3.5 glue_1.7.0
+ [41] xfun_0.44 tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.47
+ [45] htmltools_0.5.8.1 rmarkdown_2.27 compiler_4.3.3 askpass_1.2.0
+ [49] downlit_0.4.3 openssl_2.1.1</code></pre>
+<pre class="r distill-force-highlighting-css"><code></code></pre>
+<div class="footnotes footnotes-end-of-document">
+<hr />
+<ol>
+<li id="fn1"><p>And check out the elusive <code>split<-</code>
+function!<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
+<li id="fn2"><p>Who I can only assume was needing this for a fancy data
+viz thing 😆<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
+<li id="fn3"><p>I mean that in the technical sense here. In this
+problem, the unit of observation is the “cells” of the input columns
+(the values “A”, “B”, “X”, “Y”, etc.).<a href="#fnref3"
+class="footnote-back">↩︎</a></p></li>
+</ol>
+</div>
+ 478d5b66974022ae5d6b9c8ba58783a7
dplyr
https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average
- Sun, 09 Jun 2024 00:00:00 +0000
+ Sat, 08 Jun 2024 00:00:00 +0000
-
diff --git a/docs/posts/2024-06-09-ave-for-the-average/index.html b/docs/posts/2024-06-09-ave-for-the-average/index.html
index 018d471..e957f89 100644
--- a/docs/posts/2024-06-09-ave-for-the-average/index.html
+++ b/docs/posts/2024-06-09-ave-for-the-average/index.html
@@ -2703,7 +2703,7 @@
Contents
I think it’s safe to say that the average {dplyr}
user does not know the ave()
function. For that audience, this is a short appreciation post on ave()
, a case of tidyverse and base R.
ave()
ave()
is a split-apply-function in base R (specifically, {stats}
). It’s a pretty short function - maybe you can make out what it does from just reading the code1
ave()
is a split-apply-combine function in base R (specifically, {stats}
). It’s a pretty short function - maybe you can make out what it does from just reading the code1
ave
@@ -2718,7 +2718,7 @@ ave()
Despite its (rather generic and uninformative) name, I like to think of ave()
as actually belonging to the *apply()
family of functions, having particularly close ties to tapply()
.
{tidyverse}
solutionsI’ll note that there’s actually also an idiomatic {dplyr}
-solution to this using the lesser-known function add_count()
, but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:
I’ll note that there’s actually also an idiomatic {dplyr}
-solution to this using the lesser-known function add_count()
, but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:
input %>%
diff --git a/docs/posts/posts.json b/docs/posts/posts.json
index 9baa7bf..e607f47 100644
--- a/docs/posts/posts.json
+++ b/docs/posts/posts.json
@@ -13,10 +13,10 @@
"categories": [
"dplyr"
],
- "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nsessionInfo()\r\n\r\nI think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!\r\nSo maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 × 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe you’d rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nI’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.\r\nIt helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!\r\nI’ll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nIt’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse 🤝 base R\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.3.3 (2024-02-29 ucrt)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 \r\n [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 \r\n [9] tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.3.3 tidyselect_1.2.1 \r\n [5] jquerylib_0.1.4 scales_1.3.0 yaml_2.3.8 fastmap_1.1.1 \r\n [9] R6_2.5.1 generics_0.1.3 knitr_1.47 distill_1.6 \r\n [13] munsell_0.5.0 tzdb_0.4.0 bslib_0.7.0 pillar_1.9.0 \r\n [17] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 cachem_1.0.8 \r\n [21] xfun_0.44 sass_0.4.9 timechange_0.2.0 memoise_2.0.1 \r\n [25] cli_3.6.2 withr_3.0.0 magrittr_2.0.3 digest_0.6.35 \r\n [29] grid_4.3.3 rstudioapi_0.16.0 hms_1.1.3 lifecycle_1.0.4 \r\n [33] vctrs_0.6.5 downlit_0.4.3 evaluate_0.23 glue_1.7.0 \r\n [37] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.27 tools_4.3.3 \r\n [41] pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!↩︎\r\nWho I can only assume was needing this for a fancy data viz thing 😆↩︎\r\nI mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎\r\n",
+ "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nsessionInfo()\r\n\r\nI think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!\r\nSo maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 × 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe you’d rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nI’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.\r\nIt helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!\r\nI’ll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nIt’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse 🤝 base R\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.3.3 (2024-02-29 ucrt)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 \r\n [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 \r\n [9] tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.3.3 tidyselect_1.2.1 \r\n [5] jquerylib_0.1.4 scales_1.3.0 yaml_2.3.8 fastmap_1.1.1 \r\n [9] R6_2.5.1 generics_0.1.3 knitr_1.47 distill_1.6 \r\n [13] munsell_0.5.0 tzdb_0.4.0 bslib_0.7.0 pillar_1.9.0 \r\n [17] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 cachem_1.0.8 \r\n [21] xfun_0.44 sass_0.4.9 timechange_0.2.0 memoise_2.0.1 \r\n [25] cli_3.6.2 withr_3.0.0 magrittr_2.0.3 digest_0.6.35 \r\n [29] grid_4.3.3 rstudioapi_0.16.0 hms_1.1.3 lifecycle_1.0.4 \r\n [33] vctrs_0.6.5 downlit_0.4.3 evaluate_0.23 glue_1.7.0 \r\n [37] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.27 tools_4.3.3 \r\n [41] pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!↩︎\r\nWho I can only assume was needing this for a fancy data viz thing 😆↩︎\r\nI mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎\r\n",
"preview": "posts/2024-06-09-ave-for-the-average/preview.png",
- "last_modified": "2024-06-09T16:47:50+09:00",
- "input_file": {},
+ "last_modified": "2024-06-09T20:10:07+09:00",
+ "input_file": "ave-for-the-average.knit.md",
"preview_width": 926,
"preview_height": 328
},
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 55c9233..30d77e4 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -30,7 +30,7 @@
https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average/
- 2024-06-09T16:47:50+09:00
+ 2024-06-09T20:10:07+09:00
https://yjunechoe.github.io/posts/2024-03-04-args-args-args-args/