diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd index 90127ce..63bcb91 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd @@ -37,7 +37,7 @@ I think it's safe to say that the average `{dplyr}` user does not know the `ave( ## `ave()` -`ave()` is a split-apply-function in base R (specifically, `{stats}`). It's a pretty short function - maybe you can make out what it does from just reading the code^[And check out the elusive `split<-` function!] +`ave()` is a split-apply-combine function in base R (specifically, `{stats}`). It's a pretty short function - maybe you can make out what it does from just reading the code^[And check out the elusive `split<-` function!] ```{r} ave @@ -130,7 +130,7 @@ input %>% ... ``` -I'll note that there's actually also an idiomatic `{dplyr}`-solution to this using the lesser-known function `add_count()`, but you can't avoid the repetitiveness problem because it doesn't vectorize on the first argument: +I'll note that there's actually also an idiomatic `{dplyr}`-solution to this using the lesser-known function `add_count()`, but you can't avoid the repetitiveness problem because it doesn't vectorize over the first argument: ```{r} input %>% diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html index 36cec4d..a63ad2d 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html @@ -1569,7 +1569,7 @@

Contents

I think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.

ave()

-

ave() is a split-apply-function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1

+

ave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1

ave
@@ -1584,7 +1584,7 @@

ave()

} x } - <bytecode: 0x0000020b91307c48> + <bytecode: 0x0000020fc12974b8> <environment: namespace:stats>

Despite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().

@@ -1704,7 +1704,7 @@

Some {tidyverse} solutions

...
-

I’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:

+

I’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:

input %>% 
diff --git a/docs/blog.html b/docs/blog.html
index 9c24c0e..4a82238 100644
--- a/docs/blog.html
+++ b/docs/blog.html
@@ -1,3576 +1,3576 @@
-
-
-
-
-
-  
-  
-  
-  
-  
-
-  
-
- 
- 
-
-
-
-
-
-  
-  June Choe: Blog Posts
-
-
-  
-
-
-  
-  
-  
-  
-  
-
-  
-  
-  
-  
-
-  
-  
-
-  
-  
-  
-  
-
-  
-
-  
-
-  
-  
-  
-  
-  
-  
-
-  
-
-  
-
-
-  
-  
-
-  
-
-  
-
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-
-  
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

Blog Posts

- - - -
- -
-
-

`ave()` for the average {dplyr} user

-
-
dplyr
-
-

tidyverse 🤝 base R

-
-
- - - -
- -
-
-

args(args(args)(args))

-
-
args
-
metaprogramming
-
-

The unexpected sequal to "R is a language optimized for meme-ing"

-
-
- - - -
- -
-
-

HelloWorld("print")

-
-
metaprogramming
-
-

R is a language optimized for meme-ing

-
-
- - - -
- -
-
-

2023 Year in Review

-
-
reflections
-
-

Reflections and updates on what I've been up to in 2023

-
-
- - - -
- -
-
-

The many ways to (un)tidy-select

-
-
data wrangling
-
dplyr
-
tidyselect
-
-

Deconstructing {tidyselect} and building it back up

-
-
- - - -
- -
-
-

Fumbling my way through an XY problem

-
-
reflections
-
-

Some lessons learned from a (personal) case study

-
-
- - - -
- -
-
-

Row relational operations with slice()

-
-
data wrangling
-
dplyr
-
-

A love letter to dplyr::slice() and a gallery of usecases

-
-
- - - -
- -
-
-

First impressions of DataFrames.jl and accessories

-
-
julia
-
data wrangling
-
DataFrames.jl
-
dplyr
-
data.table
-
-

Perspectives from a {dplyr} and {data.table} useR

-
-
- - - -
- -
-
-

Reflections on useR! 2022

-
-
conference
-
ggtrace
-
-

Notes from attending and speaking at my first R conference

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 2

-
-
data visualization
-
ggplot2
-
tutorial
-
-

Exposing the `Stat` ggproto in functional programming terms

-
-
- - - -
- -
-
-

Demystifying delayed aesthetic evaluation: Part 1

-
-
data visualization
-
ggplot2
-
ggplot internals
-
tutorial
-
-

Exploring the logic of `after_stat()` to peek inside ggplot internals

-
-
- - - -
- -
-
-

Setting up and debugging custom fonts

-
-
data visualization
-
ggplot2
-
typography
-
tutorial
-
-

A practical introduction to all (new) things font in R

-
-
- - - -
- -
-
-

Random Sampling: A table animation

-
-
data visualization
-
data wrangling
-
-

Plus a convenient way of rendering LaTeX expressions as images

-
-
- - - -
- -
-
-

Collapse repetitive piping with reduce()

-
-
data wrangling
-
tutorial
-
-

Featuring accumulate()

-
-
- - - -
- -
-
-

Plot Makeover #2

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Making a dodged-stacked hybrid bar plot in {ggplot2}

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 45

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Waffle chart of IKEA furnitures in stock

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 44

-
-
ggplot2
-
gganimate
-
spatial
-
data visualization
-
tidytuesday
-
-

Patched animation of the location and cumulative capacity of wind turbines in Canada

-
-
- - - -
- -
-
-

Analysis of @everycolorbot's tweets

-
-
data visualization
-
ggplot2
-
rtweet
-
colors
-
-

And why you should avoid neon colors

-
-
- - - -
- -
-
-

Designing guiding aesthetics

-
-
data visualization
-
ggplot2
-
tidytuesday
-
-

The fine line between creativity and noise

-
-
- - - -
- -
-
-

Demystifying stat_ layers in {ggplot2}

-
-
data visualization
-
ggplot2
-
tutorial
-
-

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 39

-
-
ggplot2
-
data visualization
-
tidytuesday
-
-

Stacked area plot of the heights of Himalayan peaks attempted over the last century

-
-
- - - -
- -
-
-

Plot Makeover #1

-
-
plot makeover
-
data visualization
-
ggplot2
-
-

Flattening a faceted grid for strictly horizontal comparisons

-
-
- - - -
- -
-
-

TidyTuesday 2020 week 38

-
-
tables
-
data visualization
-
tidytuesday
-
-

Visualizing two decades of primary and secondary education spending with {gt}

-
-
- - - -
- -
-
-

Embedding videos in {reactable} tables

-
-
tables
-
data visualization
-
-

Pushing the limits of expandable row details

-
-
- - - -
- -
-
-

Fonts for graphs

-
-
data visualization
-
typography
-
-

A small collection of my favorite fonts for data visualization

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 33

-
-
tidytuesday
-
gganimate
-
ggplot2
-
-

An animation of the main characters in Avatar

-
-
- - - -
- -
-
-

Saving a line of piping

-
-
data wrangling
-
dplyr
-
tutorial
-
-

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

-
-
- - - -
- -
-
-

TidyTuesday 2020 Week 32

-
-
tidytuesday
-
data visualization
-
ggplot2
-
-

A dumbbell chart visualization of energy production trends among European countries

-
-
- - - -
- -
-
-

Six years of my Spotify playlists

-
-
ggplot2
-
gganimate
-
spotifyr
-
data wrangling
-
data visualization
-
-

An analysis of acoustic features with {spotifyr}

-
-
- - - -
- -
-
-

Shiny tips - the first set

-
-
shiny
-
-

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

-
-
- - - -
- -
-
-

geom_paired_raincloud()

-
-
data visualization
-
ggplot2
-
-

A {ggplot2} geom for visualizing change in distribution between two conditions.

-
-
- - - -
- -
-
-

Plotting treemaps with {treemap} and {ggplot2}

-
-
data visualization
-
treemap
-
ggplot2
-
tutorial
-
-

Using underlying plot data for maximum customization

-
-
- - - -
- -
-
-

Indexing tip for {spacyr}

-
-
data wrangling
-
NLP
-
spacyr
-
-

Speeding up the analysis of dependency relations.

-
-
- - - -
- -
-
-

The Correlation Parameter in Mixed Effects Models

-
-
statistics
-
mixed-effects models
-
tutorial
-
-

Notes on the Corr term in {lme4} output

-
-
-
-
- -
- -
- - -
-

Blog Posts

- - - - -
- - -
- -
- - -
- -
-
- - - - - -
- - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + June Choe: Blog Posts + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+

Blog Posts

+ + + +
+ +
+
+

`ave()` for the average {dplyr} user

+
+
dplyr
+
+

tidyverse 🤝 base R

+
+
+ + + +
+ +
+
+

args(args(args)(args))

+
+
args
+
metaprogramming
+
+

The unexpected sequal to "R is a language optimized for meme-ing"

+
+
+ + + +
+ +
+
+

HelloWorld("print")

+
+
metaprogramming
+
+

R is a language optimized for meme-ing

+
+
+ + + +
+ +
+
+

2023 Year in Review

+
+
reflections
+
+

Reflections and updates on what I've been up to in 2023

+
+
+ + + +
+ +
+
+

The many ways to (un)tidy-select

+
+
data wrangling
+
dplyr
+
tidyselect
+
+

Deconstructing {tidyselect} and building it back up

+
+
+ + + +
+ +
+
+

Fumbling my way through an XY problem

+
+
reflections
+
+

Some lessons learned from a (personal) case study

+
+
+ + + +
+ +
+
+

Row relational operations with slice()

+
+
data wrangling
+
dplyr
+
+

A love letter to dplyr::slice() and a gallery of usecases

+
+
+ + + +
+ +
+
+

First impressions of DataFrames.jl and accessories

+
+
julia
+
data wrangling
+
DataFrames.jl
+
dplyr
+
data.table
+
+

Perspectives from a {dplyr} and {data.table} useR

+
+
+ + + +
+ +
+
+

Reflections on useR! 2022

+
+
conference
+
ggtrace
+
+

Notes from attending and speaking at my first R conference

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 2

+
+
data visualization
+
ggplot2
+
tutorial
+
+

Exposing the `Stat` ggproto in functional programming terms

+
+
+ + + +
+ +
+
+

Demystifying delayed aesthetic evaluation: Part 1

+
+
data visualization
+
ggplot2
+
ggplot internals
+
tutorial
+
+

Exploring the logic of `after_stat()` to peek inside ggplot internals

+
+
+ + + +
+ +
+
+

Setting up and debugging custom fonts

+
+
data visualization
+
ggplot2
+
typography
+
tutorial
+
+

A practical introduction to all (new) things font in R

+
+
+ + + +
+ +
+
+

Random Sampling: A table animation

+
+
data visualization
+
data wrangling
+
+

Plus a convenient way of rendering LaTeX expressions as images

+
+
+ + + +
+ +
+
+

Collapse repetitive piping with reduce()

+
+
data wrangling
+
tutorial
+
+

Featuring accumulate()

+
+
+ + + +
+ +
+
+

Plot Makeover #2

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Making a dodged-stacked hybrid bar plot in {ggplot2}

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 45

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Waffle chart of IKEA furnitures in stock

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 44

+
+
ggplot2
+
gganimate
+
spatial
+
data visualization
+
tidytuesday
+
+

Patched animation of the location and cumulative capacity of wind turbines in Canada

+
+
+ + + +
+ +
+
+

Analysis of @everycolorbot's tweets

+
+
data visualization
+
ggplot2
+
rtweet
+
colors
+
+

And why you should avoid neon colors

+
+
+ + + +
+ +
+
+

Designing guiding aesthetics

+
+
data visualization
+
ggplot2
+
tidytuesday
+
+

The fine line between creativity and noise

+
+
+ + + +
+ +
+
+

Demystifying stat_ layers in {ggplot2}

+
+
data visualization
+
ggplot2
+
tutorial
+
+

The motivation behind stat, the distinction between stat and geom, and a case study of stat_summary()

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 39

+
+
ggplot2
+
data visualization
+
tidytuesday
+
+

Stacked area plot of the heights of Himalayan peaks attempted over the last century

+
+
+ + + +
+ +
+
+

Plot Makeover #1

+
+
plot makeover
+
data visualization
+
ggplot2
+
+

Flattening a faceted grid for strictly horizontal comparisons

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 week 38

+
+
tables
+
data visualization
+
tidytuesday
+
+

Visualizing two decades of primary and secondary education spending with {gt}

+
+
+ + + +
+ +
+
+

Embedding videos in {reactable} tables

+
+
tables
+
data visualization
+
+

Pushing the limits of expandable row details

+
+
+ + + +
+ +
+
+

Fonts for graphs

+
+
data visualization
+
typography
+
+

A small collection of my favorite fonts for data visualization

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 33

+
+
tidytuesday
+
gganimate
+
ggplot2
+
+

An animation of the main characters in Avatar

+
+
+ + + +
+ +
+
+

Saving a line of piping

+
+
data wrangling
+
dplyr
+
tutorial
+
+

Some notes on lesser known functions/functionalities that combine common chain of {dplyr} verbs.

+
+
+ + + +
+ +
+
+

TidyTuesday 2020 Week 32

+
+
tidytuesday
+
data visualization
+
ggplot2
+
+

A dumbbell chart visualization of energy production trends among European countries

+
+
+ + + +
+ +
+
+

Six years of my Spotify playlists

+
+
ggplot2
+
gganimate
+
spotifyr
+
data wrangling
+
data visualization
+
+

An analysis of acoustic features with {spotifyr}

+
+
+ + + +
+ +
+
+

Shiny tips - the first set

+
+
shiny
+
+

%||%, imap() + {shinybusy}, and user inputs in modalDialog()

+
+
+ + + +
+ +
+
+

geom_paired_raincloud()

+
+
data visualization
+
ggplot2
+
+

A {ggplot2} geom for visualizing change in distribution between two conditions.

+
+
+ + + +
+ +
+
+

Plotting treemaps with {treemap} and {ggplot2}

+
+
data visualization
+
treemap
+
ggplot2
+
tutorial
+
+

Using underlying plot data for maximum customization

+
+
+ + + +
+ +
+
+

Indexing tip for {spacyr}

+
+
data wrangling
+
NLP
+
spacyr
+
+

Speeding up the analysis of dependency relations.

+
+
+ + + +
+ +
+
+

The Correlation Parameter in Mixed Effects Models

+
+
statistics
+
mixed-effects models
+
tutorial
+
+

Notes on the Corr term in {lme4} output

+
+
+
+
+ +
+ +
+ + +
+

Blog Posts

+ + + + +
+ + +
+ +
+ + +
+ +
+
+ + + + + +
+ + + + + + + + + diff --git a/docs/blog.xml b/docs/blog.xml index c1750e8..de5efd8 100644 --- a/docs/blog.xml +++ b/docs/blog.xml @@ -12,15 +12,278 @@ https://yjunechoe.github.io Distill - Sun, 09 Jun 2024 00:00:00 +0000 + Sat, 08 Jun 2024 00:00:00 +0000 `ave()` for the average {dplyr} user June Choe https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average - tidyverse 🤝 base R + + + +<p>I think it’s safe to say that the average <code>{dplyr}</code> user +does not know the <code>ave()</code> function. For that audience, this +is a short appreciation post on <code>ave()</code>, a case of tidyverse +<em>and</em> base R.</p> +<h2 id="ave"><code>ave()</code></h2> +<p><code>ave()</code> is a split-apply-combine function in base R +(specifically, <code>{stats}</code>). It’s a pretty short function - +maybe you can make out what it does from just reading the code<a +href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p> +<pre class="r"><code>ave</code></pre> +<pre><code> function (x, ..., FUN = mean) + { + if (missing(...)) + x[] &lt;- FUN(x) + else { + g &lt;- interaction(...) + split(x, g) &lt;- lapply(split(x, g), FUN) + } + x + } + &lt;bytecode: 0x0000020fc12974b8&gt; + &lt;environment: namespace:stats&gt;</code></pre> +<p>Despite its (rather generic and uninformative) name, I like to think +of <code>ave()</code> as actually belonging to the <code>*apply()</code> +family of functions, having particularly close ties to +<code>tapply()</code>.</p> +<p>A unique feature of <code>ave()</code> is the invariant that it +<strong>returns a vector of the same length as the input</strong>. And +if you use an aggregating function like <code>sum()</code> or +<code>mean()</code>, it simply repeats those values over the +observations on the basis of their grouping.</p> +<p>For example, whereas <code>tapply()</code> can be used to summarize +the average <code>mpg</code> by <code>cyl</code>:</p> +<pre class="r"><code>tapply(mtcars$mpg, mtcars$cyl, FUN = mean)</code></pre> +<pre><code> 4 6 8 + 26.66364 19.74286 15.10000</code></pre> +<p>The same syntax with <code>ave()</code> will repeat those values over +each element of the input vector:</p> +<pre class="r"><code>ave(mtcars$mpg, mtcars$cyl, FUN = mean)</code></pre> +<pre><code> [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 + [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 + [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 + [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364</code></pre> +<p>You can also get to this output from <code>tapply()</code> with an +extra step of vectorized indexing:</p> +<pre class="r"><code>tapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]</code></pre> +<pre><code> 6 6 4 6 8 6 8 4 + 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 + 4 6 6 8 8 8 8 8 + 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 + 8 4 4 4 4 8 8 8 + 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 + 8 4 4 4 8 6 8 4 + 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364</code></pre> +<h2 id="the-problem">The problem</h2> +<p>Nothing sparks more joy than when a base R function helps you write +more “tidy” code. I’ve talked about this in length before with +<code>outer()</code> in a <a +href="https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations/">prior +blog post on <code>dplyr::slice()</code></a>, and here I want to show a +cool <code>ave()</code> + <code>dplyr::mutate()</code> combo.</p> +<p>This example is adapted from a reprex by <a +href="https://cedricscherer.netlify.app/">Cédric Scherer</a><a +href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> on the +DSLC (previously R4DS) slack.</p> +<p>Given an input of multiple discrete columns and the frequencies of +these values:</p> +<pre class="r"><code>input &lt;- data.frame( + a = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;), + b = c(&quot;X&quot;, &quot;Y&quot;, &quot;Y&quot;, &quot;Z&quot;), + c = c(&quot;M&quot;, &quot;N&quot;, &quot;O&quot;, &quot;O&quot;), + freq = c(5, 12, 3, 7) +) +input</code></pre> +<pre><code> a b c freq + 1 A X M 5 + 2 A Y N 12 + 3 A Y O 3 + 4 B Z O 7</code></pre> +<p>The task is to add new columns named <code>freq_*</code> that show +the total frequency of the values in each column:</p> +<pre class="r"><code>output &lt;- data.frame( + a = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;), + freq_a = c(20, 20, 20, 7), + b = c(&quot;X&quot;, &quot;Y&quot;, &quot;Y&quot;, &quot;Z&quot;), + freq_b = c(5, 15, 15, 7), + c = c(&quot;M&quot;, &quot;N&quot;, &quot;O&quot;, &quot;O&quot;), + freq_c = c(5, 12, 10, 10), + freq = c(5, 12, 3, 7) +) +output</code></pre> +<pre><code> a freq_a b freq_b c freq_c freq + 1 A 20 X 5 M 5 5 + 2 A 20 Y 15 N 12 12 + 3 A 20 Y 15 O 10 3 + 4 B 7 Z 7 O 10 7</code></pre> +<p>So for example, in column <code>a</code> the value <code>"A"</code> +is associated with values <code>5</code>, <code>12</code>, and +<code>3</code> in the <code>freq</code> column, so a new +<code>freq_a</code> column should be created to track their total +frequencies <code>5 + 12 + 3</code> and associate that value +(<code>20</code>) for all occurrences of <code>"A"</code> in the +<code>a</code> column.</p> +<h2 id="some-tidyverse-solutions">Some <code>{tidyverse}</code> +solutions</h2> +<p>The gut feeling is that this seems to lack a straightforwardly “tidy” +solution. I mean, the input isn’t even <strong>tidy</strong><a +href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> in the +first place!</p> +<p>So maybe we’d be better off starting with a pivoted tidy data for +constructing a tidy solution:</p> +<pre class="r"><code>library(tidyverse) +input %&gt;% + pivot_longer(-freq)</code></pre> +<pre><code> # A tibble: 12 × 3 + freq name value + &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; + 1 5 a A + 2 5 b X + 3 5 c M + 4 12 a A + 5 12 b Y + 6 12 c N + 7 3 a A + 8 3 b Y + 9 3 c O + 10 7 a B + 11 7 b Z + 12 7 c O</code></pre> +<p>But recall that the desired output is of a wide form like the input, +so it looks like our tidy solution will require some indirection, +involving something like:</p> +<pre class="r"><code>input %&gt;% + pivot_longer(-freq) %&gt;% + ... %&gt;% + pivot_wider(...)</code></pre> +<p>Or maybe you’d rather tackle this with some +<code>left_join()</code>s, like:</p> +<pre class="r"><code>input %&gt;% + left_join(summarize(input, freq_a = sum(freq), .by = a)) %&gt;% + ...</code></pre> +<p>I’ll note that there’s actually also an idiomatic +<code>{dplyr}</code>-solution to this using the lesser-known function +<code>add_count()</code>, but you can’t avoid the repetitiveness problem +because it doesn’t vectorize over the first argument:</p> +<pre class="r"><code>input %&gt;% + add_count(a, wt = freq, name = &quot;freq_a&quot;) %&gt;% + add_count(b, wt = freq, name = &quot;freq_b&quot;) %&gt;% + add_count(c, wt = freq, name = &quot;freq_c&quot;)</code></pre> +<pre><code> a b c freq freq_a freq_b freq_c + 1 A X M 5 20 5 5 + 2 A Y N 12 20 15 12 + 3 A Y O 3 20 15 10 + 4 B Z O 7 7 7 10</code></pre> +<p>You could try to scale this <code>add_count()</code> solution with +<code>reduce()</code> (see my previous blog post on <a +href="https://yjunechoe.github.io/posts/2020-12-13-collapse-repetitive-piping-with-reduce/">collapsing +repetitive piping</a>), but now we’re straying very far from the “tidy” +territory:</p> +<pre class="r"><code>input %&gt;% + purrr::reduce( + c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;), + ~ .x %&gt;% + add_count(.data[[.y]], wt = freq, name = paste0(&quot;freq_&quot;, .y)), + .init = . + )</code></pre> +<pre><code> a b c freq freq_a freq_b freq_c + 1 A X M 5 20 5 5 + 2 A Y N 12 20 15 12 + 3 A Y O 3 20 15 10 + 4 B Z O 7 7 7 10</code></pre> +<p>IMO this problem is actually a really good thinking exercise for the +“average {dplyr} user”, so I encourage you to take a stab at this +yourself before proceeding if you’ve read this far!</p> +<h2 id="an-ave-dplyr-solution">An <code>ave()</code> + +<code>{dplyr}</code> solution</h2> +<p>The crucial piece of the puzzle here is to think a little outside the +box, beyond “data(frame) wrangling”.</p> +<p>It helps to simplify the problem once we think about the problem in +terms of “(column) vector wrangling” first, and that’s where +<code>ave()</code> comes in!</p> +<p>I’ll start with the cake first - this is the one-liner +<code>ave()</code> solution I advocated for:</p> +<pre class="r"><code>input %&gt;% + mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = &quot;freq_{.col}&quot;))</code></pre> +<pre><code> a b c freq freq_a freq_b freq_c + 1 A X M 5 20 5 5 + 2 A Y N 12 20 15 12 + 3 A Y O 3 20 15 10 + 4 B Z O 7 7 7 10</code></pre> +<p>Taking column <code>freq_a</code> as an example, the +<code>ave()</code> part of the solution essential creates this vector of +summed-up <code>freq</code> values by the categories of +<code>a</code>:</p> +<pre class="r"><code>ave(input$freq, input$a, FUN = sum)</code></pre> +<pre><code> [1] 20 20 20 7</code></pre> +<p>From there, <code>across()</code> handles the iteration over columns +and, as an added bonus, the naming of the new columns in convenient +<code>{glue}</code> syntax (<code>"freq_{.col}"</code>).</p> +<p>It’s the perfect mashup of base R + tidyverse. Base R takes care of +the problem at the vector level with a split-apply-combine that’s +concisely expressed with <code>ave()</code>, and tidyverse scales that +solution up to the dataframe level with <code>mutate()</code> and +<code>across()</code>.</p> +<p>tidyverse 🤝 base R</p> +<h2 id="sessioninfo">sessionInfo()</h2> +<pre class="r"><code>sessionInfo()</code></pre> +<pre><code> R version 4.3.3 (2024-02-29 ucrt) + Platform: x86_64-w64-mingw32/x64 (64-bit) + Running under: Windows 11 x64 (build 22631) + + Matrix products: default + + + locale: + [1] LC_COLLATE=English_United States.utf8 + [2] LC_CTYPE=English_United States.utf8 + [3] LC_MONETARY=English_United States.utf8 + [4] LC_NUMERIC=C + [5] LC_TIME=English_United States.utf8 + + time zone: Asia/Seoul + tzcode source: internal + + attached base packages: + [1] stats graphics grDevices utils datasets methods base + + other attached packages: + [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 + [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 + [9] tidyverse_2.0.0 ggplot2_3.5.1 + + loaded via a namespace (and not attached): + [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 + [5] stringi_1.8.4 distill_1.6 hms_1.1.3 digest_0.6.35 + [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.3 timechange_0.2.0 + [13] bookdown_0.38 fastmap_1.1.1 rprojroot_2.0.4 jsonlite_1.8.8 + [17] fansi_1.0.6 scales_1.3.0 jquerylib_0.1.4 cli_3.6.2 + [21] rlang_1.1.4 munsell_0.5.0 withr_3.0.0 cachem_1.0.8 + [25] yaml_2.3.8 tools_4.3.3 tzdb_0.4.0 memoise_2.0.1 + [29] colorspace_2.1-0 vctrs_0.6.5 R6_2.5.1 mime_0.12 + [33] png_0.1-8 lifecycle_1.0.4 fontawesome_0.5.2 pkgconfig_2.0.3 + [37] pillar_1.9.0 bslib_0.7.0 gtable_0.3.5 glue_1.7.0 + [41] xfun_0.44 tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.47 + [45] htmltools_0.5.8.1 rmarkdown_2.27 compiler_4.3.3 askpass_1.2.0 + [49] downlit_0.4.3 openssl_2.1.1</code></pre> +<pre class="r distill-force-highlighting-css"><code></code></pre> +<div class="footnotes footnotes-end-of-document"> +<hr /> +<ol> +<li id="fn1"><p>And check out the elusive <code>split&lt;-</code> +function!<a href="#fnref1" class="footnote-back">↩︎</a></p></li> +<li id="fn2"><p>Who I can only assume was needing this for a fancy data +viz thing 😆<a href="#fnref2" class="footnote-back">↩︎</a></p></li> +<li id="fn3"><p>I mean that in the technical sense here. In this +problem, the unit of observation is the “cells” of the input columns +(the values “A”, “B”, “X”, “Y”, etc.).<a href="#fnref3" +class="footnote-back">↩︎</a></p></li> +</ol> +</div> + 478d5b66974022ae5d6b9c8ba58783a7 dplyr https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average - Sun, 09 Jun 2024 00:00:00 +0000 + Sat, 08 Jun 2024 00:00:00 +0000 diff --git a/docs/posts/2024-06-09-ave-for-the-average/index.html b/docs/posts/2024-06-09-ave-for-the-average/index.html index 018d471..e957f89 100644 --- a/docs/posts/2024-06-09-ave-for-the-average/index.html +++ b/docs/posts/2024-06-09-ave-for-the-average/index.html @@ -2703,7 +2703,7 @@

Contents

I think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.

ave()

-

ave() is a split-apply-function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1

+

ave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1

ave
@@ -2718,7 +2718,7 @@

ave()

} x } - <bytecode: 0x0000020b91307c48> + <bytecode: 0x0000020fc12974b8> <environment: namespace:stats>

Despite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().

@@ -2838,7 +2838,7 @@

Some {tidyverse} solutions

...
-

I’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:

+

I’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:

input %>% 
diff --git a/docs/posts/posts.json b/docs/posts/posts.json
index 9baa7bf..e607f47 100644
--- a/docs/posts/posts.json
+++ b/docs/posts/posts.json
@@ -13,10 +13,10 @@
     "categories": [
       "dplyr"
     ],
-    "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nsessionInfo()\r\n\r\nI think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n  function (x, ..., FUN = mean) \r\n  {\r\n      if (missing(...)) \r\n          x[] <- FUN(x)\r\n      else {\r\n          g <- interaction(...)\r\n          split(x, g) <- lapply(split(x, g), FUN)\r\n      }\r\n      x\r\n  }\r\n  \r\n  \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n         4        6        8 \r\n  26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n   [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n   [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n  [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n  [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n         6        6        4        6        8        6        8        4 \r\n  19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n         4        6        6        8        8        8        8        8 \r\n  26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n         8        4        4        4        4        8        8        8 \r\n  15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n         8        4        4        4        8        6        8        4 \r\n  15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n  a = c(\"A\", \"A\", \"A\", \"B\"), \r\n  b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n  c = c(\"M\", \"N\", \"O\", \"O\"), \r\n  freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n    a b c freq\r\n  1 A X M    5\r\n  2 A Y N   12\r\n  3 A Y O    3\r\n  4 B Z O    7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n  a = c(\"A\", \"A\", \"A\", \"B\"), \r\n  freq_a = c(20, 20, 20, 7),\r\n  b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n  freq_b = c(5, 15, 15, 7), \r\n  c = c(\"M\", \"N\", \"O\", \"O\"), \r\n  freq_c = c(5, 12, 10, 10), \r\n  freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n    a freq_a b freq_b c freq_c freq\r\n  1 A     20 X      5 M      5    5\r\n  2 A     20 Y     15 N     12   12\r\n  3 A     20 Y     15 O     10    3\r\n  4 B      7 Z      7 O     10    7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!\r\nSo maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n  pivot_longer(-freq)\r\n\r\n  # A tibble: 12 × 3\r\n      freq name  value\r\n       \r\n   1     5 a     A    \r\n   2     5 b     X    \r\n   3     5 c     M    \r\n   4    12 a     A    \r\n   5    12 b     Y    \r\n   6    12 c     N    \r\n   7     3 a     A    \r\n   8     3 b     Y    \r\n   9     3 c     O    \r\n  10     7 a     B    \r\n  11     7 b     Z    \r\n  12     7 c     O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n  pivot_longer(-freq) %>% \r\n  ... %>% \r\n  pivot_wider(...)\r\n\r\n\r\nOr maybe you’d rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n  left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n  ...\r\n\r\n\r\nI’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize on the first argument:\r\n\r\n\r\ninput %>% \r\n  add_count(a, wt = freq, name = \"freq_a\") %>% \r\n  add_count(b, wt = freq, name = \"freq_b\") %>% \r\n  add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:\r\n\r\n\r\ninput %>% \r\n  purrr::reduce(\r\n    c(\"a\", \"b\", \"c\"),\r\n    ~ .x %>% \r\n      add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n    .init = .\r\n  )\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nIMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.\r\nIt helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!\r\nI’ll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n  mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n  [1] 20 20 20  7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nIt’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse 🤝 base R\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n  R version 4.3.3 (2024-02-29 ucrt)\r\n  Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n  Running under: Windows 11 x64 (build 22631)\r\n  \r\n  Matrix products: default\r\n  \r\n  \r\n  locale:\r\n  [1] LC_COLLATE=English_United States.utf8 \r\n  [2] LC_CTYPE=English_United States.utf8   \r\n  [3] LC_MONETARY=English_United States.utf8\r\n  [4] LC_NUMERIC=C                          \r\n  [5] LC_TIME=English_United States.utf8    \r\n  \r\n  time zone: Asia/Seoul\r\n  tzcode source: internal\r\n  \r\n  attached base packages:\r\n  [1] stats     graphics  grDevices utils     datasets  methods   base     \r\n  \r\n  other attached packages:\r\n   [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    \r\n   [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   \r\n   [9] tidyverse_2.0.0 ggplot2_3.5.1  \r\n  \r\n  loaded via a namespace (and not attached):\r\n   [1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.3.3    tidyselect_1.2.1 \r\n   [5] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.8        fastmap_1.1.1    \r\n   [9] R6_2.5.1          generics_0.1.3    knitr_1.47        distill_1.6      \r\n  [13] munsell_0.5.0     tzdb_0.4.0        bslib_0.7.0       pillar_1.9.0     \r\n  [17] rlang_1.1.4       utf8_1.2.4        stringi_1.8.4     cachem_1.0.8     \r\n  [21] xfun_0.44         sass_0.4.9        timechange_0.2.0  memoise_2.0.1    \r\n  [25] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35    \r\n  [29] grid_4.3.3        rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4  \r\n  [33] vctrs_0.6.5       downlit_0.4.3     evaluate_0.23     glue_1.7.0       \r\n  [37] fansi_1.0.6       colorspace_2.1-0  rmarkdown_2.27    tools_4.3.3      \r\n  [41] pkgconfig_2.0.3   htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!↩︎\r\nWho I can only assume was needing this for a fancy data viz thing 😆↩︎\r\nI mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎\r\n",
+    "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nsessionInfo()\r\n\r\nI think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n  function (x, ..., FUN = mean) \r\n  {\r\n      if (missing(...)) \r\n          x[] <- FUN(x)\r\n      else {\r\n          g <- interaction(...)\r\n          split(x, g) <- lapply(split(x, g), FUN)\r\n      }\r\n      x\r\n  }\r\n  \r\n  \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n         4        6        8 \r\n  26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n   [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n   [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n  [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n  [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n         6        6        4        6        8        6        8        4 \r\n  19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n         4        6        6        8        8        8        8        8 \r\n  26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n         8        4        4        4        4        8        8        8 \r\n  15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n         8        4        4        4        8        6        8        4 \r\n  15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n  a = c(\"A\", \"A\", \"A\", \"B\"), \r\n  b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n  c = c(\"M\", \"N\", \"O\", \"O\"), \r\n  freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n    a b c freq\r\n  1 A X M    5\r\n  2 A Y N   12\r\n  3 A Y O    3\r\n  4 B Z O    7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n  a = c(\"A\", \"A\", \"A\", \"B\"), \r\n  freq_a = c(20, 20, 20, 7),\r\n  b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n  freq_b = c(5, 15, 15, 7), \r\n  c = c(\"M\", \"N\", \"O\", \"O\"), \r\n  freq_c = c(5, 12, 10, 10), \r\n  freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n    a freq_a b freq_b c freq_c freq\r\n  1 A     20 X      5 M      5    5\r\n  2 A     20 Y     15 N     12   12\r\n  3 A     20 Y     15 O     10    3\r\n  4 B      7 Z      7 O     10    7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!\r\nSo maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n  pivot_longer(-freq)\r\n\r\n  # A tibble: 12 × 3\r\n      freq name  value\r\n       \r\n   1     5 a     A    \r\n   2     5 b     X    \r\n   3     5 c     M    \r\n   4    12 a     A    \r\n   5    12 b     Y    \r\n   6    12 c     N    \r\n   7     3 a     A    \r\n   8     3 b     Y    \r\n   9     3 c     O    \r\n  10     7 a     B    \r\n  11     7 b     Z    \r\n  12     7 c     O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n  pivot_longer(-freq) %>% \r\n  ... %>% \r\n  pivot_wider(...)\r\n\r\n\r\nOr maybe you’d rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n  left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n  ...\r\n\r\n\r\nI’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n  add_count(a, wt = freq, name = \"freq_a\") %>% \r\n  add_count(b, wt = freq, name = \"freq_b\") %>% \r\n  add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:\r\n\r\n\r\ninput %>% \r\n  purrr::reduce(\r\n    c(\"a\", \"b\", \"c\"),\r\n    ~ .x %>% \r\n      add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n    .init = .\r\n  )\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nIMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.\r\nIt helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!\r\nI’ll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n  mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n    a b c freq freq_a freq_b freq_c\r\n  1 A X M    5     20      5      5\r\n  2 A Y N   12     20     15     12\r\n  3 A Y O    3     20     15     10\r\n  4 B Z O    7      7      7     10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n  [1] 20 20 20  7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nIt’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse 🤝 base R\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n  R version 4.3.3 (2024-02-29 ucrt)\r\n  Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n  Running under: Windows 11 x64 (build 22631)\r\n  \r\n  Matrix products: default\r\n  \r\n  \r\n  locale:\r\n  [1] LC_COLLATE=English_United States.utf8 \r\n  [2] LC_CTYPE=English_United States.utf8   \r\n  [3] LC_MONETARY=English_United States.utf8\r\n  [4] LC_NUMERIC=C                          \r\n  [5] LC_TIME=English_United States.utf8    \r\n  \r\n  time zone: Asia/Seoul\r\n  tzcode source: internal\r\n  \r\n  attached base packages:\r\n  [1] stats     graphics  grDevices utils     datasets  methods   base     \r\n  \r\n  other attached packages:\r\n   [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    \r\n   [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   \r\n   [9] tidyverse_2.0.0 ggplot2_3.5.1  \r\n  \r\n  loaded via a namespace (and not attached):\r\n   [1] gtable_0.3.5      jsonlite_1.8.8    compiler_4.3.3    tidyselect_1.2.1 \r\n   [5] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.8        fastmap_1.1.1    \r\n   [9] R6_2.5.1          generics_0.1.3    knitr_1.47        distill_1.6      \r\n  [13] munsell_0.5.0     tzdb_0.4.0        bslib_0.7.0       pillar_1.9.0     \r\n  [17] rlang_1.1.4       utf8_1.2.4        stringi_1.8.4     cachem_1.0.8     \r\n  [21] xfun_0.44         sass_0.4.9        timechange_0.2.0  memoise_2.0.1    \r\n  [25] cli_3.6.2         withr_3.0.0       magrittr_2.0.3    digest_0.6.35    \r\n  [29] grid_4.3.3        rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4  \r\n  [33] vctrs_0.6.5       downlit_0.4.3     evaluate_0.23     glue_1.7.0       \r\n  [37] fansi_1.0.6       colorspace_2.1-0  rmarkdown_2.27    tools_4.3.3      \r\n  [41] pkgconfig_2.0.3   htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!↩︎\r\nWho I can only assume was needing this for a fancy data viz thing 😆↩︎\r\nI mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎\r\n",
     "preview": "posts/2024-06-09-ave-for-the-average/preview.png",
-    "last_modified": "2024-06-09T16:47:50+09:00",
-    "input_file": {},
+    "last_modified": "2024-06-09T20:10:07+09:00",
+    "input_file": "ave-for-the-average.knit.md",
     "preview_width": 926,
     "preview_height": 328
   },
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 55c9233..30d77e4 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -30,7 +30,7 @@
   
   
     https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average/
-    2024-06-09T16:47:50+09:00
+    2024-06-09T20:10:07+09:00
   
   
     https://yjunechoe.github.io/posts/2024-03-04-args-args-args-args/