diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd index 2623c33..074aafc 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.Rmd @@ -178,9 +178,9 @@ It's the perfect mashup of base R + tidyverse. Base R takes care of the problem tidyverse š¤ base R -## Aside: data.table š¤ collapse +## Aside: `{data.table}` š¤ `{collapse}` -Since I wrote this blog post, I discovered that `{data.table}` recently added in support for using `names(.SD)` in the LHS of the walrus `:=`. I'm so excited for this to hit the next release (v1.6.0)! +Since I wrote this blog post, I discovered that `{data.table}` recently added in support for using `names(.SD)` in the LHS of the walrus `:=`. I'm so excited for this to hit the [next release](https://rdatatable.gitlab.io/data.table/news/index.html) (v1.16.0)! I've trying to be more mindful of showcasing `{data.table}` whenever I talk about `{dplyr}`, so here's a solution to compare with the `dplyr::across()` solution above. @@ -209,7 +209,7 @@ ave(input_dt$freq, input_dt$a, FUN = sum) fsum(input_dt$freq, input_dt$a, TRA = "replace") # Also, TRA = 2 ``` -So the version of the solution integrating `fsum()` would be:^[I couldn't show this here with this particular example, but another nice feature of `{collapse}` š¤ `{data.table}` is the fact that they do not shy away from consuming/producing matrices: see `scale()[,1]` vs. `fscale()` for a good example of this.] +So a version of the solution integrating `fsum()` would be:^[I couldn't show this here with this particular example, but another nice feature of `{collapse}` š¤ `{data.table}` is the fact that they do not shy away from consuming/producing matrices: see `scale()[,1]` vs. `fscale()` for a good example of this.] ```{r} input_dt[, names(.SD) := NULL, .SDcols = patterns("^freq_")] diff --git a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html index 2f753f1..398fccb 100644 --- a/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html +++ b/_posts/2024-06-09-ave-for-the-average/ave-for-the-average.html @@ -1563,7 +1563,7 @@
{tidyverse}
solutionsave()
+ {dplyr}
solution{data.table}
š¤ {collapse}
ave()
Despite its (rather generic and uninformative) name, I like to think of ave()
as actually belonging to the *apply()
family of functions, having particularly close ties to tapply()
.
ave()
+ {dplyr}
solu
From there, across()
handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue}
syntax ("freq_{.col}"
).
Itās the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine thatās concisely expressed with ave()
, and tidyverse scales that solution up to the dataframe level with mutate()
and across()
.
tidyverse š¤ base R
-Since I wrote this blog post, I discovered that {data.table}
recently added in support for using names(.SD)
in the LHS of the walrus :=
. Iām so excited for this to hit the next release (v1.6.0)!
{data.table}
š¤ {collapse}
Since I wrote this blog post, I discovered that {data.table}
recently added in support for using names(.SD)
in the LHS of the walrus :=
. Iām so excited for this to hit the next release (v1.16.0)!
Iāve trying to be more mindful of showcasing {data.table}
whenever I talk about {dplyr}
, so hereās a solution to compare with the dplyr::across()
solution above.
[1] 20 20 20 7
-So the version of the solution integrating fsum()
would be:4
So a version of the solution integrating fsum()
would be:4
input_dt[, names(.SD) := NULL, .SDcols = patterns("^freq_")]
diff --git a/docs/posts/2024-06-09-ave-for-the-average/index.html b/docs/posts/2024-06-09-ave-for-the-average/index.html
index 2de8576..880ad62 100644
--- a/docs/posts/2024-06-09-ave-for-the-average/index.html
+++ b/docs/posts/2024-06-09-ave-for-the-average/index.html
@@ -2697,7 +2697,7 @@ Contents
The problem
Some {tidyverse}
solutions
An ave()
+ {dplyr}
solution
-Aside: data.table š¤ collapse
+Aside: {data.table}
š¤ {collapse}
sessionInfo()
@@ -2719,7 +2719,7 @@ ave()
}
x
}
- <bytecode: 0x000002dc894a8f80>
+ <bytecode: 0x0000029931326f80>
<environment: namespace:stats>
Despite its (rather generic and uninformative) name, I like to think of ave()
as actually belonging to the *apply()
family of functions, having particularly close ties to tapply()
.
ave()
+ {dplyr}
solu
From there, across()
handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue}
syntax ("freq_{.col}"
).
Itās the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine thatās concisely expressed with ave()
, and tidyverse scales that solution up to the dataframe level with mutate()
and across()
.
tidyverse š¤ base R
-Since I wrote this blog post, I discovered that {data.table}
recently added in support for using names(.SD)
in the LHS of the walrus :=
. Iām so excited for this to hit the next release (v1.6.0)!
{data.table}
š¤ {collapse}
Since I wrote this blog post, I discovered that {data.table}
recently added in support for using names(.SD)
in the LHS of the walrus :=
. Iām so excited for this to hit the next release (v1.16.0)!
Iāve trying to be more mindful of showcasing {data.table}
whenever I talk about {dplyr}
, so hereās a solution to compare with the dplyr::across()
solution above.
[1] 20 20 20 7
So the version of the solution integrating fsum()
would be:4
So a version of the solution integrating fsum()
would be:4
input_dt[, names(.SD) := NULL, .SDcols = patterns("^freq_")]
diff --git a/docs/posts/posts.json b/docs/posts/posts.json
index ea930df..c2a2f8c 100644
--- a/docs/posts/posts.json
+++ b/docs/posts/posts.json
@@ -13,9 +13,9 @@
"categories": [
"dplyr"
],
- "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nAside: data.table š¤ collapse\r\nsessionInfo()\r\n\r\nI think itās safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). Itās a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more ātidyā code. Iāve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by CĆ©dric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly ātidyā solution. I mean, the input isnāt even tidy3 in the first place!\r\nSo maybe weād be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 Ć 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe youād rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nIāll note that thereās actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you canāt avoid the repetitiveness problem because it doesnāt vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now weāre straying very far from the ātidyā territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the āaverage {dplyr} userā, so I encourage you to take a stab at this yourself before proceeding if youāve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond ādata(frame) wranglingā.\r\nIt helps to simplify the problem once we think about the problem in terms of ā(column) vector wranglingā first, and thatās where ave() comes in!\r\nIāll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nItās the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine thatās concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse š¤ base R\r\nAside: data.table š¤ collapse\r\nSince I wrote this blog post, I discovered that {data.table} recently added in support for using names(.SD) in the LHS of the walrus :=. Iām so excited for this to hit the next release (v1.6.0)!\r\nIāve trying to be more mindful of showcasing {data.table} whenever I talk about {dplyr}, so hereās a solution to compare with the dplyr::across() solution above.\r\n\r\n\r\n\r\n\r\n\r\n# data.table::update_dev_pkg()\r\nlibrary(data.table)\r\ninput_dt <- as.data.table(input)\r\ninput_dt\r\n\r\n a b c freq\r\n \r\n 1: A X M 5\r\n 2: A Y N 12\r\n 3: A Y O 3\r\n 4: B Z O 7\r\n\r\n\r\n\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) ave(freq, x, FUN = sum)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\nIn practice, I often pair {data.table} with {collapse}, where the latter provides a rich and performant set of split-apply-combine vector operations, to the likes of ave(). In {collapse}, ave(..., FUN = sum) can be expressed as fsum(..., TRA = \"replace\"):\r\n\r\n\r\nlibrary(collapse)\r\nave(input_dt$freq, input_dt$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nfsum(input_dt$freq, input_dt$a, TRA = \"replace\") # Also, TRA = 2\r\n\r\n [1] 20 20 20 7\r\n\r\nSo the version of the solution integrating fsum() would be:4\r\n\r\n\r\ninput_dt[, names(.SD) := NULL, .SDcols = patterns(\"^freq_\")]\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) fsum(freq, x, TRA = 2)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\ndata.table š¤ collapse\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] collapse_2.0.14 data.table_1.15.99 lubridate_1.9.3 forcats_1.0.0 \r\n [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 \r\n [9] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.4.1 Rcpp_1.0.12 \r\n [5] tidyselect_1.2.1 parallel_4.4.1 jquerylib_0.1.4 scales_1.3.0 \r\n [9] yaml_2.3.8 fastmap_1.1.1 R6_2.5.1 generics_0.1.3 \r\n [13] knitr_1.47 distill_1.6 munsell_0.5.0 tzdb_0.4.0 \r\n [17] bslib_0.7.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 \r\n [21] stringi_1.8.4 cachem_1.0.8 xfun_0.44 sass_0.4.9 \r\n [25] timechange_0.2.0 memoise_2.0.1 cli_3.6.2 withr_3.0.0 \r\n [29] magrittr_2.0.3 digest_0.6.35 grid_4.4.1 rstudioapi_0.16.0\r\n [33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 downlit_0.4.3 \r\n [37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0 \r\n [41] rmarkdown_2.27 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!ā©ļø\r\nWho I can only assume was needing this for a fancy data viz thing šā©ļø\r\nI mean that in the technical sense here. In this problem, the unit of observation is the ācellsā of the input columns (the values āAā, āBā, āXā, āYā, etc.).ā©ļø\r\nI couldnāt show this here with this particular example, but another nice feature of {collapse} š¤ {data.table} is the fact that they do not shy away from consuming/producing matrices: see scale()[,1] vs.Ā fscale() for a good example of this.ā©ļø\r\n",
+ "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nAside: {data.table} š¤ {collapse}\r\nsessionInfo()\r\n\r\nI think itās safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). Itās a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more ātidyā code. Iāve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by CĆ©dric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly ātidyā solution. I mean, the input isnāt even tidy3 in the first place!\r\nSo maybe weād be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 Ć 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe youād rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nIāll note that thereās actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you canāt avoid the repetitiveness problem because it doesnāt vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now weāre straying very far from the ātidyā territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the āaverage {dplyr} userā, so I encourage you to take a stab at this yourself before proceeding if youāve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond ādata(frame) wranglingā.\r\nIt helps to simplify the problem once we think about the problem in terms of ā(column) vector wranglingā first, and thatās where ave() comes in!\r\nIāll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nItās the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine thatās concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse š¤ base R\r\nAside: {data.table} š¤ {collapse}\r\nSince I wrote this blog post, I discovered that {data.table} recently added in support for using names(.SD) in the LHS of the walrus :=. Iām so excited for this to hit the next release (v1.16.0)!\r\nIāve trying to be more mindful of showcasing {data.table} whenever I talk about {dplyr}, so hereās a solution to compare with the dplyr::across() solution above.\r\n\r\n\r\n\r\n\r\n\r\n# data.table::update_dev_pkg()\r\nlibrary(data.table)\r\ninput_dt <- as.data.table(input)\r\ninput_dt\r\n\r\n a b c freq\r\n \r\n 1: A X M 5\r\n 2: A Y N 12\r\n 3: A Y O 3\r\n 4: B Z O 7\r\n\r\n\r\n\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) ave(freq, x, FUN = sum)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\nIn practice, I often pair {data.table} with {collapse}, where the latter provides a rich and performant set of split-apply-combine vector operations, to the likes of ave(). In {collapse}, ave(..., FUN = sum) can be expressed as fsum(..., TRA = \"replace\"):\r\n\r\n\r\nlibrary(collapse)\r\nave(input_dt$freq, input_dt$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nfsum(input_dt$freq, input_dt$a, TRA = \"replace\") # Also, TRA = 2\r\n\r\n [1] 20 20 20 7\r\n\r\nSo a version of the solution integrating fsum() would be:4\r\n\r\n\r\ninput_dt[, names(.SD) := NULL, .SDcols = patterns(\"^freq_\")]\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) fsum(freq, x, TRA = 2)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\ndata.table š¤ collapse\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] collapse_2.0.14 data.table_1.15.99 lubridate_1.9.3 forcats_1.0.0 \r\n [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 \r\n [9] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.4.1 Rcpp_1.0.12 \r\n [5] tidyselect_1.2.1 parallel_4.4.1 jquerylib_0.1.4 scales_1.3.0 \r\n [9] yaml_2.3.8 fastmap_1.1.1 R6_2.5.1 generics_0.1.3 \r\n [13] knitr_1.47 distill_1.6 munsell_0.5.0 tzdb_0.4.0 \r\n [17] bslib_0.7.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 \r\n [21] stringi_1.8.4 cachem_1.0.8 xfun_0.44 sass_0.4.9 \r\n [25] timechange_0.2.0 memoise_2.0.1 cli_3.6.2 withr_3.0.0 \r\n [29] magrittr_2.0.3 digest_0.6.35 grid_4.4.1 rstudioapi_0.16.0\r\n [33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 downlit_0.4.3 \r\n [37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0 \r\n [41] rmarkdown_2.27 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!ā©ļø\r\nWho I can only assume was needing this for a fancy data viz thing šā©ļø\r\nI mean that in the technical sense here. In this problem, the unit of observation is the ācellsā of the input columns (the values āAā, āBā, āXā, āYā, etc.).ā©ļø\r\nI couldnāt show this here with this particular example, but another nice feature of {collapse} š¤ {data.table} is the fact that they do not shy away from consuming/producing matrices: see scale()[,1] vs.Ā fscale() for a good example of this.ā©ļø\r\n",
"preview": "posts/2024-06-09-ave-for-the-average/preview.png",
- "last_modified": "2024-06-23T13:26:59+09:00",
+ "last_modified": "2024-06-23T13:35:29+09:00",
"input_file": {},
"preview_width": 926,
"preview_height": 328
diff --git a/docs/search.json b/docs/search.json
index d058485..c746167 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -5,7 +5,7 @@
"title": "Blog Posts",
"author": [],
"contents": "\r\n\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:27:47+09:00"
+ "last_modified": "2024-06-23T13:35:47+09:00"
},
{
"path": "index.html",
@@ -13,21 +13,21 @@
"description": "Ph.D. Candidate in Linguistics",
"author": [],
"contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n Education\r\n\r\n\r\nB.A. (hons.) Northwestern University (2016ā20)\r\n\r\n\r\nPh.D.Ā University of Pennsylvania (2020 ~)\r\n\r\n\r\n Interests\r\n\r\n\r\n(Computational) Psycholinguistics\r\n\r\n\r\nLanguage Acquisition\r\n\r\n\r\nSentence Processing\r\n\r\n\r\nProsody\r\n\r\n\r\nQuantitative Methods\r\n\r\n\r\n\r\n\r\n\r\n Methods:\r\n\r\nWeb-based experiments, eye-tracking, self-paced reading, corpus analysis\r\n\r\n\r\n\r\n Programming:\r\n\r\nR (fluent) | HTML/CSS, Javascript, Julia (proficient) | Python (coursework)\r\n\r\n\r\n\r\n\r\n\r\nI am a PhD candidate in Linguistics at the University of Pennsylvania, and a student affiliate of Penn MindCORE and the Language and Communication Sciences program. I am a psycholinguist broadly interested in experimental approaches to studying meaning, of various flavors. My advisor is Anna Papafragou and I am a member of the Language & Cognition Lab.\r\nI received my B.A. in Linguistics from Northwestern University, where I worked with Jennifer Cole, Masaya Yoshida, and Annette DāOnofrio. I also worked as a research assistant for the Language, Education, and Reading Neuroscience Lab. My thesis explored the role of prosodic focus in garden-path reanalysis.\r\nBeyond linguistics research, I have interests in data visualization, science communication, and the R programming language. I author packages in statistical computing and graphics (ex: ggtrace, jlmerclusterperm) and collaborate on other open-source software (ex: openalexR, pointblank). I also maintain a technical blog as a hobby and occasionally take on small statistical consulting projects.\r\n\r\n\r\n\r\n\r\ncontact me: yjchoe@sas.upenn.edu\r\n\r\n\r\n\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:27:49+09:00"
+ "last_modified": "2024-06-23T13:35:49+09:00"
},
{
"path": "news.html",
"title": "News",
"author": [],
"contents": "\r\n\r\n\r\nFor more of my personal news external/tangential to research\r\n2023\r\nAugust\r\nI was unfortunately not able to make it in person to JSM 2023 but have my pre-recorded talk has been uploaded!\r\nJune\r\nMy package jlmerclusterperm was published on CRAN!\r\nApril\r\nI was accepted to SMLP (Summer School on Statistical Methods for Linguistics and Psychology), to be held in September at the University of Potsdam, Germany! I will be joining the āAdvanced methods in frequentist statistics with Juliaā stream. Huge thanks to MindCORE for funding my travels to attend!\r\nJanuary\r\nI received the ASA Statistical Computing and Graphics student award for my paper Sublayer modularity in the Grammar of Graphics! I will be presenting my work at the 2023 Joint Statistical Meetings in Toronto in August.\r\n2022\r\nSeptember\r\nI was invited to a Korean data science podcast dataholic (ė°ģ“ķ°ķė¦) to talk about my experience presenting at the RStudio and useR conferences! Part 1, Part 2\r\nAugust\r\nI led a workshop on IBEX and PCIbex with Nayoun Kim at the Seoul International Conference on Linguistics (SICOL 2022).\r\nJuly\r\nI attended my first in-person R conference at rstudio::conf(2022) and gave a talk on ggplot internals.\r\nJune\r\nI gave a talk on my package {ggtrace} at the useR! 2022 conference. I was awarded the diversity scholarship which covered my registration and workshop fees. My reflections\r\nI gave a talk at RLadies philly on using dplyrās slice() function for row-relational operations.\r\n2021\r\nJuly\r\nMy tutorial on custom fonts in R was featured as a highlight on the R Weekly podcast!\r\nJune\r\nI gave a talk at RLadies philly on using icon fonts for data viz! I also wrote a follow-up blog post that goes deeper into font rendering in R.\r\nMay\r\nSnowGlobe, a project started in my undergrad, was featured in an article by the Northwestern University Library. We also had a workshop for SnowGlobe which drew participants from over a hundred universities!\r\nJanuary\r\nI joined Nayoun Kim for a workshop on experimental syntax conducted in Korean and held at Sungkyunkwan University (Korea). I helped design materials for a session on scripting online experiments with IBEX, including interactive slides made with R!\r\n2020\r\nNovember\r\nI joined designer Will Chase on his stream to talk about the psycholinguistics of speech production for a data viz project on Michaelās speech errors in The Office. It was a very cool and unique opportunity to bring my two interests together!\r\nOctober\r\nMy tutorial on {ggplot2} stat_*() functions was featured as a highlight on the R Weekly podcast, which curates weekly updates from the R community.\r\nI became a data science tutor at MindCORE to help researchers at Penn with data visualization and R programming.\r\nSeptember\r\nI have moved to Philadelphia to start my PhD in Linguistics at the University of Pennsylvania!\r\nJune\r\nI graduated from Northwestern University with a B.A. in Linguistics (with honors)! I was also elected into Phi Beta Kappa and appointed as the Senior Marshal for Linguistics.\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:27:51+09:00"
+ "last_modified": "2024-06-23T13:35:50+09:00"
},
{
"path": "research.html",
"title": "Research",
"author": [],
"contents": "\r\n\r\nContents\r\nPeer-reviewed Papers\r\nConference Talks\r\nConference Presentations\r\nWorkshops led\r\nGuest lectures\r\nResearch activities in FOSS\r\nPapers\r\nTalks\r\nSoftware\r\n\r\nService\r\nEditor\r\nReviewer\r\n\r\n\r\nLinks: Google Scholar, Github, OSF\r\nPeer-reviewed Papers\r\nJune Choe, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. Journal of Memory and Language, 132, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. PDF OSF\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors for World Englishes. In Proceedings of the 29th International Conference on Computational Linguistics, 7177ā7186.\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech 2022, 1273-1277. DOI: 10.21437/Interspeech.2022-10869\r\nJune Choe, Masaya Yoshida, and Jennifer Cole. (2022). The role of prosodic focus in the reanalysis of garden path sentences: Depth of semantic processing impedes the revision of an erroneous local analysis. Glossa Psycholinguistics, 1(1). DOI: 10.5070/G601136\r\nJune Choe, and Anna Papafragou. (2022). The acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. In Proceedings of the Annual Meeting of the Cognitive Science Society, 44, 2745-2752.\r\nSean McWeeny, Jinnie S. Choi, June Choe, Alexander LaTourette, Megan Y. Roberts, and Elizabeth S. Norton. (2022). Rapid automatized naming (RAN) as a kindergarten predictor of future reading in English: A systematic review and meta-analysis. Reading Research Quarterly, 57(4), 1187ā1211. DOI: 10.1002/rrq.467\r\nConference Talks\r\nJune Choe. Distributional signatures of superordinate nouns. Talk at the 10th MACSIM conference. 6 April 2024. University of Maryland, College Park, MD.\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper\r\nJune Choe. Persona-based social expectations in sentence processing and comprehension. Talk at the Language, Stereotypes & Social Cognition workshop, 22-23 May, 2023. University of Pennsylvania, PA.\r\nJune Choe, and Anna Papafragou. Lexical alternatives and the acquisition of subordinate nouns. Talk at the 47th Boston University Conference on Language Development (BUCLD), 3-6 November, 2022. Boston University, Boston, MA. Slides\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors in American English. Talk at the 28th International Conference on Computational Linguistics (CoLing), 12-17 October, 2022. Gyeongju, South Korea. Slides\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. Talk at the 23rd Conference of the International Speech Communication Association (INTERSPEECH), 18-22 September, 2022. Incheon, South Korea.\r\nConference Presentations\r\nJune Choe, and Anna Papafragou. Distributional signatures of superordinate nouns. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Pragmatic underpinnings of the basic-level bias. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Abstract Poster\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 9th Mid-Atlantic Colloquium of Studies in Meaning (MACSIM), 15 April 2023. University of Pennsylvania, PA.\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 36th Annual Conference on Human Sentence Processing, 9-11 March 2022. University of Pittsburg, PA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. Poster at the 2nd Experiments in Linguistic Meaning (ELM) conference, 18-20 May 2022. University of Pennsylvania, Philadelphia, PA.\r\nJune Choe, and Anna Papafragou. Beyond the basic level: Levels of informativeness and the acquisition of subordinate nouns. Poster at the 35th Annual Conference on Human Sentence Processing (HSP), 24-26 March 2022. University of California, Santa Cruz, CA.\r\nJune Choe, Jennifer Cole, and Masaya Yoshida. Prosodic Focus Strengthens Semantic Persistence. Poster at The 26th Architectures and Mechanisms for Language Processing (AMLaP), 3-5 September 2020. Potsdam, Germany. Abstract Video Slides\r\nJune Choe. Computer-assisted snowball search for meta-analysis research. Poster at The 2020 Undergraduate Research & Arts Exposition. 27-28 May 2020. Northwestern University, Evanston, IL. 2nd Place Poster Award. Abstract\r\nJune Choe. Social Information in Sentence Processing. Talk at The 2019 Undergraduate Research & Arts Exposition. 29 May 2019. Northwestern University, Evanston, IL. Abstract\r\nJune Choe, Shayne Sloggett, Masaya Yoshida and Annette DāOnofrio. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 32nd CUNY Conference on Human Sentence Processing. 29-31 March 2019. University of Colorado, Boulder, CO.\r\nDāOnofrio, Annette, June Choe and Masaya Yoshida. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 93rd Annual Meeting of the Linguistics Society of America. 3-6 January 2019. New York City, NY.\r\nWorkshops led\r\nIntroduction to mixed-effects models in Julia. Workshop at Penn MindCORE. 1 December 2023. Philadelphia, PA. Github Colab notebook\r\nExperimental syntax using IBEX/PCIBEX with Dr.Ā Nayoun Kim. Workshop at the 2022 Seoul International Conference on Linguistics. 11-12 August 2022. Seoul, South Korea. PDF\r\nExperimental syntax using IBEX: a walkthrough with Dr.Ā Nayoun Kim. 2021 BK Winter School-Workshop on Experimental Linguistics/Syntax at Sungkyunkwan University, 19-22 January 2021. Seoul, South Korea. PDF\r\nGuest lectures\r\nHard words and (syntactic) bootstrapping. LING 5750 āThe Acquisition of Meaningā. Instructor: Dr.Ā Anna Papafragou. Spring 2024, University of Pennsylvania.\r\nIntroduction to R for psychology research. PSYC 4997 āSenior Honors Seminar in Psychologyā. Instructor: Dr.Ā Coren Apicella. Spring 2024, University of Pennsylvania. Colab notebook\r\nModel fitting and diagnosis with MixedModels.jl in Julia. LING 5670 āQuantitative Study of Linguistic Variationā. Instructor: Dr.Ā Meredith Tamminga. Fall 2023, University of Pennsylvania.\r\nSimulation-based power analysis for mixed-effects models. LING 5670 āQuantitative Study of Linguistic Variationā. Instructor: Dr.Ā Meredith Tamminga. Spring 2023, University of Pennsylvania.\r\nResearch activities in FOSS\r\nPapers\r\nMassimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and June Choe. (2024). openalexR: An R-tool for collecting bibliometric data from OpenAlex. The R Journal, 15(4), 166-179. Paper, Github\r\nJune Choe. (2022). Sub-layer modularity in the Grammar of Graphics. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper, Github\r\nTalks\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada.\r\nJune Choe. Fast cluster-based permutation test using mixed-effects models. Talk at the Integrated Language Science and Technology (ILST) seminar, 21 April 2023. University of Pennsylvania, PA.\r\nJune Choe. Cracking open ggplot internals with {ggtrace}. Talk at the 2022 RStudio Conference, 25-28 July 2022. Washington D.C. https://github.com/yjunechoe/ggtrace-rstudioconf2022\r\nJune Choe. Stepping into {ggplot2} internals with {ggtrace}. Talk at the 2022 useR! Conference, 20-23 June 2022. Vanderbilt University, TN. https://github.com/yjunechoe/ggtrace-user2022\r\nSoftware\r\nRich Iannone, June Choe, Mauricio Vargas Sepulveda. (2024). pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. R package version 0.12.1. https://CRAN.R-project.org/package=pointblank. Github\r\nMassimo Aria, Corrado Cuccurullo, Trang Le, June Choe. (2024). openalexR: Getting Bibliographic Records from āOpenAlexā Database Using āDSLā API. R package version 1.2.3. https://CRAN.R-project.org/package=openalexR. Github\r\nJune Choe. (2024). jlmerclusterperm: Cluster-Based Permutation Analysis for Densely Sampled Time Data. R package version 1.1.3. https://cran.r-project.org/package=jlmerclusterperm. Github\r\nSean McWeeny, June Choe, & Elizabeth S. Norton. (2021). SnowGlobe: An Iterative Search Tool for Systematic Reviews and Meta-Analyses [Computer Software]. OSF\r\nService\r\nEditor\r\nPenn Working Papers in Linguistics (PWPL), Volumne 30, Issue 1.\r\nReviewer\r\nLanguage Learning and Development\r\nJournal of Open Source Software\r\nProceedings of the Annual Meeting of the Cognitive Science Society\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:27:55+09:00"
+ "last_modified": "2024-06-23T13:35:52+09:00"
},
{
"path": "resources.html",
@@ -35,14 +35,14 @@
"description": "Mostly for R and data visualization\n",
"author": [],
"contents": "\r\n\r\nContents\r\nLinguistics\r\nData Visualization\r\nPackages and software\r\nTutorial Blog Posts\r\nBy others\r\n\r\nLinguistics\r\nScripting online experiments with IBEX (workshop slides & materials with Nayoun Kim)\r\nData Visualization\r\n{ggplot2} style guide and showcase - most recent version (2/10/2021)\r\nCracking open the internals of ggplot: A {ggtrace} showcase - slides\r\nPackages and software\r\n{ggtrace}: R package for exploring, debugging, and manipulating ggplot internals by exposing the underlying object-oriented system in functional programming terms.\r\n{penngradlings}: R package for the University of Pennsylvania Graduate Linguistics Society.\r\n{LingWER}: R package for linguistic analysis of Word Error Rate for evaluating transcriptions and other speech-to-text output, using a deterministic matrix-based search algorithm optimized for R.\r\n{gridAnnotate}: R package for interactively annotating figures from the plot pane, using {grid} graphical objects.\r\nSnowGlobe: A tool for meta-analysis research. Developed with Jinnie Choi, Sean McWeeny, and Elizabeth Norton, with funding from the Northwestern University Library. Currently under development but basic features are functional. Validation experiments and guides at OSF repo.\r\nTutorial Blog Posts\r\n{ggplot2} stat_*() functions [post]\r\nCustom fonts in R [post]\r\n{purrr} reduce() family [post1, post2]\r\nThe correlation parameter in {lme4} mixed effects models [post]\r\nShortcuts for common chain of {dplyr} functions [post]\r\nPlotting highly-customizable treemaps with {treemap} and {ggplot2} [post]\r\nBy others\r\nTutorials:\r\nA ggplot2 Tutorial for Beautiful Plotting in R by CĆ©dric Scherer\r\nggplot2 Wizardry Hands-On by CĆ©dric Scherer\r\nggplot2 workshop by Thomas Lin Pedersen\r\nBooks:\r\nR for Data Science by Hadley Wickham and Garrett Grolemund\r\nR Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund\r\nggplot2: elegant graphics for data analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen\r\nFundamentals of Data Visualization by Claus O. Wilke\r\nEfficient R Programming by Colin Gillespie and Robin Lovelace\r\nAdvanced R by Hadley Wickham\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:27:59+09:00"
+ "last_modified": "2024-06-23T13:35:54+09:00"
},
{
"path": "software.html",
"title": "Software",
"author": [],
"contents": "\r\n\r\nContents\r\nggtrace\r\njlmerclusterperm\r\npointblank\r\nopenalexR\r\nggcolormeter\r\nddplot\r\nSnowglobe (retired)\r\n\r\nMain: Github profile, R-universe profile\r\nggtrace\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R\r\nLinks: Github, website, talks (useR! 2022, rstudio::conf 2022), paper\r\n\r\nProgrammatically explore, debug, and manipulate ggplot internals. Package {ggtrace} offers a low-level interface that extends base R capabilities of trace, as well as a family of workflow functions that make interactions with ggplot internals more accessible.\r\n\r\njlmerclusterperm\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R, Julia\r\nLinks: CRAN, Github, website\r\n\r\nAn implementation of fast cluster-based permutation analysis (CPA) for densely-sampled time data developed in Maris & Oostenveld (2007). Supports (generalized, mixed-effects) regression models for the calculation of timewise statistics. Provides both a wholesale and a piecemeal interface to the CPA procedure with an emphasis on interpretability and diagnostics. Integrates Julia libraries MixedModels.jl and GLM.jl for performance improvements, with additional functionalities for interfacing with Julia from āRā powered by the JuliaConnectoR package.\r\n\r\npointblank\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R, HTML/CSS, Javascript\r\nLinks: Github, website\r\n\r\nData quality assessment and metadata reporting for data frames and database tables\r\n\r\nopenalexR\r\n\r\n\r\n\r\nRole: Contributor\r\nLanguage: R\r\nLinks: Github, website\r\n\r\nA set of tools to extract bibliographic content from the OpenAlex database using API https://docs.openalex.org.\r\n\r\nggcolormeter\r\nRole: Author\r\nLanguage: R\r\nLinks: Github\r\n\r\n{ggcolormeter} adds guide_colormeter(), a {ggplot2} color/fill legend guide extension in the style of a dashboard meter.\r\n\r\nddplot\r\nRole: Contributor\r\nLanguage: R, JavaScript\r\nLinks: Github, website\r\n\r\nCreate āD3ā based āSVGā (āScalable Vector Graphicsā) graphics using a simple āRā API. The package aims to simplify the creation of many āSVGā plot types using a straightforward āRā API. The package relies on the ār2d3ā āRā package and the āD3ā āJavaScriptā library. See https://rstudio.github.io/r2d3/ and https://d3js.org/ respectively.\r\n\r\nSnowglobe (retired)\r\nRole: Author\r\nLanguage: R, SQL\r\nLinks: Github, OSF, poster\r\n\r\nAn iterative search tool for systematic reviews and meta-analyses, implemented as a Shiny app. Retired due to the discontinuation of the Microsoft Academic Graph service in 2021. I now contribute to {openalexR}.\r\n\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:28:01+09:00"
+ "last_modified": "2024-06-23T13:35:59+09:00"
},
{
"path": "visualizations.html",
@@ -50,7 +50,7 @@
"description": "Select data visualizations",
"author": [],
"contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n",
- "last_modified": "2024-06-23T13:28:03+09:00"
+ "last_modified": "2024-06-23T13:36:02+09:00"
}
],
"collections": ["posts/posts.json"]
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 6458de3..89ec247 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -30,7 +30,7 @@
https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average/
- 2024-06-23T13:26:59+09:00
+ 2024-06-23T13:35:29+09:00
https://yjunechoe.github.io/posts/2024-03-04-args-args-args-args/