diff --git a/docs/index.html b/docs/index.html
index 76b63e7..18c959c 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -2742,7 +2742,7 @@
June Choe
-
I am a linguist, software developer, and scientific programmer. In my PhD research in Linguistics at the University of Pennsylvania, I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.
+
I am a fifth-year PhD candidate in Linguistics at the University of Pennsylvania. I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.
I am also active in the R programming community as a mentor and open-source developer. I maintain and collaborate on several open-source software in statistical computing (jlmerclusterperm), data visualization (ggtrace), data quality assurance (pointblank), and interfaces to data APIs (openalexR). My work on graphics received an award from the American Statistical Association in 2023. I enjoy writing in my free time and have been maintaining a technical blog over the past 5 years, covering topics including R language internals, software design principles, and practical tutorials for everyday data analysis.
diff --git a/docs/posts/2020-06-07-correlation-parameter-mem/index.html b/docs/posts/2020-06-07-correlation-parameter-mem/index.html
index 0d0fe55..244f91b 100644
--- a/docs/posts/2020-06-07-correlation-parameter-mem/index.html
+++ b/docs/posts/2020-06-07-correlation-parameter-mem/index.html
@@ -2612,7 +2612,7 @@
${suggestion.title}
diff --git a/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html b/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html
index a7eb1f3..ded9385 100644
--- a/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html
+++ b/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-06-30-treemap-with-ggplot/index.html b/docs/posts/2020-06-30-treemap-with-ggplot/index.html
index 58a5f3d..ee08096 100644
--- a/docs/posts/2020-06-30-treemap-with-ggplot/index.html
+++ b/docs/posts/2020-06-30-treemap-with-ggplot/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-07-13-geom-paired-raincloud/index.html b/docs/posts/2020-07-13-geom-paired-raincloud/index.html
index e878169..460a823 100644
--- a/docs/posts/2020-07-13-geom-paired-raincloud/index.html
+++ b/docs/posts/2020-07-13-geom-paired-raincloud/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-07-20-shiny-tips-1/index.html b/docs/posts/2020-07-20-shiny-tips-1/index.html
index ef2c5b0..a109769 100644
--- a/docs/posts/2020-07-20-shiny-tips-1/index.html
+++ b/docs/posts/2020-07-20-shiny-tips-1/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html b/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html
index f215cb1..9aaf784 100644
--- a/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html
+++ b/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html b/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html
index ea1b16a..02d6caa 100644
--- a/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html
+++ b/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-08-07-saving-a-line-of-piping/index.html b/docs/posts/2020-08-07-saving-a-line-of-piping/index.html
index 0fca849..65d4da6 100644
--- a/docs/posts/2020-08-07-saving-a-line-of-piping/index.html
+++ b/docs/posts/2020-08-07-saving-a-line-of-piping/index.html
@@ -2618,7 +2618,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html b/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html
index 3735453..c71a884 100644
--- a/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html
+++ b/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-06-fonts-for-graphs/index.html b/docs/posts/2020-09-06-fonts-for-graphs/index.html
index 2f04d59..655a9a7 100644
--- a/docs/posts/2020-09-06-fonts-for-graphs/index.html
+++ b/docs/posts/2020-09-06-fonts-for-graphs/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-12-videos-in-reactable/index.html b/docs/posts/2020-09-12-videos-in-reactable/index.html
index 3f61e54..673e5ee 100644
--- a/docs/posts/2020-09-12-videos-in-reactable/index.html
+++ b/docs/posts/2020-09-12-videos-in-reactable/index.html
@@ -2623,7 +2623,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html b/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html
index 9bf230a..7f51ace 100644
--- a/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html
+++ b/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-20-plot-makeover-1/index.html b/docs/posts/2020-09-20-plot-makeover-1/index.html
index de874e7..4080775 100644
--- a/docs/posts/2020-09-20-plot-makeover-1/index.html
+++ b/docs/posts/2020-09-20-plot-makeover-1/index.html
@@ -2619,7 +2619,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html b/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html
index 4d11e65..473e4d9 100644
--- a/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html
+++ b/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html b/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html
index 28fbfee..b9572db 100644
--- a/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html
+++ b/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html
@@ -2614,7 +2614,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html b/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html
index 12b1464..67169b0 100644
--- a/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html
+++ b/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html
@@ -2617,7 +2617,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html b/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html
index 51bcfbd..10c1222 100644
--- a/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html
+++ b/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html
@@ -2618,7 +2618,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html b/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html
index 4626475..7bc8e83 100644
--- a/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html
+++ b/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html b/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html
index d1391ce..1ae279d 100644
--- a/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html
+++ b/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html
@@ -2612,7 +2612,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-11-08-plot-makeover-2/index.html b/docs/posts/2020-11-08-plot-makeover-2/index.html
index 3178c45..ee9e1a4 100644
--- a/docs/posts/2020-11-08-plot-makeover-2/index.html
+++ b/docs/posts/2020-11-08-plot-makeover-2/index.html
@@ -2625,7 +2625,7 @@ ${suggestion.title}
diff --git a/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html b/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html
index c9151f3..7173cb9 100644
--- a/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html
+++ b/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html b/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html
index fe8f186..a14ade0 100644
--- a/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html
+++ b/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html
@@ -2614,7 +2614,7 @@ ${suggestion.title}
diff --git a/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html b/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html
index 2666606..aa5a8b8 100644
--- a/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html
+++ b/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html
@@ -2619,7 +2619,7 @@ ${suggestion.title}
diff --git a/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html b/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html
index 3b4e2bd..285789b 100644
--- a/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html
+++ b/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html
@@ -2621,7 +2621,7 @@ ${suggestion.title}
diff --git a/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html b/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html
index 53672e4..ce0b649 100644
--- a/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html
+++ b/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html
@@ -2630,7 +2630,7 @@ ${suggestion.title}
diff --git a/docs/posts/2022-07-30-user2022/index.html b/docs/posts/2022-07-30-user2022/index.html
index 4d532e4..3bbacaf 100644
--- a/docs/posts/2022-07-30-user2022/index.html
+++ b/docs/posts/2022-07-30-user2022/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html b/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html
index 08b84f5..dec515d 100644
--- a/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html
+++ b/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html
@@ -2622,7 +2622,7 @@ ${suggestion.title}
diff --git a/docs/posts/2023-06-11-row-relational-operations/index.html b/docs/posts/2023-06-11-row-relational-operations/index.html
index a908a1d..37b5432 100644
--- a/docs/posts/2023-06-11-row-relational-operations/index.html
+++ b/docs/posts/2023-06-11-row-relational-operations/index.html
@@ -2626,7 +2626,7 @@ ${suggestion.title}
diff --git a/docs/posts/2023-07-09-x-y-problem/index.html b/docs/posts/2023-07-09-x-y-problem/index.html
index 24ef487..1e53001 100644
--- a/docs/posts/2023-07-09-x-y-problem/index.html
+++ b/docs/posts/2023-07-09-x-y-problem/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2023-12-03-untidy-select/index.html b/docs/posts/2023-12-03-untidy-select/index.html
index c72ee28..440a5f0 100644
--- a/docs/posts/2023-12-03-untidy-select/index.html
+++ b/docs/posts/2023-12-03-untidy-select/index.html
@@ -2626,7 +2626,7 @@ ${suggestion.title}
diff --git a/docs/posts/2023-12-31-2023-year-in-review/index.html b/docs/posts/2023-12-31-2023-year-in-review/index.html
index 20a84b7..d291bc4 100644
--- a/docs/posts/2023-12-31-2023-year-in-review/index.html
+++ b/docs/posts/2023-12-31-2023-year-in-review/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2024-02-20-helloworld-print/index.html b/docs/posts/2024-02-20-helloworld-print/index.html
index 47029a6..da8a78e 100644
--- a/docs/posts/2024-02-20-helloworld-print/index.html
+++ b/docs/posts/2024-02-20-helloworld-print/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2024-03-04-args-args-args-args/index.html b/docs/posts/2024-03-04-args-args-args-args/index.html
index 73f8124..1892af8 100644
--- a/docs/posts/2024-03-04-args-args-args-args/index.html
+++ b/docs/posts/2024-03-04-args-args-args-args/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2024-06-09-ave-for-the-average/index.html b/docs/posts/2024-06-09-ave-for-the-average/index.html
index 073379e..347f80d 100644
--- a/docs/posts/2024-06-09-ave-for-the-average/index.html
+++ b/docs/posts/2024-06-09-ave-for-the-average/index.html
@@ -2620,7 +2620,7 @@ ${suggestion.title}
diff --git a/docs/posts/2024-07-21-enumerate-possible-options/index.html b/docs/posts/2024-07-21-enumerate-possible-options/index.html
index 135b919..40402aa 100644
--- a/docs/posts/2024-07-21-enumerate-possible-options/index.html
+++ b/docs/posts/2024-07-21-enumerate-possible-options/index.html
@@ -2616,7 +2616,7 @@ ${suggestion.title}
diff --git a/docs/posts/2024-09-22-fetch-files-web/index.html b/docs/posts/2024-09-22-fetch-files-web/index.html
index 8016c1c..5013530 100644
--- a/docs/posts/2024-09-22-fetch-files-web/index.html
+++ b/docs/posts/2024-09-22-fetch-files-web/index.html
@@ -2616,7 +2616,7 @@ ${suggestion.title}
diff --git a/docs/posts/posts.json b/docs/posts/posts.json
index db24bfc..28605cc 100644
--- a/docs/posts/posts.json
+++ b/docs/posts/posts.json
@@ -15,7 +15,7 @@
],
"contents": "\r\n\r\nContents\r\nGitHub (public repos)\r\nGitHub (gists)\r\nGitHub (private repos)\r\nOSF\r\nAside: Canât go wrong with a copy-paste!\r\nOther goodies\r\nStreaming with {duckdb}\r\nOther sources for data\r\nMiscellaneous tips and tricks\r\n\r\nsessionInfo()\r\n\r\nEvery so often Iâll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.\r\nOver the years Iâve accumulated some tricks to get data into R âstraight from a urlâ, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought Iâd write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that Iâm someone who primarily works with tabular data and interface with GitHub and OSF as data repositories.\r\nGitHub (public repos)\r\nGitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the dplyr::starwars dataset from tidyverse/dplyr, at https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv:\r\n\r\n\r\n\r\nThat url, despite ending in a .csv, does not point to the raw data - instead, the contents of the page is a full html document:\r\n\r\n\r\nrvest::read_html(\"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\")\r\n\r\n\r\n {html_document}\r\n \\n \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia OrâŚ\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2âŚ\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.âŚ\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", NâŚ\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"âŚ\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",âŚ\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, âŚ\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",âŚ\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"feminiâŚ\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"TâŚ\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"HumaâŚ\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, RâŚ\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"ImperialâŚ\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", âŚ\r\n\r\nBut note that this method of âclick the Raw button to get the corresponding raw.githubusercontent.com/⌠url to the file contentsâ will not work for file formats that cannot be displayed in plain text (clicking the button will instead download the file via your browser). So sometimes (especially when you have a binary file) you have to construct this âremote-readableâ url to the file manually.\r\nFortunately, going from one link to the other is pretty formulaic. To demonstrate the difference with the url for the starwars dataset again:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\",\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\n\r\n\r\n[1] \"https:// github .com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\"[1] \"https://raw.githubusercontent.com/tidyverse/dplyr /main/data-raw/starwars.csv\"\r\n\r\n\r\nGitHub (gists)\r\nItâs a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, hereâs a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.\r\nBut thatâs again a full-on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist\r\n\r\n\r\n\r\nBut actually, that long link you get by default points to the current commit, specifically. If you instead want the link to be kept up to date with the most recent commit, you can omit the second hash that comes after raw/:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\",\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\"\r\n)\r\n\r\n\r\n[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\"[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw /stroop.csv\"\r\n\r\n\r\nIn practice, I donât use gists to store replicability-sensitive data, so I prefer to just use the shorter link thatâs not tied to a specific commit.\r\n\r\n\r\nread.csv(\"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 240\r\n Columns: 5\r\n $ subj \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S02âŚ\r\n $ word \"blue\", \"blue\", \"green\", \"green\", \"red\", \"red\", \"yellow\", \"yâŚ\r\n $ condition \"match\", \"mismatch\", \"match\", \"mismatch\", \"match\", \"mismatchâŚ\r\n $ accuracy 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, âŚ\r\n $ RT 400, 549, 576, 406, 296, 231, 433, 1548, 561, 1751, 286, 710âŚ\r\n\r\nGitHub (private repos)\r\nWe now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and youâre signed in, you can follow the same step of copying the link that the Raw button redirects to.\r\nExcept this time, when you open the file at that url (assuming it can display in plain text), youâll see the url come with a âtokenâ attached at the end (Iâll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it will expire at some point as GitHub refreshes tokens periodically (so treat them as if theyâre for single use).\r\nFor a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:1\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/contents/{path}\")$download_url\r\n\r\n\r\nNote that this is actually also a general solution to getting a url to GitHub file contents. So for example, even without any credentials set up you can point to dplyrâs starwars.csv since thatâs publicly accessible. This method produces the same âraw.githubusercontent.com/âŚâ url we saw earlier:\r\n\r\n\r\ngh::gh(\"/repos/tidyverse/dplyr/contents/data-raw/starwars.csv\")$download_url\r\n\r\n [1] \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n\r\nNow for a demonstration with a private repo, here is one of mine that you cannot access https://github.com/yjunechoe/my-super-secret-repo. But because I set up my credentials in {gh}, I can generate a link to a content within that repo with the access token attached (â?token=âŚâ):\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n # truncating\r\n gsub(x = _, \"^(.{100}).*\", \"\\\\1...\")\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR2JPXCIX5...\"\r\n\r\nI can then use this url to read the private file:2\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n readLines()\r\n\r\n [1] \"Surprise!\"\r\n\r\nOSF\r\nOSF (the Open Science Framework) is another data repository that I interact with a lot, and reading files off of OSF follows a similar strategy to fetching public files on GitHub.\r\nConsider, for example, the dyestuff.arrow file in the OSF repository for MixedModels.jl. Browsing the repository through the point-and-click interface can get you to the page for the file at https://osf.io/9vztj/, where it shows:\r\n\r\n\r\n\r\nThe download button can be found inside the dropdown menubar to the right:\r\n\r\n\r\n\r\nBut instead of clicking on the icon (which will start a download via the browser), we can grab the embedded link address: https://osf.io/download/9vztj/. That url can then be passed directly into a read function:\r\n\r\n\r\narrow::read_feather(\"https://osf.io/download/9vztj/\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 30\r\n Columns: 2\r\n $ batch A, A, A, A, A, B, B, B, B, B, C, C, C, C, C, D, D, D, D, D, E, EâŚ\r\n $ yield 1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595âŚ\r\n\r\nYou might have already caught on to this, but the pattern is to simply point to osf.io/download/ instead of osf.io/.\r\nThis method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents.\r\nBy inserting /download into this url, we can read the csv file contents directly:\r\n\r\n\r\nread.csv(\"https://osf.io/download/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad\") |> \r\n head()\r\n\r\n Item plaus_bias trans_bias\r\n 1 Awakened -0.29631221 -1.2200901\r\n 2 Calmed 0.09877074 -0.4102332\r\n 3 Choked 1.28401957 -1.4284905\r\n 4 Dressed -0.59262442 -1.2087228\r\n 5 Failed -0.98770736 0.1098839\r\n 6 Groomed -1.08647810 0.9889550\r\n\r\nSee also the {osfr} package for a more principled interface to OSF.\r\nAside: Canât go wrong with a copy-paste!\r\nReading remote files aside, I think itâs severely underrated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a \"clipboard\" connection.3\r\nI sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base Râs clipboard functionalities.\r\nFor example, given this markdown table:\r\n\r\n\r\naggregate(mtcars, mpg ~ cyl, mean) |> \r\n knitr::kable()\r\n\r\ncyl\r\nmpg\r\n4\r\n26.66364\r\n6\r\n19.74286\r\n8\r\n15.10000\r\n\r\nYou can copy its contents and run the following code to get that data back as an R data frame:\r\n\r\n\r\nread.delim(\"clipboard\")\r\n# Or, `read.delim(text = readClipboard())`\r\n\r\n\r\n\r\n cyl mpg\r\n 1 4 26.66364\r\n 2 6 19.74286\r\n 3 8 15.10000\r\n\r\nIf youâre instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:4\r\n\r\n\r\npaste(1:10, collapse = \", \") |> \r\n cat()\r\n\r\n 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\r\n\r\n\r\n\r\nscan(\"clipboard\", sep = \",\")\r\n# Or, `scan(textConnection(readClipboard()), sep = \",\")`\r\n\r\n\r\n\r\n [1] 1 2 3 4 5 6 7 8 9 10\r\n\r\nIt should be noted though that parsing clipboard contents is not a robust feature in base R. If you want a more principled approach to reading data from clipboard, you should use {datapasta}. And for printing data for others to copy-paste into R, use {constructive}. See also {clipr} which extends clipboard read/write functionalities.\r\nOther goodies\r\nâ ď¸ What lies ahead are denser than the kinds of âlow-techâ advice I wrote about above.\r\nStreaming with {duckdb}\r\nOne caveat to all the âread from webâ approaches I covered above is that it often does not actually circumvent the action of downloading the file onto your computer. For example, when you read a file from âraw.githubusercontent.com/âŚâ with read.csv(), there is an implicit download.file() of the data into the current R sessionâs tempdir().\r\nAn alternative that actually reads the data straight into memory is streaming. Streaming is moreso a feature of database languages, but thereâs good integration of such tools with R, so this option is available from within R as well.\r\nHere, I briefly outline what I learned from (mostly) reading a blog post by François Michonneau, which covers how to stream remote files using {duckdb}. Itâs pretty comprehensive but I wanted to make a template for just one method that I prefer.\r\nWe start by loading the {duckdb} package, creating a connection to an in-memory database, installing the httpfs extension (if not installed already), and loading httpfs for the database.\r\n\r\n\r\nlibrary(duckdb)\r\ncon <- dbConnect(duckdb())\r\n# dbExecute(con, \"INSTALL httpfs;\") # You may also need to \"INSTALL parquet;\"\r\ninvisible(dbExecute(con, \"LOAD httpfs;\"))\r\n\r\n\r\nFor this example I will use a parquet file from one of my projects which is hosted on GitHub: https://github.com/yjunechoe/repetition_events. The data I want to read is at the relative path /data/tokens_data/childID=1/part-7.parquet. I went ahead and converted that into the âraw contentsâ url shown below:\r\n\r\n\r\n# A parquet file of tokens from a sample of child-directed speech\r\nfile <- \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"\r\n\r\n# For comparison, reading its contents with {arrow}\r\narrow::read_parquet(file) |> \r\n head(5)\r\n\r\n # A tibble: 5 Ă 3\r\n utterance_id gloss part_of_speech\r\n \r\n 1 1 www \"\" \r\n 2 2 bye \"co\" \r\n 3 3 mhm \"co\" \r\n 4 4 Mommy's \"n:prop\" \r\n 5 4 here \"adv\"\r\n\r\nIn duckdb, the httpfs extension we loaded above allows PARQUET_SCAN5 to read a remote parquet file.\r\n\r\n\r\nquery1 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN({`file`})\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query1)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\")\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query1)\r\n\r\n utterance_id gloss part_of_speech\r\n 1 1 www \r\n 2 2 bye co\r\n 3 3 mhm co\r\n 4 4 Mommy's n:prop\r\n 5 4 here adv\r\n\r\nAnd actually, in my case, the parquet file represents one of many files that had been previously split up via hive partitioning. To preserve this metadata even as I read in just a single file, I need to do two things:\r\nSpecify hive_partitioning=true when calling PARQUET_SCAN.\r\nEnsure that the hive-partitioning syntax is represented in the url with URLdecode() (since the = character can sometimes be escaped, as in this case).\r\n\r\n\r\nemphatic::hl_diff(file, URLdecode(file))\r\n\r\n\r\n[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID= 1/part-7.parquet\"\r\n\r\n\r\nWith that, the data now shows that the observations are from child #1 in the sample.\r\n\r\n\r\nfile <- URLdecode(file)\r\nquery2 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n {`file`},\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query2)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\",\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query2)\r\n\r\n utterance_id gloss part_of_speech childID\r\n 1 1 www 1\r\n 2 2 bye co 1\r\n 3 3 mhm co 1\r\n 4 4 Mommy's n:prop 1\r\n 5 4 here adv 1\r\n\r\nTo do this more programmatically over all parquet files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/git/trees/{branch/tag/commitSHA}?recursive=true\")$tree\r\n\r\n\r\nTo get the file tree of the repo on the master branch, we use:\r\n\r\n\r\nfiles <- gh::gh(\"/repos/yjunechoe/repetition_events/git/trees/master?recursive=true\")$tree\r\n\r\n\r\nWith recursive=true, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex:\r\n\r\n\r\nparquet_files <- sapply(files, `[[`, \"path\") |> \r\n grep(x = _, pattern = \".*/tokens_data/.*parquet$\", value = TRUE)\r\nlength(parquet_files)\r\n\r\n [1] 70\r\n\r\nhead(parquet_files)\r\n\r\n [1] \"data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nFinally, we complete the path using the âhttps://raw.githubusercontent.com/âŚâ url:\r\n\r\n\r\nparquet_files <- paste0(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/\",\r\n parquet_files\r\n)\r\nhead(parquet_files)\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nBack on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].6 This time, we also ask for a quick computation to count the number of distinct childIDs:\r\n\r\n\r\nquery3 <- glue::glue_sql(\"\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n [{parquet_files*}],\r\n hive_partitioning=true\r\n )\r\n\", .con = con)\r\ncat(gsub(\"^(.{80}).*(.{60})$\", \"\\\\1 ... \\\\2\", query3))\r\n\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n ['https://raw.githubusercont ... data/childID=9/part-64.parquet'],\r\n hive_partitioning=true\r\n )\r\n\r\ndbGetQuery(con, query3)\r\n\r\n count(DISTINCT childID)\r\n 1 70\r\n\r\nThis returns 70 which matches the length of the parquet_files vector listing the files that had been partitioned by childID.\r\nFor further analyses, we can CREATE TABLE7 our data in our in-memory database con:\r\n\r\n\r\nquery4 <- glue::glue_sql(\"\r\n CREATE TABLE tokens_data AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning=true)\r\n\", .con = con)\r\ninvisible(dbExecute(con, query4))\r\ndbListTables(con)\r\n\r\n [1] \"tokens_data\"\r\n\r\nThat lets us reference the table via dplyr::tbl(), at which point we can switch over to another high-level interface like {dplyr} to query it using its familiar functions:\r\n\r\n\r\nlibrary(dplyr)\r\ntokens_data <- tbl(con, \"tokens_data\")\r\n\r\n# Q: What are the most common verbs spoken to children in this sample?\r\ntokens_data |> \r\n filter(part_of_speech == \"v\") |> \r\n count(gloss, sort = TRUE) |> \r\n head() |> \r\n collect()\r\n\r\n # A tibble: 6 Ă 2\r\n gloss n\r\n \r\n 1 go 13614\r\n 2 see 13114\r\n 3 do 11829\r\n 4 have 10794\r\n 5 want 10560\r\n 6 put 9190\r\n\r\nCombined, hereâs one (hastily put together) attempt at wrapping this workflow into a function:\r\n\r\n\r\nload_dataset_from_gh <- function(con, tblname, user, repo, branch, regex,\r\n partition = TRUE, lazy = TRUE) {\r\n \r\n allfiles <- gh::gh(glue::glue(\"/repos/{user}/{repo}/git/trees/{branch}?recursive=true\"))$tree\r\n files_relpath <- grep(regex, sapply(allfiles, `[[`, \"path\"), value = TRUE)\r\n # Use the actual Contents API here instead, if the repo is private\r\n files <- glue::glue(\"https://raw.githubusercontent.com/{user}/{repo}/{branch}/{files_relpath}\")\r\n \r\n type <- if (lazy) quote(VIEW) else quote(TABLE)\r\n partition <- as.integer(partition)\r\n \r\n dbExecute(con, \"LOAD httpfs;\")\r\n dbExecute(con, glue::glue_sql(\"\r\n CREATE {type} {`tblname`} AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning={partition})\r\n \", .con = con))\r\n \r\n invisible(TRUE)\r\n\r\n}\r\n\r\ncon2 <- dbConnect(duckdb())\r\nload_dataset_from_gh(\r\n con = con2,\r\n tblname = \"tokens_data\",\r\n user = \"yjunechoe\",\r\n repo = \"repetition_events\",\r\n branch = \"master\",\r\n regex = \".*data/tokens_data/.*parquet$\"\r\n)\r\ntbl(con2, \"tokens_data\")\r\n\r\n # Source: table [?? x 4]\r\n # Database: DuckDB v1.0.0 [jchoe@Windows 10 x64:R 4.4.1/:memory:]\r\n utterance_id gloss part_of_speech childID\r\n \r\n 1 1 www \"\" 1\r\n 2 2 bye \"co\" 1\r\n 3 3 mhm \"co\" 1\r\n 4 4 Mommy's \"n:prop\" 1\r\n 5 4 here \"adv\" 1\r\n 6 5 wanna \"mod:aux\" 1\r\n 7 5 sit \"v\" 1\r\n 8 5 down \"adv\" 1\r\n 9 6 there \"adv\" 1\r\n 10 7 let's \"v\" 1\r\n # âš more rows\r\n\r\nOther sources for data\r\nIn writing this blog post, Iâm indebted to all the knowledgeable folks on Mastodon who suggested their own recommended tools and workflows for various kinds of remote data. Unfortunately, Iâm not familiar enough with most of them enough to do them justice, but I still wanted to record the suggestions I got from there for posterity.\r\nFirst, a post about reading remote files would not be complete without a mention of the wonderful {googlesheets4} package for reading from Google Sheets. I debated whether I should include a larger discussion of {googlesheets4}, and despite using it quite often myself I ultimately decided to omit it for the sake of space and because the package website is already very comprehensive. I would suggest starting from the Get Started vignette if you are new and interested.\r\nSecond, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (MaĂŤlle Salmonâs comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - Iâm not familiar with it myself so I thought I wouldnât write anything for it here BUT Isabella VelĂĄsquez came in clutch sharing a recent talk on dynamically loading up-to-date data with {pins} which is a great demo of the unique strengths of {pins}.\r\nLastly, I inadvertently(?) started some discussion around remotely accessing spatial files. I donât work with spatial data at all but I can totally imagine how the hassle of the traditional click-download-find-load workflow would be even more pronounced for spatial data which are presumably much larger in size and more difficult to preview. On this note, Iâll just link to Carl Boettigerâs comment about the fact that GDAL has a virtual file system that you can interface with from R packages wrapping this API (ex: {gdalraster}), and to Michael Sumnerâs comment/gist + Chris Toneyâs comment on the fact that you can even use this feature to stream non-spatial data!\r\nMiscellaneous tips and tricks\r\nI also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so Iâll be updating this space as more come back to me:\r\nWhen reading remote .rda or .RData files with load(), you may need to wrap the link in url() first (ref: stackoverflow).\r\n{vroom} can remotely read gzipped files, without having to download.file() and unzip() first.\r\n{curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:\r\n\r\n\r\nfetched <- curl::curl_fetch_memory(\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\nread.csv(text = rawToChar(fetched$content)) |> \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia OrâŚ\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2âŚ\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.âŚ\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", NâŚ\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"âŚ\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",âŚ\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, âŚ\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",âŚ\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"feminiâŚ\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"TâŚ\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"HumaâŚ\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, RâŚ\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"ImperialâŚ\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", âŚ\r\n\r\nEven if youâre going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().8 Many {curl} functions can also handle retries and stop/resumes which is cool too.\r\n{httr2} can capture a continuous data stream with httr2::req_perform_stream() up to a set time or size.\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: America/New_York\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] dplyr_1.1.4 duckdb_1.0.0 DBI_1.2.3 ggplot2_3.5.1.9000\r\n \r\n loaded via a namespace (and not attached):\r\n [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4 generics_0.1.3 \r\n [5] xml2_1.3.6 distill_1.6 digest_0.6.35 magrittr_2.0.3 \r\n [9] evaluate_0.24.0 grid_4.4.1 blob_1.2.4 fastmap_1.1.1 \r\n [13] jsonlite_1.8.8 processx_3.8.4 chromote_0.3.1 ps_1.7.5 \r\n [17] promises_1.3.0 httr_1.4.7 rvest_1.0.4 purrr_1.0.2 \r\n [21] fansi_1.0.6 scales_1.3.0 httr2_1.0.3.9000 jquerylib_0.1.4 \r\n [25] cli_3.6.2 rlang_1.1.4 dbplyr_2.5.0 gitcreds_0.1.2 \r\n [29] bit64_4.0.5 munsell_0.5.1 withr_3.0.1 cachem_1.0.8 \r\n [33] yaml_2.3.8 tools_4.4.1 tzdb_0.4.0 memoise_2.0.1 \r\n [37] colorspace_2.1-1 assertthat_0.2.1 curl_5.2.1 vctrs_0.6.5 \r\n [41] R6_2.5.1 lifecycle_1.0.4 emphatic_0.1.8 bit_4.0.5 \r\n [45] arrow_16.1.0 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.7.0 \r\n [49] later_1.3.2 gtable_0.3.5 glue_1.7.0 gh_1.4.0 \r\n [53] Rcpp_1.0.12 xfun_0.47 tibble_3.2.1 tidyselect_1.2.1 \r\n [57] highr_0.11 rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1\r\n [61] websocket_1.4.1 rmarkdown_2.27 compiler_4.4.1 downlit_0.4.4\r\n\r\n\r\n\r\n\r\n\r\nThanks @tanho for pointing me to this at the R4DS/DSLC slack.âŠď¸\r\nNote that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).âŠď¸\r\nThe special value \"clipboard\" works for most base-R read functions that take a file or con argument.âŠď¸\r\nThanks @coolbutuseless for pointing me to textConnection()!âŠď¸\r\nOr READ_PARQUET - same thing.âŠď¸\r\nWe can also get this formatting with a combination of shQuote() and toString().âŠď¸\r\nWhereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.âŠď¸\r\nSee an example implemented for {openalexR}, an API package.âŠď¸\r\n",
"preview": "posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg",
- "last_modified": "2024-09-22T18:49:08-04:00",
+ "last_modified": "2024-09-22T15:49:08-07:00",
"input_file": {}
},
{
@@ -34,7 +34,7 @@
],
"contents": "\r\n\r\nContents\r\nTake the argument name and negate it - is the intention clear?\r\nLook at the argument name - is it verb-y without an object?\r\nIs the argument a scalar adjective? Consider naming the scale.\r\nIs the argument truly binary? Still prefer enum and name the obvious/absence.\r\nMove shared strings across options into the argument name\r\n\r\nIâve been having a blast reading through the Tidy design principles book lately - itâs packed with just the kind of stuff I needed to hear at this stage of my developer experience. And actually, I started writing packages in the post-{devtools}/R Packages era, so I wasnât too surprised to find that my habits already align with many of the design principles advocated for in the book.1\r\nBut there was one pattern which took me a bit to fully wrap my head around (and be fully convinced by). Itâs first introduced in the chapter âEnumerate possible optionsâ which gives a pretty convincing example of the base R function rank(). rank() has a couple options for resolving ties between values which are exposed to the user via the ties.method argument. The default value of this argument is a vector that enumerates all the possible options, and the userâs choice of (or the lack of) an option is resolved through match.arg() and then the appropriate algorithm is called via a switch() statement.\r\nThis is all good and well, but the book takes it a step further in a later chapter âPrefer an enum, even if only two choicesâ, which outlines what I personally consider to be one of the more controversial (and newer2) strategies advocated for in the book. Itâs a specific case of the âenumerate possible optionsâ principle applied to boolean arguments, and is best understood with an example (of sort() vs. vctrs::vec_sort(), from the book):\r\n\r\n\r\n# Booolean options\r\nsort(x, decreasing = TRUE)\r\nsort(x, decreasing = FALSE)\r\n\r\n# Enumerated options\r\nvctrs::vec_sort(x, direction = \"desc\")\r\nvctrs::vec_sort(x, direction = \"asc\")\r\n\r\n\r\nThe main argument for this pattern is one of clarity. In the case of the example above, it is unclear from reading decreasing = FALSE whether that expresses âsort in the opposite of decreasing order (i.e., increasing/ascending)â or âdo not sort in decreasing order (ex: leave it alone)â. The former is the correct interpretation, and this is expressed much clearer with direction = \"asc\", which contrasts with the other option direction = \"desc\".3\r\nIâve never used this pattern for boolean options previously, but itâs been growing on me and Iâm starting to get convinced. But in thinking through its implementation for refactoring code that I own and/or use, I got walled by the hardest problem in CS: naming things. A lot has been said on how to name things, but Iâve realized that the case of âturn booleans into enumsâ raises a whole different naming problem, one where you have to be precise about whatâs being negated, the alternatives that are being contrasted, and the scale that the enums lie on.\r\nWhat follows are my somewhat half-baked, unstructured thoughts on some heuristics that I hope can be useful for determining when to apply the âenumerate possible optionsâ principle for boolean options, and how to rename them in the refactoring.\r\nTake the argument name and negate it - is the intention clear?\r\nOne good litmus test for whether you should convert your boolean option into an enum is to take the argument name X and turn it into âXâ and ânot-Xâ - is the intended behavior expressed clearly in the context of the function? If, conceptually, the options are truly and unambiguously binary, then it should still make sense. But if the TRUE/FALSE options assume a very particular contrast which is difficult to recover from just reading âXâ vs. ânot-Xâ, consider using an enum for the two options.\r\nTo take sort() as an example again, imagine if we were to re-write it as:\r\n\r\n\r\nsort(option = \"decreasing\")\r\nsort(option = \"not-decreasing\")\r\n\r\n\r\nIf \"decreasing\" vs. \"not-decreasing\" is ambiguous, then maybe thatâs a sign to consider ditching the boolean pattern and spell out the options more explicitly with e.g., direction = \"desc\" and direction = \"asc\", as vctrs::vec_sort() does. I also think this is a useful exercise because it reflects the userâs experience when encountering boolean options.\r\nLook at the argument name - is it verb-y without an object?\r\nLetâs take a bigger offender of this principle as an example: ggplot2::facet_grid(). facet_grid() is a function that I use all the time, and it has a couple boolean arguments which makes no immediate sense to me. Admittedly, Iâve never actually used them in practice, but from all my experience with {ggplot2} and facet_grid(), shouldnât I be able to get at least some clues as to what they do from reading the arguments?4\r\n\r\n\r\nFilter(is.logical, formals(ggplot2::facet_grid))\r\n\r\n $shrink\r\n [1] TRUE\r\n \r\n $as.table\r\n [1] TRUE\r\n \r\n $drop\r\n [1] TRUE\r\n \r\n $margins\r\n [1] FALSE\r\n\r\nTake for example the shrink argument. Right off the bat it already runs into the problem where itâs not clear what weâre shrinking. I find this to be a general problem with boolean arguments: theyâre often verbs with the object omitted (presumably to save keystrokes). Using the heuristic of negating the argument, we get âshrinkâ vs. âdonât shrinkâ, which not only repeats the problem of the ambiguity of negation as we saw with sort() previously, but also exposes how serious the problem of missing the object of the verb is.\r\nAt this point you may be wondering what exactly the shrink argument does at all. From the docs:\r\n\r\nIf TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary.\r\n\r\nThe intended contrast seems to be one of âstatisticsâ (default) vs. âraw dataâ, so these are obvious candidates for our enum refactoring. But something like shrink = c(\"statistics\", \"raw-data\") doesnât quite cut it yet, because the object of shrinking is not the data, but the scales. So to be fully informative, the argument name should complete the verb phrase (i.e., include the object).\r\nCombining the observations from above, I think the following makes more sense:\r\n\r\n\r\n# Boolean options\r\nfacet_grid(shrink = TRUE)\r\nfacet_grid(shrink = FALSE)\r\n\r\n# Enumerated options\r\nfacet_grid(shrink_scales_to = \"statistics\")\r\nfacet_grid(shrink_scales_to = \"raw-data\")\r\n\r\n\r\nThis last point is a bit of a tangent, but after tinkering with the behavior of shrink more, I donât think âshrinkâ is a particularly useful description here. I might actually prefer something more neutral like fit_scales_to.\r\nIs the argument a scalar adjective? Consider naming the scale.\r\nLoosely speaking, scalar (a.k.a. gradable) adjectives are adjectives that can be strengthened (or weakened) - English grammar can express this with the suffixes â-erâ and â-estâ. For example, âtallâ is a scalar adjective because you can say âtallerâ and âtallestâ, and scalar adjectives are called such because they lie on a scale (in this case, the scale of height). Note that the quality of an adjective as a scalar one is not so clear though, as you can âmore Xâ or âmost Xâ just about any adjective X (e.g., even true vs. false can lie on a scale of more true or more false) - what matters more is if saying something like âmore Xâ makes sense in the context of where X is found (e.g., the context of the function).5 If so, youâre dealing with a scalar adjective.\r\nThis Linguistics 101 tangent is relevant here because I often see boolean arguments named after scalar adjectives, but I feel like in those cases itâs better to just name the scale itself (which in turn makes the switch to enum more natural).\r\nA contrived example would be if a function had a boolean argument called tall. To refactor this into an enum, we can rename the argument to the scale itself (height) and enumerate the two end points:\r\n\r\n\r\n# Boolean options\r\nfun(tall = TRUE)\r\nfun(tall = FALSE)\r\n\r\n# Enumerated options\r\nfun(height = \"tall\")\r\nfun(height = \"short\")\r\n\r\n\r\nA frequent offender of the enum principle in the wild is the verbose argument. verbose is an interesting case study because it suffers from the additional problem of there possibly being more than 2 options as the function matures. The book offers some strategies for remedying these kinds of problems after-the-fact, but I think a proactive solution is to name the argument verbosity (the name of the scale) with the possible options enumerated (see also a recent Mastodon thread that has great suggestions on this topic).\r\n\r\n\r\n# Boolean options\r\nfun(verbose = TRUE)\r\nfun(verbose = FALSE)\r\n\r\n# Enumerated options\r\nfun(verbosity = \"all\")\r\nfun(verbosity = \"none\")\r\n\r\n\r\nI like this strategy of ânaming the scaleâ because it gives off the impression to users that the possible options are values that lie on the scale. In the example above, it could either be the extremes \"all\" or \"none\", but also possibly somewhere in between if the writer of the function chooses to introduce more granular settings later.\r\nIs the argument truly binary? Still prefer enum and name the obvious/absence.\r\nSometimes a boolean argument may encode a genuinely binary choice of a true/false, on/off, yes/no option. But refactoring the boolean options as enum may still offer some benefits. In those cases, I prefer the strategy of name the obvious/absence.\r\nSome cases for improvement are easier to spot than others. An easy case is something like the REML argument in lme4::lmer(). Without going into too much detail, when REML = TRUE (default), the model optimizes the REML (restricted/residualized maximum likelihood) criterion in finding the best fitting model. But itâs not like the model doesnât use any criteria for goodness of fit when REML = FALSE. Instead, when REML = FALSE, the function uses a different criterion of ML (maximum likelihood). So the choice is not really between toggling REML on or off, but rather between the choice of REML vs. ML. The enum version lets us spell out the assumed default and make the choice between the two explicit (again, with room for introducing other criteria in the future):\r\n\r\n\r\n# Boolean options\r\nlmer::lme4(REML = TRUE)\r\nlmer::lme4(REML = FALSE)\r\n\r\n# Enumerated options\r\nlmer::lme4(criterion = \"REML\")\r\nlmer::lme4(criterion = \"ML\")\r\n\r\n\r\nA somewhat harder case is a true presence-or-absence kind of a situation, where setting the argument to true/false essentially boils down to triggering an if block inside the function. For example, say a function has an option to use an optimizer called âMyOptimâ. This may be implemented as:\r\n\r\n\r\n# Boolean options\r\nfun(optimize = TRUE)\r\nfun(optimize = FALSE)\r\n\r\n\r\nEven if the absence of optimization is not nameable, you could just call that option something like \"none\" for the enum pattern, which makes the choices explicit:\r\n\r\n\r\n# Enumerated options\r\nfun(optimizer = \"MyOptim\")\r\nfun(optimizer = \"none\")\r\n\r\n\r\nOf course, the more difficult case is when the thing thatâs being toggled isnât really nameable. I think this is more often the case in practice, and may be the reason why there are many verb-y names for arguments with boolean options. Like, you wrote some code that optimizes something, but you have no name for it, so the argument that toggles it simply refers to its function, like âshould the function optimize?â.\r\nBut not all is lost. I think one way out of this would be to enumerate over placeholders, not necessarily names. So something like:\r\n\r\n\r\n# Enumerated options (placeholders)\r\nfun(optimizer = 1) # bespoke optimizer\r\nfun(optimizer = 0) # none\r\n\r\n\r\nThen the documentation can clarify what the placeholder values 0, 1, etc. represent in longer, paragraph form, to describe what they do without the pressure of having to name the options.6 Itâs not pretty, but I donât think there will ever be a pretty solution to this problem if you want to avoid naming things entirely.\r\nMove shared strings across options into the argument name\r\nThis one is simple and easily demonstrated with an example. Consider the matrix() function for constructing a matrix. It has an argument byrow which fills the matrix by column when FALSE (default) or by row when TRUE. The argument controls the margin of fill, so we could re-write it as a fill argument like so:\r\n\r\n\r\n# Boolean options\r\nmatrix(byrow = FALSE)\r\nmatrix(byrow = TRUE)\r\n\r\n# Enumerated options\r\nmatrix(fill = \"bycolumn\")\r\nmatrix(fill = \"byrow\")\r\n\r\n\r\nThe options \"bycolumn\" and \"byrow\" share the âbyâ string, so we could move that into the argument name:\r\n\r\n\r\nmatrix(fill_by = \"column\")\r\nmatrix(fill_by = \"row\")\r\n\r\n\r\nAt this point I was also wondering whether the enumerated options should have the shortened \"col\" or the full \"column\" name. At the moment Iâm less decided about this, but note that given the partial matching behavior in match.arg(), you could get away with matrix(fill_by = \"col\") in both cases.\r\nAt least from the bookâs examples, it looks like shortening is ok for the options. To repeat the vctrs::vec_sort() example from earlier:\r\n\r\n\r\nvctrs::vec_sort(x, direction = \"desc\") # vs. \"descending\"\r\nvctrs::vec_sort(x, direction = \"asc\") # vs. \"ascending\"\r\n\r\n\r\nI was actually kind of surprised by this when I first saw it, and I have mixed feelings especially for \"asc\" since thatâs not very frequent as a shorthand for âascendingâ (e.g., {dplyr} has desc() but not a asc() equivalent - see also the previous section on ânaming the obviousâ). So I feel like Iâd prefer for this to be spelled out in full in the function, and users can still loosely do partial matching in practice.7\r\n\r\nThe fun part of reading the book for me is not necessarily about discovering new patterns, but about being able to put a name to them and think more critically about their pros and cons.âŠď¸\r\nTo quote the book: â⌠this is a pattern that we only discovered relatively recentlyââŠď¸\r\nThe book describes the awkwardness of decreasing = FALSE as âfeels like a double negativeâ, but I think this is just a general, pervasive problem of pragmatic ambiguity with negation, and this issue of âwhat exactly is being negated?â is actually one of my research topics! Negation is interpreted with respect to the relevant and accessible alternatives (which âdescâ vs. âascâ establishes very well) - in turn, recovering the intended meaning of the negation is difficult deprived of that context (like in the case of âdirection = TRUE/FALSEâ). See: Alternative Semantics.âŠď¸\r\nTo pre-empt the preference for short argument names, the fact that users donât reach for these arguments in everyday use of facet_grid() should loosen that constraint for short, easy-to-type names. IMO the âtoo much to typeâ complaint since time immemorial is already obviated by auto-complete, and should frankly just be ignored for the designing these kinds of esoteric arguments that only experienced users would reach for in very specific circumstances.âŠď¸\r\nTry this from the view point of both the developer and the user!âŠď¸\r\nIMO, {collapse} does a very good job at this (see ?TRA).âŠď¸\r\nOf course, the degree to which youâd encourage this should depend on how sure you are about the stability of the current set of enumerated options.âŠď¸\r\n",
"preview": "posts/2024-07-21-enumerate-possible-options/preview.jpg",
- "last_modified": "2024-09-01T17:53:55-04:00",
+ "last_modified": "2024-09-01T14:53:55-07:00",
"input_file": {}
},
{
@@ -53,7 +53,7 @@
],
"contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nAside: {data.table} đ¤ {collapse}\r\nsessionInfo()\r\n\r\nI think itâs safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). Itâs a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more âtidyâ code. Iâve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by CĂŠdric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly âtidyâ solution. I mean, the input isnât even tidy3 in the first place!\r\nSo maybe weâd be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 Ă 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe youâd rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nIâll note that thereâs actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you canât avoid the repetitiveness problem because it doesnât vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now weâre straying very far from the âtidyâ territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the âaverage {dplyr} userâ, so I encourage you to take a stab at this yourself before proceeding if youâve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond âdata(frame) wranglingâ.\r\nIt helps to simplify the problem once we think about the problem in terms of â(column) vector wranglingâ first, and thatâs where ave() comes in!\r\nIâll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nItâs the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine thatâs concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse đ¤ base R\r\nAside: {data.table} đ¤ {collapse}\r\nSince I wrote this blog post, I discovered that {data.table} recently added in support for using names(.SD) in the LHS of the walrus :=. Iâm so excited for this to hit the next release (v1.16.0)!\r\nIâve trying to be more mindful of showcasing {data.table} whenever I talk about {dplyr}, so hereâs a solution to compare with the dplyr::across() solution above.\r\n\r\n\r\n\r\n\r\n\r\n# data.table::update_dev_pkg()\r\nlibrary(data.table)\r\ninput_dt <- as.data.table(input)\r\ninput_dt\r\n\r\n a b c freq\r\n \r\n 1: A X M 5\r\n 2: A Y N 12\r\n 3: A Y O 3\r\n 4: B Z O 7\r\n\r\n\r\n\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) ave(freq, x, FUN = sum)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\nIn practice, I often pair {data.table} with {collapse}, where the latter provides a rich and performant set of split-apply-combine vector operations, to the likes of ave(). In {collapse}, ave(..., FUN = sum) can be expressed as fsum(..., TRA = \"replace\"):\r\n\r\n\r\nlibrary(collapse)\r\nave(input_dt$freq, input_dt$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nfsum(input_dt$freq, input_dt$a, TRA = \"replace\") # Also, TRA = 2\r\n\r\n [1] 20 20 20 7\r\n\r\nSo a version of the solution integrating fsum() would be:4\r\n\r\n\r\ninput_dt[, names(.SD) := NULL, .SDcols = patterns(\"^freq_\")]\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) fsum(freq, x, TRA = 2)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\ndata.table đ¤ collapse\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] collapse_2.0.14 data.table_1.15.99 lubridate_1.9.3 forcats_1.0.0 \r\n [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 \r\n [9] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.4.1 Rcpp_1.0.12 \r\n [5] tidyselect_1.2.1 parallel_4.4.1 jquerylib_0.1.4 scales_1.3.0 \r\n [9] yaml_2.3.8 fastmap_1.1.1 R6_2.5.1 generics_0.1.3 \r\n [13] knitr_1.47 distill_1.6 munsell_0.5.0 tzdb_0.4.0 \r\n [17] bslib_0.7.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 \r\n [21] stringi_1.8.4 cachem_1.0.8 xfun_0.44 sass_0.4.9 \r\n [25] timechange_0.2.0 memoise_2.0.1 cli_3.6.2 withr_3.0.0 \r\n [29] magrittr_2.0.3 digest_0.6.35 grid_4.4.1 rstudioapi_0.16.0\r\n [33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 downlit_0.4.3 \r\n [37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0 \r\n [41] rmarkdown_2.27 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!âŠď¸\r\nWho I can only assume was needing this for a fancy data viz thing đâŠď¸\r\nI mean that in the technical sense here. In this problem, the unit of observation is the âcellsâ of the input columns (the values âAâ, âBâ, âXâ, âYâ, etc.).âŠď¸\r\nI couldnât show this here with this particular example, but another nice feature of {collapse} đ¤ {data.table} is the fact that they do not shy away from consuming/producing matrices: see scale()[,1] vs. fscale() for a good example of this.âŠď¸\r\n",
"preview": "posts/2024-06-09-ave-for-the-average/preview.png",
- "last_modified": "2024-06-23T00:35:29-04:00",
+ "last_modified": "2024-06-22T21:35:29-07:00",
"input_file": {},
"preview_width": 926,
"preview_height": 328
@@ -75,7 +75,7 @@
],
"contents": "\r\n\r\nContents\r\nargs()\r\nargs(args)\r\nargs(args)(args)\r\nargs(args(args)(args))\r\nad infinitum\r\nHad enough args() yet?\r\nTL;DR: str()\r\nCoda (serious): redesigning args()\r\nTake 1) Display is the side-effect; output is trivial\r\nTake 2) Display is the side-effect; output is meaningful\r\nTake 3) Just remove the NULL\r\n\r\nsessionInfo()\r\n\r\nThe kind of blog posts that I have the most fun writing are those where I hyperfocus on a single function, like dplyr::slice(), purrr::reduce(), and ggplot2::stat_summary(). In writing blog posts of this kind, I naturally come across a point where I need to introduce the argument(s) that the function takes. I usually talk about them one at a time as needed, but I could start by front-loading that important piece of information first.\r\nIn fact, thereâs a function in R that lets me do exactly that, called args().\r\nargs()\r\nargs() is, in theory, a very neat function. According to ?args:\r\n\r\nDisplays the argument names and corresponding default values of a (non-primitive or primitive) function.\r\n\r\nSo, for example, I know that sum() takes the arguments ... and na.rm (with the na.rm = FALSE default). The role of args() is to display exactly that piece of information using R code. This blog runs on rmarkdown, so surely I can use args() as a convenient and fancy way of showing information about a functionâs arguments to my readers.\r\nIn this blog post, I want to talk about args(). So letâs start by looking at the argument that args() takes.\r\nOf course, I could just print args in the console:\r\n\r\n\r\nargs\r\n\r\n function (name) \r\n .Internal(args(name))\r\n \r\n \r\n\r\nBut wouldnât it be fun if I used args() itself to get this information?\r\nargs(args)\r\n\r\n\r\nargs(args)\r\n\r\n function (name) \r\n NULL\r\n\r\nOkay, so I get the function (name) piece, which is the information I wanted to show. We can see that args() takes one argument, called name, with no defaults.\r\nBut wait - whatâs that NULL doing there in the second line?\r\nHmm, I wonder if they forgot to invisible()-y return the NULL. args() is a function for displaying a functionâs arguments after all, so maybe the arguments are printed to the console as a side-effect and the actual output of args() is NULL.\r\nIf that is true, we should be able to suppress the printing of NULL with invisible():\r\n\r\n\r\ninvisible(args(args))\r\n\r\n\r\nUh oh, now everything is invisible.\r\nAlright, enough games! What exactly are you, output of args()?!\r\n\r\n\r\ntypeof(args(args))\r\n\r\n [1] \"closure\"\r\n\r\nWhat?\r\nargs(args)(args)\r\nTurns out that args(args) is actually returning a whole function thatâs a copy of args(), except with its body replaced with NULL.\r\nSo args(args) is itself a function that takes an argument called name and then returns NULL. Letâs assign it to a variable and call it like a function:\r\n\r\n\r\nabomination <- args(args)\r\n\r\n\r\n\r\n\r\nabomination(123)\r\n\r\n NULL\r\n\r\nabomination(mtcars)\r\n\r\n NULL\r\n\r\nabomination(stop())\r\n\r\n NULL\r\n\r\nThe body is just NULL, so the function doesnât care what it receives1 - it just returns NULL.\r\nIn fact, we could even pass it⌠args:\r\n\r\n\r\nargs(args)(args)\r\n\r\n NULL\r\n\r\nargs(args(args)(args))\r\nBut wait, thatâs not all! args() doesnât just accept a function as its argument. From the documentation:\r\n\r\nValue\r\nNULL in case of a non-function.\r\n\r\nSo yeah - if args() receives a non-function, it just returns NULL:\r\n\r\n\r\nargs(123)\r\n\r\n NULL\r\n\r\nargs(mtcars)\r\n\r\n NULL\r\n\r\nThis applies to any non-function, including⌠NULL:\r\n\r\n\r\nargs(NULL)\r\n\r\n NULL\r\n\r\nAnd recall that:\r\n\r\n\r\nis.null( args(args)(args) )\r\n\r\n [1] TRUE\r\n\r\nTherefore, this is a valid expression in base R:\r\n\r\n\r\nargs(args(args)(args))\r\n\r\n NULL\r\n\r\nad infinitum\r\nFor our cursed usecase of using args(f) to return a copy of f with itâs body replaced with NULL only to then immediately call args(f)(f) to return NULL, it really doesnât matter what the identity of f is as long as itâs a function.\r\nThat function can even be ⌠args(args)!\r\nSo letâs take our args(args(args)(args)):\r\n\r\n\r\nargs( args( args )( args ))\r\n\r\n NULL\r\n\r\nAnd swap every args() with args(args):\r\n\r\n\r\nargs(args)( args(args)( args(args) )( args(args) ))\r\n\r\n NULL\r\n\r\nOr better yet, swap every args() with args(args(args)):\r\n\r\n\r\nargs(args(args))( args(args(args))( args(args(args)) )( args(args(args)) ))\r\n\r\n NULL\r\n\r\nThe above unhinged examples are a product of two patterns:\r\nThe fact that you always get function (name) NULL from wrapping args()s over args:\r\n\r\n\r\nlist(\r\n args( args),\r\n args( args(args)),\r\n args(args(args(args)))\r\n )\r\n\r\n [[1]]\r\n function (name) \r\n NULL\r\n\r\n [[2]]\r\n function (name) \r\n NULL\r\n\r\n [[3]]\r\n function (name) \r\n NULL\r\n\r\nThe fact that you can get this whole thing to return NULL by having function (name) NULL call the function object args. You can do this anywhere in the stack and the NULL will simply propagate:\r\n\r\n\r\nlist(\r\n args(args(args(args))) (args) ,\r\n args(args(args(args)) (args) ) ,\r\n args(args(args(args) (args) ))\r\n )\r\n\r\n [[1]]\r\n NULL\r\n\r\n [[2]]\r\n NULL\r\n\r\n [[3]]\r\n NULL\r\n\r\nWe could keep going but itâs tiring to type out and read all these nested args()⌠but did you know that thereâs this thing called the pipe %>% thatâs the solution to all code readability issues?\r\nHad enough args() yet?\r\nLetâs make an args() factory ARGS() âŚ\r\n\r\n\r\nlibrary(magrittr)\r\nARGS <- function(n) {\r\n Reduce(\r\n f = \\(x,y) bquote(.(x) %>% args()),\r\n x = seq_len(n),\r\n init = quote(args)\r\n )\r\n}\r\n\r\n\r\n⌠to produce a sequence of args() âŚ\r\n\r\n\r\nARGS(10)\r\n\r\n args %>% args() %>% args() %>% args() %>% args() %>% args() %>% \r\n args() %>% args() %>% args() %>% args() %>% args()\r\n\r\neval(ARGS(10))\r\n\r\n function (name) \r\n NULL\r\n\r\n⌠and tidy it up!\r\n\r\n\r\nARGS(10) %>% \r\n deparse1() %>% \r\n styler::style_text()\r\n\r\n args %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args()\r\n\r\nWanna see even more unhinged?\r\nLetâs try to produce a âmatrixâ of args(). You get a choice of i ârowsâ of piped lines, and j âcolumnsâ of args()-around-args each time - all to produce a NULL.\r\nReady?\r\n\r\n\r\nARGS2 <- function(i, j) {\r\n Reduce(\r\n f = \\(x,y) bquote(.(x) %>% (.(y))),\r\n x = rep(list(Reduce(\\(x,y) call(\"args\", x), seq_len(j), quote(args))), i)\r\n )\r\n}\r\n\r\n\r\n\r\n\r\nARGS2(5, 1) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args) %>%\r\n (args(args)) %>%\r\n (args(args)) %>%\r\n (args(args)) %>%\r\n (args(args))\r\n\r\n\r\n\r\nARGS2(5, 3) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args(args(args))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args))))\r\n\r\n\r\n\r\nARGS2(10, 5) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args(args(args(args(args))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args))))))\r\n\r\n\r\n\r\nlist(\r\n eval(ARGS2(5, 1)),\r\n eval(ARGS2(5, 3)),\r\n eval(ARGS2(10, 5))\r\n)\r\n\r\n [[1]]\r\n NULL\r\n \r\n [[2]]\r\n NULL\r\n \r\n [[3]]\r\n NULL\r\n\r\nYay!\r\nTL;DR: str()\r\nIf you want a version of args() that does what itâs supposed to, use str() instead:2\r\n\r\n\r\nstr(args)\r\n\r\n function (name)\r\n\r\nstr(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nargs() is hereafter banned from my blog.\r\nCoda (serious): redesigning args()\r\nThe context for my absurd rant above is that I was just complaining about how I think args() is a rather poorly designed function.\r\nLetâs try to redesign args(). Iâll do three takes:\r\nTake 1) Display is the side-effect; output is trivial\r\nIf the whole point of args() is to display a functionâs arguments for inspection in interactive usage, then that can simply be done as a side-effect.\r\nAs I said above, str() surprisingly has this more sensible behavior out of the box. So letâs write our first redesign of args() which just calls str():\r\n\r\n\r\nargs1 <- function(name) {\r\n str(name)\r\n}\r\nargs1(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nIn args1()/str(), information about the function arguments are sent to the console.3 We know this because we canât suppress this with invisible but we can grab this via capture.output:\r\n\r\n\r\ninvisible( args1(sum) )\r\n\r\n function (..., na.rm = FALSE)\r\n\r\ncapture.output( args1(sum) )\r\n\r\n [1] \"function (..., na.rm = FALSE) \"\r\n\r\nFor functions whose purpose is to signal information to the console (and whose usage is limited to interactive contexts), we donât particularly care about the output. In fact, because the focus isnât on the output, the return value should be as trivial as possible.\r\nA recommended option is to just invisibly return NULL. This is now how args1() does it (via str()).4:\r\n\r\n\r\nprint( args1(sum) )\r\n\r\n function (..., na.rm = FALSE) \r\n NULL\r\n\r\nis.null( args1(sum) )\r\n\r\n function (..., na.rm = FALSE)\r\n [1] TRUE\r\n\r\nAlternatively, the function could just invisibly return what it receives,5 which is another common pattern for cases like this. Again, we return invisibly to avoid distracting from the fact that the point of the function is to display as the side-effect.\r\n\r\n\r\nargs2 <- function(name) {\r\n str(sum)\r\n invisible(name)\r\n}\r\n\r\n\r\n\r\n\r\nargs2(rnorm)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\n\r\n\r\nargs2(rnorm)(5)\r\n\r\n function (..., na.rm = FALSE)\r\n [1] -0.5494891 1.2861975 -1.2755454 1.0817387 -0.7248563\r\n\r\nTake 2) Display is the side-effect; output is meaningful\r\nOne thing I neglected to mention in this blog post is that there are other ways to extract a functionâs arguments. One of them is formals():6\r\n\r\n\r\nformals(args)\r\n\r\n $name\r\n\r\nformals(rnorm)\r\n\r\n $n\r\n \r\n \r\n $mean\r\n [1] 0\r\n \r\n $sd\r\n [1] 1\r\n\r\nformals() returns the information about a functionâs arguments in a list which is pretty boring, but itâs an object we can manipulate (unlike the return value of str()). So thereâs some pros and cons.\r\nActually, we could just combine both formals() and str():\r\n\r\n\r\nargs3 <- function(name) {\r\n str(name)\r\n invisible(formals(name))\r\n}\r\n\r\n\r\n\r\n\r\narguments <- args3(rnorm)\r\n\r\n function (n, mean = 0, sd = 1)\r\n\r\narguments\r\n\r\n $n\r\n \r\n \r\n $mean\r\n [1] 0\r\n \r\n $sd\r\n [1] 1\r\n\r\narguments$mean\r\n\r\n [1] 0\r\n\r\nYou get the nice display as a side-effect (via str()) and then an informative output (via formals()). You could even turn this into a class with a print method, which is definitely the better way to go about this, but Iâm running out of steam here and I donât like OOP, so I wonât touch that here.\r\nTake 3) Just remove the NULL\r\nThis last redesign is the simplest of the three, and narrowly deals with the problem of that pesky NULL shown alongside the function arguments:\r\n\r\n\r\nargs(sum)\r\n\r\n function (..., na.rm = FALSE) \r\n NULL\r\n\r\nFine, Iâll give them that args() must, for compatibility with S whatever reason, return a whole new function object, which in turn requires a function body. But if that function is just as a placeholder and not meant to be called, canât you just make the function body, like, empty?\r\n\r\n\r\nargs4 <- function(name) {\r\n f <- args(name)\r\n body(f) <- quote(expr=)\r\n f\r\n}\r\nargs4(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nargs4(rnorm)\r\n\r\n function (n, mean = 0, sd = 1)\r\n\r\ntypeof( args4(rnorm) )\r\n\r\n [1] \"closure\"\r\n\r\nLike, come on!\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.3.3 (2024-02-29 ucrt)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: America/New_York\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] magrittr_2.0.3\r\n \r\n loaded via a namespace (and not attached):\r\n [1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.1 knitr_1.45 \r\n [5] rlang_1.1.2 xfun_0.41 purrr_1.0.2 styler_1.10.2 \r\n [9] jsonlite_1.8.8 htmltools_0.5.7 sass_0.4.7 fansi_1.0.5 \r\n [13] rmarkdown_2.25 R.cache_0.16.0 evaluate_0.23 jquerylib_0.1.4 \r\n [17] distill_1.6 fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.4 \r\n [21] memoise_2.0.1 compiler_4.3.3 prettycode_1.1.0 downlit_0.4.3 \r\n [25] rstudioapi_0.15.0 R.oo_1.25.0 R.utils_2.12.3 digest_0.6.33 \r\n [29] R6_2.5.1 R.methodsS3_1.8.2 bslib_0.6.1 tools_4.3.3 \r\n [33] withr_3.0.0 cachem_1.0.8\r\n\r\n\r\nYou can even see lazy evaluation in action when it receives stop() without erroring.âŠď¸\r\nThough you have to remove the \"srcref\" attribute if the function has one. But also donât actually do this!âŠď¸\r\nTechnically, the \"output\" stream.âŠď¸\r\nFor the longest time, I thought args() was doing this from how its output looked.âŠď¸\r\nEssentially acting like identity().âŠď¸\r\nBut note that it has a special behavior of returning NULL for primitive functions (written in C) that clearly have user-facing arguments on the R side. See also formalArgs(), for a shortcut to names(formals())âŠď¸\r\n",
"preview": "posts/2024-03-04-args-args-args-args/preview.png",
- "last_modified": "2024-03-05T17:04:54-05:00",
+ "last_modified": "2024-03-05T14:04:54-08:00",
"input_file": {},
"preview_width": 419,
"preview_height": 300
@@ -96,7 +96,7 @@
],
"contents": "\r\n\r\nContents\r\nHello World\r\nQuirks in R syntax\r\nFlipping\r\nRegistering arguments\r\nUnflipping\r\nCurrying\r\nFin.\r\n\r\nHello World\r\nGetting a program to print âHello Worldâ is one of the earliest things people are taught to do when picking up a new programming language. This universal experience among programmers has also turned it into a running joke about the complexity of programming languages.\r\nFor example, whereas in R we can express what we want transparently in the following:\r\n\r\n\r\nprint(\"Hello World\")\r\n\r\n\r\nThis simple task can get absurdly complex in other languages; perhaps most notoriously, Java:\r\n\r\nclass HelloWorld {\r\n public static void main(String[] args) {\r\n System.out.println(\"Hello, World!\"); \r\n }\r\n}\r\n\r\nThis joke around âHello Worldâ has also evolved into other forms. Every once in a while I come across a variant of the joke in the style of something like:\r\n\r\n\r\nHelloWorld(\"print\")\r\n\r\n [1] \"HelloWorld\"\r\n\r\nThis is funny because it seemingly swaps the role of the argument and the function in an expression. Itâs also a good educational example because it demonstrates the arbitrariness of signs as a universal design principle of programming (and human!) languages.1 Crucially, you should be able to produce this behavior in any reasonable programming language - the ability to do this is a feature, not a bug.\r\nThe most trivial implementation of the above is to define HelloWorld() as a function thatâs been hardcoded to simply print âHello Worldâ:\r\n\r\n\r\nHelloWorld <- function(x) print(\"HelloWorld\")\r\n\r\n\r\nBut here, too, languages show differences. Not so much in their ability to implement this specific solution, but in their ability to formulate a generalizable solution in an idiomatic way, using tools and concepts that are native to the language.\r\nWhen it comes to R, it turns out that R has certain quirks which can give us a surprisingly principled and lean solution to the problem. So thatâs what this blog post will be about.\r\nQuirks in R syntax\r\nIn R, functions are distinguished from non-functions in part by their role as a caller. This role is defined by its syntactic position in an expression: it always occupies the first position [[1]] of a object.2\r\n\r\n\r\n\r\n\r\nplus_expr <- quote(1 + 2)\r\nplus_expr\r\n\r\n 1 + 2\r\n\r\nplus_expr[[1]]\r\n\r\n `+`\r\n\r\n\r\n\r\n\r\n\r\nsum_expr <- quote(sum(1, 2))\r\nsum_expr\r\n\r\n sum(1, 2)\r\n\r\nsum_expr[[1]]\r\n\r\n sum\r\n\r\n\r\n\r\nWhen R sees a variable in an expression and needs to resolve its value, it firstly determines whether the value must be a function, by virtue of its position in the expression. Here, R eagerly commits to the assumption that whatever appears in the caller position must be a function.\r\nThis gives rise to a somewhat surprising behavior. In evaluating the expression f(1, 2) inside a local scope below, R smartly skips the immediately-adjacent, local value of f (a numeric constant) to scope the global value of f() (alias of the function sum()) thatâs âfurther awayâ.\r\n\r\n\r\nf <- sum\r\nlocal({\r\n f <- 0\r\n f(1, 2)\r\n})\r\n\r\n [1] 3\r\n\r\nSo the point here is that, R knows to only scope values of f that are functions, because it found f in the caller position of the expression:\r\n\r\n\r\nf_expr <- quote(f(1, 2))\r\nf_expr[[1]]\r\n\r\n f\r\n\r\nThis in and of itself is interesting, but I want to return to my characterization of R as âeagerly committingâ to this. Consider the fact that the above example works even if you swapped f with the string \"f\" in the expression:\r\n\r\n\r\nf <- sum\r\nlocal({\r\n f <- 0\r\n \"f\"(1, 2)\r\n})\r\n\r\n [1] 3\r\n\r\nBecause R eagerly commits to the invariant that the first position is reserved for functions, it repairs \"f\"() to f() at the level of the parser, before the evaluation engine even sees the expression.\r\n\r\n\r\nf_expr2 <- quote(\"f\"(1, 2))\r\nf_expr2\r\n\r\n f(1, 2)\r\n\r\nf_expr2[[1]]\r\n\r\n f\r\n\r\nAll of this to say that the following syntax that looks even more flipped is also valid in R:\r\n\r\n\r\nHelloWorld <- function(x) print(\"HelloWorld\")\r\n\"HelloWorld\"(print)\r\n\r\n [1] \"HelloWorld\"\r\n\r\nThis is trivially true about Râs sytax and its parser but funny nonetheless, so this deserves a mention first. Now lets talk about the implementation side of things - how well does R fair in letting us express something like âarg(f) should evaluate to f(arg)â?\r\nFlipping\r\nIâll get right to the chase - the following definition for HelloWorld() gives us the ability to pass in a function that is then called with \"HelloWorld\" as the argument.\r\n\r\n\r\nHelloWorld <- function(x) {\r\n fun <- match.fun(x)\r\n arg <- deparse(sys.call()[[1]])\r\n fun(arg)\r\n}\r\nHelloWorld(\"print\")\r\n\r\n [1] \"HelloWorld\"\r\n\r\nHelloWorld(toupper)\r\n\r\n [1] \"HELLOWORLD\"\r\n\r\nThere are two pieces to this solution.\r\nFirst is match.fun(), which allows HelloWorld() to receive the name of a function as a string and match the function with that name. This is kind of like what we talked about in the previous section with \"f\"(), but itâs a more explicit, less auto-magic way of handling functions specified as a string:\r\n\r\n\r\nidentical(match.fun(\"print\"), print)\r\n\r\n [1] TRUE\r\n\r\nA nice convenience feature is that when match.fun() receives a function, it simply passes it through. That also gives us this equality:\r\n\r\n\r\nidentical(match.fun(print), print)\r\n\r\n [1] TRUE\r\n\r\nIn sum, match.fun() gives us a choice in whether HelloWorld() receives its argument as a string vs. symbol. Combined with our observation from the previous section, this gives us a full 2-by-2 variation in whether the function or the argument is a string (vs. a symbol):\r\n\r\n\r\nHelloWorld(print)\r\nHelloWorld(\"print\")\r\n\"HelloWorld\"(print)\r\n\"HelloWorld\"(\"print\")\r\n\r\n\r\nThe second piece of the solution is sys.call(), which returns the expression that called the function where sys.call() is called from. Itâs hard to explain in words but actually pretty intuitive once you see some examples:\r\n\r\n\r\nf <- function(...) {\r\n sys.call()\r\n}\r\nf()\r\n\r\n f()\r\n\r\nf(arg = val)\r\n\r\n f(arg = val)\r\n\r\nf(pi)\r\n\r\n f(pi)\r\n\r\nAnd thatâs it! When sys.call() is called from f(), it captures the expression that makes up f(...). So in the case of HelloWorld(\"print\"), the call to sys.call() evaluates to the following language object:\r\n\r\n HelloWorld(\"print\")\r\n\r\n⌠which is essentially a list of length-2:\r\n\r\n [[1]]\r\n HelloWorld\r\n \r\n [[2]]\r\n [1] \"print\"\r\n\r\nSo the code deparse(sys.call()[[1]]) grabs the symbol HelloWorld and deparse()s it into a string, resulting in \"HelloWorld\". And as I mentioned before, we grab the string \"print\" and pass it to match.fun() to get back the print() function.\r\nOnce we have these two pieces, the line fun(arg) evaluates to the un-flipped version print(\"HelloWorld\").\r\nAnd of course, as far as the argument is concerned, HelloWorld() takes any function that can operate on the string \"HelloWorld\":\r\n\r\n\r\ncaps_split <- function(x) {\r\n strsplit(x, \"(? print()), the next best tool for this job is probably currying.\r\nHereâs the simplest attempt at that:6\r\n\r\n\r\ncurry <- function(arg) {\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n}\r\ncurry(\"HelloWorld\")(print)\r\n\r\n [1] \"HelloWorld\"\r\n\r\nEssentially, curry(\"HelloWorld\") is returning a function that takes a function and calls that function with \"HelloWorld\" as its argument. Although, unfortunately, thatâs not so obvious from the function definition which just looks generic:\r\n\r\n\r\ncurry(\"HelloWorld\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n \r\n \r\n\r\nFor us to see \"HelloWorld\" in the function body for curry(\"HelloWorld\"), we would need to in-line the value of arg when the curried function is defined.7 Letâs take this up in steps.\r\nFirst, we can use substitute() (or bquote()) to create an expression where the value of arg is in-lined. Both methods produce the contextualized function definition we want.\r\n\r\n\r\ncurry2 <- function(arg) {\r\n list(\r\n substitute = substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ),\r\n bquote = bquote(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(.(arg))\r\n }\r\n )\r\n )\r\n}\r\ncurry2(\"HelloWorld\")\r\n\r\n $substitute\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n $bquote\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n\r\nLetâs stick with substitute() and move on. Now that we have an expression of the function definition, we can eval()-uate it to get an actual function object back.\r\n\r\n\r\ncurry2 <- function(arg) {\r\n eval(substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ))\r\n}\r\ncurry2(\"HelloWorld\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n \r\n\r\nWait⌠\"HelloWorld\" just turned back into arg! Turns out that functions in R have a âmemoryâ of how they were defined. Itâs stored in the srcref attribute of functions, and this is the function definition that gets shown when we print functions.\r\n\r\n\r\nHelloWorld <- curry2(\"HelloWorld\")\r\nattr(HelloWorld, \"srcref\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n\r\nAnd actually, if we just strip this attribute away, we can see our work of in-lining arg:\r\n\r\n\r\nattr(HelloWorld, \"srcref\") <- NULL\r\nHelloWorld\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nWe can now go back to the currying function and implement this solution there:\r\n\r\n\r\ncurry3 <- function(arg) {\r\n inlined <- eval(substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ))\r\n attr(inlined, \"srcref\") <- NULL\r\n inlined\r\n}\r\ncurry3(\"HelloWorld\")\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nTo avoid all this mess, you could also inline arg first, and then piece together the function from scratch:\r\n\r\n\r\ncurry4 <- function(arg) {\r\n inlined_body <- rlang::expr({\r\n fun <- match.fun(fun)\r\n fun(!!arg)\r\n })\r\n rlang::new_function(\r\n args = rlang::pairlist2(fun=),\r\n body = inlined_body\r\n )\r\n}\r\ncurry4(\"HelloWorld\")\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nFin.\r\n\r\n\r\nregister(`But don't do this in practice!`)\r\nget(ls()[order(-nchar(ls()))][1])(\"print\")\r\n\r\n [1] \"But don't do this in practice!\"\r\n\r\n\r\nA topic close to my heart as a linguist. This is one of the first things we teach in intro to linguistics.âŠď¸\r\nWhere objects of class are essentially a (nested) list of symbols and constants.âŠď¸\r\nThe notorious <<- is evidence of this.âŠď¸\r\n\r\nThis is starting to look something like a very butchered form of string interningâŚâŠď¸\r\nYou might protest that as.character(substitute()) is bad practice which is true but itâs idiomatic in the sense that itâs the first line of the function definition of require().âŠď¸\r\nA version with stricter safeguards would probably use force() among other things (see Adv R).âŠď¸\r\nThis in-lining also resolves the need for force().âŠď¸\r\n",
"preview": "posts/2024-02-20-helloworld-print/preview.png",
- "last_modified": "2024-02-20T15:25:44-05:00",
+ "last_modified": "2024-02-20T12:25:44-08:00",
"input_file": {},
"preview_width": 462,
"preview_height": 184
@@ -117,7 +117,7 @@
],
"contents": "\r\n\r\nContents\r\nIntro\r\nResearch\r\nBlogging\r\nR stuff\r\nPersonal\r\n\r\n\r\n\r\n\r\nFigure 1: New yearâs eve celebration fireworks at Long Beach, CA.\r\n\r\n\r\n\r\nIntro\r\nIâve been seeing a couple folks on Mastodon sharing their âyear in reviewâ blog posts, and I thought that was really cool, so I decided to write my own too! Iâm mostly documenting for myself but hopefully this also serves as an update of a sort for my friends over the internet since Iâve been pretty silent online this year.\r\nResearch\r\nBeing the Good Grad Student⢠I am, Iâm forefronting my academia happenings first. In numbers, I published one paper, gave two talks, and presented three posters. Iâm not super proud of those numbers: I think theyâre a lot less than what people might expect from a 4th year PhD student. But a lot of effort went into each1 and 2023 overall has been a great year for refining and narrowing down on my dissertation topic.2 I did a ton of readings and I hope it pays off for next year when I actually get started on writing the thing.\r\nI already document my research happenings elsewhere and I know that the primarily audience of my blog isnât linguists, so I wonât expand on that more here.\r\nBlogging\r\n2023 was the year when it became painfully obvious to me that I donât have much in terms of a portfolio in the sense of the buzzword-y âdata science portfolioâ that industry recruiters purportedly look for. This ironically coincided with another realization I had, which is that Iâm increasingly becoming âthe department tech/stats guyâ where I take on many small tasks and favors from faculty and other students here and there; I truly do enjoy doing this work, but itâs completely invisible to my CV/resume. Iâm still navigating this weird position Iâm in, but Iâve found some nice tips3 and at least I still have another year until Iâm on the job market to fully figure this out.\r\nThe reason why I put the above rant under the âBloggingâ section is because my blog is the closest thing I have a portfolio - thereâs not much here, but itâs a public-facing space I own where I get to show people what I know and how I think. So in 2023 I was more conscious about what I blog about and how. The change was subtle - my blog persona is still my usual self, but Iâve tried to diversify the style of my blogs. Whereas I mostly wrote long-form, tutorial-style blog posts in the past, I only wrote one such post (on dplyr::slice()) this year. My other blog posts were one reflecting on how to better answer other peopleâs questions, and another where I nerd out on the internals of {tidyselect} with little regard for its practicality.4.\r\nAll in all, I wrote three blog posts this year (not including this one). This is the usual rate of publishing blog posts for me, but I hope to write more frequently next year (and write shorter posts overall, and in less formal tone).\r\nR stuff\r\nI didnât think Iâd have much to say about the R stuff I did this year until I sat down to write this blog. Even though this year was the busiest Iâve ever been with research, it turns out that I still ended up doing quite a bit of R stuff in my free time. Iâll cover this chronologically.\r\n\r\nAt the beginning of the year, I was really lucky to receive the student paper award from the Statistical Computing and Graphics section of the ASA, writing about {ggtrace}.5 In the paper, I focused on {ggtrace} as a pedagogical tool for aspiring {ggplot2} extension developers. In the process, I rediscovered the power of reframing ggplot internals as data wrangling and went back to {ggtrace} to add a couple convenience functions for interactive use-cases. After over two years since its inception, {ggtrace} now feels pretty complete in terms of its core features (but suggestions and requests are always welcome!).\r\n\r\nIn Spring, I began writing {jlmerclusterperm}, a statistical package implementing the cluster-based permutation test for time series data, using mixed-effects models. This was a new challenge for me for two reasons. First, I wrote much of the package in Julia - this was my first time writing Julia code for âproductionâ and within an R package.6 Second, I wrote this package for a seminar on eye movements that I was taking that Spring in the psychology department. I wrote {jlmerclusterperm} in an intense burst - most of it was complete by the end of May and I turned in the package as my final.7 I also gave a school-internal talk on it in April; my first time talking about R in front of an entirely academic audience.\r\nIn Summer, I continued polishing {jlmerclusterperm} with another ambitious goal of getting it to CRAN, at the suggestion of a couple researchers who said theyâd like to use it for their own research. The already-hard task of getting through my first CRAN submission was compounded by the fact that the package contained Julia code - it took nine resubmissions in the span of two months to finally get {jlmerclusterperm} stably on CRAN.8\r\n\r\n\r\n\r\nFigure 2: Group photo taken at SMLP2023.\r\n\r\n\r\n\r\nAt the beginning of Fall, I attended the Advanced Frequentist stream of the SMLP2023 workshop, taught by Phillip Alday, Reinhold Kliegl and Douglas Bates. The topic was mixed-effects regression models in Julia, one that I became very excited about especially after working on {jlmerclusterperm}. It was an absolute blast and I wish that everyone in linguistics/psychology research appreciated good stats/data analysis as much as the folks I met there. The workshop was far away in Germany (my first time ever in Europe!) and Iâm really thankful to MindCORE for giving me a grant to help with travel expenses.\r\n\r\nFor most of Fall, I didnât do much R stuff, especially with the start of the Fall semester and a big conference looming on the horizon. But the little time I did spend on it, I worked on maintenance and upkeep for {openalexR}, one of my few collaborative projects. Itâs also one of the few packages for which Iâm an author of that I actually frequently use myself. I used {openalexR} a lot during the Fall semester for conducting literature reviews in preparation for my dissertation proposal, so I had a few opportunities to catch bugs and work on other improvements. I also spent a lot of my time in the Fall TA-ing for an undergraduate data science class that we recently started offering in our department. This was actually my third year in a row TA-ing it, so it went pretty smoothly. I even learned some new quirky R behaviors from my students along the way.\r\n\r\nIn October, I virtually attended the R/Pharma conference and joined a workshop on data validation using the {pointblank} package by Rich Iannone. I had used {pointblank} a little before, but I didnât explore its features much because I thought it had some odd behaviors that I couldnât comprehend. The workshop cleared up some of the confusion for me, and Rich made it clear in the workshop that he welcomed contributions to improve the package. So I made a PR addressing the biggest pain point I personally had with using {pointblank}. This turned out to be a pretty big undertaking which took over a month to complete. In the process, I become a co-author of {pointblank}, and I merged a series of PRs that improved the consistency of function designs, among other things.\r\nThe last R thing I did this year was actually secretly Julia - in December I gave a school-internal workshop on fitting mixed effects in Julia, geared towards an academic audience with prior experience in R. I advocated for a middle-ground approach where you can keep doing everything in R and RStudio, except move just the modelling workflow into Julia. I live-coded some Julia code and ran it from RStudio, which I think wasnât too difficult to grasp.9 I have a half-baked package of addins to make R-Julia interoperability smoother in RStudio; I hope to wrap it up and share it some day.\r\nThat brings me to the present moment, where Iâm currently taking a break from FOSS to focus on my research, as my dissertation proposal defense is coming up soon. I will continue to be responsive with maintaining {jlmerclusterperm} during this time (since thereâs an active user-base of researchers who find it useful) but my other projects will become low priority. I also donât think Iâll be starting a new project any time soon, but in the near future I hope I come up with something cool that lets me test-drive {S7}!\r\nPersonal\r\nThis year, I tried to be less of a workaholic. I think I did an okay job at it, and it mostly came in the form of diversifying my hobbies (R used to be my only hobby since starting grad school). I got back into ice skating10 and, briefly, swimming,11 and Iâm fortunate that both are available literally two blocks away from my department. My girlfriend and I got really into escape rooms this year, mostly playing online ones due to budget constraints.12 I also got back into playing Steam games13 and racked up over 300 hours on Slay the Spire, mostly from the ~2 weeks recovering from covid in September.14\r\nAnd of course, I have many people to thank for making this a wonderful year.15 Happy new year to all!\r\n\r\nI was the first author for all research that I presented, as is often the case in linguistics.âŠď¸\r\nBroadly, how kids learn words with overlapping meanings like âdalmatianâ<âdogâ<âanimalâ from the language input.âŠď¸\r\nLike this blog post.âŠď¸\r\nA style heavily inspired by some of my favorite R bloggers like Matt Dray and Jonathan CarrollâŠď¸\r\nCoincidentally, my girlfriend also won a student award this year from another ASA - the Acoustical Society of America.âŠď¸\r\nI canât recommend {JuliaConnectoR} enough for this.âŠď¸\r\nIâm actually quite proud of myself for pulling this off - writing an R package for the final was unprecedented for the class.âŠď¸\r\nIn the process, I received the elusive CRAN Note for exceeding 6 updates in under a month (CRAN recommends one update every 1-2 months).âŠď¸\r\nUsing some tricks described in the workshop materials.âŠď¸\r\nI used to play ice hockey competitively as a kid.âŠď¸\r\nTurns out that swimming does not play well with my preexisting ear conditions.âŠď¸\r\nMost recently we played Hallows Hill which I think is the best one weâve played so farâŠď¸\r\nIâm very into roguelike genres but havenât really played video games since high school.âŠď¸\r\nFor the fellow nerds, I reached A20 on Ironclad, Defect, and Watcher. Iâm working my way up for Silent.âŠď¸\r\nIâm feeling shy so this goes in the footnotes. In roughly chronological order, Iâm firstly indebted to Sam Tyner-Monroe who encouraged me write up {ggtrace} for the ASA paper award after my rstudio::conf talk on it last year. Iâm grateful to Gina Reynolds and Teun van den Brand (and others in the ggplot extension club) for engaging in many insightful data viz/ggplot internals discussions with me. Iâm also grateful to my FOSS collaborators, especially Trang Le, from whom Iâve learned a lot about code review and package design principles while working on {openalexR} together. Last but not least, I owe a lot to Daniel Sjoberg and Shannon Pileggi for a recent development that Iâm not ready to publicly share yet đ¤Ť.âŠď¸\r\n",
"preview": "posts/2023-12-31-2023-year-in-review/preview.png",
- "last_modified": "2024-01-01T15:43:40-05:00",
+ "last_modified": "2024-01-01T12:43:40-08:00",
"input_file": {},
"preview_width": 1512,
"preview_height": 1371
@@ -140,7 +140,7 @@
],
"contents": "\r\n\r\nContents\r\nIntro\r\nSome observations\r\ntidy-select!\r\ntidy?-select\r\nuntidy-select?\r\nuntidy-select!\r\n\r\nTidying untidy-select\r\nWriting untidy-select helpers\r\n1) times()\r\n2) offset()\r\n3) neighbors()\r\nDIY!\r\n\r\nLetâs get practical\r\n1) Sorting columns\r\n2) Error handling\r\n\r\nConclusion\r\n\r\nIntro\r\nRecently, Iâve been having frequent run-ins with {tidyselect} internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, Iâm gonna talk about some of the weirder stuff first and then touch on some of the âpracticalâ side to this.\r\nSome observations\r\nLetâs start with some facts about how {tidyselect} is supposed to work. Iâll use this toy data for the demo:\r\n\r\n\r\nlibrary(dplyr, warn.conflicts = FALSE)\r\nlibrary(tidyselect)\r\ndf <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])\r\ndf\r\n\r\n # A tibble: 2 Ă 3\r\n x y z \r\n \r\n 1 1 a A \r\n 2 2 b B\r\n\r\ntidy-select!\r\n{tidyselect} is the package that powers dplyr::select(). If youâve used {dplyr}, you already know the behavior of select() pretty well. We can specify a column as string, symbol, or by its position:\r\n\r\n\r\ndf %>% \r\n select(\"x\")\r\n\r\n # A tibble: 2 Ă 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(x)\r\n\r\n # A tibble: 2 Ă 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(1)\r\n\r\n # A tibble: 2 Ă 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\nItâs not obvious from the outside, but the way this works is that these user-supplied expressions (like \"x\", x, and 1) all get resolved to integer before the selection actually happens.\r\nSo to be more specific, the three calls to select() were the same because these three calls to tidyselect::eval_select() are the same:1\r\n\r\n\r\neval_select(quote(\"x\"), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(x), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(1), df)\r\n\r\n x \r\n 1\r\n\r\nYou can also see eval_select() in action in the method for select():\r\n\r\n\r\ndplyr:::select.data.frame\r\n\r\n function (.data, ...) \r\n {\r\n error_call <- dplyr_error_call()\r\n loc <- tidyselect::eval_select(expr(c(...)), data = .data, \r\n error_call = error_call)\r\n loc <- ensure_group_vars(loc, .data, notify = TRUE)\r\n out <- dplyr_col_select(.data, loc)\r\n out <- set_names(out, names(loc))\r\n out\r\n }\r\n \r\n \r\n\r\ntidy?-select\r\nBecause the column subsetting part is ultimately done using integers, we can theoretically pass select() any expression, as long as it resolves to an integer vector.\r\nFor example, we can use 1 + 1 to select the second column:\r\n\r\n\r\ndf %>% \r\n select(1 + 1)\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nAnd vector recycling is still a thing here too - we can use c(1, 2) + 1 to select the second and third columns:\r\n\r\n\r\ndf %>% \r\n select(c(1, 2) + 1)\r\n\r\n # A tibble: 2 Ă 2\r\n y z \r\n \r\n 1 a A \r\n 2 b B\r\n\r\nOrdinary function calls work as well - we can select a random column using sample():\r\n\r\n\r\ndf %>% \r\n select(sample(ncol(df), 1))\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nWe can even use the .env pronoun to scope an integer variable from the global environment:2\r\n\r\n\r\noffset <- 1\r\ndf %>% \r\n select(1 + .env$offset)\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nSo thatâs kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that weâve selected using strings and symbols?\r\nuntidy-select?\r\nUh not quite. select() doesnât like doing math on strings and symbols.\r\n\r\n\r\ndf %>% \r\n select(x + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `x + 1`.\r\n Caused by error:\r\n ! object 'x' not found\r\n\r\ndf %>% \r\n select(\"x\" + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `\"x\" + 1`.\r\n Caused by error in `\"x\" + 1`:\r\n ! non-numeric argument to binary operator\r\n\r\nIn fact, it doesnât even like doing certain kinds of math like multiplication (*), even with numeric constants:\r\n\r\n\r\ndf %>% \r\n select(1 * 2)\r\n\r\n Error in `select()`:\r\n ! Can't use arithmetic operator `*` in selection context.\r\n\r\nThis actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.\r\nUnlessâŚ\r\nuntidy-select!\r\nIt turns out that {tidyselect} helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (+) working if we wrap our columns in redundant column selection helpers like all_of() and matches()\r\n\r\n\r\ndf %>% \r\n select(all_of(\"x\") + 1)\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>% \r\n select(matches(\"^x$\") + 1)\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nFor multiplication, we have to additionally circumvent the censoring of the * symbol. Here, we can simply use a different name for the same operation:4\r\n\r\n\r\n`%times%` <- `*`\r\ndf %>% \r\n select(matches(\"^x$\") %times% 2)\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nBut geez, itâs so tiring to type all_of() and matches() all the time. There must be a better way to break the rule!\r\nTidying untidy-select\r\nLetâs make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside select() where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.\r\nLetâs call this function math(). It should let us express stuff like âgive me the column to the right of column xâ via this intuitive(?) syntax:\r\n\r\n\r\n\r\n\r\n\r\ndf %>% \r\n select(math(x + 1))\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThis is my take on math():\r\n\r\n\r\nmath <- function(expr) {\r\n math_expr <- rlang::enquo(expr)\r\n columns <- tidyselect::peek_vars()\r\n col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)\r\n mask <- rlang::as_data_mask(col_locs)\r\n out <- rlang::eval_tidy(math_expr, mask)\r\n out\r\n}\r\n\r\n\r\nThereâs a lot of weird functions involved here, but itâs easier to digest by focusing on its parts. Hereâs what each local variable in the function looks like for our math(x + 1) example above:\r\n\r\n $math_expr\r\n \r\n expr: ^x + 1\r\n env: 0x0000012f8e27cec8\r\n \r\n $columns\r\n [1] \"x\" \"y\" \"z\"\r\n \r\n $col_locs\r\n x y z\r\n 1 1 2 3\r\n \r\n $mask\r\n \r\n \r\n $out\r\n [1] 2\r\n\r\nLetâs walk through the pieces:\r\nmath_expr: the captured user expression, with the environment attached\r\ncolumns: the column names of the current dataframe, in order\r\ncol_locs: a dataframe of column names and location, created from columns\r\nmask: a data mask created from col_locs\r\nout: location of column(s) to select\r\nEssentially, math() first captures the expression to evaluate it in its own special environment, circumventing select()âs safeguards. Then, it grabs the column names of the data frame with tidyselect::peek_vars() to define col_locs and then mask. The data mask mask is then used inside rlang::eval_tidy() to resolve symbols like x to integer 1 when evaluating the captured expression x + 1. The expression math(x + 1) thus evaluates to 1 + 1. In turn, select(math(x + 1)) is evaluated to select(2), returning us the second column of the dataframe.\r\nWriting untidy-select helpers\r\nA small yet powerful detail in the implementation of math() is the fact that it captures the expression as a quosure. This allows math() to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.\r\nThis makes more sense with some examples. Here, I define helper functions that call math() under the hood with their own templatic math expressions (and I have them print() the expression as passed to math() for clarity). The fact that math() captures its argument as a quosure is what allows local variables like n to be correctly scoped in these examples.\r\n1) times()\r\n\r\n\r\ntimes <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col * n))) # for debugging\r\n math(!!col * n)\r\n}\r\ndf %>%\r\n select(times(x, 2))\r\n\r\n math(x * n)\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum2 <- 2\r\ndf %>%\r\n select(times(x, num2))\r\n\r\n math(x * n)\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n2) offset()\r\n\r\n\r\noffset <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col + n))) # for debugging\r\n math(!!col + n)\r\n}\r\ndf %>%\r\n select(offset(x, 1))\r\n\r\n math(x + n)\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum1 <- 1\r\ndf %>%\r\n select(offset(x, num1))\r\n\r\n math(x + n)\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n3) neighbors()\r\n\r\n\r\nneighbors <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n range <- c(-(n:1), 1:n)\r\n print(rlang::expr(math(!!col + !!range))) # for debugging\r\n math(!!col + !!range)\r\n}\r\ndf %>%\r\n select(neighbors(y, 1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 Ă 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\n\r\n\r\ndf %>%\r\n select(neighbors(y, num1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 Ă 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\nDIY!\r\nAnd of course, we can do arbitrary injections ourselves as well with !! or .env$:\r\n\r\n\r\ndf %>%\r\n select(math(x * !!num2))\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>%\r\n select(math(x * .env$num2))\r\n\r\n # A tibble: 2 Ă 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThat was fun but probably not super practical. Letâs set math() aside to try our hands on something more useful.\r\nLetâs get practical\r\n1) Sorting columns\r\nProbably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern (\"x|y_\\\\d\") and those outside that pattern (\"year\", \"day\", etc.).\r\n\r\n\r\ndata_cols <- expand.grid(first = c(\"x\", \"y\"), second = 1:3) %>%\r\n mutate(cols = paste0(first, \"_\", second)) %>%\r\n pull(cols)\r\ndf2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)\r\ndf2 <- cbind(df2, storms[1,1:5])\r\ndf2 <- df2[, sample(ncol(df2))]\r\ndf2\r\n\r\n y_3 x_3 month day hour y_2 y_1 x_2 year name x_1\r\n 1 6 5 6 27 0 4 2 3 1975 Amy 1\r\n\r\nItâs trivial to select columns by pattern - we can use the matches() helper:\r\n\r\n\r\ndf2 %>%\r\n select(matches(\"(x|y)_(\\\\d)\"))\r\n\r\n y_3 x_3 y_2 y_1 x_2 x_1\r\n 1 6 5 4 2 3 1\r\n\r\nBut what if I also wanted to further sort these columns, after I select them? Thereâs no easy way to do this âon the flyâ inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.\r\nBut hereâs one way of getting at that, exploiting two facts:\r\nmatches(), like other tidyselect helpers, immediately resolves the selection to integer\r\npeek_vars() returns the column names in order, which lets us recover the column names from location\r\nAnd thatâs pretty much all there is to the tidyselect magic that goes into my solution below. I define locs (integer vector of column locations) and cols (character vector of column names at those locations), and the rest is just regex and sorting:\r\n\r\n\r\nordered_matches <- function(matches, order) {\r\n # tidyselect magic\r\n locs <- tidyselect::matches(matches)\r\n cols <- tidyselect::peek_vars()[locs]\r\n # Ordinary evaluation\r\n groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]\r\n reordered <- do.call(\"order\", asplit(groups[order, ], 1))\r\n locs[reordered]\r\n}\r\n\r\n\r\nUsing ordered_matches(), we can not only select columns but also sort them using regex capture groups.\r\nThis sorts the columns by letter first then number:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(1, 2)))\r\n\r\n x_1 x_2 x_3 y_1 y_2 y_3\r\n 1 1 3 5 2 4 6\r\n\r\nThis sorts the columns by number first then letter:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)))\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3\r\n 1 1 2 3 4 5 6\r\n\r\nAnd if we wanted the other columns too, we can use everything() to grab the ârestâ:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)), everything())\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name\r\n 1 1 2 3 4 5 6 6 27 0 1975 Amy\r\n\r\n2) Error handling\r\nOne of the really nice parts about the {tidyselect} design is the fact that error messages are very informative.\r\nFor example, if you select a non-existing column, it errors while pointing out that mistake:\r\n\r\n\r\ndf3 <- data.frame(x = 1)\r\nnonexistent_selection <- quote(c(x, y))\r\neval_select(nonexistent_selection, df3)\r\n\r\n Error:\r\n ! Can't subset columns that don't exist.\r\n â Column `y` doesn't exist.\r\n\r\nIf you use a tidyselect helper that returns nothing, it wonât complain by default:\r\n\r\n\r\nzero_selection <- quote(starts_with(\"z\"))\r\neval_select(zero_selection, df3)\r\n\r\n named integer(0)\r\n\r\nBut you can make that error with allow_empty = FALSE:\r\n\r\n\r\neval_select(zero_selection, df3, allow_empty = FALSE)\r\n\r\n Error:\r\n ! Must select at least one item.\r\n\r\nGeneral evaluation errors are caught and chained:\r\n\r\n\r\nevaluation_error <- quote(stop(\"I'm a bad expression!\"))\r\neval_select(evaluation_error, df3)\r\n\r\n Error:\r\n ! Problem while evaluating `stop(\"I'm a bad expression!\")`.\r\n Caused by error:\r\n ! I'm a bad expression!\r\n\r\nThese error signalling patterns are clearly very useful for users,5 but thereâs a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.\r\nFor example, the attempted non-existent column is stored in $i:6\r\n\r\n\r\ncnd_nonexistent <- rlang::catch_cnd(\r\n eval_select(nonexistent_selection, df3)\r\n)\r\ncnd_nonexistent$i\r\n\r\n [1] \"y\"\r\n\r\nZero column selections give you NULL in $i when you set it to error:\r\n\r\n\r\ncnd_zero_selection <- rlang::catch_cnd(\r\n eval_select(zero_selection, df3, allow_empty = FALSE)\r\n)\r\ncnd_zero_selection$i\r\n\r\n NULL\r\n\r\nGeneral evaluation errors are distinguished by having a $parent:\r\n\r\n\r\ncnd_evaluation_error <- rlang::catch_cnd(\r\n eval_select(evaluation_error, df3)\r\n)\r\ncnd_evaluation_error$parent\r\n\r\n \r\n\r\nAgain, this is more useful as a developer, if youâre building something that integrates {tidyselect}.7 But I personally find this interesting to know about anyways!\r\nConclusion\r\nHere I end with the (usual) disclaimer to not actually just copy paste these for production - theyâre written with the very low standard of scratching my itch, so they do not come with any warranty!\r\nBut I hope that this was a fun exercise in thinking through one of the most mysterious magics in {dplyr}. Iâm sure to reference this many times in the future myself.\r\n\r\nThe examples quote(\"x\") and quote(1) are redundant because \"x\" and 1 are constants. I keep quote() in there just to make the comparison clearerâŠď¸\r\nNot to be confused with all_of(). The idiomatic pattern for scoping an external character vector is to do all_of(x) not .env$x. Itâs only when youâre scoping a non-character vector that youâd use .env$.âŠď¸\r\nItâs also strangely reminiscent of my previous blog post on dplyr::slice()âŠď¸\r\nThanks to Jonathan Carroll for this suggestion!âŠď¸\r\nFor those who actually read error messages, at least (points to self) âŚâŠď¸\r\nThough {tidyselect} errors early, so itâll only record the first attempted column causing the error. You could use a while() loop (catch and remove bad columns from the data until thereâs no more error) if you really wanted to get the full set of offending columns.âŠď¸\r\nIf you want some examples of post-processing tidyselect errors, thereâs some stuff I did for pointblank that may be helpful as a reference.âŠď¸\r\n",
"preview": "posts/2023-12-03-untidy-select/preview.png",
- "last_modified": "2023-12-04T10:11:22-05:00",
+ "last_modified": "2023-12-04T07:11:22-08:00",
"input_file": {},
"preview_width": 957,
"preview_height": 664
@@ -161,7 +161,7 @@
],
"contents": "\r\n\r\nContents\r\nIntro\r\nWhat is an XY problem?\r\nThe question\r\nAttempt 1: after_stat()? I know that!\r\nAttempt 2: Hmm but why not after_scale()?\r\nAttempt 3: Oh. You just wanted a scale_fill_*()âŚ\r\nReflections\r\nEnding on a fun aside - accidentally escaping an XY problem\r\n\r\nIntro\r\nA few months ago, over at the R4DS slack (http://r4ds.io/join), someone posted a ggplot question that was within my area of âexpertiseâ. I got tagged in the thread, I went in, and it took me 3 tries to arrive at the correct solution that the poster was asking for.\r\nThe embarrassing part of the exchange was that I would write one solution, think about what I wrote for a bit, and then write a different solution after realizing that I had misunderstood the intent of the original question. In other words, I was consistently missing the point.\r\nThis is a microcosm of a bigger problem of mine that Iâve been noticing lately, as my role in the R community has shifted from mostly asking questions to mostly answering questions. By this point Iâve sort of pin-pointed the problem: I have a hard time recognizing that Iâm stuck in an XY problem.\r\nI have a lot of thoughts on this and I want to document them for future me,1 so here goes a rant. I hope itâs useful to whoever is reading this too.\r\nWhat is an XY problem?\r\nAccording to Wikipedia:\r\n\r\nThe XY problem is a communication problem⌠where the question is about an end userâs attempted solution (Y) rather than the root problem itself (X).\r\n\r\nThe classic example of this is when a (novice) user asks how to extract the last 3 characters in a filename. Thereâs no good reason to blindly grab the last 3 characters, so what they probably meant to ask is how to get the file extension (which is not always 3 characters long, like .R or .Rproj).2\r\nAnother somewhat related cult-classic, copypasta3 example is the âDonât use regex to parse HTMLâ answer on stackoverflow. Here, a user asks how to use regular expressions to match HTML tags, to which the top-voted answer is donât (instead, you should use a dedicated parser). The delivery of this answer is a work of art, so I highly suggest you giving it a read if you havenât seen it already (the link is above for your amusement).\r\nAn example of an XY problem in R that might hit closer to home is when a user complains about the notorious Object of type 'closure' is not subsettable error. Itâs often brought up as a cautionary tale for novice users (error messages can only tell you so much, so you must develop debugging strategies), but it has a special meaning for more experienced users whoâve been bit by this multiple times. So for me, when I see novice users reporting this specific error, I usually ask them if they have a variable called data and whether they forgot to run the line assigning that variable. Of course, this answer does not explain what the error means,4 but oftentimes itâs the solution that the user is looking for.\r\n\r\n\r\n# Oops forgot to define `data`!\r\n# `data` is a function (in {base}), which is not subsettable\r\ndata$value\r\n\r\n Error in data$value: object of type 'closure' is not subsettable\r\n\r\nAs one last example, check out this lengthy exchange on splitting a string (Y) to parse JSON (X). I felt compelled to include this example because it does a good job capturing the degree of frustration (very high) that normally comes with XY problems.\r\nBut the thing about the XY problem is that it often prompts the lesson of asking good questions: donât skip steps in your reasoning, make your goals/intentions clear, use a reprex,5 and so on. But in so far as itâs a communication problem involving both parties, I think we should also talk about what the person answering the question can do to recognize an XY problem and break out of it.\r\nEnter me, someone who really needs to do a better job of recognizing when Iâm stuck in an XY problem. So with the definition out of the way, letâs break down how I messed up.\r\nThe question\r\nThe question asks:\r\n\r\nDoes anyone know how to access the number of bars in a barplot? Iâm looking for something that will return â15â for the following code, that can be used within ggplot, like after_stat()\r\n\r\nThe question comes with an example code. Not exactly a reprex, but something to help understand the question:\r\n\r\n\r\np <- ggplot(mpg, aes(manufacturer, fill = manufacturer)) +\r\n geom_bar()\r\np\r\n\r\n\r\n\r\nThe key phrase in the question is âcan be used within ggplotâ. So the user isnât looking for something like this even though itâs conceptually equivalent:\r\n\r\n\r\nlength(unique(mpg$manufacturer))\r\n\r\n [1] 15\r\n\r\nThe idea here is that ggplot knows that there are 15 bars, so this fact must represented somewhere in the internals. The user wants to be able to access that value dynamically.\r\nAttempt 1: after_stat()? I know that!\r\nThe very last part of the question â⌠like after_stat()â triggered some alarms in the thread and got me called in. For those unfamiliar, after_stat() is part of the new and obscure family of delayed aesthetic evaluation functions introduced in ggplot 3.3.0. Itâs something that you normally donât think about in ggplot, but itâs a topic that Iâve been obsessed with for the last 2 years or so: it has resulted in a paper, a package (ggtrace), blog posts, and talks (useR!, rstudio::conf, JSM).\r\nThe user asked about after_stat(), so naturally I came up with an after_stat() solution. In the after-stat stage of the bar layerâs data, the layer data looks like this:\r\n\r\n\r\n# remotes::install_github(\"yjunechoe/ggtrace\")\r\nlibrary(ggtrace)\r\n# Grab the state of the layer data in the after-stat\r\nlayer_after_stat(p)\r\n\r\n # A tibble: 15 Ă 8\r\n count prop x width flipped_aes fill PANEL group\r\n \r\n 1 18 1 1 0.9 FALSE audi 1 1\r\n 2 19 1 2 0.9 FALSE chevrolet 1 2\r\n 3 37 1 3 0.9 FALSE dodge 1 3\r\n 4 25 1 4 0.9 FALSE ford 1 4\r\n 5 9 1 5 0.9 FALSE honda 1 5\r\n 6 14 1 6 0.9 FALSE hyundai 1 6\r\n 7 8 1 7 0.9 FALSE jeep 1 7\r\n 8 4 1 8 0.9 FALSE land rover 1 8\r\n 9 3 1 9 0.9 FALSE lincoln 1 9\r\n 10 4 1 10 0.9 FALSE mercury 1 10\r\n 11 13 1 11 0.9 FALSE nissan 1 11\r\n 12 5 1 12 0.9 FALSE pontiac 1 12\r\n 13 14 1 13 0.9 FALSE subaru 1 13\r\n 14 34 1 14 0.9 FALSE toyota 1 14\r\n 15 27 1 15 0.9 FALSE volkswagen 1 15\r\n\r\nItâs a tidy data where each row represents a barplot. So the number of bars is the length of any column in the after-stat data, but itâd be most principled to take the length of the group column in this case.6\r\nSo the after-stat expression that returns the desired value 15 is after_stat(length(group)), which essentially evaluates to the following:\r\n\r\n\r\nlength(layer_after_stat(p)$group)\r\n\r\n [1] 15\r\n\r\nFor example, you can use this inside the aes() to annotate the total number of bars on top of each bar:\r\n\r\n\r\nggplot(mpg, aes(manufacturer, fill = manufacturer)) +\r\n geom_bar() +\r\n geom_label(\r\n aes(label = after_stat(length(group))),\r\n fill = \"white\",\r\n stat = \"count\"\r\n )\r\n\r\n\r\n\r\nThe after_stat(length(group)) solution returns the number of bars using after_stat(), as the user asked. But as you can see this is extremely useless: there are many technical constraints on what you can actually do with this information in the after-stat stage.\r\nI should have checked if they actually wanted an after_stat() solution first, before providing this answer. But I got distracted by the after_stat() keyword and got too excited by the prospect of someone else taking interest in the thing that Iâm obsessed with. Alas this wasnât the case - they were trying to do something practical - so I went back into the thread to figure out their goal for my second attempt.\r\nAttempt 2: Hmm but why not after_scale()?\r\nWhat I had neglected in my first attempt was the fact that the user talked more about their problem with someone else who got to the question before I did. That discussion turned out to include an important clue to the intent behind the original question: the user wanted the number of bars in order to interpolate the color of the bars.\r\nSo for example, a palette function like topo.colors() takes n to produce interpolated color values:\r\n\r\n\r\ntopo.colors(n = 16)\r\n\r\n [1] \"#4C00FF\" \"#0F00FF\" \"#002EFF\" \"#006BFF\" \"#00A8FF\" \"#00E5FF\" \"#00FF4D\"\r\n [8] \"#00FF00\" \"#4DFF00\" \"#99FF00\" \"#E6FF00\" \"#FFFF00\" \"#FFEA2D\" \"#FFDE59\"\r\n [15] \"#FFDB86\" \"#FFE0B3\"\r\n\r\nchroma::show_col(topo.colors(16))\r\n\r\n\r\n\r\nIf the intent is to use the number of bars to generate a vector of colors to assign to the bars, then a better place to do it would be in the after_scale(), where the state of the layer data in the after-scale looks like this:\r\n\r\n\r\nlayer_after_scale(p)\r\n\r\n # A tibble: 15 Ă 16\r\n fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax \r\n \r\n 1 #F87⌠18 18 1 1 FALSE 1 1 0 18 0.55 1.45\r\n 2 #E58⌠19 19 1 2 FALSE 1 2 0 19 1.55 2.45\r\n 3 #C99⌠37 37 1 3 FALSE 1 3 0 37 2.55 3.45\r\n 4 #A3A⌠25 25 1 4 FALSE 1 4 0 25 3.55 4.45\r\n 5 #6BB⌠9 9 1 5 FALSE 1 5 0 9 4.55 5.45\r\n 6 #00B⌠14 14 1 6 FALSE 1 6 0 14 5.55 6.45\r\n 7 #00B⌠8 8 1 7 FALSE 1 7 0 8 6.55 7.45\r\n 8 #00C⌠4 4 1 8 FALSE 1 8 0 4 7.55 8.45\r\n 9 #00B⌠3 3 1 9 FALSE 1 9 0 3 8.55 9.45\r\n 10 #00B⌠4 4 1 10 FALSE 1 10 0 4 9.55 10.45\r\n 11 #619⌠13 13 1 11 FALSE 1 11 0 13 10.55 11.45\r\n 12 #B98⌠5 5 1 12 FALSE 1 12 0 5 11.55 12.45\r\n 13 #E76⌠14 14 1 13 FALSE 1 13 0 14 12.55 13.45\r\n 14 #FD6⌠34 34 1 14 FALSE 1 14 0 34 13.55 14.45\r\n 15 #FF6⌠27 27 1 15 FALSE 1 15 0 27 14.55 15.45\r\n # âš 4 more variables: colour , linewidth , linetype ,\r\n # alpha \r\n\r\nItâs still a tidy data where each row represents a bar. But the important distinction between the after-stat and the after-scale is that the after-scale data reflects the work of the (non-positional) scales. So the fill column here is now the actual hexadecimal color values for the bars:\r\n\r\n\r\nlayer_after_scale(p)$fill\r\n\r\n [1] \"#F8766D\" \"#E58700\" \"#C99800\" \"#A3A500\" \"#6BB100\" \"#00BA38\" \"#00BF7D\"\r\n [8] \"#00C0AF\" \"#00BCD8\" \"#00B0F6\" \"#619CFF\" \"#B983FF\" \"#E76BF3\" \"#FD61D1\"\r\n [15] \"#FF67A4\"\r\n\r\nchroma::show_col(layer_after_scale(p)$fill)\r\n\r\n\r\n\r\nWhat after_scale()/stage(after_scale = ) allows you to do is override these color values right before the layer data is sent off to be drawn. So we again use the same expression length(group) to grab the number of bars in the after-scale data, pass that value to a color palette function like topo.colors(), and re-map to the fill aesthetic.\r\n\r\n\r\nggplot(mpg, aes(manufacturer)) +\r\n geom_bar(aes(fill = stage(manufacturer, after_scale = topo.colors(length(group))))) +\r\n scale_fill_identity()\r\n\r\n\r\n\r\nSo this solution achieves the desired effect, but itâs needlessly complicated. You need complex staging of the fill aesthetic via stage() and you also need to pair this with scale_fill_identity() to let ggplot know that youâre directly supplying the fill values (otherwise you get errors and warnings).\r\nWait hold up - a fill scale? Did this user actually just want a custom fill scale? OhhhâŚ\r\nAttempt 3: Oh. You just wanted a scale_fill_*()âŚ\r\nSo yeah. It turns out that they just wanted a custom scale that takes some set of colors and interpolate the colors across the bars in the plot.\r\nThe correct way to approach this problem is to create a new fill scale that wraps around discrete_scale(). The scale function should take a set of colors (cols) and pass discrete_scale() a palette function created via the function factory colorRampPalette().\r\n\r\n\r\nscale_fill_interpolate <- function(cols, ...) {\r\n discrete_scale(\r\n aesthetics = \"fill\",\r\n scale_name = \"interpolate\",\r\n palette = colorRampPalette(cols),\r\n ...\r\n )\r\n}\r\n\r\n\r\nOur new scale_fill_interpolate() function can now be added to the plot like any other scale:\r\n\r\n\r\np +\r\n scale_fill_interpolate(c(\"pink\", \"goldenrod\"))\r\n\r\n\r\n\r\n\r\n\r\np +\r\n scale_fill_interpolate(c(\"steelblue\", \"orange\", \"forestgreen\"))\r\n\r\n\r\n\r\n\r\n\r\nset.seed(123)\r\ncols <- sample(colors(), 5)\r\ncols\r\n\r\n [1] \"lightgoldenrodyellow\" \"mediumorchid1\" \"gray26\" \r\n [4] \"palevioletred2\" \"gray42\"\r\n\r\np +\r\n scale_fill_interpolate(cols)\r\n\r\n\r\n\r\nI sent (a variant of) this answer to the thread and the user marked it solved with a thanks, concluding my desperate spiral into finding the right solution to the intended question.\r\nReflections\r\nSo why was this so hard for me to get? The most immediate cause is because I quickly skimmed the wording of the question and extracted two key phrases:\r\nâaccess the number of bars in a barplotâ\r\nâthat can be used within ggplot, like after_stat()â\r\nBut neither of these turned out to be important (or even relevant) to the solution. The correct answer was just a clean custom fill scale, where you donât have to think about the number of bars or accessing that in the internals. Simply extending discrete_scale() allows you to abstract away from those details entirely.\r\nSo in fairness, it was a very difficult XY problem to get out of. But the wording of the question wasnât the root cause. I think the root cause is some combination of the following:\r\nThere are many ways to do the same thing in R so I automatically assume that my solution counts as a contribution as long as it gets the job done. But solutions should also be understandable for the person asking the question. Looking back, I was insane to even suggest my second attempt as the solution because itâs so contrived and borderline incomprehensible. It only sets the user up for more confusion and bugs in the future, so that was a bit irresponsible and selfish of me (it only scratches my itch).\r\nSolutions to (practical) problems are usually boring and Iâm allergic to boring solutions. This is a bad attitude to have when offering to help people. I assumed that people share my excitement about ggplot internals, but actually most users donât care (thatâs why itâs called the internals and hidden from users). An important context that I miss as the person answering questions on the other end is that users post questions when theyâre stuck and frustrated. Their goal is not to take a hard problem and turn it into a thinking exercise or a learning experience (that part happens organically, but is not the goal). If anything, thatâs what Iâm doing when I choose to take interest in other peopleâs (coding) problems.\r\nI imbue intent to questions that are clearing missing it. I donât think thatâs a categorically bad thing because it can sometimes land you in a shortcut out of an XY problem. But when you miss, itâs catastrophic and pulls you deeper into the problem. I think that was the case for me here - I conflated the X with the Y and assumed that after_stat() was relevant on face value because I personally know it to be a very powerful tool. I let my own history of treating after_stat() like the X (âHow can I use after_stat() to solve/simplify this problem?â) guide my interpretation of the question, which is not good practice.\r\nOf course, there are likely more to this, but these are plenty for me to work on for now.\r\nLastly, I donât want this to detract from the fact that the onus is on users to ask good questions. I donât want to put question-answer-ers on the spot for their handling of XY problems. After all, most are volunteers who gain nothing from helping others besides status and some internet points.7 Just take this as me telling myself to be a better person.\r\nEnding on a fun aside - accidentally escaping an XY problem\r\nItâs not my style to write serious blog posts. I think I deserve a break from many paragraphs of self-induced beat down.\r\nSo in that spirit I want to end on a funny anecdote where I escaped an XY problem by pure luck.\r\nI came across a relatively straightforward question which can be summarized as the following:\r\n\r\n\r\ninput <- \"a + c + d + e\"\r\noutput <- c(\"a\", \"c\", \"d\", \"e\")\r\n\r\n\r\nThere are many valid approaches to this and some were already posted to the thread:\r\n\r\n\r\nstrsplit(input, \" + \", TRUE)[[1]]\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nall.vars(parse(text = input))\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nMe, knowing too many useless things (and knowing that the the user already has the best answers), suggested a quirky alternative:8\r\n\r\nThis is super off-label usage but you can also use Râs formula utilities to parse this:9\r\n\r\n\r\n\r\nattr(terms(reformulate(input)), \"term.labels\")\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nTo my surprise, the response I got was:\r\n\r\nLovely! These definitely originated from formula ages ago so itâs actually not far off-label at all đ\r\n\r\n\r\nEspecially before slack deletes the old messages.âŠď¸\r\nIn R, you can use tools::file_ext() or fs::path_ext().âŠď¸\r\nhttps://en.wikipedia.org/wiki/CopypastaâŠď¸\r\nGood luck trying to explain the actual error message. Especially closure, a kind of weird vocabulary in R (fun fact - the first edition of Advanced R used to have a section on closure which is absent in the second edition probably because âIn R, almost every function is a closureâ).âŠď¸\r\nParadoxically, XY problems sometimes arise when inexperienced users try to come up with a reprex. They might capture the error/problem too narrowly, such that the more important broader context is left out.âŠď¸\r\nOr the number of distinct combinations between PANEL and group, as in nlevels(interaction(PANEL, group, drop = TRUE)). But of course thatâs overkill and only of interest for âtheoretical purityâ.âŠď¸\r\nAnd I like the R4DS slack because it doesnât have âinternet points.â There is status (moderator) though I donât wear the badge (literally - itâs an emoji).âŠď¸\r\nActually I only thought of this because Iâd been writing a statistical package that required some nasty metaprogramming with the formula object.âŠď¸\r\nThe significance of this solution building on top of Râs formula utilities is that it will also parse stuff like \"a*b\" as c(\"a\", \"b\", \"a:b\"). So given that the inputs originated as R formulas (as the user later clarifies), this is the principled approach.âŠď¸\r\n",
"preview": "posts/2023-07-09-x-y-problem/preview.png",
- "last_modified": "2023-07-10T04:24:43-04:00",
+ "last_modified": "2023-07-10T01:24:43-07:00",
"input_file": {},
"preview_width": 238,
"preview_height": 205
@@ -183,7 +183,7 @@
],
"contents": "\r\n\r\nContents\r\nIntro\r\nSpecial properties of dplyr::slice()\r\nBasic usage\r\nRe-imagining slice() with data-masking\r\nSpecial properties of slice()\r\n\r\nA gallery of row operations with slice()\r\nRepeat rows (in place)\r\nSubset a selection of rows + the following row\r\nSubset a selection of rows + multiple following rows\r\nFilter (and encode) neighboring rows\r\nAside: kronecker() as as.vector(outer())\r\nWindowed min/max/median (etc.)\r\nEvenly distributed row shuffling of balanced categories\r\nInserting a new row at specific intervals\r\nEvenly distributed row shuffling of unequal categories\r\n\r\nConclusion\r\n\r\nIntro\r\nIn data wrangling, there are a handful of classes of operations on data frames that we think of as theoretically well-defined and tackling distinct problems. To name a few, these include subsetting, joins, split-apply-combine, pairwise operations, nested-column workflows, and so on.\r\nAgainst this rich backdrop, thereâs one aspect of data wrangling that doesnât receive as much attention: ordering of rows. This isnât necessarily surprising - we often think of row order as an auxiliary attribute of data frames since they donât speak to the content of the data, per se. I think we all share the intuition that two dataframe that differ only in row order are practically the same for most analysis purposes.\r\nExcept when they arenât.\r\nIn this blog post I want to talk about a few, somewhat esoteric cases of what I like to call row-relational operations. My goal is to try to motivate row-relational operations as a full-blown class of data wrangling operation that includes not only row ordering, but also sampling, shuffling, repeating, interweaving, and so on (Iâll go over all of these later).\r\nWithout spoiling too much, I believe that dplyr::slice() offers a powerful context for operations over row indices, even those that at first seem to lack a âtidyâ solution. You may already know slice() as an indexing function, but my hope is to convince you that it can do so much more.\r\nLetâs start by first talking about some special properties of dplyr::slice(), and then see how we can use it for various row-relational operations.\r\nSpecial properties of dplyr::slice()\r\nBasic usage\r\nFor the following demonstration, Iâll use a small subset of the dplyr::starwars dataset:\r\n\r\n\r\nstarwars_sm <- dplyr::starwars[1:10, 1:3]\r\nstarwars_sm\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n\r\n1) Row selection\r\nslice() is a row indexing verb - if you pass it a vector of integers, it subsets data frame rows:\r\n\r\n\r\nstarwars_sm |> \r\n slice(1:6) # First six rows\r\n\r\n # A tibble: 6 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n\r\nLike other dplyr verbs with mutate-semantics, you can use context-dependent expressions inside slice(). For example, you can use n() to grab the last row (or last couple of rows):\r\n\r\n\r\nstarwars_sm |> \r\n slice( n() ) # Last row\r\n\r\n # A tibble: 1 Ă 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n\r\nstarwars_sm |> \r\n slice( n() - 2:0 ) # Last three rows\r\n\r\n # A tibble: 3 Ă 3\r\n name height mass\r\n \r\n 1 R5-D4 97 32\r\n 2 Biggs Darklighter 183 84\r\n 3 Obi-Wan Kenobi 182 77\r\n\r\nAnother context-dependent expression that comes in handy is row_number(), which returns all row indices. Using it inside slice() essentially performs an identity transformation:\r\n\r\n\r\nidentical(\r\n starwars_sm,\r\n starwars_sm |> slice( row_number() )\r\n)\r\n\r\n [1] TRUE\r\n\r\nLastly, similar to in select(), you can use - for negative indexing (to remove rows):\r\n\r\n\r\nidentical(\r\n starwars_sm |> slice(1:3), # First three rows\r\n starwars_sm |> slice(-(4:n())) # All rows except fourth row to last row\r\n)\r\n\r\n [1] TRUE\r\n\r\n2) Dynamic dots\r\nslice() supports dynamic dots. If you pass row indices into multiple argument positions, slice() will concatenate them for you:\r\n\r\n\r\nidentical(\r\n starwars_sm |> slice(1:6),\r\n starwars_sm |> slice(1, 2:4, 5, 6)\r\n)\r\n\r\n [1] TRUE\r\n\r\nIf you have a list() of row indices, you can use the splice operator !!! to spread them out:\r\n\r\n\r\nstarwars_sm |> \r\n slice( !!!list(1, 2:4, 5, 6) )\r\n\r\n # A tibble: 6 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n\r\nThe above call to slice() evaluates to the following after splicing:\r\n\r\n\r\nrlang::expr( slice(!!!list(1, 2:4, 5, 6)) )\r\n\r\n slice(1, 2:4, 5, 6)\r\n\r\n3) Row ordering\r\nslice() respects the order in which you supplied the row indices:\r\n\r\n\r\nstarwars_sm |> \r\n slice(3, 1, 2, 5)\r\n\r\n # A tibble: 4 Ă 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 Luke Skywalker 172 77\r\n 3 C-3PO 167 75\r\n 4 Leia Organa 150 49\r\n\r\nThis means you can do stuff like random sampling with sample():\r\n\r\n\r\nstarwars_sm |> \r\n slice( sample(n()) )\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n 2 Owen Lars 178 120\r\n 3 Leia Organa 150 49\r\n 4 Darth Vader 202 136\r\n 5 Luke Skywalker 172 77\r\n 6 R5-D4 97 32\r\n 7 C-3PO 167 75\r\n 8 Beru Whitesun Lars 165 75\r\n 9 Biggs Darklighter 183 84\r\n 10 R2-D2 96 32\r\n\r\nYou can also shuffle a subset of rows (ex: just the first five):\r\n\r\n\r\nstarwars_sm |> \r\n slice( sample(5), 6:n() )\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 C-3PO 167 75\r\n 2 Leia Organa 150 49\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Luke Skywalker 172 77\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n\r\nOr reorder all rows by their indices (ex: in reverse):\r\n\r\n\r\nstarwars_sm |> \r\n slice( rev(row_number()) )\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n 2 Biggs Darklighter 183 84\r\n 3 R5-D4 97 32\r\n 4 Beru Whitesun Lars 165 75\r\n 5 Owen Lars 178 120\r\n 6 Leia Organa 150 49\r\n 7 Darth Vader 202 136\r\n 8 R2-D2 96 32\r\n 9 C-3PO 167 75\r\n 10 Luke Skywalker 172 77\r\n\r\n4) Out-of-bounds handling\r\nIf you pass a row index thatâs out of bounds, slice() returns a 0-row data frame:\r\n\r\n\r\nstarwars_sm |> \r\n slice( n() + 1 ) # Select the row after the last row\r\n\r\n # A tibble: 0 Ă 3\r\n # âš 3 variables: name , height , mass \r\n\r\nWhen mixed with valid row indices, out-of-bounds indices are simply ignored (much đ for this behavior):\r\n\r\n\r\nstarwars_sm |> \r\n slice(\r\n 0, # 0th row - ignored\r\n 1:3, # first three rows\r\n n() + 1 # 1 after last row - ignored\r\n )\r\n\r\n # A tibble: 3 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n\r\nThis lets you do funky stuff like select all even numbered rows by passing slice() all row indices times 2:\r\n\r\n\r\nstarwars_sm |> \r\n slice( row_number() * 2 ) # Add `- 1` at the end for *odd* rows!\r\n\r\n # A tibble: 5 Ă 3\r\n name height mass\r\n \r\n 1 C-3PO 167 75\r\n 2 Darth Vader 202 136\r\n 3 Owen Lars 178 120\r\n 4 R5-D4 97 32\r\n 5 Obi-Wan Kenobi 182 77\r\n\r\nRe-imagining slice() with data-masking\r\nslice() is already pretty neat as it is, but thatâs just the tip of the iceberg.\r\nThe really cool, under-rated feature of slice() is that itâs data-masked, meaning that you can reference column vectors as if theyâre variables. Another way of describing this property of slice() is to say that it has mutate-semantics.\r\nAt a very basic level, this means that slice() can straightforwardly replicate the behavior of some dplyr verbs like arrange() and filter()!\r\nslice() as arrange()\r\nFrom our starwars_sm data, if we want to sort by height we can use arrange():\r\n\r\n\r\nstarwars_sm |> \r\n arrange(height)\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nBut we can also do this with slice() to the same effect, using order():\r\n\r\n\r\nstarwars_sm |> \r\n slice( order(height) )\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nThis is conceptually equivalent to combining the following 2-step process:\r\n\r\n\r\nordered_val_ind <- order(starwars_sm$height)\r\n ordered_val_ind\r\n\r\n [1] 3 8 5 7 2 1 6 10 9 4\r\n\r\n\r\n\r\nstarwars_sm |> \r\n slice( ordered_val_ind )\r\n\r\n # A tibble: 10 Ă 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nslice() as filter()\r\nWe can also use slice() to filter(), using which():\r\n\r\n\r\nidentical(\r\n starwars_sm |> filter( height > 150 ),\r\n starwars_sm |> slice( which(height > 150) )\r\n)\r\n\r\n [1] TRUE\r\n\r\nThus, we can think of filter() and slice() as two sides of the same coin:\r\nfilter() takes a logical vector thatâs the same length as the number of rows in the data frame\r\nslice() takes an integer vector thatâs a (sub)set of a data frameâs row indices.\r\nTo put it more concretely, this logical vector was being passed to the above filter() call:\r\n\r\n\r\nstarwars_sm$height > 150\r\n\r\n [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE\r\n\r\nWhile this integer vector was being passed to the above slice() call, where which() returns the position of TRUE values, given a logical vector:\r\n\r\n\r\nwhich( starwars_sm$height > 150 )\r\n\r\n [1] 1 2 4 6 7 9 10\r\n\r\nSpecial properties of slice()\r\nThis re-imagined slice() that heavily exploits data-masking gives us two interesting properties:\r\nWe can work with sets of row indices that need not to be the same length as the data frame (vs. filter()).\r\nWe can work with row indices as integers, which are legible to arithmetic operations (ex: + and *)\r\nTo grok the significance of working with rows as integer sets, letâs work through some examples where slice() comes in very handy.\r\nA gallery of row operations with slice()\r\nRepeat rows (in place)\r\nIn {tidyr}, thereâs a function called uncount() which does the opposite of dplyr::count():\r\n\r\n\r\nlibrary(tidyr)\r\n# Example from `tidyr::uncount()` docs\r\nuncount_df <- tibble(x = c(\"a\", \"b\"), n = c(1, 2))\r\nuncount_df\r\n\r\n # A tibble: 2 Ă 2\r\n x n\r\n \r\n 1 a 1\r\n 2 b 2\r\n\r\nuncount_df |> \r\n uncount(n)\r\n\r\n # A tibble: 3 Ă 1\r\n x \r\n \r\n 1 a \r\n 2 b \r\n 3 b\r\n\r\nWe can mimic this behavior with slice(), using rep(times = ...):\r\n\r\n\r\nrep(1:nrow(uncount_df), times = uncount_df$n)\r\n\r\n [1] 1 2 2\r\n\r\nuncount_df |> \r\n slice( rep(row_number(), times = n) ) |> \r\n select( -n )\r\n\r\n # A tibble: 3 Ă 1\r\n x \r\n \r\n 1 a \r\n 2 b \r\n 3 b\r\n\r\nWhat if instead of a whole column storing that information, we only have information about row position?\r\nLetâs say we want to duplicate the rows of starwars_sm at the repeat_at positions:\r\n\r\n\r\nrepeat_at <- sample(5, 2)\r\nrepeat_at\r\n\r\n [1] 4 5\r\n\r\nIn slice(), youâd just select all rows plus those additional rows, then sort the integer row indices:\r\n\r\n\r\nstarwars_sm |> \r\n slice( sort(c(row_number(), repeat_at)) )\r\n\r\n # A tibble: 12 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Leia Organa 150 49\r\n 7 Leia Organa 150 49\r\n 8 Owen Lars 178 120\r\n 9 Beru Whitesun Lars 165 75\r\n 10 R5-D4 97 32\r\n 11 Biggs Darklighter 183 84\r\n 12 Obi-Wan Kenobi 182 77\r\n\r\nWhat if we also separately have information about how much to repeat those rows by?\r\n\r\n\r\nrepeat_by <- c(3, 4)\r\n\r\n\r\nYou can apply the same rep() method for just the subset of rows to repeat:\r\n\r\n\r\nstarwars_sm |> \r\n slice( sort(c(row_number(), rep(repeat_at, times = repeat_by - 1))) )\r\n\r\n # A tibble: 15 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Darth Vader 202 136\r\n 7 Leia Organa 150 49\r\n 8 Leia Organa 150 49\r\n 9 Leia Organa 150 49\r\n 10 Leia Organa 150 49\r\n 11 Owen Lars 178 120\r\n 12 Beru Whitesun Lars 165 75\r\n 13 R5-D4 97 32\r\n 14 Biggs Darklighter 183 84\r\n 15 Obi-Wan Kenobi 182 77\r\n\r\nCircling back to uncount(), you could also initialize a vector of 1s and replace() where the rows should be repeated:\r\n\r\n\r\nstarwars_sm |> \r\n uncount( replace(rep(1, n()), repeat_at, repeat_by) )\r\n\r\n # A tibble: 15 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Darth Vader 202 136\r\n 7 Leia Organa 150 49\r\n 8 Leia Organa 150 49\r\n 9 Leia Organa 150 49\r\n 10 Leia Organa 150 49\r\n 11 Owen Lars 178 120\r\n 12 Beru Whitesun Lars 165 75\r\n 13 R5-D4 97 32\r\n 14 Biggs Darklighter 183 84\r\n 15 Obi-Wan Kenobi 182 77\r\n\r\nSubset a selection of rows + the following row\r\nRow order can sometimes encode a meaningful continuous measure, like time.\r\nTake for example this subset of the flights dataset in {nycflights13}:\r\n\r\n\r\nflights_df <- nycflights13::flights |> \r\n filter(month == 3, day == 3, origin == \"JFK\") |> \r\n select(dep_time, flight, carrier) |> \r\n slice(1:100) |> \r\n arrange(dep_time)\r\nflights_df\r\n\r\n # A tibble: 100 Ă 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 555 145 B6 \r\n 4 556 208 B6 \r\n 5 556 79 B6 \r\n 6 601 501 B6 \r\n 7 604 725 B6 \r\n 8 606 135 B6 \r\n 9 606 600 UA \r\n 10 607 829 US \r\n # âš 90 more rows\r\n\r\nHere, the rows are ordered by dep_time, such that given a row, the next row is a data point for the next flight that departed from the airport.\r\nAnd letâs say weâre interested in flights that took off immediately after American Airlines (\"AA\") flights. Given what we just noted about the ordering of rows in the data frame, we can do this in slice() by adding 1 to the row index of AA flights:\r\n\r\n\r\nflights_df |> \r\n slice( which(carrier == \"AA\") + 1 )\r\n\r\n # A tibble: 14 Ă 3\r\n dep_time flight carrier\r\n \r\n 1 551 5716 EV \r\n 2 627 905 B6 \r\n 3 652 117 B6 \r\n 4 714 825 AA \r\n 5 717 987 B6 \r\n 6 724 11 VX \r\n 7 742 183 DL \r\n 8 802 655 AA \r\n 9 805 2143 DL \r\n 10 847 59 B6 \r\n 11 858 647 AA \r\n 12 859 120 DL \r\n 13 1031 179 AA \r\n 14 1036 641 B6\r\n\r\nWhat if we also want to keep observations for the preceding AA flights as well? We can just stick which(carrier == \"AA\") inside slice() too:\r\n\r\n\r\nflights_df |> \r\n slice(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n )\r\n\r\n # A tibble: 28 Ă 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 626 413 AA \r\n 3 652 1815 AA \r\n 4 711 443 AA \r\n 5 714 825 AA \r\n 6 724 33 AA \r\n 7 739 59 AA \r\n 8 802 1838 AA \r\n 9 802 655 AA \r\n 10 843 1357 AA \r\n # âš 18 more rows\r\n\r\nBut now the rows are now ordered such that all the AA flights come before the other flights! How can we preserve the original order of increasing dep_time?\r\nWe could reconstruct the initial row order by piping the result into arrange(dep_time) again, but the simplest solution would be to concatenate the set of row indices and sort() them, since the output of which() is already integer!\r\n\r\n\r\nflights_df |> \r\n slice(\r\n sort(c(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n ))\r\n )\r\n\r\n # A tibble: 28 Ă 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 626 413 AA \r\n 4 627 905 B6 \r\n 5 652 1815 AA \r\n 6 652 117 B6 \r\n 7 711 443 AA \r\n 8 714 825 AA \r\n 9 714 825 AA \r\n 10 717 987 B6 \r\n # âš 18 more rows\r\n\r\nNotice how the 8th and 9th rows are repeated here - thatâs because 2 AA flights departed in a row (ha!). We can use unique() to remove duplicate rows in the same call to slice():\r\n\r\n\r\nflights_df |> \r\n slice(\r\n unique(sort(c(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n )))\r\n )\r\n\r\n # A tibble: 24 Ă 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 626 413 AA \r\n 4 627 905 B6 \r\n 5 652 1815 AA \r\n 6 652 117 B6 \r\n 7 711 443 AA \r\n 8 714 825 AA \r\n 9 717 987 B6 \r\n 10 724 33 AA \r\n # âš 14 more rows\r\n\r\nImportantly, we can do all of this inside slice() because weâre working with integer sets. The integer part allows us to do things like + 1 and sort(), while the set part allows us to combine with c() and remove duplicates with unique().\r\nSubset a selection of rows + multiple following rows\r\nIn this example, letâs problematize our approach with the repeated which() calls in our previous solution.\r\nImagine another scenario where we want to filter for all AA flights and three subsequent flights for each.\r\nDo we need to write the solution out like this? Thatâs a lot of repetition!\r\n\r\n\r\nflights_df |> \r\n slice(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1,\r\n which(carrier == \"AA\") + 2,\r\n which(carrier == \"AA\") + 3\r\n )\r\n\r\n\r\nYou might think we can get away with + 0:3, but it doesnât work as weâd like. The + just forces 0:3 to be (partially) recycled to the same length as carrier for element-wise addition:\r\n\r\n\r\nwhich(flights_df$carrier == \"AA\") + 0:3\r\n\r\n Warning in which(flights_df$carrier == \"AA\") + 0:3: longer object length is not\r\n a multiple of shorter object length\r\n [1] 1 14 20 27 25 28 34 40 38 62 66 68 91 93\r\n\r\nIf only we can get the outer sum of the two arrays, 0:3 and which(carrier == \"AA\") ⌠Oh wait, we can - thatâs what outer() does!\r\n\r\n\r\nouter(0:3, which(flights_df$carrier == \"AA\"), `+`)\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]\r\n [1,] 1 13 18 24 25 27 32 37 38 61 64 65 91 92\r\n [2,] 2 14 19 25 26 28 33 38 39 62 65 66 92 93\r\n [3,] 3 15 20 26 27 29 34 39 40 63 66 67 93 94\r\n [4,] 4 16 21 27 28 30 35 40 41 64 67 68 94 95\r\n\r\nThis is essentially the repeated which() vectors stacked on top of each other, but as a matrix:\r\n\r\n\r\nprint( which(flights_df$carrier == \"AA\") )\r\nprint( which(flights_df$carrier == \"AA\") + 1 )\r\nprint( which(flights_df$carrier == \"AA\") + 2 )\r\nprint( which(flights_df$carrier == \"AA\") + 3 )\r\n\r\n [1] 1 13 18 24 25 27 32 37 38 61 64 65 91 92\r\n [1] 2 14 19 25 26 28 33 38 39 62 65 66 92 93\r\n [1] 3 15 20 26 27 29 34 39 40 63 66 67 93 94\r\n [1] 4 16 21 27 28 30 35 40 41 64 67 68 94 95\r\n\r\nThe fact that outer() returns all the relevant row indices inside a single matrix is nice because we can collect the indices column-by-column, preserving row order. Matrices, like data frames, are column-major, so coercing a matrix to a vector collapses it column-wise:\r\n\r\n\r\nas.integer( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\n\r\n [1] 1 2 3 4 13 14 15 16 18 19 20 21 24 25 26 27 25 26 27 28 27 28 29 30 32\r\n [26] 33 34 35 37 38 39 40 38 39 40 41 61 62 63 64 64 65 66 67 65 66 67 68 91 92\r\n [51] 93 94 92 93 94 95\r\n\r\n\r\nOther ways to coerce matrix to vector\r\nThere are two other options for coercing a matrix to vector - c() and as.vector(). I like to stick with as.integer() because that enforces integer type (which makes sense for row indices), and c() can be nice because itâs less to type (although itâs off-label usage):\r\n\r\n\r\n# Not run, but equivalent to `as.integer()` method\r\nas.vector( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\nc( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\n\r\n\r\nSomewhat relatedly - and this only works inside the tidy-eval context of slice() - you can get a similar effect of âcollapsingâ a matrix using the splice operator !!!:\r\n\r\n\r\nseq_matrix <- matrix(1:9, byrow = TRUE, nrow = 3)\r\nas.integer(seq_matrix)\r\n\r\n [1] 1 4 7 2 5 8 3 6 9\r\n\r\nidentical(\r\n mtcars |> slice( as.vector(seq_matrix) ),\r\n mtcars |> slice( !!!seq_matrix )\r\n)\r\n\r\n [1] TRUE\r\n\r\nHere, the !!!seq_matrix was slotting each individual âcellâ as argument to slice():\r\n\r\n\r\nrlang::expr( slice(!!!seq_matrix) )\r\n\r\n slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)\r\n\r\nA big difference in behavior between as.integer() vs. !!! is that the latter works for lists of indices too, by slotting each element of the list as an argument to slice():\r\n\r\n\r\nseq_list <- list(c(1, 4, 7, 2), c(5, 8, 3, 6, 9))\r\nrlang::expr( slice( !!!seq_list ) )\r\n\r\n slice(c(1, 4, 7, 2), c(5, 8, 3, 6, 9))\r\n\r\nHowever, as you may already know, as.integer() cannot flatten lists:\r\n\r\n\r\nas.integer(seq_list)\r\n\r\n Error in eval(expr, envir, enclos): 'list' object cannot be coerced to type 'integer'\r\n\r\nNote that as.vector() and c() leaves lists as is, which is another reason to prefer as.integer() for type-checking:\r\n\r\n\r\nidentical(seq_list, as.vector(seq_list))\r\nidentical(seq_list, c(seq_list))\r\n\r\n [1] TRUE\r\n [1] TRUE\r\n\r\nFinally, back in our !!!seq_matrix example, we could have applied asplit(MARGIN = 2) to chunk the splicing by matrix column, although the overall effect would be the same:\r\n\r\n\r\nrlang::expr(slice( !!!seq_matrix ))\r\n\r\n slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)\r\n\r\nrlang::expr(slice( !!!asplit(seq_matrix, 2) ))\r\n\r\n slice(c(1L, 4L, 7L), c(2L, 5L, 8L), c(3L, 6L, 9L))\r\n\r\nThis lets us ask questions like: Which AA flights departed within 3 flights of another AA flight?\r\n\r\n\r\nflights_df |> \r\n slice( as.integer( outer(0:3, which(carrier == \"AA\"), `+`) ) ) |> \r\n filter( carrier == \"AA\", duplicated(flight) ) |> \r\n distinct(flight, carrier)\r\n\r\n # A tibble: 6 Ă 2\r\n flight carrier\r\n \r\n 1 825 AA \r\n 2 33 AA \r\n 3 655 AA \r\n 4 1 AA \r\n 5 647 AA \r\n 6 179 AA\r\n\r\n\r\nSlicing all the way down: Case 1\r\nWith the addition of the .by argument to slice() in dplyr v1.10, we can re-write the above code as three calls to slice() (+ a call to select()):\r\n\r\n\r\nflights_df |> \r\n slice( as.integer( outer(0:3, which(carrier == \"AA\"), `+`) ) ) |> \r\n slice( which(carrier == \"AA\" & duplicated(flight)) ) |> # filter()\r\n slice( 1, .by = c(flight, carrier) ) |> # distinct()\r\n select(flight, carrier)\r\n\r\n # A tibble: 6 Ă 2\r\n flight carrier\r\n \r\n 1 825 AA \r\n 2 33 AA \r\n 3 655 AA \r\n 4 1 AA \r\n 5 647 AA \r\n 6 179 AA\r\n\r\nThe next example will demonstrate another, perhaps more practical usecase for outer() in slice().\r\nFilter (and encode) neighboring rows\r\nLetâs use a subset of the {gapminder} data set for this one. Here, we have data for each European countryâs GDP-per-capita by year, between 1992 to 2007:\r\n\r\n\r\ngapminder_df <- gapminder::gapminder |> \r\n left_join(gapminder::country_codes, by = \"country\") |> # `multiple = \"all\"`\r\n filter(year >= 1992, continent == \"Europe\") |> \r\n select(country, country_code = iso_alpha, year, gdpPercap)\r\ngapminder_df\r\n\r\n # A tibble: 120 Ă 4\r\n country country_code year gdpPercap\r\n \r\n 1 Albania ALB 1992 2497.\r\n 2 Albania ALB 1997 3193.\r\n 3 Albania ALB 2002 4604.\r\n 4 Albania ALB 2007 5937.\r\n 5 Austria AUT 1992 27042.\r\n 6 Austria AUT 1997 29096.\r\n 7 Austria AUT 2002 32418.\r\n 8 Austria AUT 2007 36126.\r\n 9 Belgium BEL 1992 25576.\r\n 10 Belgium BEL 1997 27561.\r\n # âš 110 more rows\r\n\r\nThis time, letâs see the desired output (plot) first and build our way up. The goal is to plot the GDP growth of Germany over the years, and its yearly GDP neighbors side-by-side:\r\n\r\n\r\n\r\nFirst, letâs think about what a âGDP neighborâ means in row-relational terms. If you arranged the data by GDP, the GDP neighbors would be the rows that come immediately before and after the rows for Germany. You need to recalculate neighbors every year though, so this arrange() + slice() combo should happen by-year.\r\nWith that in mind, letâs set up a year grouping and arrange by gdpPercap within year:1\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap, .by_group = TRUE)\r\n\r\n # A tibble: 120 Ă 4\r\n # Groups: year [4]\r\n country country_code year gdpPercap\r\n \r\n 1 Albania ALB 1992 2497.\r\n 2 Bosnia and Herzegovina BIH 1992 2547.\r\n 3 Turkey TUR 1992 5678.\r\n 4 Bulgaria BGR 1992 6303.\r\n 5 Romania ROU 1992 6598.\r\n 6 Montenegro MNE 1992 7003.\r\n 7 Poland POL 1992 7739.\r\n 8 Croatia HRV 1992 8448.\r\n 9 Serbia SRB 1992 9325.\r\n 10 Slovak Republic SVK 1992 9498.\r\n # âš 110 more rows\r\n\r\nNow within each year, we want to grab the row for Germany and its neighboring rows. We can do this by taking the outer() sum of -1:1 and the row indices for Germany:\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap, .by_group = TRUE) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) )\r\n\r\n # A tibble: 12 Ă 4\r\n # Groups: year [4]\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 United Kingdom GBR 2002 29479.\r\n 8 Germany DEU 2002 30036.\r\n 9 Belgium BEL 2002 30486.\r\n 10 France FRA 2007 30470.\r\n 11 Germany DEU 2007 32170.\r\n 12 United Kingdom GBR 2007 33203.\r\n\r\n\r\nSlicing all the way down: Case 2\r\nThe new .by argument in slice() comes in handy again here, allowing us to collapse the group_by() + arrange() combo into one slice() call:\r\n\r\n\r\ngapminder_df |> \r\n slice( order(gdpPercap), .by = year) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) )\r\n\r\n # A tibble: 12 Ă 4\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 United Kingdom GBR 2002 29479.\r\n 8 Germany DEU 2002 30036.\r\n 9 Belgium BEL 2002 30486.\r\n 10 France FRA 2007 30470.\r\n 11 Germany DEU 2007 32170.\r\n 12 United Kingdom GBR 2007 33203.\r\n\r\nFor our purposes here we want actually the grouping to persist for the following mutate() call, but there may be other cases where youâd want to use slice(.by = ) for temporary grouping.\r\nNow weâre already starting to see the shape of the data that we want! The last step is to encode the relationship of each row to Germany - does a row represent Germany itself, or a country thatâs one GDP ranking below or above Germany?\r\nContinuing with our grouped context, we make a new column grp that assigns a factor value \"lo\"-\"is\"-\"hi\" (for âlowerâ than Germany, âisâ Germany and âhigherâ than Germany) to each country trio by year. Notice the use of fct_inorder() below - this ensures that the factor levels are in the order of their occurrence (necessary for the correct ordering of bars in geom_col() later):\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\")))\r\n\r\n # A tibble: 12 Ă 5\r\n # Groups: year [4]\r\n country country_code year gdpPercap grp \r\n \r\n 1 Denmark DNK 1992 26407. lo \r\n 2 Germany DEU 1992 26505. is \r\n 3 Netherlands NLD 1992 26791. hi \r\n 4 Belgium BEL 1997 27561. lo \r\n 5 Germany DEU 1997 27789. is \r\n 6 Iceland ISL 1997 28061. hi \r\n 7 United Kingdom GBR 2002 29479. lo \r\n 8 Germany DEU 2002 30036. is \r\n 9 Belgium BEL 2002 30486. hi \r\n 10 France FRA 2007 30470. lo \r\n 11 Germany DEU 2007 32170. is \r\n 12 United Kingdom GBR 2007 33203. hi\r\n\r\nWe now have everything thatâs necessary to make our desired plot, so we ungroup(), write some {ggplot2} code, and voila!\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))) |> \r\n # Ungroup and make ggplot\r\n ungroup() |> \r\n ggplot(aes(as.factor(year), gdpPercap, group = grp)) +\r\n geom_col(aes(fill = grp == \"is\"), position = position_dodge()) +\r\n geom_text(\r\n aes(label = country_code),\r\n vjust = 1.3,\r\n position = position_dodge(width = .9)\r\n ) +\r\n scale_fill_manual(\r\n values = c(\"grey75\", \"steelblue\"),\r\n guide = guide_none()\r\n ) +\r\n theme_classic() +\r\n labs(x = \"Year\", y = \"GDP per capita\")\r\n\r\n\r\n\r\n\r\nSolving the harder version of the problem\r\nThe solution presented above relies on a fragile assumption that Germany will always have a higher and lower ranking GDP neighbor every year. But nothing about the problem description guarantees this, so how can we re-write our code to be more robust?\r\nFirst, letâs simulate a data where Germany is the lowest ranking country in 2002 and the highest ranking in 2007. In other words, Germany only has one GDP neighbor in those years:\r\n\r\n\r\ngapminder_harder_df <- gapminder_df |> \r\n slice( order(gdpPercap), .by = year) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n slice( -7, -12 )\r\ngapminder_harder_df\r\n\r\n # A tibble: 10 Ă 4\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 Germany DEU 2002 30036.\r\n 8 Belgium BEL 2002 30486.\r\n 9 France FRA 2007 30470.\r\n 10 Germany DEU 2007 32170.\r\n\r\nGiven this data, we cannot assign the full, length-3 lo-is-hi factor by group, because the groups for year 2002 and 2007 only have 2 observations:\r\n\r\n\r\ngapminder_harder_df |> \r\n group_by(year) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\")))\r\n\r\n Error in `mutate()`:\r\n âš In argument: `grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))`.\r\n âš In group 3: `year = 2002`.\r\n Caused by error:\r\n ! `grp` must be size 2 or 1, not 3.\r\n\r\nThe trick here is to turn each group of rows into an integer sequence where Germany is âanchoredâ to 2, and then use that vector to subset the lo-is-hi factor:\r\n\r\n\r\ngapminder_harder_df |> \r\n group_by(year) |> \r\n mutate(\r\n Germany_anchored_to_2 = row_number() - which(country == \"Germany\") + 2,\r\n grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))[Germany_anchored_to_2]\r\n )\r\n\r\n # A tibble: 10 Ă 6\r\n # Groups: year [4]\r\n country country_code year gdpPercap Germany_anchored_to_2 grp \r\n \r\n 1 Denmark DNK 1992 26407. 1 lo \r\n 2 Germany DEU 1992 26505. 2 is \r\n 3 Netherlands NLD 1992 26791. 3 hi \r\n 4 Belgium BEL 1997 27561. 1 lo \r\n 5 Germany DEU 1997 27789. 2 is \r\n 6 Iceland ISL 1997 28061. 3 hi \r\n 7 Germany DEU 2002 30036. 2 is \r\n 8 Belgium BEL 2002 30486. 3 hi \r\n 9 France FRA 2007 30470. 1 lo \r\n 10 Germany DEU 2007 32170. 2 is\r\n\r\nWe find that the lessons of working with row indices from slice() translated to solving this complex mutate() problem - neat!\r\nAside: kronecker() as as.vector(outer())\r\nFollowing from the slice() + outer() strategy demoed above, imagine if we wanted to filter for \"Luke Skywalker\" and 4 other characters that are neighbors in the height and mass values.\r\n\r\n\r\ndplyr::starwars[, 1:3]\r\n\r\n # A tibble: 87 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n # âš 77 more rows\r\n\r\nIn row-relational terms, âfiltering neighboring valuesâ just means âfiltering rows after arranging by the values we care aboutâ. We can express this using slice() and outer() as:\r\n\r\n\r\nstarwars %>% \r\n select(name, mass, height) %>% \r\n arrange(mass, height) %>% \r\n slice( as.vector(outer(-1:1, which(grepl(\"(Luke|Anakin) Skywalker\", name)), `+`)) )\r\n\r\n # A tibble: 6 Ă 3\r\n name mass height\r\n \r\n 1 Wedge Antilles 77 170\r\n 2 Luke Skywalker 77 172\r\n 3 Obi-Wan Kenobi 77 182\r\n 4 Biggs Darklighter 84 183\r\n 5 Anakin Skywalker 84 188\r\n 6 Mace Windu 84 188\r\n\r\nI raised this example on an unrelated thread on the R4DS/DSLC slack, where Anthony Durrant pointed me to kronecker() as a version of outer() that unlist before returning the output.\r\nSo in examples involving outer() to generate row indices in slice(), we can also use kronecker() instead to save a call to a flattening function like as.vector():\r\n\r\n\r\nstarwars %>% \r\n select(name, mass, height) %>% \r\n arrange(mass, height) %>% \r\n slice( kronecker(-1:1, which(grepl(\"(Luke|Anakin) Skywalker\", name)), `+`) )\r\n\r\n # A tibble: 6 Ă 3\r\n name mass height\r\n \r\n 1 Wedge Antilles 77 170\r\n 2 Biggs Darklighter 84 183\r\n 3 Luke Skywalker 77 172\r\n 4 Anakin Skywalker 84 188\r\n 5 Obi-Wan Kenobi 77 182\r\n 6 Mace Windu 84 188\r\n\r\nI returning to this problem with kronecker() also inspired me to write a function around this. Since slice()-ing with which(...) is just filter(...), I call it filter_around() and give it a filter() with its defaults.\r\n\r\n\r\nfilter_around <- function(.data, ..., by, n = 0, name = NULL) {\r\n data <- .data |>\r\n dplyr::arrange(dplyr::pick({{ by }}))\r\n dots <- rlang::enquos(...)\r\n lgls <- lapply(dots, rlang::eval_tidy, data = data)\r\n inds <- which(as.logical(do.call(pmin, lgls)))\r\n inds_around <- kronecker(inds, -n:n, `+`)\r\n if (!is.null(name)) {\r\n data[[name]] <- replace(rep(FALSE, nrow(data)), inds, TRUE)\r\n }\r\n data |>\r\n dplyr::slice(.env$inds_around)\r\n}\r\n\r\n\r\nPassing conditions to the dots makes it behave like filter():\r\n\r\n\r\ndplyr::starwars[, 1:3] |>\r\n filter_around(\r\n # Two conditions below evaluate to `grepl(\"(Luke|Anakin) Skywalker\", name)`\r\n grepl(\"Skywalker\", name),\r\n name != \"Shmi Skywalker\"\r\n )\r\n\r\n # A tibble: 2 Ă 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 Anakin Skywalker 188 84\r\n\r\nAnd other arguments can be used for the âaroundâ behavior:\r\n\r\n\r\n# Filter the target rows *and* a pair of neighbors for each row by height + mass\r\ndplyr::starwars[, 1:3] |>\r\n filter_around(\r\n grepl(\"Skywalker\", name),\r\n name != \"Shmi Skywalker\",\r\n # Extra args\r\n by = c(height, mass),\r\n n = 1,\r\n name = \"target\"\r\n )\r\n\r\n # A tibble: 6 Ă 4\r\n name height mass target\r\n