diff --git a/docs/index.html b/docs/index.html index 76b63e7..18c959c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2742,7 +2742,7 @@

June Choe

-

I am a linguist, software developer, and scientific programmer. In my PhD research in Linguistics at the University of Pennsylvania, I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.

+

I am a fifth-year PhD candidate in Linguistics at the University of Pennsylvania. I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.

I am also active in the R programming community as a mentor and open-source developer. I maintain and collaborate on several open-source software in statistical computing (jlmerclusterperm), data visualization (ggtrace), data quality assurance (pointblank), and interfaces to data APIs (openalexR). My work on graphics received an award from the American Statistical Association in 2023. I enjoy writing in my free time and have been maintaining a technical blog over the past 5 years, covering topics including R language internals, software design principles, and practical tutorials for everyday data analysis.

diff --git a/docs/posts/2020-06-07-correlation-parameter-mem/index.html b/docs/posts/2020-06-07-correlation-parameter-mem/index.html index 0d0fe55..244f91b 100644 --- a/docs/posts/2020-06-07-correlation-parameter-mem/index.html +++ b/docs/posts/2020-06-07-correlation-parameter-mem/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html b/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html index a7eb1f3..ded9385 100644 --- a/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html +++ b/docs/posts/2020-06-25-indexing-tip-for-spacyr/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-06-30-treemap-with-ggplot/index.html b/docs/posts/2020-06-30-treemap-with-ggplot/index.html index 58a5f3d..ee08096 100644 --- a/docs/posts/2020-06-30-treemap-with-ggplot/index.html +++ b/docs/posts/2020-06-30-treemap-with-ggplot/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-07-13-geom-paired-raincloud/index.html b/docs/posts/2020-07-13-geom-paired-raincloud/index.html index e878169..460a823 100644 --- a/docs/posts/2020-07-13-geom-paired-raincloud/index.html +++ b/docs/posts/2020-07-13-geom-paired-raincloud/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-07-20-shiny-tips-1/index.html b/docs/posts/2020-07-20-shiny-tips-1/index.html index ef2c5b0..a109769 100644 --- a/docs/posts/2020-07-20-shiny-tips-1/index.html +++ b/docs/posts/2020-07-20-shiny-tips-1/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html b/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html index f215cb1..9aaf784 100644 --- a/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html +++ b/docs/posts/2020-07-29-six-years-of-my-spotify-playlists/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html b/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html index ea1b16a..02d6caa 100644 --- a/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html +++ b/docs/posts/2020-08-04-tidytuesday-2020-week-32/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-08-07-saving-a-line-of-piping/index.html b/docs/posts/2020-08-07-saving-a-line-of-piping/index.html index 0fca849..65d4da6 100644 --- a/docs/posts/2020-08-07-saving-a-line-of-piping/index.html +++ b/docs/posts/2020-08-07-saving-a-line-of-piping/index.html @@ -2618,7 +2618,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html b/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html index 3735453..c71a884 100644 --- a/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html +++ b/docs/posts/2020-08-17-tidytuesday-2020-week-33/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-06-fonts-for-graphs/index.html b/docs/posts/2020-09-06-fonts-for-graphs/index.html index 2f04d59..655a9a7 100644 --- a/docs/posts/2020-09-06-fonts-for-graphs/index.html +++ b/docs/posts/2020-09-06-fonts-for-graphs/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-12-videos-in-reactable/index.html b/docs/posts/2020-09-12-videos-in-reactable/index.html index 3f61e54..673e5ee 100644 --- a/docs/posts/2020-09-12-videos-in-reactable/index.html +++ b/docs/posts/2020-09-12-videos-in-reactable/index.html @@ -2623,7 +2623,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html b/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html index 9bf230a..7f51ace 100644 --- a/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html +++ b/docs/posts/2020-09-14-tidytuesday-2020-week-38/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-20-plot-makeover-1/index.html b/docs/posts/2020-09-20-plot-makeover-1/index.html index de874e7..4080775 100644 --- a/docs/posts/2020-09-20-plot-makeover-1/index.html +++ b/docs/posts/2020-09-20-plot-makeover-1/index.html @@ -2619,7 +2619,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html b/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html index 4d11e65..473e4d9 100644 --- a/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html +++ b/docs/posts/2020-09-23-tidytuesday-2020-week-39/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html b/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html index 28fbfee..b9572db 100644 --- a/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html +++ b/docs/posts/2020-09-26-demystifying-stat-layers-ggplot2/index.html @@ -2614,7 +2614,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html b/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html index 12b1464..67169b0 100644 --- a/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html +++ b/docs/posts/2020-10-13-designing-guiding-aesthetics/index.html @@ -2617,7 +2617,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html b/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html index 51bcfbd..10c1222 100644 --- a/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html +++ b/docs/posts/2020-10-22-analysis-of-everycolorbots-tweets/index.html @@ -2618,7 +2618,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html b/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html index 4626475..7bc8e83 100644 --- a/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html +++ b/docs/posts/2020-10-28-tidytuesday-2020-week-44/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html b/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html index d1391ce..1ae279d 100644 --- a/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html +++ b/docs/posts/2020-11-03-tidytuesday-2020-week-45/index.html @@ -2612,7 +2612,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-11-08-plot-makeover-2/index.html b/docs/posts/2020-11-08-plot-makeover-2/index.html index 3178c45..ee9e1a4 100644 --- a/docs/posts/2020-11-08-plot-makeover-2/index.html +++ b/docs/posts/2020-11-08-plot-makeover-2/index.html @@ -2625,7 +2625,7 @@

${suggestion.title}

diff --git a/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html b/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html index c9151f3..7173cb9 100644 --- a/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html +++ b/docs/posts/2020-12-13-collapse-repetitive-piping-with-reduce/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html b/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html index fe8f186..a14ade0 100644 --- a/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html +++ b/docs/posts/2021-01-17-random-sampling-a-table-animation/index.html @@ -2614,7 +2614,7 @@

${suggestion.title}

diff --git a/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html b/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html index 2666606..aa5a8b8 100644 --- a/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html +++ b/docs/posts/2021-06-24-setting-up-and-debugging-custom-fonts/index.html @@ -2619,7 +2619,7 @@

${suggestion.title}

diff --git a/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html b/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html index 3b4e2bd..285789b 100644 --- a/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html +++ b/docs/posts/2022-03-10-ggplot2-delayed-aes-1/index.html @@ -2621,7 +2621,7 @@

${suggestion.title}

diff --git a/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html b/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html index 53672e4..ce0b649 100644 --- a/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html +++ b/docs/posts/2022-07-06-ggplot2-delayed-aes-2/index.html @@ -2630,7 +2630,7 @@

${suggestion.title}

diff --git a/docs/posts/2022-07-30-user2022/index.html b/docs/posts/2022-07-30-user2022/index.html index 4d532e4..3bbacaf 100644 --- a/docs/posts/2022-07-30-user2022/index.html +++ b/docs/posts/2022-07-30-user2022/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html b/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html index 08b84f5..dec515d 100644 --- a/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html +++ b/docs/posts/2022-11-13-dataframes-jl-and-accessories/index.html @@ -2622,7 +2622,7 @@

${suggestion.title}

diff --git a/docs/posts/2023-06-11-row-relational-operations/index.html b/docs/posts/2023-06-11-row-relational-operations/index.html index a908a1d..37b5432 100644 --- a/docs/posts/2023-06-11-row-relational-operations/index.html +++ b/docs/posts/2023-06-11-row-relational-operations/index.html @@ -2626,7 +2626,7 @@

${suggestion.title}

diff --git a/docs/posts/2023-07-09-x-y-problem/index.html b/docs/posts/2023-07-09-x-y-problem/index.html index 24ef487..1e53001 100644 --- a/docs/posts/2023-07-09-x-y-problem/index.html +++ b/docs/posts/2023-07-09-x-y-problem/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2023-12-03-untidy-select/index.html b/docs/posts/2023-12-03-untidy-select/index.html index c72ee28..440a5f0 100644 --- a/docs/posts/2023-12-03-untidy-select/index.html +++ b/docs/posts/2023-12-03-untidy-select/index.html @@ -2626,7 +2626,7 @@

${suggestion.title}

diff --git a/docs/posts/2023-12-31-2023-year-in-review/index.html b/docs/posts/2023-12-31-2023-year-in-review/index.html index 20a84b7..d291bc4 100644 --- a/docs/posts/2023-12-31-2023-year-in-review/index.html +++ b/docs/posts/2023-12-31-2023-year-in-review/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2024-02-20-helloworld-print/index.html b/docs/posts/2024-02-20-helloworld-print/index.html index 47029a6..da8a78e 100644 --- a/docs/posts/2024-02-20-helloworld-print/index.html +++ b/docs/posts/2024-02-20-helloworld-print/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2024-03-04-args-args-args-args/index.html b/docs/posts/2024-03-04-args-args-args-args/index.html index 73f8124..1892af8 100644 --- a/docs/posts/2024-03-04-args-args-args-args/index.html +++ b/docs/posts/2024-03-04-args-args-args-args/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2024-06-09-ave-for-the-average/index.html b/docs/posts/2024-06-09-ave-for-the-average/index.html index 073379e..347f80d 100644 --- a/docs/posts/2024-06-09-ave-for-the-average/index.html +++ b/docs/posts/2024-06-09-ave-for-the-average/index.html @@ -2620,7 +2620,7 @@

${suggestion.title}

diff --git a/docs/posts/2024-07-21-enumerate-possible-options/index.html b/docs/posts/2024-07-21-enumerate-possible-options/index.html index 135b919..40402aa 100644 --- a/docs/posts/2024-07-21-enumerate-possible-options/index.html +++ b/docs/posts/2024-07-21-enumerate-possible-options/index.html @@ -2616,7 +2616,7 @@

${suggestion.title}

diff --git a/docs/posts/2024-09-22-fetch-files-web/index.html b/docs/posts/2024-09-22-fetch-files-web/index.html index 8016c1c..5013530 100644 --- a/docs/posts/2024-09-22-fetch-files-web/index.html +++ b/docs/posts/2024-09-22-fetch-files-web/index.html @@ -2616,7 +2616,7 @@

${suggestion.title}

diff --git a/docs/posts/posts.json b/docs/posts/posts.json index db24bfc..28605cc 100644 --- a/docs/posts/posts.json +++ b/docs/posts/posts.json @@ -15,7 +15,7 @@ ], "contents": "\r\n\r\nContents\r\nGitHub (public repos)\r\nGitHub (gists)\r\nGitHub (private repos)\r\nOSF\r\nAside: Can’t go wrong with a copy-paste!\r\nOther goodies\r\nStreaming with {duckdb}\r\nOther sources for data\r\nMiscellaneous tips and tricks\r\n\r\nsessionInfo()\r\n\r\nEvery so often I’ll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.\r\nOver the years I’ve accumulated some tricks to get data into R “straight from a url”, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I’d write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I’m someone who primarily works with tabular data and interface with GitHub and OSF as data repositories.\r\nGitHub (public repos)\r\nGitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the dplyr::starwars dataset from tidyverse/dplyr, at https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv:\r\n\r\n\r\n\r\nThat url, despite ending in a .csv, does not point to the raw data - instead, the contents of the page is a full html document:\r\n\r\n\r\nrvest::read_html(\"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\")\r\n\r\n\r\n {html_document}\r\n \\n \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia Or…\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", N…\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"…\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",…\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",…\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"femini…\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"T…\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"Huma…\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, R…\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"Imperial…\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", …\r\n\r\nBut note that this method of “click the Raw button to get the corresponding raw.githubusercontent.com/… url to the file contents” will not work for file formats that cannot be displayed in plain text (clicking the button will instead download the file via your browser). So sometimes (especially when you have a binary file) you have to construct this “remote-readable” url to the file manually.\r\nFortunately, going from one link to the other is pretty formulaic. To demonstrate the difference with the url for the starwars dataset again:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\",\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\n\r\n\r\n[1] \"https:// github .com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\"[1] \"https://raw.githubusercontent.com/tidyverse/dplyr /main/data-raw/starwars.csv\"\r\n\r\n\r\nGitHub (gists)\r\nIt’s a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, here’s a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.\r\nBut that’s again a full-on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist\r\n\r\n\r\n\r\nBut actually, that long link you get by default points to the current commit, specifically. If you instead want the link to be kept up to date with the most recent commit, you can omit the second hash that comes after raw/:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\",\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\"\r\n)\r\n\r\n\r\n[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\"[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw /stroop.csv\"\r\n\r\n\r\nIn practice, I don’t use gists to store replicability-sensitive data, so I prefer to just use the shorter link that’s not tied to a specific commit.\r\n\r\n\r\nread.csv(\"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 240\r\n Columns: 5\r\n $ subj \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S02…\r\n $ word \"blue\", \"blue\", \"green\", \"green\", \"red\", \"red\", \"yellow\", \"y…\r\n $ condition \"match\", \"mismatch\", \"match\", \"mismatch\", \"match\", \"mismatch…\r\n $ accuracy 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …\r\n $ RT 400, 549, 576, 406, 296, 231, 433, 1548, 561, 1751, 286, 710…\r\n\r\nGitHub (private repos)\r\nWe now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and you’re signed in, you can follow the same step of copying the link that the Raw button redirects to.\r\nExcept this time, when you open the file at that url (assuming it can display in plain text), you’ll see the url come with a “token” attached at the end (I’ll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it will expire at some point as GitHub refreshes tokens periodically (so treat them as if they’re for single use).\r\nFor a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:1\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/contents/{path}\")$download_url\r\n\r\n\r\nNote that this is actually also a general solution to getting a url to GitHub file contents. So for example, even without any credentials set up you can point to dplyr’s starwars.csv since that’s publicly accessible. This method produces the same “raw.githubusercontent.com/…” url we saw earlier:\r\n\r\n\r\ngh::gh(\"/repos/tidyverse/dplyr/contents/data-raw/starwars.csv\")$download_url\r\n\r\n [1] \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n\r\nNow for a demonstration with a private repo, here is one of mine that you cannot access https://github.com/yjunechoe/my-super-secret-repo. But because I set up my credentials in {gh}, I can generate a link to a content within that repo with the access token attached (“?token=…”):\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n # truncating\r\n gsub(x = _, \"^(.{100}).*\", \"\\\\1...\")\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR2JPXCIX5...\"\r\n\r\nI can then use this url to read the private file:2\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n readLines()\r\n\r\n [1] \"Surprise!\"\r\n\r\nOSF\r\nOSF (the Open Science Framework) is another data repository that I interact with a lot, and reading files off of OSF follows a similar strategy to fetching public files on GitHub.\r\nConsider, for example, the dyestuff.arrow file in the OSF repository for MixedModels.jl. Browsing the repository through the point-and-click interface can get you to the page for the file at https://osf.io/9vztj/, where it shows:\r\n\r\n\r\n\r\nThe download button can be found inside the dropdown menubar to the right:\r\n\r\n\r\n\r\nBut instead of clicking on the icon (which will start a download via the browser), we can grab the embedded link address: https://osf.io/download/9vztj/. That url can then be passed directly into a read function:\r\n\r\n\r\narrow::read_feather(\"https://osf.io/download/9vztj/\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 30\r\n Columns: 2\r\n $ batch A, A, A, A, A, B, B, B, B, B, C, C, C, C, C, D, D, D, D, D, E, E…\r\n $ yield 1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595…\r\n\r\nYou might have already caught on to this, but the pattern is to simply point to osf.io/download/ instead of osf.io/.\r\nThis method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents.\r\nBy inserting /download into this url, we can read the csv file contents directly:\r\n\r\n\r\nread.csv(\"https://osf.io/download/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad\") |> \r\n head()\r\n\r\n Item plaus_bias trans_bias\r\n 1 Awakened -0.29631221 -1.2200901\r\n 2 Calmed 0.09877074 -0.4102332\r\n 3 Choked 1.28401957 -1.4284905\r\n 4 Dressed -0.59262442 -1.2087228\r\n 5 Failed -0.98770736 0.1098839\r\n 6 Groomed -1.08647810 0.9889550\r\n\r\nSee also the {osfr} package for a more principled interface to OSF.\r\nAside: Can’t go wrong with a copy-paste!\r\nReading remote files aside, I think it’s severely underrated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a \"clipboard\" connection.3\r\nI sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R’s clipboard functionalities.\r\nFor example, given this markdown table:\r\n\r\n\r\naggregate(mtcars, mpg ~ cyl, mean) |> \r\n knitr::kable()\r\n\r\ncyl\r\nmpg\r\n4\r\n26.66364\r\n6\r\n19.74286\r\n8\r\n15.10000\r\n\r\nYou can copy its contents and run the following code to get that data back as an R data frame:\r\n\r\n\r\nread.delim(\"clipboard\")\r\n# Or, `read.delim(text = readClipboard())`\r\n\r\n\r\n\r\n cyl mpg\r\n 1 4 26.66364\r\n 2 6 19.74286\r\n 3 8 15.10000\r\n\r\nIf you’re instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:4\r\n\r\n\r\npaste(1:10, collapse = \", \") |> \r\n cat()\r\n\r\n 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\r\n\r\n\r\n\r\nscan(\"clipboard\", sep = \",\")\r\n# Or, `scan(textConnection(readClipboard()), sep = \",\")`\r\n\r\n\r\n\r\n [1] 1 2 3 4 5 6 7 8 9 10\r\n\r\nIt should be noted though that parsing clipboard contents is not a robust feature in base R. If you want a more principled approach to reading data from clipboard, you should use {datapasta}. And for printing data for others to copy-paste into R, use {constructive}. See also {clipr} which extends clipboard read/write functionalities.\r\nOther goodies\r\n⚠️ What lies ahead are denser than the kinds of “low-tech” advice I wrote about above.\r\nStreaming with {duckdb}\r\nOne caveat to all the “read from web” approaches I covered above is that it often does not actually circumvent the action of downloading the file onto your computer. For example, when you read a file from “raw.githubusercontent.com/…” with read.csv(), there is an implicit download.file() of the data into the current R session’s tempdir().\r\nAn alternative that actually reads the data straight into memory is streaming. Streaming is moreso a feature of database languages, but there’s good integration of such tools with R, so this option is available from within R as well.\r\nHere, I briefly outline what I learned from (mostly) reading a blog post by François Michonneau, which covers how to stream remote files using {duckdb}. It’s pretty comprehensive but I wanted to make a template for just one method that I prefer.\r\nWe start by loading the {duckdb} package, creating a connection to an in-memory database, installing the httpfs extension (if not installed already), and loading httpfs for the database.\r\n\r\n\r\nlibrary(duckdb)\r\ncon <- dbConnect(duckdb())\r\n# dbExecute(con, \"INSTALL httpfs;\") # You may also need to \"INSTALL parquet;\"\r\ninvisible(dbExecute(con, \"LOAD httpfs;\"))\r\n\r\n\r\nFor this example I will use a parquet file from one of my projects which is hosted on GitHub: https://github.com/yjunechoe/repetition_events. The data I want to read is at the relative path /data/tokens_data/childID=1/part-7.parquet. I went ahead and converted that into the “raw contents” url shown below:\r\n\r\n\r\n# A parquet file of tokens from a sample of child-directed speech\r\nfile <- \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"\r\n\r\n# For comparison, reading its contents with {arrow}\r\narrow::read_parquet(file) |> \r\n head(5)\r\n\r\n # A tibble: 5 × 3\r\n utterance_id gloss part_of_speech\r\n \r\n 1 1 www \"\" \r\n 2 2 bye \"co\" \r\n 3 3 mhm \"co\" \r\n 4 4 Mommy's \"n:prop\" \r\n 5 4 here \"adv\"\r\n\r\nIn duckdb, the httpfs extension we loaded above allows PARQUET_SCAN5 to read a remote parquet file.\r\n\r\n\r\nquery1 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN({`file`})\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query1)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\")\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query1)\r\n\r\n utterance_id gloss part_of_speech\r\n 1 1 www \r\n 2 2 bye co\r\n 3 3 mhm co\r\n 4 4 Mommy's n:prop\r\n 5 4 here adv\r\n\r\nAnd actually, in my case, the parquet file represents one of many files that had been previously split up via hive partitioning. To preserve this metadata even as I read in just a single file, I need to do two things:\r\nSpecify hive_partitioning=true when calling PARQUET_SCAN.\r\nEnsure that the hive-partitioning syntax is represented in the url with URLdecode() (since the = character can sometimes be escaped, as in this case).\r\n\r\n\r\nemphatic::hl_diff(file, URLdecode(file))\r\n\r\n\r\n[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID= 1/part-7.parquet\"\r\n\r\n\r\nWith that, the data now shows that the observations are from child #1 in the sample.\r\n\r\n\r\nfile <- URLdecode(file)\r\nquery2 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n {`file`},\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query2)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\",\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query2)\r\n\r\n utterance_id gloss part_of_speech childID\r\n 1 1 www 1\r\n 2 2 bye co 1\r\n 3 3 mhm co 1\r\n 4 4 Mommy's n:prop 1\r\n 5 4 here adv 1\r\n\r\nTo do this more programmatically over all parquet files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/git/trees/{branch/tag/commitSHA}?recursive=true\")$tree\r\n\r\n\r\nTo get the file tree of the repo on the master branch, we use:\r\n\r\n\r\nfiles <- gh::gh(\"/repos/yjunechoe/repetition_events/git/trees/master?recursive=true\")$tree\r\n\r\n\r\nWith recursive=true, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex:\r\n\r\n\r\nparquet_files <- sapply(files, `[[`, \"path\") |> \r\n grep(x = _, pattern = \".*/tokens_data/.*parquet$\", value = TRUE)\r\nlength(parquet_files)\r\n\r\n [1] 70\r\n\r\nhead(parquet_files)\r\n\r\n [1] \"data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nFinally, we complete the path using the “https://raw.githubusercontent.com/…” url:\r\n\r\n\r\nparquet_files <- paste0(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/\",\r\n parquet_files\r\n)\r\nhead(parquet_files)\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nBack on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].6 This time, we also ask for a quick computation to count the number of distinct childIDs:\r\n\r\n\r\nquery3 <- glue::glue_sql(\"\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n [{parquet_files*}],\r\n hive_partitioning=true\r\n )\r\n\", .con = con)\r\ncat(gsub(\"^(.{80}).*(.{60})$\", \"\\\\1 ... \\\\2\", query3))\r\n\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n ['https://raw.githubusercont ... data/childID=9/part-64.parquet'],\r\n hive_partitioning=true\r\n )\r\n\r\ndbGetQuery(con, query3)\r\n\r\n count(DISTINCT childID)\r\n 1 70\r\n\r\nThis returns 70 which matches the length of the parquet_files vector listing the files that had been partitioned by childID.\r\nFor further analyses, we can CREATE TABLE7 our data in our in-memory database con:\r\n\r\n\r\nquery4 <- glue::glue_sql(\"\r\n CREATE TABLE tokens_data AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning=true)\r\n\", .con = con)\r\ninvisible(dbExecute(con, query4))\r\ndbListTables(con)\r\n\r\n [1] \"tokens_data\"\r\n\r\nThat lets us reference the table via dplyr::tbl(), at which point we can switch over to another high-level interface like {dplyr} to query it using its familiar functions:\r\n\r\n\r\nlibrary(dplyr)\r\ntokens_data <- tbl(con, \"tokens_data\")\r\n\r\n# Q: What are the most common verbs spoken to children in this sample?\r\ntokens_data |> \r\n filter(part_of_speech == \"v\") |> \r\n count(gloss, sort = TRUE) |> \r\n head() |> \r\n collect()\r\n\r\n # A tibble: 6 × 2\r\n gloss n\r\n \r\n 1 go 13614\r\n 2 see 13114\r\n 3 do 11829\r\n 4 have 10794\r\n 5 want 10560\r\n 6 put 9190\r\n\r\nCombined, here’s one (hastily put together) attempt at wrapping this workflow into a function:\r\n\r\n\r\nload_dataset_from_gh <- function(con, tblname, user, repo, branch, regex,\r\n partition = TRUE, lazy = TRUE) {\r\n \r\n allfiles <- gh::gh(glue::glue(\"/repos/{user}/{repo}/git/trees/{branch}?recursive=true\"))$tree\r\n files_relpath <- grep(regex, sapply(allfiles, `[[`, \"path\"), value = TRUE)\r\n # Use the actual Contents API here instead, if the repo is private\r\n files <- glue::glue(\"https://raw.githubusercontent.com/{user}/{repo}/{branch}/{files_relpath}\")\r\n \r\n type <- if (lazy) quote(VIEW) else quote(TABLE)\r\n partition <- as.integer(partition)\r\n \r\n dbExecute(con, \"LOAD httpfs;\")\r\n dbExecute(con, glue::glue_sql(\"\r\n CREATE {type} {`tblname`} AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning={partition})\r\n \", .con = con))\r\n \r\n invisible(TRUE)\r\n\r\n}\r\n\r\ncon2 <- dbConnect(duckdb())\r\nload_dataset_from_gh(\r\n con = con2,\r\n tblname = \"tokens_data\",\r\n user = \"yjunechoe\",\r\n repo = \"repetition_events\",\r\n branch = \"master\",\r\n regex = \".*data/tokens_data/.*parquet$\"\r\n)\r\ntbl(con2, \"tokens_data\")\r\n\r\n # Source: table [?? x 4]\r\n # Database: DuckDB v1.0.0 [jchoe@Windows 10 x64:R 4.4.1/:memory:]\r\n utterance_id gloss part_of_speech childID\r\n \r\n 1 1 www \"\" 1\r\n 2 2 bye \"co\" 1\r\n 3 3 mhm \"co\" 1\r\n 4 4 Mommy's \"n:prop\" 1\r\n 5 4 here \"adv\" 1\r\n 6 5 wanna \"mod:aux\" 1\r\n 7 5 sit \"v\" 1\r\n 8 5 down \"adv\" 1\r\n 9 6 there \"adv\" 1\r\n 10 7 let's \"v\" 1\r\n # ℹ more rows\r\n\r\nOther sources for data\r\nIn writing this blog post, I’m indebted to all the knowledgeable folks on Mastodon who suggested their own recommended tools and workflows for various kinds of remote data. Unfortunately, I’m not familiar enough with most of them enough to do them justice, but I still wanted to record the suggestions I got from there for posterity.\r\nFirst, a post about reading remote files would not be complete without a mention of the wonderful {googlesheets4} package for reading from Google Sheets. I debated whether I should include a larger discussion of {googlesheets4}, and despite using it quite often myself I ultimately decided to omit it for the sake of space and because the package website is already very comprehensive. I would suggest starting from the Get Started vignette if you are new and interested.\r\nSecond, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (Maëlle Salmon’s comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - I’m not familiar with it myself so I thought I wouldn’t write anything for it here BUT Isabella Velásquez came in clutch sharing a recent talk on dynamically loading up-to-date data with {pins} which is a great demo of the unique strengths of {pins}.\r\nLastly, I inadvertently(?) started some discussion around remotely accessing spatial files. I don’t work with spatial data at all but I can totally imagine how the hassle of the traditional click-download-find-load workflow would be even more pronounced for spatial data which are presumably much larger in size and more difficult to preview. On this note, I’ll just link to Carl Boettiger’s comment about the fact that GDAL has a virtual file system that you can interface with from R packages wrapping this API (ex: {gdalraster}), and to Michael Sumner’s comment/gist + Chris Toney’s comment on the fact that you can even use this feature to stream non-spatial data!\r\nMiscellaneous tips and tricks\r\nI also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so I’ll be updating this space as more come back to me:\r\nWhen reading remote .rda or .RData files with load(), you may need to wrap the link in url() first (ref: stackoverflow).\r\n{vroom} can remotely read gzipped files, without having to download.file() and unzip() first.\r\n{curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:\r\n\r\n\r\nfetched <- curl::curl_fetch_memory(\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\nread.csv(text = rawToChar(fetched$content)) |> \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia Or…\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", N…\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"…\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",…\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",…\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"femini…\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"T…\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"Huma…\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, R…\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"Imperial…\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", …\r\n\r\nEven if you’re going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().8 Many {curl} functions can also handle retries and stop/resumes which is cool too.\r\n{httr2} can capture a continuous data stream with httr2::req_perform_stream() up to a set time or size.\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: America/New_York\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] dplyr_1.1.4 duckdb_1.0.0 DBI_1.2.3 ggplot2_3.5.1.9000\r\n \r\n loaded via a namespace (and not attached):\r\n [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4 generics_0.1.3 \r\n [5] xml2_1.3.6 distill_1.6 digest_0.6.35 magrittr_2.0.3 \r\n [9] evaluate_0.24.0 grid_4.4.1 blob_1.2.4 fastmap_1.1.1 \r\n [13] jsonlite_1.8.8 processx_3.8.4 chromote_0.3.1 ps_1.7.5 \r\n [17] promises_1.3.0 httr_1.4.7 rvest_1.0.4 purrr_1.0.2 \r\n [21] fansi_1.0.6 scales_1.3.0 httr2_1.0.3.9000 jquerylib_0.1.4 \r\n [25] cli_3.6.2 rlang_1.1.4 dbplyr_2.5.0 gitcreds_0.1.2 \r\n [29] bit64_4.0.5 munsell_0.5.1 withr_3.0.1 cachem_1.0.8 \r\n [33] yaml_2.3.8 tools_4.4.1 tzdb_0.4.0 memoise_2.0.1 \r\n [37] colorspace_2.1-1 assertthat_0.2.1 curl_5.2.1 vctrs_0.6.5 \r\n [41] R6_2.5.1 lifecycle_1.0.4 emphatic_0.1.8 bit_4.0.5 \r\n [45] arrow_16.1.0 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.7.0 \r\n [49] later_1.3.2 gtable_0.3.5 glue_1.7.0 gh_1.4.0 \r\n [53] Rcpp_1.0.12 xfun_0.47 tibble_3.2.1 tidyselect_1.2.1 \r\n [57] highr_0.11 rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1\r\n [61] websocket_1.4.1 rmarkdown_2.27 compiler_4.4.1 downlit_0.4.4\r\n\r\n\r\n\r\n\r\n\r\nThanks @tanho for pointing me to this at the R4DS/DSLC slack.↩︎\r\nNote that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).↩︎\r\nThe special value \"clipboard\" works for most base-R read functions that take a file or con argument.↩︎\r\nThanks @coolbutuseless for pointing me to textConnection()!↩︎\r\nOr READ_PARQUET - same thing.↩︎\r\nWe can also get this formatting with a combination of shQuote() and toString().↩︎\r\nWhereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.↩︎\r\nSee an example implemented for {openalexR}, an API package.↩︎\r\n", "preview": "posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg", - "last_modified": "2024-09-22T18:49:08-04:00", + "last_modified": "2024-09-22T15:49:08-07:00", "input_file": {} }, { @@ -34,7 +34,7 @@ ], "contents": "\r\n\r\nContents\r\nTake the argument name and negate it - is the intention clear?\r\nLook at the argument name - is it verb-y without an object?\r\nIs the argument a scalar adjective? Consider naming the scale.\r\nIs the argument truly binary? Still prefer enum and name the obvious/absence.\r\nMove shared strings across options into the argument name\r\n\r\nI’ve been having a blast reading through the Tidy design principles book lately - it’s packed with just the kind of stuff I needed to hear at this stage of my developer experience. And actually, I started writing packages in the post-{devtools}/R Packages era, so I wasn’t too surprised to find that my habits already align with many of the design principles advocated for in the book.1\r\nBut there was one pattern which took me a bit to fully wrap my head around (and be fully convinced by). It’s first introduced in the chapter “Enumerate possible options” which gives a pretty convincing example of the base R function rank(). rank() has a couple options for resolving ties between values which are exposed to the user via the ties.method argument. The default value of this argument is a vector that enumerates all the possible options, and the user’s choice of (or the lack of) an option is resolved through match.arg() and then the appropriate algorithm is called via a switch() statement.\r\nThis is all good and well, but the book takes it a step further in a later chapter “Prefer an enum, even if only two choices”, which outlines what I personally consider to be one of the more controversial (and newer2) strategies advocated for in the book. It’s a specific case of the “enumerate possible options” principle applied to boolean arguments, and is best understood with an example (of sort() vs. vctrs::vec_sort(), from the book):\r\n\r\n\r\n# Booolean options\r\nsort(x, decreasing = TRUE)\r\nsort(x, decreasing = FALSE)\r\n\r\n# Enumerated options\r\nvctrs::vec_sort(x, direction = \"desc\")\r\nvctrs::vec_sort(x, direction = \"asc\")\r\n\r\n\r\nThe main argument for this pattern is one of clarity. In the case of the example above, it is unclear from reading decreasing = FALSE whether that expresses “sort in the opposite of decreasing order (i.e., increasing/ascending)” or “do not sort in decreasing order (ex: leave it alone)”. The former is the correct interpretation, and this is expressed much clearer with direction = \"asc\", which contrasts with the other option direction = \"desc\".3\r\nI’ve never used this pattern for boolean options previously, but it’s been growing on me and I’m starting to get convinced. But in thinking through its implementation for refactoring code that I own and/or use, I got walled by the hardest problem in CS: naming things. A lot has been said on how to name things, but I’ve realized that the case of “turn booleans into enums” raises a whole different naming problem, one where you have to be precise about what’s being negated, the alternatives that are being contrasted, and the scale that the enums lie on.\r\nWhat follows are my somewhat half-baked, unstructured thoughts on some heuristics that I hope can be useful for determining when to apply the “enumerate possible options” principle for boolean options, and how to rename them in the refactoring.\r\nTake the argument name and negate it - is the intention clear?\r\nOne good litmus test for whether you should convert your boolean option into an enum is to take the argument name X and turn it into “X” and “not-X” - is the intended behavior expressed clearly in the context of the function? If, conceptually, the options are truly and unambiguously binary, then it should still make sense. But if the TRUE/FALSE options assume a very particular contrast which is difficult to recover from just reading “X” vs. “not-X”, consider using an enum for the two options.\r\nTo take sort() as an example again, imagine if we were to re-write it as:\r\n\r\n\r\nsort(option = \"decreasing\")\r\nsort(option = \"not-decreasing\")\r\n\r\n\r\nIf \"decreasing\" vs. \"not-decreasing\" is ambiguous, then maybe that’s a sign to consider ditching the boolean pattern and spell out the options more explicitly with e.g., direction = \"desc\" and direction = \"asc\", as vctrs::vec_sort() does. I also think this is a useful exercise because it reflects the user’s experience when encountering boolean options.\r\nLook at the argument name - is it verb-y without an object?\r\nLet’s take a bigger offender of this principle as an example: ggplot2::facet_grid(). facet_grid() is a function that I use all the time, and it has a couple boolean arguments which makes no immediate sense to me. Admittedly, I’ve never actually used them in practice, but from all my experience with {ggplot2} and facet_grid(), shouldn’t I be able to get at least some clues as to what they do from reading the arguments?4\r\n\r\n\r\nFilter(is.logical, formals(ggplot2::facet_grid))\r\n\r\n $shrink\r\n [1] TRUE\r\n \r\n $as.table\r\n [1] TRUE\r\n \r\n $drop\r\n [1] TRUE\r\n \r\n $margins\r\n [1] FALSE\r\n\r\nTake for example the shrink argument. Right off the bat it already runs into the problem where it’s not clear what we’re shrinking. I find this to be a general problem with boolean arguments: they’re often verbs with the object omitted (presumably to save keystrokes). Using the heuristic of negating the argument, we get “shrink” vs. “don’t shrink”, which not only repeats the problem of the ambiguity of negation as we saw with sort() previously, but also exposes how serious the problem of missing the object of the verb is.\r\nAt this point you may be wondering what exactly the shrink argument does at all. From the docs:\r\n\r\nIf TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary.\r\n\r\nThe intended contrast seems to be one of “statistics” (default) vs. “raw data”, so these are obvious candidates for our enum refactoring. But something like shrink = c(\"statistics\", \"raw-data\") doesn’t quite cut it yet, because the object of shrinking is not the data, but the scales. So to be fully informative, the argument name should complete the verb phrase (i.e., include the object).\r\nCombining the observations from above, I think the following makes more sense:\r\n\r\n\r\n# Boolean options\r\nfacet_grid(shrink = TRUE)\r\nfacet_grid(shrink = FALSE)\r\n\r\n# Enumerated options\r\nfacet_grid(shrink_scales_to = \"statistics\")\r\nfacet_grid(shrink_scales_to = \"raw-data\")\r\n\r\n\r\nThis last point is a bit of a tangent, but after tinkering with the behavior of shrink more, I don’t think “shrink” is a particularly useful description here. I might actually prefer something more neutral like fit_scales_to.\r\nIs the argument a scalar adjective? Consider naming the scale.\r\nLoosely speaking, scalar (a.k.a. gradable) adjectives are adjectives that can be strengthened (or weakened) - English grammar can express this with the suffixes “-er” and “-est”. For example, “tall” is a scalar adjective because you can say “taller” and “tallest”, and scalar adjectives are called such because they lie on a scale (in this case, the scale of height). Note that the quality of an adjective as a scalar one is not so clear though, as you can “more X” or “most X” just about any adjective X (e.g., even true vs. false can lie on a scale of more true or more false) - what matters more is if saying something like “more X” makes sense in the context of where X is found (e.g., the context of the function).5 If so, you’re dealing with a scalar adjective.\r\nThis Linguistics 101 tangent is relevant here because I often see boolean arguments named after scalar adjectives, but I feel like in those cases it’s better to just name the scale itself (which in turn makes the switch to enum more natural).\r\nA contrived example would be if a function had a boolean argument called tall. To refactor this into an enum, we can rename the argument to the scale itself (height) and enumerate the two end points:\r\n\r\n\r\n# Boolean options\r\nfun(tall = TRUE)\r\nfun(tall = FALSE)\r\n\r\n# Enumerated options\r\nfun(height = \"tall\")\r\nfun(height = \"short\")\r\n\r\n\r\nA frequent offender of the enum principle in the wild is the verbose argument. verbose is an interesting case study because it suffers from the additional problem of there possibly being more than 2 options as the function matures. The book offers some strategies for remedying these kinds of problems after-the-fact, but I think a proactive solution is to name the argument verbosity (the name of the scale) with the possible options enumerated (see also a recent Mastodon thread that has great suggestions on this topic).\r\n\r\n\r\n# Boolean options\r\nfun(verbose = TRUE)\r\nfun(verbose = FALSE)\r\n\r\n# Enumerated options\r\nfun(verbosity = \"all\")\r\nfun(verbosity = \"none\")\r\n\r\n\r\nI like this strategy of “naming the scale” because it gives off the impression to users that the possible options are values that lie on the scale. In the example above, it could either be the extremes \"all\" or \"none\", but also possibly somewhere in between if the writer of the function chooses to introduce more granular settings later.\r\nIs the argument truly binary? Still prefer enum and name the obvious/absence.\r\nSometimes a boolean argument may encode a genuinely binary choice of a true/false, on/off, yes/no option. But refactoring the boolean options as enum may still offer some benefits. In those cases, I prefer the strategy of name the obvious/absence.\r\nSome cases for improvement are easier to spot than others. An easy case is something like the REML argument in lme4::lmer(). Without going into too much detail, when REML = TRUE (default), the model optimizes the REML (restricted/residualized maximum likelihood) criterion in finding the best fitting model. But it’s not like the model doesn’t use any criteria for goodness of fit when REML = FALSE. Instead, when REML = FALSE, the function uses a different criterion of ML (maximum likelihood). So the choice is not really between toggling REML on or off, but rather between the choice of REML vs. ML. The enum version lets us spell out the assumed default and make the choice between the two explicit (again, with room for introducing other criteria in the future):\r\n\r\n\r\n# Boolean options\r\nlmer::lme4(REML = TRUE)\r\nlmer::lme4(REML = FALSE)\r\n\r\n# Enumerated options\r\nlmer::lme4(criterion = \"REML\")\r\nlmer::lme4(criterion = \"ML\")\r\n\r\n\r\nA somewhat harder case is a true presence-or-absence kind of a situation, where setting the argument to true/false essentially boils down to triggering an if block inside the function. For example, say a function has an option to use an optimizer called “MyOptim”. This may be implemented as:\r\n\r\n\r\n# Boolean options\r\nfun(optimize = TRUE)\r\nfun(optimize = FALSE)\r\n\r\n\r\nEven if the absence of optimization is not nameable, you could just call that option something like \"none\" for the enum pattern, which makes the choices explicit:\r\n\r\n\r\n# Enumerated options\r\nfun(optimizer = \"MyOptim\")\r\nfun(optimizer = \"none\")\r\n\r\n\r\nOf course, the more difficult case is when the thing that’s being toggled isn’t really nameable. I think this is more often the case in practice, and may be the reason why there are many verb-y names for arguments with boolean options. Like, you wrote some code that optimizes something, but you have no name for it, so the argument that toggles it simply refers to its function, like “should the function optimize?”.\r\nBut not all is lost. I think one way out of this would be to enumerate over placeholders, not necessarily names. So something like:\r\n\r\n\r\n# Enumerated options (placeholders)\r\nfun(optimizer = 1) # bespoke optimizer\r\nfun(optimizer = 0) # none\r\n\r\n\r\nThen the documentation can clarify what the placeholder values 0, 1, etc. represent in longer, paragraph form, to describe what they do without the pressure of having to name the options.6 It’s not pretty, but I don’t think there will ever be a pretty solution to this problem if you want to avoid naming things entirely.\r\nMove shared strings across options into the argument name\r\nThis one is simple and easily demonstrated with an example. Consider the matrix() function for constructing a matrix. It has an argument byrow which fills the matrix by column when FALSE (default) or by row when TRUE. The argument controls the margin of fill, so we could re-write it as a fill argument like so:\r\n\r\n\r\n# Boolean options\r\nmatrix(byrow = FALSE)\r\nmatrix(byrow = TRUE)\r\n\r\n# Enumerated options\r\nmatrix(fill = \"bycolumn\")\r\nmatrix(fill = \"byrow\")\r\n\r\n\r\nThe options \"bycolumn\" and \"byrow\" share the “by” string, so we could move that into the argument name:\r\n\r\n\r\nmatrix(fill_by = \"column\")\r\nmatrix(fill_by = \"row\")\r\n\r\n\r\nAt this point I was also wondering whether the enumerated options should have the shortened \"col\" or the full \"column\" name. At the moment I’m less decided about this, but note that given the partial matching behavior in match.arg(), you could get away with matrix(fill_by = \"col\") in both cases.\r\nAt least from the book’s examples, it looks like shortening is ok for the options. To repeat the vctrs::vec_sort() example from earlier:\r\n\r\n\r\nvctrs::vec_sort(x, direction = \"desc\") # vs. \"descending\"\r\nvctrs::vec_sort(x, direction = \"asc\") # vs. \"ascending\"\r\n\r\n\r\nI was actually kind of surprised by this when I first saw it, and I have mixed feelings especially for \"asc\" since that’s not very frequent as a shorthand for “ascending” (e.g., {dplyr} has desc() but not a asc() equivalent - see also the previous section on “naming the obvious”). So I feel like I’d prefer for this to be spelled out in full in the function, and users can still loosely do partial matching in practice.7\r\n\r\nThe fun part of reading the book for me is not necessarily about discovering new patterns, but about being able to put a name to them and think more critically about their pros and cons.↩︎\r\nTo quote the book: “… this is a pattern that we only discovered relatively recently”↩︎\r\nThe book describes the awkwardness of decreasing = FALSE as “feels like a double negative”, but I think this is just a general, pervasive problem of pragmatic ambiguity with negation, and this issue of “what exactly is being negated?” is actually one of my research topics! Negation is interpreted with respect to the relevant and accessible alternatives (which “desc” vs. “asc” establishes very well) - in turn, recovering the intended meaning of the negation is difficult deprived of that context (like in the case of “direction = TRUE/FALSE”). See: Alternative Semantics.↩︎\r\nTo pre-empt the preference for short argument names, the fact that users don’t reach for these arguments in everyday use of facet_grid() should loosen that constraint for short, easy-to-type names. IMO the “too much to type” complaint since time immemorial is already obviated by auto-complete, and should frankly just be ignored for the designing these kinds of esoteric arguments that only experienced users would reach for in very specific circumstances.↩︎\r\nTry this from the view point of both the developer and the user!↩︎\r\nIMO, {collapse} does a very good job at this (see ?TRA).↩︎\r\nOf course, the degree to which you’d encourage this should depend on how sure you are about the stability of the current set of enumerated options.↩︎\r\n", "preview": "posts/2024-07-21-enumerate-possible-options/preview.jpg", - "last_modified": "2024-09-01T17:53:55-04:00", + "last_modified": "2024-09-01T14:53:55-07:00", "input_file": {} }, { @@ -53,7 +53,7 @@ ], "contents": "\r\n\r\nContents\r\nave()\r\nThe problem\r\nSome {tidyverse} solutions\r\nAn ave() + {dplyr} solution\r\nAside: {data.table} 🤝 {collapse}\r\nsessionInfo()\r\n\r\nI think it’s safe to say that the average {dplyr} user does not know the ave() function. For that audience, this is a short appreciation post on ave(), a case of tidyverse and base R.\r\nave()\r\nave() is a split-apply-combine function in base R (specifically, {stats}). It’s a pretty short function - maybe you can make out what it does from just reading the code1\r\n\r\n\r\nave\r\n\r\n function (x, ..., FUN = mean) \r\n {\r\n if (missing(...)) \r\n x[] <- FUN(x)\r\n else {\r\n g <- interaction(...)\r\n split(x, g) <- lapply(split(x, g), FUN)\r\n }\r\n x\r\n }\r\n \r\n \r\n\r\nDespite its (rather generic and uninformative) name, I like to think of ave() as actually belonging to the *apply() family of functions, having particularly close ties to tapply().\r\nA unique feature of ave() is the invariant that it returns a vector of the same length as the input. And if you use an aggregating function like sum() or mean(), it simply repeats those values over the observations on the basis of their grouping.\r\nFor example, whereas tapply() can be used to summarize the average mpg by cyl:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n 4 6 8 \r\n 26.66364 19.74286 15.10000\r\n\r\nThe same syntax with ave() will repeat those values over each element of the input vector:\r\n\r\n\r\nave(mtcars$mpg, mtcars$cyl, FUN = mean)\r\n\r\n [1] 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364\r\n [9] 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000\r\n [17] 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000\r\n [25] 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nYou can also get to this output from tapply() with an extra step of vectorized indexing:\r\n\r\n\r\ntapply(mtcars$mpg, mtcars$cyl, FUN = mean)[as.character(mtcars$cyl)]\r\n\r\n 6 6 4 6 8 6 8 4 \r\n 19.74286 19.74286 26.66364 19.74286 15.10000 19.74286 15.10000 26.66364 \r\n 4 6 6 8 8 8 8 8 \r\n 26.66364 19.74286 19.74286 15.10000 15.10000 15.10000 15.10000 15.10000 \r\n 8 4 4 4 4 8 8 8 \r\n 15.10000 26.66364 26.66364 26.66364 26.66364 15.10000 15.10000 15.10000 \r\n 8 4 4 4 8 6 8 4 \r\n 15.10000 26.66364 26.66364 26.66364 15.10000 19.74286 15.10000 26.66364\r\n\r\nThe problem\r\nNothing sparks more joy than when a base R function helps you write more “tidy” code. I’ve talked about this in length before with outer() in a prior blog post on dplyr::slice(), and here I want to show a cool ave() + dplyr::mutate() combo.\r\nThis example is adapted from a reprex by Cédric Scherer2 on the DSLC (previously R4DS) slack.\r\nGiven an input of multiple discrete columns and the frequencies of these values:\r\n\r\n\r\ninput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq = c(5, 12, 3, 7)\r\n)\r\ninput\r\n\r\n a b c freq\r\n 1 A X M 5\r\n 2 A Y N 12\r\n 3 A Y O 3\r\n 4 B Z O 7\r\n\r\nThe task is to add new columns named freq_* that show the total frequency of the values in each column:\r\n\r\n\r\noutput <- data.frame(\r\n a = c(\"A\", \"A\", \"A\", \"B\"), \r\n freq_a = c(20, 20, 20, 7),\r\n b = c(\"X\", \"Y\", \"Y\", \"Z\"),\r\n freq_b = c(5, 15, 15, 7), \r\n c = c(\"M\", \"N\", \"O\", \"O\"), \r\n freq_c = c(5, 12, 10, 10), \r\n freq = c(5, 12, 3, 7)\r\n)\r\noutput\r\n\r\n a freq_a b freq_b c freq_c freq\r\n 1 A 20 X 5 M 5 5\r\n 2 A 20 Y 15 N 12 12\r\n 3 A 20 Y 15 O 10 3\r\n 4 B 7 Z 7 O 10 7\r\n\r\nSo for example, in column a the value \"A\" is associated with values 5, 12, and 3 in the freq column, so a new freq_a column should be created to track their total frequencies 5 + 12 + 3 and associate that value (20) for all occurrences of \"A\" in the a column.\r\nSome {tidyverse} solutions\r\nThe gut feeling is that this seems to lack a straightforwardly “tidy” solution. I mean, the input isn’t even tidy3 in the first place!\r\nSo maybe we’d be better off starting with a pivoted tidy data for constructing a tidy solution:\r\n\r\n\r\nlibrary(tidyverse)\r\ninput %>% \r\n pivot_longer(-freq)\r\n\r\n # A tibble: 12 × 3\r\n freq name value\r\n \r\n 1 5 a A \r\n 2 5 b X \r\n 3 5 c M \r\n 4 12 a A \r\n 5 12 b Y \r\n 6 12 c N \r\n 7 3 a A \r\n 8 3 b Y \r\n 9 3 c O \r\n 10 7 a B \r\n 11 7 b Z \r\n 12 7 c O\r\n\r\nBut recall that the desired output is of a wide form like the input, so it looks like our tidy solution will require some indirection, involving something like:\r\n\r\n\r\ninput %>% \r\n pivot_longer(-freq) %>% \r\n ... %>% \r\n pivot_wider(...)\r\n\r\n\r\nOr maybe you’d rather tackle this with some left_join()s, like:\r\n\r\n\r\ninput %>% \r\n left_join(summarize(input, freq_a = sum(freq), .by = a)) %>% \r\n ...\r\n\r\n\r\nI’ll note that there’s actually also an idiomatic {dplyr}-solution to this using the lesser-known function add_count(), but you can’t avoid the repetitiveness problem because it doesn’t vectorize over the first argument:\r\n\r\n\r\ninput %>% \r\n add_count(a, wt = freq, name = \"freq_a\") %>% \r\n add_count(b, wt = freq, name = \"freq_b\") %>% \r\n add_count(c, wt = freq, name = \"freq_c\")\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nYou could try to scale this add_count() solution with reduce() (see my previous blog post on collapsing repetitive piping), but now we’re straying very far from the “tidy” territory:\r\n\r\n\r\ninput %>% \r\n purrr::reduce(\r\n c(\"a\", \"b\", \"c\"),\r\n ~ .x %>% \r\n add_count(.data[[.y]], wt = freq, name = paste0(\"freq_\", .y)),\r\n .init = .\r\n )\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nIMO this problem is actually a really good thinking exercise for the “average {dplyr} user”, so I encourage you to take a stab at this yourself before proceeding if you’ve read this far!\r\nAn ave() + {dplyr} solution\r\nThe crucial piece of the puzzle here is to think a little outside the box, beyond “data(frame) wrangling”.\r\nIt helps to simplify the problem once we think about the problem in terms of “(column) vector wrangling” first, and that’s where ave() comes in!\r\nI’ll start with the cake first - this is the one-liner ave() solution I advocated for:\r\n\r\n\r\ninput %>% \r\n mutate(across(a:c, ~ ave(freq, .x, FUN = sum), .names = \"freq_{.col}\"))\r\n\r\n a b c freq freq_a freq_b freq_c\r\n 1 A X M 5 20 5 5\r\n 2 A Y N 12 20 15 12\r\n 3 A Y O 3 20 15 10\r\n 4 B Z O 7 7 7 10\r\n\r\nTaking column freq_a as an example, the ave() part of the solution essential creates this vector of summed-up freq values by the categories of a:\r\n\r\n\r\nave(input$freq, input$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nFrom there, across() handles the iteration over columns and, as an added bonus, the naming of the new columns in convenient {glue} syntax (\"freq_{.col}\").\r\nIt’s the perfect mashup of base R + tidyverse. Base R takes care of the problem at the vector level with a split-apply-combine that’s concisely expressed with ave(), and tidyverse scales that solution up to the dataframe level with mutate() and across().\r\ntidyverse 🤝 base R\r\nAside: {data.table} 🤝 {collapse}\r\nSince I wrote this blog post, I discovered that {data.table} recently added in support for using names(.SD) in the LHS of the walrus :=. I’m so excited for this to hit the next release (v1.16.0)!\r\nI’ve trying to be more mindful of showcasing {data.table} whenever I talk about {dplyr}, so here’s a solution to compare with the dplyr::across() solution above.\r\n\r\n\r\n\r\n\r\n\r\n# data.table::update_dev_pkg()\r\nlibrary(data.table)\r\ninput_dt <- as.data.table(input)\r\ninput_dt\r\n\r\n a b c freq\r\n \r\n 1: A X M 5\r\n 2: A Y N 12\r\n 3: A Y O 3\r\n 4: B Z O 7\r\n\r\n\r\n\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) ave(freq, x, FUN = sum)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\nIn practice, I often pair {data.table} with {collapse}, where the latter provides a rich and performant set of split-apply-combine vector operations, to the likes of ave(). In {collapse}, ave(..., FUN = sum) can be expressed as fsum(..., TRA = \"replace\"):\r\n\r\n\r\nlibrary(collapse)\r\nave(input_dt$freq, input_dt$a, FUN = sum)\r\n\r\n [1] 20 20 20 7\r\n\r\nfsum(input_dt$freq, input_dt$a, TRA = \"replace\") # Also, TRA = 2\r\n\r\n [1] 20 20 20 7\r\n\r\nSo a version of the solution integrating fsum() would be:4\r\n\r\n\r\ninput_dt[, names(.SD) := NULL, .SDcols = patterns(\"^freq_\")]\r\ninput_dt[, paste0(\"freq_\", names(.SD)) := lapply(.SD, \\(x) fsum(freq, x, TRA = 2)), .SDcols = a:c]\r\ninput_dt\r\n\r\n a b c freq freq_a freq_b freq_c\r\n \r\n 1: A X M 5 20 5 5\r\n 2: A Y N 12 20 15 12\r\n 3: A Y O 3 20 15 10\r\n 4: B Z O 7 7 7 10\r\n\r\ndata.table 🤝 collapse\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: Asia/Seoul\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] collapse_2.0.14 data.table_1.15.99 lubridate_1.9.3 forcats_1.0.0 \r\n [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 \r\n [9] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.5.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.4.1 Rcpp_1.0.12 \r\n [5] tidyselect_1.2.1 parallel_4.4.1 jquerylib_0.1.4 scales_1.3.0 \r\n [9] yaml_2.3.8 fastmap_1.1.1 R6_2.5.1 generics_0.1.3 \r\n [13] knitr_1.47 distill_1.6 munsell_0.5.0 tzdb_0.4.0 \r\n [17] bslib_0.7.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 \r\n [21] stringi_1.8.4 cachem_1.0.8 xfun_0.44 sass_0.4.9 \r\n [25] timechange_0.2.0 memoise_2.0.1 cli_3.6.2 withr_3.0.0 \r\n [29] magrittr_2.0.3 digest_0.6.35 grid_4.4.1 rstudioapi_0.16.0\r\n [33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 downlit_0.4.3 \r\n [37] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0 \r\n [41] rmarkdown_2.27 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1\r\n\r\n\r\nAnd check out the elusive split<- function!↩︎\r\nWho I can only assume was needing this for a fancy data viz thing 😆↩︎\r\nI mean that in the technical sense here. In this problem, the unit of observation is the “cells” of the input columns (the values “A”, “B”, “X”, “Y”, etc.).↩︎\r\nI couldn’t show this here with this particular example, but another nice feature of {collapse} 🤝 {data.table} is the fact that they do not shy away from consuming/producing matrices: see scale()[,1] vs. fscale() for a good example of this.↩︎\r\n", "preview": "posts/2024-06-09-ave-for-the-average/preview.png", - "last_modified": "2024-06-23T00:35:29-04:00", + "last_modified": "2024-06-22T21:35:29-07:00", "input_file": {}, "preview_width": 926, "preview_height": 328 @@ -75,7 +75,7 @@ ], "contents": "\r\n\r\nContents\r\nargs()\r\nargs(args)\r\nargs(args)(args)\r\nargs(args(args)(args))\r\nad infinitum\r\nHad enough args() yet?\r\nTL;DR: str()\r\nCoda (serious): redesigning args()\r\nTake 1) Display is the side-effect; output is trivial\r\nTake 2) Display is the side-effect; output is meaningful\r\nTake 3) Just remove the NULL\r\n\r\nsessionInfo()\r\n\r\nThe kind of blog posts that I have the most fun writing are those where I hyperfocus on a single function, like dplyr::slice(), purrr::reduce(), and ggplot2::stat_summary(). In writing blog posts of this kind, I naturally come across a point where I need to introduce the argument(s) that the function takes. I usually talk about them one at a time as needed, but I could start by front-loading that important piece of information first.\r\nIn fact, there’s a function in R that lets me do exactly that, called args().\r\nargs()\r\nargs() is, in theory, a very neat function. According to ?args:\r\n\r\nDisplays the argument names and corresponding default values of a (non-primitive or primitive) function.\r\n\r\nSo, for example, I know that sum() takes the arguments ... and na.rm (with the na.rm = FALSE default). The role of args() is to display exactly that piece of information using R code. This blog runs on rmarkdown, so surely I can use args() as a convenient and fancy way of showing information about a function’s arguments to my readers.\r\nIn this blog post, I want to talk about args(). So let’s start by looking at the argument that args() takes.\r\nOf course, I could just print args in the console:\r\n\r\n\r\nargs\r\n\r\n function (name) \r\n .Internal(args(name))\r\n \r\n \r\n\r\nBut wouldn’t it be fun if I used args() itself to get this information?\r\nargs(args)\r\n\r\n\r\nargs(args)\r\n\r\n function (name) \r\n NULL\r\n\r\nOkay, so I get the function (name) piece, which is the information I wanted to show. We can see that args() takes one argument, called name, with no defaults.\r\nBut wait - what’s that NULL doing there in the second line?\r\nHmm, I wonder if they forgot to invisible()-y return the NULL. args() is a function for displaying a function’s arguments after all, so maybe the arguments are printed to the console as a side-effect and the actual output of args() is NULL.\r\nIf that is true, we should be able to suppress the printing of NULL with invisible():\r\n\r\n\r\ninvisible(args(args))\r\n\r\n\r\nUh oh, now everything is invisible.\r\nAlright, enough games! What exactly are you, output of args()?!\r\n\r\n\r\ntypeof(args(args))\r\n\r\n [1] \"closure\"\r\n\r\nWhat?\r\nargs(args)(args)\r\nTurns out that args(args) is actually returning a whole function that’s a copy of args(), except with its body replaced with NULL.\r\nSo args(args) is itself a function that takes an argument called name and then returns NULL. Let’s assign it to a variable and call it like a function:\r\n\r\n\r\nabomination <- args(args)\r\n\r\n\r\n\r\n\r\nabomination(123)\r\n\r\n NULL\r\n\r\nabomination(mtcars)\r\n\r\n NULL\r\n\r\nabomination(stop())\r\n\r\n NULL\r\n\r\nThe body is just NULL, so the function doesn’t care what it receives1 - it just returns NULL.\r\nIn fact, we could even pass it… args:\r\n\r\n\r\nargs(args)(args)\r\n\r\n NULL\r\n\r\nargs(args(args)(args))\r\nBut wait, that’s not all! args() doesn’t just accept a function as its argument. From the documentation:\r\n\r\nValue\r\nNULL in case of a non-function.\r\n\r\nSo yeah - if args() receives a non-function, it just returns NULL:\r\n\r\n\r\nargs(123)\r\n\r\n NULL\r\n\r\nargs(mtcars)\r\n\r\n NULL\r\n\r\nThis applies to any non-function, including… NULL:\r\n\r\n\r\nargs(NULL)\r\n\r\n NULL\r\n\r\nAnd recall that:\r\n\r\n\r\nis.null( args(args)(args) )\r\n\r\n [1] TRUE\r\n\r\nTherefore, this is a valid expression in base R:\r\n\r\n\r\nargs(args(args)(args))\r\n\r\n NULL\r\n\r\nad infinitum\r\nFor our cursed usecase of using args(f) to return a copy of f with it’s body replaced with NULL only to then immediately call args(f)(f) to return NULL, it really doesn’t matter what the identity of f is as long as it’s a function.\r\nThat function can even be … args(args)!\r\nSo let’s take our args(args(args)(args)):\r\n\r\n\r\nargs( args( args )( args ))\r\n\r\n NULL\r\n\r\nAnd swap every args() with args(args):\r\n\r\n\r\nargs(args)( args(args)( args(args) )( args(args) ))\r\n\r\n NULL\r\n\r\nOr better yet, swap every args() with args(args(args)):\r\n\r\n\r\nargs(args(args))( args(args(args))( args(args(args)) )( args(args(args)) ))\r\n\r\n NULL\r\n\r\nThe above unhinged examples are a product of two patterns:\r\nThe fact that you always get function (name) NULL from wrapping args()s over args:\r\n\r\n\r\nlist(\r\n args( args),\r\n args( args(args)),\r\n args(args(args(args)))\r\n )\r\n\r\n [[1]]\r\n function (name) \r\n NULL\r\n\r\n [[2]]\r\n function (name) \r\n NULL\r\n\r\n [[3]]\r\n function (name) \r\n NULL\r\n\r\nThe fact that you can get this whole thing to return NULL by having function (name) NULL call the function object args. You can do this anywhere in the stack and the NULL will simply propagate:\r\n\r\n\r\nlist(\r\n args(args(args(args))) (args) ,\r\n args(args(args(args)) (args) ) ,\r\n args(args(args(args) (args) ))\r\n )\r\n\r\n [[1]]\r\n NULL\r\n\r\n [[2]]\r\n NULL\r\n\r\n [[3]]\r\n NULL\r\n\r\nWe could keep going but it’s tiring to type out and read all these nested args()… but did you know that there’s this thing called the pipe %>% that’s the solution to all code readability issues?\r\nHad enough args() yet?\r\nLet’s make an args() factory ARGS() …\r\n\r\n\r\nlibrary(magrittr)\r\nARGS <- function(n) {\r\n Reduce(\r\n f = \\(x,y) bquote(.(x) %>% args()),\r\n x = seq_len(n),\r\n init = quote(args)\r\n )\r\n}\r\n\r\n\r\n… to produce a sequence of args() …\r\n\r\n\r\nARGS(10)\r\n\r\n args %>% args() %>% args() %>% args() %>% args() %>% args() %>% \r\n args() %>% args() %>% args() %>% args() %>% args()\r\n\r\neval(ARGS(10))\r\n\r\n function (name) \r\n NULL\r\n\r\n… and tidy it up!\r\n\r\n\r\nARGS(10) %>% \r\n deparse1() %>% \r\n styler::style_text()\r\n\r\n args %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args() %>%\r\n args()\r\n\r\nWanna see even more unhinged?\r\nLet’s try to produce a “matrix” of args(). You get a choice of i “rows” of piped lines, and j “columns” of args()-around-args each time - all to produce a NULL.\r\nReady?\r\n\r\n\r\nARGS2 <- function(i, j) {\r\n Reduce(\r\n f = \\(x,y) bquote(.(x) %>% (.(y))),\r\n x = rep(list(Reduce(\\(x,y) call(\"args\", x), seq_len(j), quote(args))), i)\r\n )\r\n}\r\n\r\n\r\n\r\n\r\nARGS2(5, 1) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args) %>%\r\n (args(args)) %>%\r\n (args(args)) %>%\r\n (args(args)) %>%\r\n (args(args))\r\n\r\n\r\n\r\nARGS2(5, 3) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args(args(args))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args)))) %>%\r\n (args(args(args(args))))\r\n\r\n\r\n\r\nARGS2(10, 5) %>% \r\n deparse1() %>%\r\n styler::style_text()\r\n\r\n args(args(args(args(args(args))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args)))))) %>%\r\n (args(args(args(args(args(args))))))\r\n\r\n\r\n\r\nlist(\r\n eval(ARGS2(5, 1)),\r\n eval(ARGS2(5, 3)),\r\n eval(ARGS2(10, 5))\r\n)\r\n\r\n [[1]]\r\n NULL\r\n \r\n [[2]]\r\n NULL\r\n \r\n [[3]]\r\n NULL\r\n\r\nYay!\r\nTL;DR: str()\r\nIf you want a version of args() that does what it’s supposed to, use str() instead:2\r\n\r\n\r\nstr(args)\r\n\r\n function (name)\r\n\r\nstr(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nargs() is hereafter banned from my blog.\r\nCoda (serious): redesigning args()\r\nThe context for my absurd rant above is that I was just complaining about how I think args() is a rather poorly designed function.\r\nLet’s try to redesign args(). I’ll do three takes:\r\nTake 1) Display is the side-effect; output is trivial\r\nIf the whole point of args() is to display a function’s arguments for inspection in interactive usage, then that can simply be done as a side-effect.\r\nAs I said above, str() surprisingly has this more sensible behavior out of the box. So let’s write our first redesign of args() which just calls str():\r\n\r\n\r\nargs1 <- function(name) {\r\n str(name)\r\n}\r\nargs1(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nIn args1()/str(), information about the function arguments are sent to the console.3 We know this because we can’t suppress this with invisible but we can grab this via capture.output:\r\n\r\n\r\ninvisible( args1(sum) )\r\n\r\n function (..., na.rm = FALSE)\r\n\r\ncapture.output( args1(sum) )\r\n\r\n [1] \"function (..., na.rm = FALSE) \"\r\n\r\nFor functions whose purpose is to signal information to the console (and whose usage is limited to interactive contexts), we don’t particularly care about the output. In fact, because the focus isn’t on the output, the return value should be as trivial as possible.\r\nA recommended option is to just invisibly return NULL. This is now how args1() does it (via str()).4:\r\n\r\n\r\nprint( args1(sum) )\r\n\r\n function (..., na.rm = FALSE) \r\n NULL\r\n\r\nis.null( args1(sum) )\r\n\r\n function (..., na.rm = FALSE)\r\n [1] TRUE\r\n\r\nAlternatively, the function could just invisibly return what it receives,5 which is another common pattern for cases like this. Again, we return invisibly to avoid distracting from the fact that the point of the function is to display as the side-effect.\r\n\r\n\r\nargs2 <- function(name) {\r\n str(sum)\r\n invisible(name)\r\n}\r\n\r\n\r\n\r\n\r\nargs2(rnorm)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\n\r\n\r\nargs2(rnorm)(5)\r\n\r\n function (..., na.rm = FALSE)\r\n [1] -0.5494891 1.2861975 -1.2755454 1.0817387 -0.7248563\r\n\r\nTake 2) Display is the side-effect; output is meaningful\r\nOne thing I neglected to mention in this blog post is that there are other ways to extract a function’s arguments. One of them is formals():6\r\n\r\n\r\nformals(args)\r\n\r\n $name\r\n\r\nformals(rnorm)\r\n\r\n $n\r\n \r\n \r\n $mean\r\n [1] 0\r\n \r\n $sd\r\n [1] 1\r\n\r\nformals() returns the information about a function’s arguments in a list which is pretty boring, but it’s an object we can manipulate (unlike the return value of str()). So there’s some pros and cons.\r\nActually, we could just combine both formals() and str():\r\n\r\n\r\nargs3 <- function(name) {\r\n str(name)\r\n invisible(formals(name))\r\n}\r\n\r\n\r\n\r\n\r\narguments <- args3(rnorm)\r\n\r\n function (n, mean = 0, sd = 1)\r\n\r\narguments\r\n\r\n $n\r\n \r\n \r\n $mean\r\n [1] 0\r\n \r\n $sd\r\n [1] 1\r\n\r\narguments$mean\r\n\r\n [1] 0\r\n\r\nYou get the nice display as a side-effect (via str()) and then an informative output (via formals()). You could even turn this into a class with a print method, which is definitely the better way to go about this, but I’m running out of steam here and I don’t like OOP, so I won’t touch that here.\r\nTake 3) Just remove the NULL\r\nThis last redesign is the simplest of the three, and narrowly deals with the problem of that pesky NULL shown alongside the function arguments:\r\n\r\n\r\nargs(sum)\r\n\r\n function (..., na.rm = FALSE) \r\n NULL\r\n\r\nFine, I’ll give them that args() must, for compatibility with S whatever reason, return a whole new function object, which in turn requires a function body. But if that function is just as a placeholder and not meant to be called, can’t you just make the function body, like, empty?\r\n\r\n\r\nargs4 <- function(name) {\r\n f <- args(name)\r\n body(f) <- quote(expr=)\r\n f\r\n}\r\nargs4(sum)\r\n\r\n function (..., na.rm = FALSE)\r\n\r\nargs4(rnorm)\r\n\r\n function (n, mean = 0, sd = 1)\r\n\r\ntypeof( args4(rnorm) )\r\n\r\n [1] \"closure\"\r\n\r\nLike, come on!\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.3.3 (2024-02-29 ucrt)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: America/New_York\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] magrittr_2.0.3\r\n \r\n loaded via a namespace (and not attached):\r\n [1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.1 knitr_1.45 \r\n [5] rlang_1.1.2 xfun_0.41 purrr_1.0.2 styler_1.10.2 \r\n [9] jsonlite_1.8.8 htmltools_0.5.7 sass_0.4.7 fansi_1.0.5 \r\n [13] rmarkdown_2.25 R.cache_0.16.0 evaluate_0.23 jquerylib_0.1.4 \r\n [17] distill_1.6 fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.4 \r\n [21] memoise_2.0.1 compiler_4.3.3 prettycode_1.1.0 downlit_0.4.3 \r\n [25] rstudioapi_0.15.0 R.oo_1.25.0 R.utils_2.12.3 digest_0.6.33 \r\n [29] R6_2.5.1 R.methodsS3_1.8.2 bslib_0.6.1 tools_4.3.3 \r\n [33] withr_3.0.0 cachem_1.0.8\r\n\r\n\r\nYou can even see lazy evaluation in action when it receives stop() without erroring.↩︎\r\nThough you have to remove the \"srcref\" attribute if the function has one. But also don’t actually do this!↩︎\r\nTechnically, the \"output\" stream.↩︎\r\nFor the longest time, I thought args() was doing this from how its output looked.↩︎\r\nEssentially acting like identity().↩︎\r\nBut note that it has a special behavior of returning NULL for primitive functions (written in C) that clearly have user-facing arguments on the R side. See also formalArgs(), for a shortcut to names(formals())↩︎\r\n", "preview": "posts/2024-03-04-args-args-args-args/preview.png", - "last_modified": "2024-03-05T17:04:54-05:00", + "last_modified": "2024-03-05T14:04:54-08:00", "input_file": {}, "preview_width": 419, "preview_height": 300 @@ -96,7 +96,7 @@ ], "contents": "\r\n\r\nContents\r\nHello World\r\nQuirks in R syntax\r\nFlipping\r\nRegistering arguments\r\nUnflipping\r\nCurrying\r\nFin.\r\n\r\nHello World\r\nGetting a program to print “Hello World” is one of the earliest things people are taught to do when picking up a new programming language. This universal experience among programmers has also turned it into a running joke about the complexity of programming languages.\r\nFor example, whereas in R we can express what we want transparently in the following:\r\n\r\n\r\nprint(\"Hello World\")\r\n\r\n\r\nThis simple task can get absurdly complex in other languages; perhaps most notoriously, Java:\r\n\r\nclass HelloWorld {\r\n public static void main(String[] args) {\r\n System.out.println(\"Hello, World!\"); \r\n }\r\n}\r\n\r\nThis joke around “Hello World” has also evolved into other forms. Every once in a while I come across a variant of the joke in the style of something like:\r\n\r\n\r\nHelloWorld(\"print\")\r\n\r\n [1] \"HelloWorld\"\r\n\r\nThis is funny because it seemingly swaps the role of the argument and the function in an expression. It’s also a good educational example because it demonstrates the arbitrariness of signs as a universal design principle of programming (and human!) languages.1 Crucially, you should be able to produce this behavior in any reasonable programming language - the ability to do this is a feature, not a bug.\r\nThe most trivial implementation of the above is to define HelloWorld() as a function that’s been hardcoded to simply print “Hello World”:\r\n\r\n\r\nHelloWorld <- function(x) print(\"HelloWorld\")\r\n\r\n\r\nBut here, too, languages show differences. Not so much in their ability to implement this specific solution, but in their ability to formulate a generalizable solution in an idiomatic way, using tools and concepts that are native to the language.\r\nWhen it comes to R, it turns out that R has certain quirks which can give us a surprisingly principled and lean solution to the problem. So that’s what this blog post will be about.\r\nQuirks in R syntax\r\nIn R, functions are distinguished from non-functions in part by their role as a caller. This role is defined by its syntactic position in an expression: it always occupies the first position [[1]] of a object.2\r\n\r\n\r\n\r\n\r\nplus_expr <- quote(1 + 2)\r\nplus_expr\r\n\r\n 1 + 2\r\n\r\nplus_expr[[1]]\r\n\r\n `+`\r\n\r\n\r\n\r\n\r\n\r\nsum_expr <- quote(sum(1, 2))\r\nsum_expr\r\n\r\n sum(1, 2)\r\n\r\nsum_expr[[1]]\r\n\r\n sum\r\n\r\n\r\n\r\nWhen R sees a variable in an expression and needs to resolve its value, it firstly determines whether the value must be a function, by virtue of its position in the expression. Here, R eagerly commits to the assumption that whatever appears in the caller position must be a function.\r\nThis gives rise to a somewhat surprising behavior. In evaluating the expression f(1, 2) inside a local scope below, R smartly skips the immediately-adjacent, local value of f (a numeric constant) to scope the global value of f() (alias of the function sum()) that’s “further away”.\r\n\r\n\r\nf <- sum\r\nlocal({\r\n f <- 0\r\n f(1, 2)\r\n})\r\n\r\n [1] 3\r\n\r\nSo the point here is that, R knows to only scope values of f that are functions, because it found f in the caller position of the expression:\r\n\r\n\r\nf_expr <- quote(f(1, 2))\r\nf_expr[[1]]\r\n\r\n f\r\n\r\nThis in and of itself is interesting, but I want to return to my characterization of R as “eagerly committing” to this. Consider the fact that the above example works even if you swapped f with the string \"f\" in the expression:\r\n\r\n\r\nf <- sum\r\nlocal({\r\n f <- 0\r\n \"f\"(1, 2)\r\n})\r\n\r\n [1] 3\r\n\r\nBecause R eagerly commits to the invariant that the first position is reserved for functions, it repairs \"f\"() to f() at the level of the parser, before the evaluation engine even sees the expression.\r\n\r\n\r\nf_expr2 <- quote(\"f\"(1, 2))\r\nf_expr2\r\n\r\n f(1, 2)\r\n\r\nf_expr2[[1]]\r\n\r\n f\r\n\r\nAll of this to say that the following syntax that looks even more flipped is also valid in R:\r\n\r\n\r\nHelloWorld <- function(x) print(\"HelloWorld\")\r\n\"HelloWorld\"(print)\r\n\r\n [1] \"HelloWorld\"\r\n\r\nThis is trivially true about R’s sytax and its parser but funny nonetheless, so this deserves a mention first. Now lets talk about the implementation side of things - how well does R fair in letting us express something like “arg(f) should evaluate to f(arg)”?\r\nFlipping\r\nI’ll get right to the chase - the following definition for HelloWorld() gives us the ability to pass in a function that is then called with \"HelloWorld\" as the argument.\r\n\r\n\r\nHelloWorld <- function(x) {\r\n fun <- match.fun(x)\r\n arg <- deparse(sys.call()[[1]])\r\n fun(arg)\r\n}\r\nHelloWorld(\"print\")\r\n\r\n [1] \"HelloWorld\"\r\n\r\nHelloWorld(toupper)\r\n\r\n [1] \"HELLOWORLD\"\r\n\r\nThere are two pieces to this solution.\r\nFirst is match.fun(), which allows HelloWorld() to receive the name of a function as a string and match the function with that name. This is kind of like what we talked about in the previous section with \"f\"(), but it’s a more explicit, less auto-magic way of handling functions specified as a string:\r\n\r\n\r\nidentical(match.fun(\"print\"), print)\r\n\r\n [1] TRUE\r\n\r\nA nice convenience feature is that when match.fun() receives a function, it simply passes it through. That also gives us this equality:\r\n\r\n\r\nidentical(match.fun(print), print)\r\n\r\n [1] TRUE\r\n\r\nIn sum, match.fun() gives us a choice in whether HelloWorld() receives its argument as a string vs. symbol. Combined with our observation from the previous section, this gives us a full 2-by-2 variation in whether the function or the argument is a string (vs. a symbol):\r\n\r\n\r\nHelloWorld(print)\r\nHelloWorld(\"print\")\r\n\"HelloWorld\"(print)\r\n\"HelloWorld\"(\"print\")\r\n\r\n\r\nThe second piece of the solution is sys.call(), which returns the expression that called the function where sys.call() is called from. It’s hard to explain in words but actually pretty intuitive once you see some examples:\r\n\r\n\r\nf <- function(...) {\r\n sys.call()\r\n}\r\nf()\r\n\r\n f()\r\n\r\nf(arg = val)\r\n\r\n f(arg = val)\r\n\r\nf(pi)\r\n\r\n f(pi)\r\n\r\nAnd that’s it! When sys.call() is called from f(), it captures the expression that makes up f(...). So in the case of HelloWorld(\"print\"), the call to sys.call() evaluates to the following language object:\r\n\r\n HelloWorld(\"print\")\r\n\r\n… which is essentially a list of length-2:\r\n\r\n [[1]]\r\n HelloWorld\r\n \r\n [[2]]\r\n [1] \"print\"\r\n\r\nSo the code deparse(sys.call()[[1]]) grabs the symbol HelloWorld and deparse()s it into a string, resulting in \"HelloWorld\". And as I mentioned before, we grab the string \"print\" and pass it to match.fun() to get back the print() function.\r\nOnce we have these two pieces, the line fun(arg) evaluates to the un-flipped version print(\"HelloWorld\").\r\nAnd of course, as far as the argument is concerned, HelloWorld() takes any function that can operate on the string \"HelloWorld\":\r\n\r\n\r\ncaps_split <- function(x) {\r\n strsplit(x, \"(? print()), the next best tool for this job is probably currying.\r\nHere’s the simplest attempt at that:6\r\n\r\n\r\ncurry <- function(arg) {\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n}\r\ncurry(\"HelloWorld\")(print)\r\n\r\n [1] \"HelloWorld\"\r\n\r\nEssentially, curry(\"HelloWorld\") is returning a function that takes a function and calls that function with \"HelloWorld\" as its argument. Although, unfortunately, that’s not so obvious from the function definition which just looks generic:\r\n\r\n\r\ncurry(\"HelloWorld\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n \r\n \r\n\r\nFor us to see \"HelloWorld\" in the function body for curry(\"HelloWorld\"), we would need to in-line the value of arg when the curried function is defined.7 Let’s take this up in steps.\r\nFirst, we can use substitute() (or bquote()) to create an expression where the value of arg is in-lined. Both methods produce the contextualized function definition we want.\r\n\r\n\r\ncurry2 <- function(arg) {\r\n list(\r\n substitute = substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ),\r\n bquote = bquote(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(.(arg))\r\n }\r\n )\r\n )\r\n}\r\ncurry2(\"HelloWorld\")\r\n\r\n $substitute\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n $bquote\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n\r\nLet’s stick with substitute() and move on. Now that we have an expression of the function definition, we can eval()-uate it to get an actual function object back.\r\n\r\n\r\ncurry2 <- function(arg) {\r\n eval(substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ))\r\n}\r\ncurry2(\"HelloWorld\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n \r\n\r\nWait… \"HelloWorld\" just turned back into arg! Turns out that functions in R have a “memory” of how they were defined. It’s stored in the srcref attribute of functions, and this is the function definition that gets shown when we print functions.\r\n\r\n\r\nHelloWorld <- curry2(\"HelloWorld\")\r\nattr(HelloWorld, \"srcref\")\r\n\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n\r\nAnd actually, if we just strip this attribute away, we can see our work of in-lining arg:\r\n\r\n\r\nattr(HelloWorld, \"srcref\") <- NULL\r\nHelloWorld\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nWe can now go back to the currying function and implement this solution there:\r\n\r\n\r\ncurry3 <- function(arg) {\r\n inlined <- eval(substitute(\r\n function(fun) {\r\n fun <- match.fun(fun)\r\n fun(arg)\r\n }\r\n ))\r\n attr(inlined, \"srcref\") <- NULL\r\n inlined\r\n}\r\ncurry3(\"HelloWorld\")\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nTo avoid all this mess, you could also inline arg first, and then piece together the function from scratch:\r\n\r\n\r\ncurry4 <- function(arg) {\r\n inlined_body <- rlang::expr({\r\n fun <- match.fun(fun)\r\n fun(!!arg)\r\n })\r\n rlang::new_function(\r\n args = rlang::pairlist2(fun=),\r\n body = inlined_body\r\n )\r\n}\r\ncurry4(\"HelloWorld\")\r\n\r\n function (fun) \r\n {\r\n fun <- match.fun(fun)\r\n fun(\"HelloWorld\")\r\n }\r\n \r\n\r\nFin.\r\n\r\n\r\nregister(`But don't do this in practice!`)\r\nget(ls()[order(-nchar(ls()))][1])(\"print\")\r\n\r\n [1] \"But don't do this in practice!\"\r\n\r\n\r\nA topic close to my heart as a linguist. This is one of the first things we teach in intro to linguistics.↩︎\r\nWhere objects of class are essentially a (nested) list of symbols and constants.↩︎\r\nThe notorious <<- is evidence of this.↩︎\r\n\r\nThis is starting to look something like a very butchered form of string interning…↩︎\r\nYou might protest that as.character(substitute()) is bad practice which is true but it’s idiomatic in the sense that it’s the first line of the function definition of require().↩︎\r\nA version with stricter safeguards would probably use force() among other things (see Adv R).↩︎\r\nThis in-lining also resolves the need for force().↩︎\r\n", "preview": "posts/2024-02-20-helloworld-print/preview.png", - "last_modified": "2024-02-20T15:25:44-05:00", + "last_modified": "2024-02-20T12:25:44-08:00", "input_file": {}, "preview_width": 462, "preview_height": 184 @@ -117,7 +117,7 @@ ], "contents": "\r\n\r\nContents\r\nIntro\r\nResearch\r\nBlogging\r\nR stuff\r\nPersonal\r\n\r\n\r\n\r\n\r\nFigure 1: New year’s eve celebration fireworks at Long Beach, CA.\r\n\r\n\r\n\r\nIntro\r\nI’ve been seeing a couple folks on Mastodon sharing their “year in review” blog posts, and I thought that was really cool, so I decided to write my own too! I’m mostly documenting for myself but hopefully this also serves as an update of a sort for my friends over the internet since I’ve been pretty silent online this year.\r\nResearch\r\nBeing the Good Grad Student™ I am, I’m forefronting my academia happenings first. In numbers, I published one paper, gave two talks, and presented three posters. I’m not super proud of those numbers: I think they’re a lot less than what people might expect from a 4th year PhD student. But a lot of effort went into each1 and 2023 overall has been a great year for refining and narrowing down on my dissertation topic.2 I did a ton of readings and I hope it pays off for next year when I actually get started on writing the thing.\r\nI already document my research happenings elsewhere and I know that the primarily audience of my blog isn’t linguists, so I won’t expand on that more here.\r\nBlogging\r\n2023 was the year when it became painfully obvious to me that I don’t have much in terms of a portfolio in the sense of the buzzword-y “data science portfolio” that industry recruiters purportedly look for. This ironically coincided with another realization I had, which is that I’m increasingly becoming “the department tech/stats guy” where I take on many small tasks and favors from faculty and other students here and there; I truly do enjoy doing this work, but it’s completely invisible to my CV/resume. I’m still navigating this weird position I’m in, but I’ve found some nice tips3 and at least I still have another year until I’m on the job market to fully figure this out.\r\nThe reason why I put the above rant under the “Blogging” section is because my blog is the closest thing I have a portfolio - there’s not much here, but it’s a public-facing space I own where I get to show people what I know and how I think. So in 2023 I was more conscious about what I blog about and how. The change was subtle - my blog persona is still my usual self, but I’ve tried to diversify the style of my blogs. Whereas I mostly wrote long-form, tutorial-style blog posts in the past, I only wrote one such post (on dplyr::slice()) this year. My other blog posts were one reflecting on how to better answer other people’s questions, and another where I nerd out on the internals of {tidyselect} with little regard for its practicality.4.\r\nAll in all, I wrote three blog posts this year (not including this one). This is the usual rate of publishing blog posts for me, but I hope to write more frequently next year (and write shorter posts overall, and in less formal tone).\r\nR stuff\r\nI didn’t think I’d have much to say about the R stuff I did this year until I sat down to write this blog. Even though this year was the busiest I’ve ever been with research, it turns out that I still ended up doing quite a bit of R stuff in my free time. I’ll cover this chronologically.\r\n\r\nAt the beginning of the year, I was really lucky to receive the student paper award from the Statistical Computing and Graphics section of the ASA, writing about {ggtrace}.5 In the paper, I focused on {ggtrace} as a pedagogical tool for aspiring {ggplot2} extension developers. In the process, I rediscovered the power of reframing ggplot internals as data wrangling and went back to {ggtrace} to add a couple convenience functions for interactive use-cases. After over two years since its inception, {ggtrace} now feels pretty complete in terms of its core features (but suggestions and requests are always welcome!).\r\n\r\nIn Spring, I began writing {jlmerclusterperm}, a statistical package implementing the cluster-based permutation test for time series data, using mixed-effects models. This was a new challenge for me for two reasons. First, I wrote much of the package in Julia - this was my first time writing Julia code for “production” and within an R package.6 Second, I wrote this package for a seminar on eye movements that I was taking that Spring in the psychology department. I wrote {jlmerclusterperm} in an intense burst - most of it was complete by the end of May and I turned in the package as my final.7 I also gave a school-internal talk on it in April; my first time talking about R in front of an entirely academic audience.\r\nIn Summer, I continued polishing {jlmerclusterperm} with another ambitious goal of getting it to CRAN, at the suggestion of a couple researchers who said they’d like to use it for their own research. The already-hard task of getting through my first CRAN submission was compounded by the fact that the package contained Julia code - it took nine resubmissions in the span of two months to finally get {jlmerclusterperm} stably on CRAN.8\r\n\r\n\r\n\r\nFigure 2: Group photo taken at SMLP2023.\r\n\r\n\r\n\r\nAt the beginning of Fall, I attended the Advanced Frequentist stream of the SMLP2023 workshop, taught by Phillip Alday, Reinhold Kliegl and Douglas Bates. The topic was mixed-effects regression models in Julia, one that I became very excited about especially after working on {jlmerclusterperm}. It was an absolute blast and I wish that everyone in linguistics/psychology research appreciated good stats/data analysis as much as the folks I met there. The workshop was far away in Germany (my first time ever in Europe!) and I’m really thankful to MindCORE for giving me a grant to help with travel expenses.\r\n\r\nFor most of Fall, I didn’t do much R stuff, especially with the start of the Fall semester and a big conference looming on the horizon. But the little time I did spend on it, I worked on maintenance and upkeep for {openalexR}, one of my few collaborative projects. It’s also one of the few packages for which I’m an author of that I actually frequently use myself. I used {openalexR} a lot during the Fall semester for conducting literature reviews in preparation for my dissertation proposal, so I had a few opportunities to catch bugs and work on other improvements. I also spent a lot of my time in the Fall TA-ing for an undergraduate data science class that we recently started offering in our department. This was actually my third year in a row TA-ing it, so it went pretty smoothly. I even learned some new quirky R behaviors from my students along the way.\r\n\r\nIn October, I virtually attended the R/Pharma conference and joined a workshop on data validation using the {pointblank} package by Rich Iannone. I had used {pointblank} a little before, but I didn’t explore its features much because I thought it had some odd behaviors that I couldn’t comprehend. The workshop cleared up some of the confusion for me, and Rich made it clear in the workshop that he welcomed contributions to improve the package. So I made a PR addressing the biggest pain point I personally had with using {pointblank}. This turned out to be a pretty big undertaking which took over a month to complete. In the process, I become a co-author of {pointblank}, and I merged a series of PRs that improved the consistency of function designs, among other things.\r\nThe last R thing I did this year was actually secretly Julia - in December I gave a school-internal workshop on fitting mixed effects in Julia, geared towards an academic audience with prior experience in R. I advocated for a middle-ground approach where you can keep doing everything in R and RStudio, except move just the modelling workflow into Julia. I live-coded some Julia code and ran it from RStudio, which I think wasn’t too difficult to grasp.9 I have a half-baked package of addins to make R-Julia interoperability smoother in RStudio; I hope to wrap it up and share it some day.\r\nThat brings me to the present moment, where I’m currently taking a break from FOSS to focus on my research, as my dissertation proposal defense is coming up soon. I will continue to be responsive with maintaining {jlmerclusterperm} during this time (since there’s an active user-base of researchers who find it useful) but my other projects will become low priority. I also don’t think I’ll be starting a new project any time soon, but in the near future I hope I come up with something cool that lets me test-drive {S7}!\r\nPersonal\r\nThis year, I tried to be less of a workaholic. I think I did an okay job at it, and it mostly came in the form of diversifying my hobbies (R used to be my only hobby since starting grad school). I got back into ice skating10 and, briefly, swimming,11 and I’m fortunate that both are available literally two blocks away from my department. My girlfriend and I got really into escape rooms this year, mostly playing online ones due to budget constraints.12 I also got back into playing Steam games13 and racked up over 300 hours on Slay the Spire, mostly from the ~2 weeks recovering from covid in September.14\r\nAnd of course, I have many people to thank for making this a wonderful year.15 Happy new year to all!\r\n\r\nI was the first author for all research that I presented, as is often the case in linguistics.↩︎\r\nBroadly, how kids learn words with overlapping meanings like “dalmatian”<“dog”<“animal” from the language input.↩︎\r\nLike this blog post.↩︎\r\nA style heavily inspired by some of my favorite R bloggers like Matt Dray and Jonathan Carroll↩︎\r\nCoincidentally, my girlfriend also won a student award this year from another ASA - the Acoustical Society of America.↩︎\r\nI can’t recommend {JuliaConnectoR} enough for this.↩︎\r\nI’m actually quite proud of myself for pulling this off - writing an R package for the final was unprecedented for the class.↩︎\r\nIn the process, I received the elusive CRAN Note for exceeding 6 updates in under a month (CRAN recommends one update every 1-2 months).↩︎\r\nUsing some tricks described in the workshop materials.↩︎\r\nI used to play ice hockey competitively as a kid.↩︎\r\nTurns out that swimming does not play well with my preexisting ear conditions.↩︎\r\nMost recently we played Hallows Hill which I think is the best one we’ve played so far↩︎\r\nI’m very into roguelike genres but haven’t really played video games since high school.↩︎\r\nFor the fellow nerds, I reached A20 on Ironclad, Defect, and Watcher. I’m working my way up for Silent.↩︎\r\nI’m feeling shy so this goes in the footnotes. In roughly chronological order, I’m firstly indebted to Sam Tyner-Monroe who encouraged me write up {ggtrace} for the ASA paper award after my rstudio::conf talk on it last year. I’m grateful to Gina Reynolds and Teun van den Brand (and others in the ggplot extension club) for engaging in many insightful data viz/ggplot internals discussions with me. I’m also grateful to my FOSS collaborators, especially Trang Le, from whom I’ve learned a lot about code review and package design principles while working on {openalexR} together. Last but not least, I owe a lot to Daniel Sjoberg and Shannon Pileggi for a recent development that I’m not ready to publicly share yet 🤫.↩︎\r\n", "preview": "posts/2023-12-31-2023-year-in-review/preview.png", - "last_modified": "2024-01-01T15:43:40-05:00", + "last_modified": "2024-01-01T12:43:40-08:00", "input_file": {}, "preview_width": 1512, "preview_height": 1371 @@ -140,7 +140,7 @@ ], "contents": "\r\n\r\nContents\r\nIntro\r\nSome observations\r\ntidy-select!\r\ntidy?-select\r\nuntidy-select?\r\nuntidy-select!\r\n\r\nTidying untidy-select\r\nWriting untidy-select helpers\r\n1) times()\r\n2) offset()\r\n3) neighbors()\r\nDIY!\r\n\r\nLet’s get practical\r\n1) Sorting columns\r\n2) Error handling\r\n\r\nConclusion\r\n\r\nIntro\r\nRecently, I’ve been having frequent run-ins with {tidyselect} internals, discovering some weird and interesting behaviors along the way. This blog post is my attempt at documenting a couple of these. And as is the case with my usual style of writing, I’m gonna talk about some of the weirder stuff first and then touch on some of the “practical” side to this.\r\nSome observations\r\nLet’s start with some facts about how {tidyselect} is supposed to work. I’ll use this toy data for the demo:\r\n\r\n\r\nlibrary(dplyr, warn.conflicts = FALSE)\r\nlibrary(tidyselect)\r\ndf <- tibble(x = 1:2, y = letters[1:2], z = LETTERS[1:2])\r\ndf\r\n\r\n # A tibble: 2 × 3\r\n x y z \r\n \r\n 1 1 a A \r\n 2 2 b B\r\n\r\ntidy-select!\r\n{tidyselect} is the package that powers dplyr::select(). If you’ve used {dplyr}, you already know the behavior of select() pretty well. We can specify a column as string, symbol, or by its position:\r\n\r\n\r\ndf %>% \r\n select(\"x\")\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(x)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\ndf %>% \r\n select(1)\r\n\r\n # A tibble: 2 × 1\r\n x\r\n \r\n 1 1\r\n 2 2\r\n\r\nIt’s not obvious from the outside, but the way this works is that these user-supplied expressions (like \"x\", x, and 1) all get resolved to integer before the selection actually happens.\r\nSo to be more specific, the three calls to select() were the same because these three calls to tidyselect::eval_select() are the same:1\r\n\r\n\r\neval_select(quote(\"x\"), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(x), df)\r\n\r\n x \r\n 1\r\n\r\neval_select(quote(1), df)\r\n\r\n x \r\n 1\r\n\r\nYou can also see eval_select() in action in the method for select():\r\n\r\n\r\ndplyr:::select.data.frame\r\n\r\n function (.data, ...) \r\n {\r\n error_call <- dplyr_error_call()\r\n loc <- tidyselect::eval_select(expr(c(...)), data = .data, \r\n error_call = error_call)\r\n loc <- ensure_group_vars(loc, .data, notify = TRUE)\r\n out <- dplyr_col_select(.data, loc)\r\n out <- set_names(out, names(loc))\r\n out\r\n }\r\n \r\n \r\n\r\ntidy?-select\r\nBecause the column subsetting part is ultimately done using integers, we can theoretically pass select() any expression, as long as it resolves to an integer vector.\r\nFor example, we can use 1 + 1 to select the second column:\r\n\r\n\r\ndf %>% \r\n select(1 + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nAnd vector recycling is still a thing here too - we can use c(1, 2) + 1 to select the second and third columns:\r\n\r\n\r\ndf %>% \r\n select(c(1, 2) + 1)\r\n\r\n # A tibble: 2 × 2\r\n y z \r\n \r\n 1 a A \r\n 2 b B\r\n\r\nOrdinary function calls work as well - we can select a random column using sample():\r\n\r\n\r\ndf %>% \r\n select(sample(ncol(df), 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nWe can even use the .env pronoun to scope an integer variable from the global environment:2\r\n\r\n\r\noffset <- 1\r\ndf %>% \r\n select(1 + .env$offset)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nSo that’s kinda interesting.3 But what if we try to mix the different approaches to tidyselect-ing? Can we do math on columns that we’ve selected using strings and symbols?\r\nuntidy-select?\r\nUh not quite. select() doesn’t like doing math on strings and symbols.\r\n\r\n\r\ndf %>% \r\n select(x + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `x + 1`.\r\n Caused by error:\r\n ! object 'x' not found\r\n\r\ndf %>% \r\n select(\"x\" + 1)\r\n\r\n Error in `select()`:\r\n ! Problem while evaluating `\"x\" + 1`.\r\n Caused by error in `\"x\" + 1`:\r\n ! non-numeric argument to binary operator\r\n\r\nIn fact, it doesn’t even like doing certain kinds of math like multiplication (*), even with numeric constants:\r\n\r\n\r\ndf %>% \r\n select(1 * 2)\r\n\r\n Error in `select()`:\r\n ! Can't use arithmetic operator `*` in selection context.\r\n\r\nThis actually makes sense from a design POV. Adding numbers to columns probably happens more often as a mistake than something intentional. These safeguards exist to prevent users from running into cryptic errors.\r\nUnless…\r\nuntidy-select!\r\nIt turns out that {tidyselect} helpers have an interesting behavior of immediately resolving the column selection to integer. So we can get addition (+) working if we wrap our columns in redundant column selection helpers like all_of() and matches()\r\n\r\n\r\ndf %>% \r\n select(all_of(\"x\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>% \r\n select(matches(\"^x$\") + 1)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nFor multiplication, we have to additionally circumvent the censoring of the * symbol. Here, we can simply use a different name for the same operation:4\r\n\r\n\r\n`%times%` <- `*`\r\ndf %>% \r\n select(matches(\"^x$\") %times% 2)\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nBut geez, it’s so tiring to type all_of() and matches() all the time. There must be a better way to break the rule!\r\nTidying untidy-select\r\nLet’s make a tidy design for the untidy pattern of selecting columns by doing math on column locations. The idea is to make our own little scope inside select() where all the existing safeguards are suspended. Like a DSL within a DSL, if you will.\r\nLet’s call this function math(). It should let us express stuff like “give me the column to the right of column x” via this intuitive(?) syntax:\r\n\r\n\r\n\r\n\r\n\r\ndf %>% \r\n select(math(x + 1))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThis is my take on math():\r\n\r\n\r\nmath <- function(expr) {\r\n math_expr <- rlang::enquo(expr)\r\n columns <- tidyselect::peek_vars()\r\n col_locs <- as.data.frame.list(seq_along(columns), col.names = columns)\r\n mask <- rlang::as_data_mask(col_locs)\r\n out <- rlang::eval_tidy(math_expr, mask)\r\n out\r\n}\r\n\r\n\r\nThere’s a lot of weird functions involved here, but it’s easier to digest by focusing on its parts. Here’s what each local variable in the function looks like for our math(x + 1) example above:\r\n\r\n $math_expr\r\n \r\n expr: ^x + 1\r\n env: 0x0000012f8e27cec8\r\n \r\n $columns\r\n [1] \"x\" \"y\" \"z\"\r\n \r\n $col_locs\r\n x y z\r\n 1 1 2 3\r\n \r\n $mask\r\n \r\n \r\n $out\r\n [1] 2\r\n\r\nLet’s walk through the pieces:\r\nmath_expr: the captured user expression, with the environment attached\r\ncolumns: the column names of the current dataframe, in order\r\ncol_locs: a dataframe of column names and location, created from columns\r\nmask: a data mask created from col_locs\r\nout: location of column(s) to select\r\nEssentially, math() first captures the expression to evaluate it in its own special environment, circumventing select()’s safeguards. Then, it grabs the column names of the data frame with tidyselect::peek_vars() to define col_locs and then mask. The data mask mask is then used inside rlang::eval_tidy() to resolve symbols like x to integer 1 when evaluating the captured expression x + 1. The expression math(x + 1) thus evaluates to 1 + 1. In turn, select(math(x + 1)) is evaluated to select(2), returning us the second column of the dataframe.\r\nWriting untidy-select helpers\r\nA small yet powerful detail in the implementation of math() is the fact that it captures the expression as a quosure. This allows math() to appropriately scope dynamically created variables, and not just bare symbols provided directly by the user.\r\nThis makes more sense with some examples. Here, I define helper functions that call math() under the hood with their own templatic math expressions (and I have them print() the expression as passed to math() for clarity). The fact that math() captures its argument as a quosure is what allows local variables like n to be correctly scoped in these examples.\r\n1) times()\r\n\r\n\r\ntimes <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col * n))) # for debugging\r\n math(!!col * n)\r\n}\r\ndf %>%\r\n select(times(x, 2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum2 <- 2\r\ndf %>%\r\n select(times(x, num2))\r\n\r\n math(x * n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n2) offset()\r\n\r\n\r\noffset <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n print(rlang::expr(math(!!col + n))) # for debugging\r\n math(!!col + n)\r\n}\r\ndf %>%\r\n select(offset(x, 1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n\r\n\r\nnum1 <- 1\r\ndf %>%\r\n select(offset(x, num1))\r\n\r\n math(x + n)\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\n3) neighbors()\r\n\r\n\r\nneighbors <- function(col, n) {\r\n col <- rlang::ensym(col)\r\n range <- c(-(n:1), 1:n)\r\n print(rlang::expr(math(!!col + !!range))) # for debugging\r\n math(!!col + !!range)\r\n}\r\ndf %>%\r\n select(neighbors(y, 1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\n\r\n\r\ndf %>%\r\n select(neighbors(y, num1))\r\n\r\n math(y + c(-1L, 1L))\r\n # A tibble: 2 × 2\r\n x z \r\n \r\n 1 1 A \r\n 2 2 B\r\n\r\nDIY!\r\nAnd of course, we can do arbitrary injections ourselves as well with !! or .env$:\r\n\r\n\r\ndf %>%\r\n select(math(x * !!num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\ndf %>%\r\n select(math(x * .env$num2))\r\n\r\n # A tibble: 2 × 1\r\n y \r\n \r\n 1 a \r\n 2 b\r\n\r\nThat was fun but probably not super practical. Let’s set math() aside to try our hands on something more useful.\r\nLet’s get practical\r\n1) Sorting columns\r\nProbably one of the hardest things to do idiomatically in the tidyverse is sorting (a subset of) columns by their name. For example, consider this dataframe which is a mix of columns that follow some fixed pattern (\"x|y_\\\\d\") and those outside that pattern (\"year\", \"day\", etc.).\r\n\r\n\r\ndata_cols <- expand.grid(first = c(\"x\", \"y\"), second = 1:3) %>%\r\n mutate(cols = paste0(first, \"_\", second)) %>%\r\n pull(cols)\r\ndf2 <- as.data.frame.list(seq_along(data_cols), col.names = data_cols)\r\ndf2 <- cbind(df2, storms[1,1:5])\r\ndf2 <- df2[, sample(ncol(df2))]\r\ndf2\r\n\r\n y_3 x_3 month day hour y_2 y_1 x_2 year name x_1\r\n 1 6 5 6 27 0 4 2 3 1975 Amy 1\r\n\r\nIt’s trivial to select columns by pattern - we can use the matches() helper:\r\n\r\n\r\ndf2 %>%\r\n select(matches(\"(x|y)_(\\\\d)\"))\r\n\r\n y_3 x_3 y_2 y_1 x_2 x_1\r\n 1 6 5 4 2 3 1\r\n\r\nBut what if I also wanted to further sort these columns, after I select them? There’s no easy way to do this “on the fly” inside of select, especially if we want the flexibility to sort the columns by the letter vs. the number.\r\nBut here’s one way of getting at that, exploiting two facts:\r\nmatches(), like other tidyselect helpers, immediately resolves the selection to integer\r\npeek_vars() returns the column names in order, which lets us recover the column names from location\r\nAnd that’s pretty much all there is to the tidyselect magic that goes into my solution below. I define locs (integer vector of column locations) and cols (character vector of column names at those locations), and the rest is just regex and sorting:\r\n\r\n\r\nordered_matches <- function(matches, order) {\r\n # tidyselect magic\r\n locs <- tidyselect::matches(matches)\r\n cols <- tidyselect::peek_vars()[locs]\r\n # Ordinary evaluation\r\n groups <- simplify2array(regmatches(cols, regexec(matches, cols)))[-1,]\r\n reordered <- do.call(\"order\", asplit(groups[order, ], 1))\r\n locs[reordered]\r\n}\r\n\r\n\r\nUsing ordered_matches(), we can not only select columns but also sort them using regex capture groups.\r\nThis sorts the columns by letter first then number:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(1, 2)))\r\n\r\n x_1 x_2 x_3 y_1 y_2 y_3\r\n 1 1 3 5 2 4 6\r\n\r\nThis sorts the columns by number first then letter:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)))\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3\r\n 1 1 2 3 4 5 6\r\n\r\nAnd if we wanted the other columns too, we can use everything() to grab the “rest”:\r\n\r\n\r\ndf2 %>%\r\n select(ordered_matches(\"(x|y)_(\\\\d)\", c(2, 1)), everything())\r\n\r\n x_1 y_1 x_2 y_2 x_3 y_3 month day hour year name\r\n 1 1 2 3 4 5 6 6 27 0 1975 Amy\r\n\r\n2) Error handling\r\nOne of the really nice parts about the {tidyselect} design is the fact that error messages are very informative.\r\nFor example, if you select a non-existing column, it errors while pointing out that mistake:\r\n\r\n\r\ndf3 <- data.frame(x = 1)\r\nnonexistent_selection <- quote(c(x, y))\r\neval_select(nonexistent_selection, df3)\r\n\r\n Error:\r\n ! Can't subset columns that don't exist.\r\n ✖ Column `y` doesn't exist.\r\n\r\nIf you use a tidyselect helper that returns nothing, it won’t complain by default:\r\n\r\n\r\nzero_selection <- quote(starts_with(\"z\"))\r\neval_select(zero_selection, df3)\r\n\r\n named integer(0)\r\n\r\nBut you can make that error with allow_empty = FALSE:\r\n\r\n\r\neval_select(zero_selection, df3, allow_empty = FALSE)\r\n\r\n Error:\r\n ! Must select at least one item.\r\n\r\nGeneral evaluation errors are caught and chained:\r\n\r\n\r\nevaluation_error <- quote(stop(\"I'm a bad expression!\"))\r\neval_select(evaluation_error, df3)\r\n\r\n Error:\r\n ! Problem while evaluating `stop(\"I'm a bad expression!\")`.\r\n Caused by error:\r\n ! I'm a bad expression!\r\n\r\nThese error signalling patterns are clearly very useful for users,5 but there’s a little gem in there for developers too. It turns out that the error condition object contains these information too, which lets you detect different error types programmatically to forward errors to your own error handling logic.\r\nFor example, the attempted non-existent column is stored in $i:6\r\n\r\n\r\ncnd_nonexistent <- rlang::catch_cnd(\r\n eval_select(nonexistent_selection, df3)\r\n)\r\ncnd_nonexistent$i\r\n\r\n [1] \"y\"\r\n\r\nZero column selections give you NULL in $i when you set it to error:\r\n\r\n\r\ncnd_zero_selection <- rlang::catch_cnd(\r\n eval_select(zero_selection, df3, allow_empty = FALSE)\r\n)\r\ncnd_zero_selection$i\r\n\r\n NULL\r\n\r\nGeneral evaluation errors are distinguished by having a $parent:\r\n\r\n\r\ncnd_evaluation_error <- rlang::catch_cnd(\r\n eval_select(evaluation_error, df3)\r\n)\r\ncnd_evaluation_error$parent\r\n\r\n \r\n\r\nAgain, this is more useful as a developer, if you’re building something that integrates {tidyselect}.7 But I personally find this interesting to know about anyways!\r\nConclusion\r\nHere I end with the (usual) disclaimer to not actually just copy paste these for production - they’re written with the very low standard of scratching my itch, so they do not come with any warranty!\r\nBut I hope that this was a fun exercise in thinking through one of the most mysterious magics in {dplyr}. I’m sure to reference this many times in the future myself.\r\n\r\nThe examples quote(\"x\") and quote(1) are redundant because \"x\" and 1 are constants. I keep quote() in there just to make the comparison clearer↩︎\r\nNot to be confused with all_of(). The idiomatic pattern for scoping an external character vector is to do all_of(x) not .env$x. It’s only when you’re scoping a non-character vector that you’d use .env$.↩︎\r\nIt’s also strangely reminiscent of my previous blog post on dplyr::slice()↩︎\r\nThanks to Jonathan Carroll for this suggestion!↩︎\r\nFor those who actually read error messages, at least (points to self) …↩︎\r\nThough {tidyselect} errors early, so it’ll only record the first attempted column causing the error. You could use a while() loop (catch and remove bad columns from the data until there’s no more error) if you really wanted to get the full set of offending columns.↩︎\r\nIf you want some examples of post-processing tidyselect errors, there’s some stuff I did for pointblank that may be helpful as a reference.↩︎\r\n", "preview": "posts/2023-12-03-untidy-select/preview.png", - "last_modified": "2023-12-04T10:11:22-05:00", + "last_modified": "2023-12-04T07:11:22-08:00", "input_file": {}, "preview_width": 957, "preview_height": 664 @@ -161,7 +161,7 @@ ], "contents": "\r\n\r\nContents\r\nIntro\r\nWhat is an XY problem?\r\nThe question\r\nAttempt 1: after_stat()? I know that!\r\nAttempt 2: Hmm but why not after_scale()?\r\nAttempt 3: Oh. You just wanted a scale_fill_*()…\r\nReflections\r\nEnding on a fun aside - accidentally escaping an XY problem\r\n\r\nIntro\r\nA few months ago, over at the R4DS slack (http://r4ds.io/join), someone posted a ggplot question that was within my area of “expertise”. I got tagged in the thread, I went in, and it took me 3 tries to arrive at the correct solution that the poster was asking for.\r\nThe embarrassing part of the exchange was that I would write one solution, think about what I wrote for a bit, and then write a different solution after realizing that I had misunderstood the intent of the original question. In other words, I was consistently missing the point.\r\nThis is a microcosm of a bigger problem of mine that I’ve been noticing lately, as my role in the R community has shifted from mostly asking questions to mostly answering questions. By this point I’ve sort of pin-pointed the problem: I have a hard time recognizing that I’m stuck in an XY problem.\r\nI have a lot of thoughts on this and I want to document them for future me,1 so here goes a rant. I hope it’s useful to whoever is reading this too.\r\nWhat is an XY problem?\r\nAccording to Wikipedia:\r\n\r\nThe XY problem is a communication problem… where the question is about an end user’s attempted solution (Y) rather than the root problem itself (X).\r\n\r\nThe classic example of this is when a (novice) user asks how to extract the last 3 characters in a filename. There’s no good reason to blindly grab the last 3 characters, so what they probably meant to ask is how to get the file extension (which is not always 3 characters long, like .R or .Rproj).2\r\nAnother somewhat related cult-classic, copypasta3 example is the “Don’t use regex to parse HTML” answer on stackoverflow. Here, a user asks how to use regular expressions to match HTML tags, to which the top-voted answer is don’t (instead, you should use a dedicated parser). The delivery of this answer is a work of art, so I highly suggest you giving it a read if you haven’t seen it already (the link is above for your amusement).\r\nAn example of an XY problem in R that might hit closer to home is when a user complains about the notorious Object of type 'closure' is not subsettable error. It’s often brought up as a cautionary tale for novice users (error messages can only tell you so much, so you must develop debugging strategies), but it has a special meaning for more experienced users who’ve been bit by this multiple times. So for me, when I see novice users reporting this specific error, I usually ask them if they have a variable called data and whether they forgot to run the line assigning that variable. Of course, this answer does not explain what the error means,4 but oftentimes it’s the solution that the user is looking for.\r\n\r\n\r\n# Oops forgot to define `data`!\r\n# `data` is a function (in {base}), which is not subsettable\r\ndata$value\r\n\r\n Error in data$value: object of type 'closure' is not subsettable\r\n\r\nAs one last example, check out this lengthy exchange on splitting a string (Y) to parse JSON (X). I felt compelled to include this example because it does a good job capturing the degree of frustration (very high) that normally comes with XY problems.\r\nBut the thing about the XY problem is that it often prompts the lesson of asking good questions: don’t skip steps in your reasoning, make your goals/intentions clear, use a reprex,5 and so on. But in so far as it’s a communication problem involving both parties, I think we should also talk about what the person answering the question can do to recognize an XY problem and break out of it.\r\nEnter me, someone who really needs to do a better job of recognizing when I’m stuck in an XY problem. So with the definition out of the way, let’s break down how I messed up.\r\nThe question\r\nThe question asks:\r\n\r\nDoes anyone know how to access the number of bars in a barplot? I’m looking for something that will return “15” for the following code, that can be used within ggplot, like after_stat()\r\n\r\nThe question comes with an example code. Not exactly a reprex, but something to help understand the question:\r\n\r\n\r\np <- ggplot(mpg, aes(manufacturer, fill = manufacturer)) +\r\n geom_bar()\r\np\r\n\r\n\r\n\r\nThe key phrase in the question is “can be used within ggplot”. So the user isn’t looking for something like this even though it’s conceptually equivalent:\r\n\r\n\r\nlength(unique(mpg$manufacturer))\r\n\r\n [1] 15\r\n\r\nThe idea here is that ggplot knows that there are 15 bars, so this fact must represented somewhere in the internals. The user wants to be able to access that value dynamically.\r\nAttempt 1: after_stat()? I know that!\r\nThe very last part of the question “… like after_stat()” triggered some alarms in the thread and got me called in. For those unfamiliar, after_stat() is part of the new and obscure family of delayed aesthetic evaluation functions introduced in ggplot 3.3.0. It’s something that you normally don’t think about in ggplot, but it’s a topic that I’ve been obsessed with for the last 2 years or so: it has resulted in a paper, a package (ggtrace), blog posts, and talks (useR!, rstudio::conf, JSM).\r\nThe user asked about after_stat(), so naturally I came up with an after_stat() solution. In the after-stat stage of the bar layer’s data, the layer data looks like this:\r\n\r\n\r\n# remotes::install_github(\"yjunechoe/ggtrace\")\r\nlibrary(ggtrace)\r\n# Grab the state of the layer data in the after-stat\r\nlayer_after_stat(p)\r\n\r\n # A tibble: 15 × 8\r\n count prop x width flipped_aes fill PANEL group\r\n \r\n 1 18 1 1 0.9 FALSE audi 1 1\r\n 2 19 1 2 0.9 FALSE chevrolet 1 2\r\n 3 37 1 3 0.9 FALSE dodge 1 3\r\n 4 25 1 4 0.9 FALSE ford 1 4\r\n 5 9 1 5 0.9 FALSE honda 1 5\r\n 6 14 1 6 0.9 FALSE hyundai 1 6\r\n 7 8 1 7 0.9 FALSE jeep 1 7\r\n 8 4 1 8 0.9 FALSE land rover 1 8\r\n 9 3 1 9 0.9 FALSE lincoln 1 9\r\n 10 4 1 10 0.9 FALSE mercury 1 10\r\n 11 13 1 11 0.9 FALSE nissan 1 11\r\n 12 5 1 12 0.9 FALSE pontiac 1 12\r\n 13 14 1 13 0.9 FALSE subaru 1 13\r\n 14 34 1 14 0.9 FALSE toyota 1 14\r\n 15 27 1 15 0.9 FALSE volkswagen 1 15\r\n\r\nIt’s a tidy data where each row represents a barplot. So the number of bars is the length of any column in the after-stat data, but it’d be most principled to take the length of the group column in this case.6\r\nSo the after-stat expression that returns the desired value 15 is after_stat(length(group)), which essentially evaluates to the following:\r\n\r\n\r\nlength(layer_after_stat(p)$group)\r\n\r\n [1] 15\r\n\r\nFor example, you can use this inside the aes() to annotate the total number of bars on top of each bar:\r\n\r\n\r\nggplot(mpg, aes(manufacturer, fill = manufacturer)) +\r\n geom_bar() +\r\n geom_label(\r\n aes(label = after_stat(length(group))),\r\n fill = \"white\",\r\n stat = \"count\"\r\n )\r\n\r\n\r\n\r\nThe after_stat(length(group)) solution returns the number of bars using after_stat(), as the user asked. But as you can see this is extremely useless: there are many technical constraints on what you can actually do with this information in the after-stat stage.\r\nI should have checked if they actually wanted an after_stat() solution first, before providing this answer. But I got distracted by the after_stat() keyword and got too excited by the prospect of someone else taking interest in the thing that I’m obsessed with. Alas this wasn’t the case - they were trying to do something practical - so I went back into the thread to figure out their goal for my second attempt.\r\nAttempt 2: Hmm but why not after_scale()?\r\nWhat I had neglected in my first attempt was the fact that the user talked more about their problem with someone else who got to the question before I did. That discussion turned out to include an important clue to the intent behind the original question: the user wanted the number of bars in order to interpolate the color of the bars.\r\nSo for example, a palette function like topo.colors() takes n to produce interpolated color values:\r\n\r\n\r\ntopo.colors(n = 16)\r\n\r\n [1] \"#4C00FF\" \"#0F00FF\" \"#002EFF\" \"#006BFF\" \"#00A8FF\" \"#00E5FF\" \"#00FF4D\"\r\n [8] \"#00FF00\" \"#4DFF00\" \"#99FF00\" \"#E6FF00\" \"#FFFF00\" \"#FFEA2D\" \"#FFDE59\"\r\n [15] \"#FFDB86\" \"#FFE0B3\"\r\n\r\nchroma::show_col(topo.colors(16))\r\n\r\n\r\n\r\nIf the intent is to use the number of bars to generate a vector of colors to assign to the bars, then a better place to do it would be in the after_scale(), where the state of the layer data in the after-scale looks like this:\r\n\r\n\r\nlayer_after_scale(p)\r\n\r\n # A tibble: 15 × 16\r\n fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax \r\n \r\n 1 #F87… 18 18 1 1 FALSE 1 1 0 18 0.55 1.45\r\n 2 #E58… 19 19 1 2 FALSE 1 2 0 19 1.55 2.45\r\n 3 #C99… 37 37 1 3 FALSE 1 3 0 37 2.55 3.45\r\n 4 #A3A… 25 25 1 4 FALSE 1 4 0 25 3.55 4.45\r\n 5 #6BB… 9 9 1 5 FALSE 1 5 0 9 4.55 5.45\r\n 6 #00B… 14 14 1 6 FALSE 1 6 0 14 5.55 6.45\r\n 7 #00B… 8 8 1 7 FALSE 1 7 0 8 6.55 7.45\r\n 8 #00C… 4 4 1 8 FALSE 1 8 0 4 7.55 8.45\r\n 9 #00B… 3 3 1 9 FALSE 1 9 0 3 8.55 9.45\r\n 10 #00B… 4 4 1 10 FALSE 1 10 0 4 9.55 10.45\r\n 11 #619… 13 13 1 11 FALSE 1 11 0 13 10.55 11.45\r\n 12 #B98… 5 5 1 12 FALSE 1 12 0 5 11.55 12.45\r\n 13 #E76… 14 14 1 13 FALSE 1 13 0 14 12.55 13.45\r\n 14 #FD6… 34 34 1 14 FALSE 1 14 0 34 13.55 14.45\r\n 15 #FF6… 27 27 1 15 FALSE 1 15 0 27 14.55 15.45\r\n # ℹ 4 more variables: colour , linewidth , linetype ,\r\n # alpha \r\n\r\nIt’s still a tidy data where each row represents a bar. But the important distinction between the after-stat and the after-scale is that the after-scale data reflects the work of the (non-positional) scales. So the fill column here is now the actual hexadecimal color values for the bars:\r\n\r\n\r\nlayer_after_scale(p)$fill\r\n\r\n [1] \"#F8766D\" \"#E58700\" \"#C99800\" \"#A3A500\" \"#6BB100\" \"#00BA38\" \"#00BF7D\"\r\n [8] \"#00C0AF\" \"#00BCD8\" \"#00B0F6\" \"#619CFF\" \"#B983FF\" \"#E76BF3\" \"#FD61D1\"\r\n [15] \"#FF67A4\"\r\n\r\nchroma::show_col(layer_after_scale(p)$fill)\r\n\r\n\r\n\r\nWhat after_scale()/stage(after_scale = ) allows you to do is override these color values right before the layer data is sent off to be drawn. So we again use the same expression length(group) to grab the number of bars in the after-scale data, pass that value to a color palette function like topo.colors(), and re-map to the fill aesthetic.\r\n\r\n\r\nggplot(mpg, aes(manufacturer)) +\r\n geom_bar(aes(fill = stage(manufacturer, after_scale = topo.colors(length(group))))) +\r\n scale_fill_identity()\r\n\r\n\r\n\r\nSo this solution achieves the desired effect, but it’s needlessly complicated. You need complex staging of the fill aesthetic via stage() and you also need to pair this with scale_fill_identity() to let ggplot know that you’re directly supplying the fill values (otherwise you get errors and warnings).\r\nWait hold up - a fill scale? Did this user actually just want a custom fill scale? Ohhh…\r\nAttempt 3: Oh. You just wanted a scale_fill_*()…\r\nSo yeah. It turns out that they just wanted a custom scale that takes some set of colors and interpolate the colors across the bars in the plot.\r\nThe correct way to approach this problem is to create a new fill scale that wraps around discrete_scale(). The scale function should take a set of colors (cols) and pass discrete_scale() a palette function created via the function factory colorRampPalette().\r\n\r\n\r\nscale_fill_interpolate <- function(cols, ...) {\r\n discrete_scale(\r\n aesthetics = \"fill\",\r\n scale_name = \"interpolate\",\r\n palette = colorRampPalette(cols),\r\n ...\r\n )\r\n}\r\n\r\n\r\nOur new scale_fill_interpolate() function can now be added to the plot like any other scale:\r\n\r\n\r\np +\r\n scale_fill_interpolate(c(\"pink\", \"goldenrod\"))\r\n\r\n\r\n\r\n\r\n\r\np +\r\n scale_fill_interpolate(c(\"steelblue\", \"orange\", \"forestgreen\"))\r\n\r\n\r\n\r\n\r\n\r\nset.seed(123)\r\ncols <- sample(colors(), 5)\r\ncols\r\n\r\n [1] \"lightgoldenrodyellow\" \"mediumorchid1\" \"gray26\" \r\n [4] \"palevioletred2\" \"gray42\"\r\n\r\np +\r\n scale_fill_interpolate(cols)\r\n\r\n\r\n\r\nI sent (a variant of) this answer to the thread and the user marked it solved with a thanks, concluding my desperate spiral into finding the right solution to the intended question.\r\nReflections\r\nSo why was this so hard for me to get? The most immediate cause is because I quickly skimmed the wording of the question and extracted two key phrases:\r\n“access the number of bars in a barplot”\r\n“that can be used within ggplot, like after_stat()”\r\nBut neither of these turned out to be important (or even relevant) to the solution. The correct answer was just a clean custom fill scale, where you don’t have to think about the number of bars or accessing that in the internals. Simply extending discrete_scale() allows you to abstract away from those details entirely.\r\nSo in fairness, it was a very difficult XY problem to get out of. But the wording of the question wasn’t the root cause. I think the root cause is some combination of the following:\r\nThere are many ways to do the same thing in R so I automatically assume that my solution counts as a contribution as long as it gets the job done. But solutions should also be understandable for the person asking the question. Looking back, I was insane to even suggest my second attempt as the solution because it’s so contrived and borderline incomprehensible. It only sets the user up for more confusion and bugs in the future, so that was a bit irresponsible and selfish of me (it only scratches my itch).\r\nSolutions to (practical) problems are usually boring and I’m allergic to boring solutions. This is a bad attitude to have when offering to help people. I assumed that people share my excitement about ggplot internals, but actually most users don’t care (that’s why it’s called the internals and hidden from users). An important context that I miss as the person answering questions on the other end is that users post questions when they’re stuck and frustrated. Their goal is not to take a hard problem and turn it into a thinking exercise or a learning experience (that part happens organically, but is not the goal). If anything, that’s what I’m doing when I choose to take interest in other people’s (coding) problems.\r\nI imbue intent to questions that are clearing missing it. I don’t think that’s a categorically bad thing because it can sometimes land you in a shortcut out of an XY problem. But when you miss, it’s catastrophic and pulls you deeper into the problem. I think that was the case for me here - I conflated the X with the Y and assumed that after_stat() was relevant on face value because I personally know it to be a very powerful tool. I let my own history of treating after_stat() like the X (“How can I use after_stat() to solve/simplify this problem?”) guide my interpretation of the question, which is not good practice.\r\nOf course, there are likely more to this, but these are plenty for me to work on for now.\r\nLastly, I don’t want this to detract from the fact that the onus is on users to ask good questions. I don’t want to put question-answer-ers on the spot for their handling of XY problems. After all, most are volunteers who gain nothing from helping others besides status and some internet points.7 Just take this as me telling myself to be a better person.\r\nEnding on a fun aside - accidentally escaping an XY problem\r\nIt’s not my style to write serious blog posts. I think I deserve a break from many paragraphs of self-induced beat down.\r\nSo in that spirit I want to end on a funny anecdote where I escaped an XY problem by pure luck.\r\nI came across a relatively straightforward question which can be summarized as the following:\r\n\r\n\r\ninput <- \"a + c + d + e\"\r\noutput <- c(\"a\", \"c\", \"d\", \"e\")\r\n\r\n\r\nThere are many valid approaches to this and some were already posted to the thread:\r\n\r\n\r\nstrsplit(input, \" + \", TRUE)[[1]]\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nall.vars(parse(text = input))\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nMe, knowing too many useless things (and knowing that the the user already has the best answers), suggested a quirky alternative:8\r\n\r\nThis is super off-label usage but you can also use R’s formula utilities to parse this:9\r\n\r\n\r\n\r\nattr(terms(reformulate(input)), \"term.labels\")\r\n\r\n [1] \"a\" \"c\" \"d\" \"e\"\r\n\r\nTo my surprise, the response I got was:\r\n\r\nLovely! These definitely originated from formula ages ago so it’s actually not far off-label at all 🙂\r\n\r\n\r\nEspecially before slack deletes the old messages.↩︎\r\nIn R, you can use tools::file_ext() or fs::path_ext().↩︎\r\nhttps://en.wikipedia.org/wiki/Copypasta↩︎\r\nGood luck trying to explain the actual error message. Especially closure, a kind of weird vocabulary in R (fun fact - the first edition of Advanced R used to have a section on closure which is absent in the second edition probably because “In R, almost every function is a closure”).↩︎\r\nParadoxically, XY problems sometimes arise when inexperienced users try to come up with a reprex. They might capture the error/problem too narrowly, such that the more important broader context is left out.↩︎\r\nOr the number of distinct combinations between PANEL and group, as in nlevels(interaction(PANEL, group, drop = TRUE)). But of course that’s overkill and only of interest for “theoretical purity”.↩︎\r\nAnd I like the R4DS slack because it doesn’t have “internet points.” There is status (moderator) though I don’t wear the badge (literally - it’s an emoji).↩︎\r\nActually I only thought of this because I’d been writing a statistical package that required some nasty metaprogramming with the formula object.↩︎\r\nThe significance of this solution building on top of R’s formula utilities is that it will also parse stuff like \"a*b\" as c(\"a\", \"b\", \"a:b\"). So given that the inputs originated as R formulas (as the user later clarifies), this is the principled approach.↩︎\r\n", "preview": "posts/2023-07-09-x-y-problem/preview.png", - "last_modified": "2023-07-10T04:24:43-04:00", + "last_modified": "2023-07-10T01:24:43-07:00", "input_file": {}, "preview_width": 238, "preview_height": 205 @@ -183,7 +183,7 @@ ], "contents": "\r\n\r\nContents\r\nIntro\r\nSpecial properties of dplyr::slice()\r\nBasic usage\r\nRe-imagining slice() with data-masking\r\nSpecial properties of slice()\r\n\r\nA gallery of row operations with slice()\r\nRepeat rows (in place)\r\nSubset a selection of rows + the following row\r\nSubset a selection of rows + multiple following rows\r\nFilter (and encode) neighboring rows\r\nAside: kronecker() as as.vector(outer())\r\nWindowed min/max/median (etc.)\r\nEvenly distributed row shuffling of balanced categories\r\nInserting a new row at specific intervals\r\nEvenly distributed row shuffling of unequal categories\r\n\r\nConclusion\r\n\r\nIntro\r\nIn data wrangling, there are a handful of classes of operations on data frames that we think of as theoretically well-defined and tackling distinct problems. To name a few, these include subsetting, joins, split-apply-combine, pairwise operations, nested-column workflows, and so on.\r\nAgainst this rich backdrop, there’s one aspect of data wrangling that doesn’t receive as much attention: ordering of rows. This isn’t necessarily surprising - we often think of row order as an auxiliary attribute of data frames since they don’t speak to the content of the data, per se. I think we all share the intuition that two dataframe that differ only in row order are practically the same for most analysis purposes.\r\nExcept when they aren’t.\r\nIn this blog post I want to talk about a few, somewhat esoteric cases of what I like to call row-relational operations. My goal is to try to motivate row-relational operations as a full-blown class of data wrangling operation that includes not only row ordering, but also sampling, shuffling, repeating, interweaving, and so on (I’ll go over all of these later).\r\nWithout spoiling too much, I believe that dplyr::slice() offers a powerful context for operations over row indices, even those that at first seem to lack a “tidy” solution. You may already know slice() as an indexing function, but my hope is to convince you that it can do so much more.\r\nLet’s start by first talking about some special properties of dplyr::slice(), and then see how we can use it for various row-relational operations.\r\nSpecial properties of dplyr::slice()\r\nBasic usage\r\nFor the following demonstration, I’ll use a small subset of the dplyr::starwars dataset:\r\n\r\n\r\nstarwars_sm <- dplyr::starwars[1:10, 1:3]\r\nstarwars_sm\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n\r\n1) Row selection\r\nslice() is a row indexing verb - if you pass it a vector of integers, it subsets data frame rows:\r\n\r\n\r\nstarwars_sm |> \r\n slice(1:6) # First six rows\r\n\r\n # A tibble: 6 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n\r\nLike other dplyr verbs with mutate-semantics, you can use context-dependent expressions inside slice(). For example, you can use n() to grab the last row (or last couple of rows):\r\n\r\n\r\nstarwars_sm |> \r\n slice( n() ) # Last row\r\n\r\n # A tibble: 1 × 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n\r\nstarwars_sm |> \r\n slice( n() - 2:0 ) # Last three rows\r\n\r\n # A tibble: 3 × 3\r\n name height mass\r\n \r\n 1 R5-D4 97 32\r\n 2 Biggs Darklighter 183 84\r\n 3 Obi-Wan Kenobi 182 77\r\n\r\nAnother context-dependent expression that comes in handy is row_number(), which returns all row indices. Using it inside slice() essentially performs an identity transformation:\r\n\r\n\r\nidentical(\r\n starwars_sm,\r\n starwars_sm |> slice( row_number() )\r\n)\r\n\r\n [1] TRUE\r\n\r\nLastly, similar to in select(), you can use - for negative indexing (to remove rows):\r\n\r\n\r\nidentical(\r\n starwars_sm |> slice(1:3), # First three rows\r\n starwars_sm |> slice(-(4:n())) # All rows except fourth row to last row\r\n)\r\n\r\n [1] TRUE\r\n\r\n2) Dynamic dots\r\nslice() supports dynamic dots. If you pass row indices into multiple argument positions, slice() will concatenate them for you:\r\n\r\n\r\nidentical(\r\n starwars_sm |> slice(1:6),\r\n starwars_sm |> slice(1, 2:4, 5, 6)\r\n)\r\n\r\n [1] TRUE\r\n\r\nIf you have a list() of row indices, you can use the splice operator !!! to spread them out:\r\n\r\n\r\nstarwars_sm |> \r\n slice( !!!list(1, 2:4, 5, 6) )\r\n\r\n # A tibble: 6 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n\r\nThe above call to slice() evaluates to the following after splicing:\r\n\r\n\r\nrlang::expr( slice(!!!list(1, 2:4, 5, 6)) )\r\n\r\n slice(1, 2:4, 5, 6)\r\n\r\n3) Row ordering\r\nslice() respects the order in which you supplied the row indices:\r\n\r\n\r\nstarwars_sm |> \r\n slice(3, 1, 2, 5)\r\n\r\n # A tibble: 4 × 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 Luke Skywalker 172 77\r\n 3 C-3PO 167 75\r\n 4 Leia Organa 150 49\r\n\r\nThis means you can do stuff like random sampling with sample():\r\n\r\n\r\nstarwars_sm |> \r\n slice( sample(n()) )\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n 2 Owen Lars 178 120\r\n 3 Leia Organa 150 49\r\n 4 Darth Vader 202 136\r\n 5 Luke Skywalker 172 77\r\n 6 R5-D4 97 32\r\n 7 C-3PO 167 75\r\n 8 Beru Whitesun Lars 165 75\r\n 9 Biggs Darklighter 183 84\r\n 10 R2-D2 96 32\r\n\r\nYou can also shuffle a subset of rows (ex: just the first five):\r\n\r\n\r\nstarwars_sm |> \r\n slice( sample(5), 6:n() )\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 C-3PO 167 75\r\n 2 Leia Organa 150 49\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Luke Skywalker 172 77\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n\r\nOr reorder all rows by their indices (ex: in reverse):\r\n\r\n\r\nstarwars_sm |> \r\n slice( rev(row_number()) )\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 Obi-Wan Kenobi 182 77\r\n 2 Biggs Darklighter 183 84\r\n 3 R5-D4 97 32\r\n 4 Beru Whitesun Lars 165 75\r\n 5 Owen Lars 178 120\r\n 6 Leia Organa 150 49\r\n 7 Darth Vader 202 136\r\n 8 R2-D2 96 32\r\n 9 C-3PO 167 75\r\n 10 Luke Skywalker 172 77\r\n\r\n4) Out-of-bounds handling\r\nIf you pass a row index that’s out of bounds, slice() returns a 0-row data frame:\r\n\r\n\r\nstarwars_sm |> \r\n slice( n() + 1 ) # Select the row after the last row\r\n\r\n # A tibble: 0 × 3\r\n # ℹ 3 variables: name , height , mass \r\n\r\nWhen mixed with valid row indices, out-of-bounds indices are simply ignored (much 💜 for this behavior):\r\n\r\n\r\nstarwars_sm |> \r\n slice(\r\n 0, # 0th row - ignored\r\n 1:3, # first three rows\r\n n() + 1 # 1 after last row - ignored\r\n )\r\n\r\n # A tibble: 3 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n\r\nThis lets you do funky stuff like select all even numbered rows by passing slice() all row indices times 2:\r\n\r\n\r\nstarwars_sm |> \r\n slice( row_number() * 2 ) # Add `- 1` at the end for *odd* rows!\r\n\r\n # A tibble: 5 × 3\r\n name height mass\r\n \r\n 1 C-3PO 167 75\r\n 2 Darth Vader 202 136\r\n 3 Owen Lars 178 120\r\n 4 R5-D4 97 32\r\n 5 Obi-Wan Kenobi 182 77\r\n\r\nRe-imagining slice() with data-masking\r\nslice() is already pretty neat as it is, but that’s just the tip of the iceberg.\r\nThe really cool, under-rated feature of slice() is that it’s data-masked, meaning that you can reference column vectors as if they’re variables. Another way of describing this property of slice() is to say that it has mutate-semantics.\r\nAt a very basic level, this means that slice() can straightforwardly replicate the behavior of some dplyr verbs like arrange() and filter()!\r\nslice() as arrange()\r\nFrom our starwars_sm data, if we want to sort by height we can use arrange():\r\n\r\n\r\nstarwars_sm |> \r\n arrange(height)\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nBut we can also do this with slice() to the same effect, using order():\r\n\r\n\r\nstarwars_sm |> \r\n slice( order(height) )\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nThis is conceptually equivalent to combining the following 2-step process:\r\n\r\n\r\nordered_val_ind <- order(starwars_sm$height)\r\n ordered_val_ind\r\n\r\n [1] 3 8 5 7 2 1 6 10 9 4\r\n\r\n\r\n\r\nstarwars_sm |> \r\n slice( ordered_val_ind )\r\n\r\n # A tibble: 10 × 3\r\n name height mass\r\n \r\n 1 R2-D2 96 32\r\n 2 R5-D4 97 32\r\n 3 Leia Organa 150 49\r\n 4 Beru Whitesun Lars 165 75\r\n 5 C-3PO 167 75\r\n 6 Luke Skywalker 172 77\r\n 7 Owen Lars 178 120\r\n 8 Obi-Wan Kenobi 182 77\r\n 9 Biggs Darklighter 183 84\r\n 10 Darth Vader 202 136\r\n\r\nslice() as filter()\r\nWe can also use slice() to filter(), using which():\r\n\r\n\r\nidentical(\r\n starwars_sm |> filter( height > 150 ),\r\n starwars_sm |> slice( which(height > 150) )\r\n)\r\n\r\n [1] TRUE\r\n\r\nThus, we can think of filter() and slice() as two sides of the same coin:\r\nfilter() takes a logical vector that’s the same length as the number of rows in the data frame\r\nslice() takes an integer vector that’s a (sub)set of a data frame’s row indices.\r\nTo put it more concretely, this logical vector was being passed to the above filter() call:\r\n\r\n\r\nstarwars_sm$height > 150\r\n\r\n [1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE\r\n\r\nWhile this integer vector was being passed to the above slice() call, where which() returns the position of TRUE values, given a logical vector:\r\n\r\n\r\nwhich( starwars_sm$height > 150 )\r\n\r\n [1] 1 2 4 6 7 9 10\r\n\r\nSpecial properties of slice()\r\nThis re-imagined slice() that heavily exploits data-masking gives us two interesting properties:\r\nWe can work with sets of row indices that need not to be the same length as the data frame (vs. filter()).\r\nWe can work with row indices as integers, which are legible to arithmetic operations (ex: + and *)\r\nTo grok the significance of working with rows as integer sets, let’s work through some examples where slice() comes in very handy.\r\nA gallery of row operations with slice()\r\nRepeat rows (in place)\r\nIn {tidyr}, there’s a function called uncount() which does the opposite of dplyr::count():\r\n\r\n\r\nlibrary(tidyr)\r\n# Example from `tidyr::uncount()` docs\r\nuncount_df <- tibble(x = c(\"a\", \"b\"), n = c(1, 2))\r\nuncount_df\r\n\r\n # A tibble: 2 × 2\r\n x n\r\n \r\n 1 a 1\r\n 2 b 2\r\n\r\nuncount_df |> \r\n uncount(n)\r\n\r\n # A tibble: 3 × 1\r\n x \r\n \r\n 1 a \r\n 2 b \r\n 3 b\r\n\r\nWe can mimic this behavior with slice(), using rep(times = ...):\r\n\r\n\r\nrep(1:nrow(uncount_df), times = uncount_df$n)\r\n\r\n [1] 1 2 2\r\n\r\nuncount_df |> \r\n slice( rep(row_number(), times = n) ) |> \r\n select( -n )\r\n\r\n # A tibble: 3 × 1\r\n x \r\n \r\n 1 a \r\n 2 b \r\n 3 b\r\n\r\nWhat if instead of a whole column storing that information, we only have information about row position?\r\nLet’s say we want to duplicate the rows of starwars_sm at the repeat_at positions:\r\n\r\n\r\nrepeat_at <- sample(5, 2)\r\nrepeat_at\r\n\r\n [1] 4 5\r\n\r\nIn slice(), you’d just select all rows plus those additional rows, then sort the integer row indices:\r\n\r\n\r\nstarwars_sm |> \r\n slice( sort(c(row_number(), repeat_at)) )\r\n\r\n # A tibble: 12 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Leia Organa 150 49\r\n 7 Leia Organa 150 49\r\n 8 Owen Lars 178 120\r\n 9 Beru Whitesun Lars 165 75\r\n 10 R5-D4 97 32\r\n 11 Biggs Darklighter 183 84\r\n 12 Obi-Wan Kenobi 182 77\r\n\r\nWhat if we also separately have information about how much to repeat those rows by?\r\n\r\n\r\nrepeat_by <- c(3, 4)\r\n\r\n\r\nYou can apply the same rep() method for just the subset of rows to repeat:\r\n\r\n\r\nstarwars_sm |> \r\n slice( sort(c(row_number(), rep(repeat_at, times = repeat_by - 1))) )\r\n\r\n # A tibble: 15 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Darth Vader 202 136\r\n 7 Leia Organa 150 49\r\n 8 Leia Organa 150 49\r\n 9 Leia Organa 150 49\r\n 10 Leia Organa 150 49\r\n 11 Owen Lars 178 120\r\n 12 Beru Whitesun Lars 165 75\r\n 13 R5-D4 97 32\r\n 14 Biggs Darklighter 183 84\r\n 15 Obi-Wan Kenobi 182 77\r\n\r\nCircling back to uncount(), you could also initialize a vector of 1s and replace() where the rows should be repeated:\r\n\r\n\r\nstarwars_sm |> \r\n uncount( replace(rep(1, n()), repeat_at, repeat_by) )\r\n\r\n # A tibble: 15 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Darth Vader 202 136\r\n 6 Darth Vader 202 136\r\n 7 Leia Organa 150 49\r\n 8 Leia Organa 150 49\r\n 9 Leia Organa 150 49\r\n 10 Leia Organa 150 49\r\n 11 Owen Lars 178 120\r\n 12 Beru Whitesun Lars 165 75\r\n 13 R5-D4 97 32\r\n 14 Biggs Darklighter 183 84\r\n 15 Obi-Wan Kenobi 182 77\r\n\r\nSubset a selection of rows + the following row\r\nRow order can sometimes encode a meaningful continuous measure, like time.\r\nTake for example this subset of the flights dataset in {nycflights13}:\r\n\r\n\r\nflights_df <- nycflights13::flights |> \r\n filter(month == 3, day == 3, origin == \"JFK\") |> \r\n select(dep_time, flight, carrier) |> \r\n slice(1:100) |> \r\n arrange(dep_time)\r\nflights_df\r\n\r\n # A tibble: 100 × 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 555 145 B6 \r\n 4 556 208 B6 \r\n 5 556 79 B6 \r\n 6 601 501 B6 \r\n 7 604 725 B6 \r\n 8 606 135 B6 \r\n 9 606 600 UA \r\n 10 607 829 US \r\n # ℹ 90 more rows\r\n\r\nHere, the rows are ordered by dep_time, such that given a row, the next row is a data point for the next flight that departed from the airport.\r\nAnd let’s say we’re interested in flights that took off immediately after American Airlines (\"AA\") flights. Given what we just noted about the ordering of rows in the data frame, we can do this in slice() by adding 1 to the row index of AA flights:\r\n\r\n\r\nflights_df |> \r\n slice( which(carrier == \"AA\") + 1 )\r\n\r\n # A tibble: 14 × 3\r\n dep_time flight carrier\r\n \r\n 1 551 5716 EV \r\n 2 627 905 B6 \r\n 3 652 117 B6 \r\n 4 714 825 AA \r\n 5 717 987 B6 \r\n 6 724 11 VX \r\n 7 742 183 DL \r\n 8 802 655 AA \r\n 9 805 2143 DL \r\n 10 847 59 B6 \r\n 11 858 647 AA \r\n 12 859 120 DL \r\n 13 1031 179 AA \r\n 14 1036 641 B6\r\n\r\nWhat if we also want to keep observations for the preceding AA flights as well? We can just stick which(carrier == \"AA\") inside slice() too:\r\n\r\n\r\nflights_df |> \r\n slice(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n )\r\n\r\n # A tibble: 28 × 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 626 413 AA \r\n 3 652 1815 AA \r\n 4 711 443 AA \r\n 5 714 825 AA \r\n 6 724 33 AA \r\n 7 739 59 AA \r\n 8 802 1838 AA \r\n 9 802 655 AA \r\n 10 843 1357 AA \r\n # ℹ 18 more rows\r\n\r\nBut now the rows are now ordered such that all the AA flights come before the other flights! How can we preserve the original order of increasing dep_time?\r\nWe could reconstruct the initial row order by piping the result into arrange(dep_time) again, but the simplest solution would be to concatenate the set of row indices and sort() them, since the output of which() is already integer!\r\n\r\n\r\nflights_df |> \r\n slice(\r\n sort(c(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n ))\r\n )\r\n\r\n # A tibble: 28 × 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 626 413 AA \r\n 4 627 905 B6 \r\n 5 652 1815 AA \r\n 6 652 117 B6 \r\n 7 711 443 AA \r\n 8 714 825 AA \r\n 9 714 825 AA \r\n 10 717 987 B6 \r\n # ℹ 18 more rows\r\n\r\nNotice how the 8th and 9th rows are repeated here - that’s because 2 AA flights departed in a row (ha!). We can use unique() to remove duplicate rows in the same call to slice():\r\n\r\n\r\nflights_df |> \r\n slice(\r\n unique(sort(c(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1\r\n )))\r\n )\r\n\r\n # A tibble: 24 × 3\r\n dep_time flight carrier\r\n \r\n 1 535 1141 AA \r\n 2 551 5716 EV \r\n 3 626 413 AA \r\n 4 627 905 B6 \r\n 5 652 1815 AA \r\n 6 652 117 B6 \r\n 7 711 443 AA \r\n 8 714 825 AA \r\n 9 717 987 B6 \r\n 10 724 33 AA \r\n # ℹ 14 more rows\r\n\r\nImportantly, we can do all of this inside slice() because we’re working with integer sets. The integer part allows us to do things like + 1 and sort(), while the set part allows us to combine with c() and remove duplicates with unique().\r\nSubset a selection of rows + multiple following rows\r\nIn this example, let’s problematize our approach with the repeated which() calls in our previous solution.\r\nImagine another scenario where we want to filter for all AA flights and three subsequent flights for each.\r\nDo we need to write the solution out like this? That’s a lot of repetition!\r\n\r\n\r\nflights_df |> \r\n slice(\r\n which(carrier == \"AA\"),\r\n which(carrier == \"AA\") + 1,\r\n which(carrier == \"AA\") + 2,\r\n which(carrier == \"AA\") + 3\r\n )\r\n\r\n\r\nYou might think we can get away with + 0:3, but it doesn’t work as we’d like. The + just forces 0:3 to be (partially) recycled to the same length as carrier for element-wise addition:\r\n\r\n\r\nwhich(flights_df$carrier == \"AA\") + 0:3\r\n\r\n Warning in which(flights_df$carrier == \"AA\") + 0:3: longer object length is not\r\n a multiple of shorter object length\r\n [1] 1 14 20 27 25 28 34 40 38 62 66 68 91 93\r\n\r\nIf only we can get the outer sum of the two arrays, 0:3 and which(carrier == \"AA\") … Oh wait, we can - that’s what outer() does!\r\n\r\n\r\nouter(0:3, which(flights_df$carrier == \"AA\"), `+`)\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]\r\n [1,] 1 13 18 24 25 27 32 37 38 61 64 65 91 92\r\n [2,] 2 14 19 25 26 28 33 38 39 62 65 66 92 93\r\n [3,] 3 15 20 26 27 29 34 39 40 63 66 67 93 94\r\n [4,] 4 16 21 27 28 30 35 40 41 64 67 68 94 95\r\n\r\nThis is essentially the repeated which() vectors stacked on top of each other, but as a matrix:\r\n\r\n\r\nprint( which(flights_df$carrier == \"AA\") )\r\nprint( which(flights_df$carrier == \"AA\") + 1 )\r\nprint( which(flights_df$carrier == \"AA\") + 2 )\r\nprint( which(flights_df$carrier == \"AA\") + 3 )\r\n\r\n [1] 1 13 18 24 25 27 32 37 38 61 64 65 91 92\r\n [1] 2 14 19 25 26 28 33 38 39 62 65 66 92 93\r\n [1] 3 15 20 26 27 29 34 39 40 63 66 67 93 94\r\n [1] 4 16 21 27 28 30 35 40 41 64 67 68 94 95\r\n\r\nThe fact that outer() returns all the relevant row indices inside a single matrix is nice because we can collect the indices column-by-column, preserving row order. Matrices, like data frames, are column-major, so coercing a matrix to a vector collapses it column-wise:\r\n\r\n\r\nas.integer( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\n\r\n [1] 1 2 3 4 13 14 15 16 18 19 20 21 24 25 26 27 25 26 27 28 27 28 29 30 32\r\n [26] 33 34 35 37 38 39 40 38 39 40 41 61 62 63 64 64 65 66 67 65 66 67 68 91 92\r\n [51] 93 94 92 93 94 95\r\n\r\n\r\nOther ways to coerce matrix to vector\r\nThere are two other options for coercing a matrix to vector - c() and as.vector(). I like to stick with as.integer() because that enforces integer type (which makes sense for row indices), and c() can be nice because it’s less to type (although it’s off-label usage):\r\n\r\n\r\n# Not run, but equivalent to `as.integer()` method\r\nas.vector( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\nc( outer(0:3, which(flights_df$carrier == \"AA\"), `+`) )\r\n\r\n\r\nSomewhat relatedly - and this only works inside the tidy-eval context of slice() - you can get a similar effect of “collapsing” a matrix using the splice operator !!!:\r\n\r\n\r\nseq_matrix <- matrix(1:9, byrow = TRUE, nrow = 3)\r\nas.integer(seq_matrix)\r\n\r\n [1] 1 4 7 2 5 8 3 6 9\r\n\r\nidentical(\r\n mtcars |> slice( as.vector(seq_matrix) ),\r\n mtcars |> slice( !!!seq_matrix )\r\n)\r\n\r\n [1] TRUE\r\n\r\nHere, the !!!seq_matrix was slotting each individual “cell” as argument to slice():\r\n\r\n\r\nrlang::expr( slice(!!!seq_matrix) )\r\n\r\n slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)\r\n\r\nA big difference in behavior between as.integer() vs. !!! is that the latter works for lists of indices too, by slotting each element of the list as an argument to slice():\r\n\r\n\r\nseq_list <- list(c(1, 4, 7, 2), c(5, 8, 3, 6, 9))\r\nrlang::expr( slice( !!!seq_list ) )\r\n\r\n slice(c(1, 4, 7, 2), c(5, 8, 3, 6, 9))\r\n\r\nHowever, as you may already know, as.integer() cannot flatten lists:\r\n\r\n\r\nas.integer(seq_list)\r\n\r\n Error in eval(expr, envir, enclos): 'list' object cannot be coerced to type 'integer'\r\n\r\nNote that as.vector() and c() leaves lists as is, which is another reason to prefer as.integer() for type-checking:\r\n\r\n\r\nidentical(seq_list, as.vector(seq_list))\r\nidentical(seq_list, c(seq_list))\r\n\r\n [1] TRUE\r\n [1] TRUE\r\n\r\nFinally, back in our !!!seq_matrix example, we could have applied asplit(MARGIN = 2) to chunk the splicing by matrix column, although the overall effect would be the same:\r\n\r\n\r\nrlang::expr(slice( !!!seq_matrix ))\r\n\r\n slice(1L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 9L)\r\n\r\nrlang::expr(slice( !!!asplit(seq_matrix, 2) ))\r\n\r\n slice(c(1L, 4L, 7L), c(2L, 5L, 8L), c(3L, 6L, 9L))\r\n\r\nThis lets us ask questions like: Which AA flights departed within 3 flights of another AA flight?\r\n\r\n\r\nflights_df |> \r\n slice( as.integer( outer(0:3, which(carrier == \"AA\"), `+`) ) ) |> \r\n filter( carrier == \"AA\", duplicated(flight) ) |> \r\n distinct(flight, carrier)\r\n\r\n # A tibble: 6 × 2\r\n flight carrier\r\n \r\n 1 825 AA \r\n 2 33 AA \r\n 3 655 AA \r\n 4 1 AA \r\n 5 647 AA \r\n 6 179 AA\r\n\r\n\r\nSlicing all the way down: Case 1\r\nWith the addition of the .by argument to slice() in dplyr v1.10, we can re-write the above code as three calls to slice() (+ a call to select()):\r\n\r\n\r\nflights_df |> \r\n slice( as.integer( outer(0:3, which(carrier == \"AA\"), `+`) ) ) |> \r\n slice( which(carrier == \"AA\" & duplicated(flight)) ) |> # filter()\r\n slice( 1, .by = c(flight, carrier) ) |> # distinct()\r\n select(flight, carrier)\r\n\r\n # A tibble: 6 × 2\r\n flight carrier\r\n \r\n 1 825 AA \r\n 2 33 AA \r\n 3 655 AA \r\n 4 1 AA \r\n 5 647 AA \r\n 6 179 AA\r\n\r\nThe next example will demonstrate another, perhaps more practical usecase for outer() in slice().\r\nFilter (and encode) neighboring rows\r\nLet’s use a subset of the {gapminder} data set for this one. Here, we have data for each European country’s GDP-per-capita by year, between 1992 to 2007:\r\n\r\n\r\ngapminder_df <- gapminder::gapminder |> \r\n left_join(gapminder::country_codes, by = \"country\") |> # `multiple = \"all\"`\r\n filter(year >= 1992, continent == \"Europe\") |> \r\n select(country, country_code = iso_alpha, year, gdpPercap)\r\ngapminder_df\r\n\r\n # A tibble: 120 × 4\r\n country country_code year gdpPercap\r\n \r\n 1 Albania ALB 1992 2497.\r\n 2 Albania ALB 1997 3193.\r\n 3 Albania ALB 2002 4604.\r\n 4 Albania ALB 2007 5937.\r\n 5 Austria AUT 1992 27042.\r\n 6 Austria AUT 1997 29096.\r\n 7 Austria AUT 2002 32418.\r\n 8 Austria AUT 2007 36126.\r\n 9 Belgium BEL 1992 25576.\r\n 10 Belgium BEL 1997 27561.\r\n # ℹ 110 more rows\r\n\r\nThis time, let’s see the desired output (plot) first and build our way up. The goal is to plot the GDP growth of Germany over the years, and its yearly GDP neighbors side-by-side:\r\n\r\n\r\n\r\nFirst, let’s think about what a “GDP neighbor” means in row-relational terms. If you arranged the data by GDP, the GDP neighbors would be the rows that come immediately before and after the rows for Germany. You need to recalculate neighbors every year though, so this arrange() + slice() combo should happen by-year.\r\nWith that in mind, let’s set up a year grouping and arrange by gdpPercap within year:1\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap, .by_group = TRUE)\r\n\r\n # A tibble: 120 × 4\r\n # Groups: year [4]\r\n country country_code year gdpPercap\r\n \r\n 1 Albania ALB 1992 2497.\r\n 2 Bosnia and Herzegovina BIH 1992 2547.\r\n 3 Turkey TUR 1992 5678.\r\n 4 Bulgaria BGR 1992 6303.\r\n 5 Romania ROU 1992 6598.\r\n 6 Montenegro MNE 1992 7003.\r\n 7 Poland POL 1992 7739.\r\n 8 Croatia HRV 1992 8448.\r\n 9 Serbia SRB 1992 9325.\r\n 10 Slovak Republic SVK 1992 9498.\r\n # ℹ 110 more rows\r\n\r\nNow within each year, we want to grab the row for Germany and its neighboring rows. We can do this by taking the outer() sum of -1:1 and the row indices for Germany:\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap, .by_group = TRUE) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) )\r\n\r\n # A tibble: 12 × 4\r\n # Groups: year [4]\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 United Kingdom GBR 2002 29479.\r\n 8 Germany DEU 2002 30036.\r\n 9 Belgium BEL 2002 30486.\r\n 10 France FRA 2007 30470.\r\n 11 Germany DEU 2007 32170.\r\n 12 United Kingdom GBR 2007 33203.\r\n\r\n\r\nSlicing all the way down: Case 2\r\nThe new .by argument in slice() comes in handy again here, allowing us to collapse the group_by() + arrange() combo into one slice() call:\r\n\r\n\r\ngapminder_df |> \r\n slice( order(gdpPercap), .by = year) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) )\r\n\r\n # A tibble: 12 × 4\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 United Kingdom GBR 2002 29479.\r\n 8 Germany DEU 2002 30036.\r\n 9 Belgium BEL 2002 30486.\r\n 10 France FRA 2007 30470.\r\n 11 Germany DEU 2007 32170.\r\n 12 United Kingdom GBR 2007 33203.\r\n\r\nFor our purposes here we want actually the grouping to persist for the following mutate() call, but there may be other cases where you’d want to use slice(.by = ) for temporary grouping.\r\nNow we’re already starting to see the shape of the data that we want! The last step is to encode the relationship of each row to Germany - does a row represent Germany itself, or a country that’s one GDP ranking below or above Germany?\r\nContinuing with our grouped context, we make a new column grp that assigns a factor value \"lo\"-\"is\"-\"hi\" (for “lower” than Germany, “is” Germany and “higher” than Germany) to each country trio by year. Notice the use of fct_inorder() below - this ensures that the factor levels are in the order of their occurrence (necessary for the correct ordering of bars in geom_col() later):\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\")))\r\n\r\n # A tibble: 12 × 5\r\n # Groups: year [4]\r\n country country_code year gdpPercap grp \r\n \r\n 1 Denmark DNK 1992 26407. lo \r\n 2 Germany DEU 1992 26505. is \r\n 3 Netherlands NLD 1992 26791. hi \r\n 4 Belgium BEL 1997 27561. lo \r\n 5 Germany DEU 1997 27789. is \r\n 6 Iceland ISL 1997 28061. hi \r\n 7 United Kingdom GBR 2002 29479. lo \r\n 8 Germany DEU 2002 30036. is \r\n 9 Belgium BEL 2002 30486. hi \r\n 10 France FRA 2007 30470. lo \r\n 11 Germany DEU 2007 32170. is \r\n 12 United Kingdom GBR 2007 33203. hi\r\n\r\nWe now have everything that’s necessary to make our desired plot, so we ungroup(), write some {ggplot2} code, and voila!\r\n\r\n\r\ngapminder_df |> \r\n group_by(year) |> \r\n arrange(gdpPercap) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))) |> \r\n # Ungroup and make ggplot\r\n ungroup() |> \r\n ggplot(aes(as.factor(year), gdpPercap, group = grp)) +\r\n geom_col(aes(fill = grp == \"is\"), position = position_dodge()) +\r\n geom_text(\r\n aes(label = country_code),\r\n vjust = 1.3,\r\n position = position_dodge(width = .9)\r\n ) +\r\n scale_fill_manual(\r\n values = c(\"grey75\", \"steelblue\"),\r\n guide = guide_none()\r\n ) +\r\n theme_classic() +\r\n labs(x = \"Year\", y = \"GDP per capita\")\r\n\r\n\r\n\r\n\r\nSolving the harder version of the problem\r\nThe solution presented above relies on a fragile assumption that Germany will always have a higher and lower ranking GDP neighbor every year. But nothing about the problem description guarantees this, so how can we re-write our code to be more robust?\r\nFirst, let’s simulate a data where Germany is the lowest ranking country in 2002 and the highest ranking in 2007. In other words, Germany only has one GDP neighbor in those years:\r\n\r\n\r\ngapminder_harder_df <- gapminder_df |> \r\n slice( order(gdpPercap), .by = year) |> \r\n slice( as.integer(outer( -1:1, which(country == \"Germany\"), `+` )) ) |> \r\n slice( -7, -12 )\r\ngapminder_harder_df\r\n\r\n # A tibble: 10 × 4\r\n country country_code year gdpPercap\r\n \r\n 1 Denmark DNK 1992 26407.\r\n 2 Germany DEU 1992 26505.\r\n 3 Netherlands NLD 1992 26791.\r\n 4 Belgium BEL 1997 27561.\r\n 5 Germany DEU 1997 27789.\r\n 6 Iceland ISL 1997 28061.\r\n 7 Germany DEU 2002 30036.\r\n 8 Belgium BEL 2002 30486.\r\n 9 France FRA 2007 30470.\r\n 10 Germany DEU 2007 32170.\r\n\r\nGiven this data, we cannot assign the full, length-3 lo-is-hi factor by group, because the groups for year 2002 and 2007 only have 2 observations:\r\n\r\n\r\ngapminder_harder_df |> \r\n group_by(year) |> \r\n mutate(grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\")))\r\n\r\n Error in `mutate()`:\r\n ℹ In argument: `grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))`.\r\n ℹ In group 3: `year = 2002`.\r\n Caused by error:\r\n ! `grp` must be size 2 or 1, not 3.\r\n\r\nThe trick here is to turn each group of rows into an integer sequence where Germany is “anchored” to 2, and then use that vector to subset the lo-is-hi factor:\r\n\r\n\r\ngapminder_harder_df |> \r\n group_by(year) |> \r\n mutate(\r\n Germany_anchored_to_2 = row_number() - which(country == \"Germany\") + 2,\r\n grp = forcats::fct_inorder(c(\"lo\", \"is\", \"hi\"))[Germany_anchored_to_2]\r\n )\r\n\r\n # A tibble: 10 × 6\r\n # Groups: year [4]\r\n country country_code year gdpPercap Germany_anchored_to_2 grp \r\n \r\n 1 Denmark DNK 1992 26407. 1 lo \r\n 2 Germany DEU 1992 26505. 2 is \r\n 3 Netherlands NLD 1992 26791. 3 hi \r\n 4 Belgium BEL 1997 27561. 1 lo \r\n 5 Germany DEU 1997 27789. 2 is \r\n 6 Iceland ISL 1997 28061. 3 hi \r\n 7 Germany DEU 2002 30036. 2 is \r\n 8 Belgium BEL 2002 30486. 3 hi \r\n 9 France FRA 2007 30470. 1 lo \r\n 10 Germany DEU 2007 32170. 2 is\r\n\r\nWe find that the lessons of working with row indices from slice() translated to solving this complex mutate() problem - neat!\r\nAside: kronecker() as as.vector(outer())\r\nFollowing from the slice() + outer() strategy demoed above, imagine if we wanted to filter for \"Luke Skywalker\" and 4 other characters that are neighbors in the height and mass values.\r\n\r\n\r\ndplyr::starwars[, 1:3]\r\n\r\n # A tibble: 87 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 C-3PO 167 75\r\n 3 R2-D2 96 32\r\n 4 Darth Vader 202 136\r\n 5 Leia Organa 150 49\r\n 6 Owen Lars 178 120\r\n 7 Beru Whitesun Lars 165 75\r\n 8 R5-D4 97 32\r\n 9 Biggs Darklighter 183 84\r\n 10 Obi-Wan Kenobi 182 77\r\n # ℹ 77 more rows\r\n\r\nIn row-relational terms, “filtering neighboring values” just means “filtering rows after arranging by the values we care about”. We can express this using slice() and outer() as:\r\n\r\n\r\nstarwars %>% \r\n select(name, mass, height) %>% \r\n arrange(mass, height) %>% \r\n slice( as.vector(outer(-1:1, which(grepl(\"(Luke|Anakin) Skywalker\", name)), `+`)) )\r\n\r\n # A tibble: 6 × 3\r\n name mass height\r\n \r\n 1 Wedge Antilles 77 170\r\n 2 Luke Skywalker 77 172\r\n 3 Obi-Wan Kenobi 77 182\r\n 4 Biggs Darklighter 84 183\r\n 5 Anakin Skywalker 84 188\r\n 6 Mace Windu 84 188\r\n\r\nI raised this example on an unrelated thread on the R4DS/DSLC slack, where Anthony Durrant pointed me to kronecker() as a version of outer() that unlist before returning the output.\r\nSo in examples involving outer() to generate row indices in slice(), we can also use kronecker() instead to save a call to a flattening function like as.vector():\r\n\r\n\r\nstarwars %>% \r\n select(name, mass, height) %>% \r\n arrange(mass, height) %>% \r\n slice( kronecker(-1:1, which(grepl(\"(Luke|Anakin) Skywalker\", name)), `+`) )\r\n\r\n # A tibble: 6 × 3\r\n name mass height\r\n \r\n 1 Wedge Antilles 77 170\r\n 2 Biggs Darklighter 84 183\r\n 3 Luke Skywalker 77 172\r\n 4 Anakin Skywalker 84 188\r\n 5 Obi-Wan Kenobi 77 182\r\n 6 Mace Windu 84 188\r\n\r\nI returning to this problem with kronecker() also inspired me to write a function around this. Since slice()-ing with which(...) is just filter(...), I call it filter_around() and give it a filter() with its defaults.\r\n\r\n\r\nfilter_around <- function(.data, ..., by, n = 0, name = NULL) {\r\n data <- .data |>\r\n dplyr::arrange(dplyr::pick({{ by }}))\r\n dots <- rlang::enquos(...)\r\n lgls <- lapply(dots, rlang::eval_tidy, data = data)\r\n inds <- which(as.logical(do.call(pmin, lgls)))\r\n inds_around <- kronecker(inds, -n:n, `+`)\r\n if (!is.null(name)) {\r\n data[[name]] <- replace(rep(FALSE, nrow(data)), inds, TRUE)\r\n }\r\n data |>\r\n dplyr::slice(.env$inds_around)\r\n}\r\n\r\n\r\nPassing conditions to the dots makes it behave like filter():\r\n\r\n\r\ndplyr::starwars[, 1:3] |>\r\n filter_around(\r\n # Two conditions below evaluate to `grepl(\"(Luke|Anakin) Skywalker\", name)`\r\n grepl(\"Skywalker\", name),\r\n name != \"Shmi Skywalker\"\r\n )\r\n\r\n # A tibble: 2 × 3\r\n name height mass\r\n \r\n 1 Luke Skywalker 172 77\r\n 2 Anakin Skywalker 188 84\r\n\r\nAnd other arguments can be used for the “around” behavior:\r\n\r\n\r\n# Filter the target rows *and* a pair of neighbors for each row by height + mass\r\ndplyr::starwars[, 1:3] |>\r\n filter_around(\r\n grepl(\"Skywalker\", name),\r\n name != \"Shmi Skywalker\",\r\n # Extra args\r\n by = c(height, mass),\r\n n = 1,\r\n name = \"target\"\r\n )\r\n\r\n # A tibble: 6 × 4\r\n name height mass target\r\n \r\n 1 Eeth Koth 171 NA FALSE \r\n 2 Luke Skywalker 172 77 TRUE \r\n 3 Greedo 173 74 FALSE \r\n 4 Plo Koon 188 80 FALSE \r\n 5 Anakin Skywalker 188 84 TRUE \r\n 6 Mace Windu 188 84 FALSE\r\n\r\nWindowed min/max/median (etc.)\r\nLet’s say we have this small time series data, and we want to calculate a lagged 3-window moving minimum for the val column:\r\n\r\n\r\nts_df <- tibble(\r\n time = 1:6,\r\n val = sample(1:6 * 10)\r\n)\r\nts_df\r\n\r\n # A tibble: 6 × 2\r\n time val\r\n \r\n 1 1 50\r\n 2 2 40\r\n 3 3 60\r\n 4 4 30\r\n 5 5 20\r\n 6 6 10\r\n\r\nIf you’re new to window functions, think of them as a special kind of group_by() + summarize() where groups are chunks of observations along a (typically unique) continuous measure like time, and observations can be shared between groups.\r\nThere are several packages implementing moving/sliding/rolling window functions. My current favorite is {r2c} (see a review of other implementations therein), but I also like {slider} for an implementation that follows familiar “tidy” design principles:\r\n\r\n\r\nlibrary(slider)\r\nts_df |> \r\n mutate(moving_min = slide_min(val, before = 2L, complete = TRUE))\r\n\r\n # A tibble: 6 × 3\r\n time val moving_min\r\n \r\n 1 1 50 NA\r\n 2 2 40 NA\r\n 3 3 60 40\r\n 4 4 30 30\r\n 5 5 20 20\r\n 6 6 10 10\r\n\r\nMoving window is a general class of operations that encompass any arbitrary summary statistic - so not just min but other reducing functions like mean, standard deviation, etc. But what makes moving min (along with max, median, etc.) a particularly interesting case for our current discussion is that the value comes from an existing observation in the data. And if our time series is tidy, every observation makes up a row. See where I’m going with this?\r\nUsing outer() again, we can take the outer sum of all row indices of ts_df and -2:0. This gives us a matrix where each column represents a lagged size-3 moving window:\r\n\r\n\r\nwindows_3lag <- outer(-2:0, 1:nrow(ts_df), \"+\")\r\nwindows_3lag\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6]\r\n [1,] -1 0 1 2 3 4\r\n [2,] 0 1 2 3 4 5\r\n [3,] 1 2 3 4 5 6\r\n\r\nThe “lagged size-3” property of this moving window means that the first two windows are incomplete (consisting of less than 3 observations). We want to treat those as invalid, so we can drop the first two columns from our matrix:\r\n\r\n\r\nwindows_3lag[,-(1:2)]\r\n\r\n [,1] [,2] [,3] [,4]\r\n [1,] 1 2 3 4\r\n [2,] 2 3 4 5\r\n [3,] 3 4 5 6\r\n\r\nFor each remaining column, we want to grab the values of val at the corresponding row indices and find which row has the minimum val. In terms of code, we use apply() with MARGIN = 2L to column-wise apply a function where we use which.min() to find the location of the minimum val and convert it back to row index via subsetting:\r\n\r\n\r\nwindows_3lag[, -(1:2)] |> \r\n apply(MARGIN = 2L, \\(i) i[which.min(ts_df$val[i])])\r\n\r\n [1] 2 4 5 6\r\n\r\nNow let’s stick this inside slice(), exploiting the fact that it’s data-masked (ts_df$val can just be val) and exposes context-dependent expressions (1:nrow(ts_df) can just be row_number()):\r\n\r\n\r\nmoving_mins <- ts_df |> \r\n slice(\r\n outer(-2:0, row_number(), \"+\")[,-(1:2)] |> \r\n apply(MARGIN = 2L, \\(i) i[which.min(val[i])])\r\n )\r\nmoving_mins\r\n\r\n # A tibble: 4 × 2\r\n time val\r\n \r\n 1 2 40\r\n 2 4 30\r\n 3 5 20\r\n 4 6 10\r\n\r\nFrom here, we can grab the val column and pad it with NA to add our desired window_min column to the original data frame:\r\n\r\n\r\nts_df |> \r\n mutate(moving_min = c(NA, NA, moving_mins$val))\r\n\r\n # A tibble: 6 × 3\r\n time val moving_min\r\n \r\n 1 1 50 NA\r\n 2 2 40 NA\r\n 3 3 60 40\r\n 4 4 30 30\r\n 5 5 20 20\r\n 6 6 10 10\r\n\r\nAt this point you might think that this is a very round-about way of solving the same problem. But actually I think that it’s a faster route to solving a slightly more complicated problem - augmenting each observation of a data frame with information about comparison observations.\r\nFor example, our slice()-based solution sets us up nicely for also bringing along information about the time at which the moving_min occurred. After some rename()-ing and adding the original time information back in, we get back a relational data structure where time is a key shared with ts_df:\r\n\r\n\r\nmoving_mins2 <- moving_mins |> \r\n rename(moving_min_val = val, moving_min_time = time) |> \r\n mutate(time = ts_df$time[-(1:2)], .before = 1L)\r\nmoving_mins2\r\n\r\n # A tibble: 4 × 3\r\n time moving_min_time moving_min_val\r\n \r\n 1 3 2 40\r\n 2 4 4 30\r\n 3 5 5 20\r\n 4 6 6 10\r\n\r\nWe can then left-join this to the original data to augment it with information about both the value of the 3-window minimum and the time that the minimum occurred:\r\n\r\n\r\nleft_join(ts_df, moving_mins2, by = \"time\")\r\n\r\n # A tibble: 6 × 4\r\n time val moving_min_time moving_min_val\r\n \r\n 1 1 50 NA NA\r\n 2 2 40 NA NA\r\n 3 3 60 2 40\r\n 4 4 30 4 30\r\n 5 5 20 5 20\r\n 6 6 10 6 10\r\n\r\nThis is particularly useful if rows contain other useful information for comparison and you have memory to spare:\r\n\r\n\r\nts_wide_df <- ts_df |> \r\n mutate(\r\n col1 = rnorm(6),\r\n col2 = rnorm(6)\r\n )\r\nts_wide_df\r\n\r\n # A tibble: 6 × 4\r\n time val col1 col2\r\n \r\n 1 1 50 0.0183 0.00501\r\n 2 2 40 0.705 -0.0376 \r\n 3 3 60 -0.647 0.724 \r\n 4 4 30 0.868 -0.497 \r\n 5 5 20 0.376 0.0114 \r\n 6 6 10 0.310 0.00986\r\n\r\nThe below code augments each observation in the original ts_wide_df data with information about the corresponding 3-window moving min (columns prefixed with \"min3val_\")\r\n\r\n\r\nmoving_mins_wide <- ts_wide_df |> \r\n slice(\r\n outer(-2:0, row_number(), \"+\")[,-(1:2)] |> \r\n apply(MARGIN = 2L, \\(i) i[which.min(val[i])])\r\n ) |> \r\n rename_with(~ paste0(\"min3val_\", .x)) |> \r\n mutate(time = ts_wide_df$time[-(1:2)])\r\nleft_join(ts_wide_df, moving_mins_wide, by = \"time\")\r\n\r\n # A tibble: 6 × 8\r\n time val col1 col2 min3val_time min3val_val min3val_col1\r\n \r\n 1 1 50 0.0183 0.00501 NA NA NA \r\n 2 2 40 0.705 -0.0376 NA NA NA \r\n 3 3 60 -0.647 0.724 2 40 0.705\r\n 4 4 30 0.868 -0.497 4 30 0.868\r\n 5 5 20 0.376 0.0114 5 20 0.376\r\n 6 6 10 0.310 0.00986 6 10 0.310\r\n # ℹ 1 more variable: min3val_col2 \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nEvenly distributed row shuffling of balanced categories\r\nSometimes the ordering of rows in a data frame can be meaningful for an external application.\r\nFor example, many experiment-building platforms for psychology research require researchers to specify the running order of trials in an experiment via a csv, where each row represents a trial and each column represents information about the trial.\r\nSo an experiment testing the classic Stroop effect may have the following template:\r\n\r\n\r\nmismatch_trials <- tibble(\r\n item_id = 1:5,\r\n trial = \"mismatch\",\r\n word = c(\"red\", \"green\", \"purple\", \"brown\", \"blue\"),\r\n color = c(\"brown\", \"red\", \"green\", \"blue\", \"purple\")\r\n)\r\nmismatch_trials\r\n\r\n # A tibble: 5 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 2 mismatch green red \r\n 3 3 mismatch purple green \r\n 4 4 mismatch brown blue \r\n 5 5 mismatch blue purple\r\n\r\nWe probably also want to mix in some control trials where the word and color do match:\r\n\r\n\r\nmatch_trials <- mismatch_trials |> \r\n mutate(trial = \"match\", color = word)\r\nmatch_trials\r\n\r\n # A tibble: 5 × 4\r\n item_id trial word color \r\n \r\n 1 1 match red red \r\n 2 2 match green green \r\n 3 3 match purple purple\r\n 4 4 match brown brown \r\n 5 5 match blue blue\r\n\r\nNow that we have all materials for our experiment, we next want the running order to interleave the match and mismatch trials.\r\nWe first add them together into a longer data frame:\r\n\r\n\r\nstroop_trials <- bind_rows(mismatch_trials, match_trials)\r\nstroop_trials\r\n\r\n # A tibble: 10 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 2 mismatch green red \r\n 3 3 mismatch purple green \r\n 4 4 mismatch brown blue \r\n 5 5 mismatch blue purple\r\n 6 1 match red red \r\n 7 2 match green green \r\n 8 3 match purple purple\r\n 9 4 match brown brown \r\n 10 5 match blue blue\r\n\r\nAnd from here we can exploit the fact that all mismatch items come before match items, and that they share the same length of 5:\r\n\r\n\r\nstroop_trials |> \r\n slice( as.integer(outer(c(0, 5), 1:5, \"+\")) )\r\n\r\n # A tibble: 10 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 1 match red red \r\n 3 2 mismatch green red \r\n 4 2 match green green \r\n 5 3 mismatch purple green \r\n 6 3 match purple purple\r\n 7 4 mismatch brown blue \r\n 8 4 match brown brown \r\n 9 5 mismatch blue purple\r\n 10 5 match blue blue\r\n\r\nThis relies on a strong assumptions about the row order in the original data, though. So a safer alternative is to represent the row indices for \"match\" and \"mismatch\" trials as rows of a matrix, and then collapse column-wise.\r\nLet’s try this outside of slice() first. We start with a call to sapply() to construct a matrix where the columns contain row indices for each unique category of trial:\r\n\r\n\r\nsapply(unique(stroop_trials$trial), \\(x) which(stroop_trials$trial == x))\r\n\r\n mismatch match\r\n [1,] 1 6\r\n [2,] 2 7\r\n [3,] 3 8\r\n [4,] 4 9\r\n [5,] 5 10\r\n\r\nThen we transpose the matrix with t(), which rotates it:\r\n\r\n\r\nt( sapply(unique(stroop_trials$trial), \\(x) which(stroop_trials$trial == x)) )\r\n\r\n [,1] [,2] [,3] [,4] [,5]\r\n mismatch 1 2 3 4 5\r\n match 6 7 8 9 10\r\n\r\nNow lets stick that inside slice, remembering to collapse the transposed matrix into vector:\r\n\r\n\r\ninterleaved_stroop_trials <- stroop_trials |> \r\n slice( as.integer(t(sapply(unique(trial), \\(x) which(trial == x)))) )\r\ninterleaved_stroop_trials\r\n\r\n # A tibble: 10 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 1 match red red \r\n 3 2 mismatch green red \r\n 4 2 match green green \r\n 5 3 mismatch purple green \r\n 6 3 match purple purple\r\n 7 4 mismatch brown blue \r\n 8 4 match brown brown \r\n 9 5 mismatch blue purple\r\n 10 5 match blue blue\r\n\r\nAt the moment, we have both “red” word trails showing up together, and then the “green”s, the “purple”s, and so on. If we wanted to introduce some randomness to the presentation order within each type of trial, we can wrap the row indices in sample() to shuffle them first:\r\n\r\n\r\nshuffled_stroop_trials <- stroop_trials |> \r\n slice( as.integer(t(sapply(unique(trial), \\(x) sample(which(trial == x))))) )\r\nshuffled_stroop_trials\r\n\r\n # A tibble: 10 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 2 mismatch green red \r\n 4 4 match brown brown \r\n 5 3 mismatch purple green \r\n 6 1 match red red \r\n 7 4 mismatch brown blue \r\n 8 3 match purple purple\r\n 9 5 mismatch blue purple\r\n 10 2 match green green\r\n\r\n\r\nInserting a new row at specific intervals\r\nContinuing with our Stroop experiment template example, let’s say we want to give participants a break every two trials.\r\nIn a matrix representation, this means constructing this 2-row matrix of row indices:\r\n\r\n\r\nmatrix(1:nrow(shuffled_stroop_trials), nrow = 2)\r\n\r\n [,1] [,2] [,3] [,4] [,5]\r\n [1,] 1 3 5 7 9\r\n [2,] 2 4 6 8 10\r\n\r\nAnd adding a row of that represent a separator/break, before collapsing column-wise:\r\n\r\n\r\nmatrix(1:nrow(shuffled_stroop_trials), nrow = 2) |> \r\n rbind(11)\r\n\r\n [,1] [,2] [,3] [,4] [,5]\r\n [1,] 1 3 5 7 9\r\n [2,] 2 4 6 8 10\r\n [3,] 11 11 11 11 11\r\n\r\nUsing slice, this means adding a row to the data representing a break trial first, and then adding a row to the row index matrix representing that row:\r\n\r\n\r\nstroop_with_breaks <- shuffled_stroop_trials |> \r\n add_row(trial = \"BREAK\") |> \r\n slice(\r\n matrix(row_number()[-n()], nrow = 2) |> \r\n rbind(n()) |> \r\n as.integer()\r\n )\r\nstroop_with_breaks\r\n\r\n # A tibble: 15 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 NA BREAK \r\n 4 2 mismatch green red \r\n 5 4 match brown brown \r\n 6 NA BREAK \r\n 7 3 mismatch purple green \r\n 8 1 match red red \r\n 9 NA BREAK \r\n 10 4 mismatch brown blue \r\n 11 3 match purple purple\r\n 12 NA BREAK \r\n 13 5 mismatch blue purple\r\n 14 2 match green green \r\n 15 NA BREAK \r\n\r\nIf we don’t want a break after the last trial, we can use negative indexing with slice(-n()):\r\n\r\n\r\nstroop_with_breaks |> \r\n slice(-n())\r\n\r\n # A tibble: 14 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 NA BREAK \r\n 4 2 mismatch green red \r\n 5 4 match brown brown \r\n 6 NA BREAK \r\n 7 3 mismatch purple green \r\n 8 1 match red red \r\n 9 NA BREAK \r\n 10 4 mismatch brown blue \r\n 11 3 match purple purple\r\n 12 NA BREAK \r\n 13 5 mismatch blue purple\r\n 14 2 match green green\r\n\r\nWhat about after 3 trials, where the number of trials (10) is not divisibly by 3? Can we still use a matrix?\r\nYes, you’d just need to explicitly fill in the “blanks”!\r\nConceptually, we want a matrix like this, where extra “cells” are padded with 0s (recall that 0s are ignored in slice()):\r\n\r\n\r\nmatrix(c(1:10, rep(0, 3 - 10 %% 3)), nrow = 3)\r\n\r\n [,1] [,2] [,3] [,4]\r\n [1,] 1 4 7 10\r\n [2,] 2 5 8 0\r\n [3,] 3 6 9 0\r\n\r\nAnd this is how that could be implemented inside slice(), minding the fact that adding the break trial increases original row count by 1:\r\n\r\n\r\nshuffled_stroop_trials |> \r\n add_row(trial = \"BREAK\") |> \r\n slice(\r\n c(seq_len(n()-1), rep(0, 3 - (n()-1) %% 3)) |> \r\n matrix(nrow = 3) |> \r\n rbind(n()) |> \r\n as.integer()\r\n ) |> \r\n slice(-n())\r\n\r\n # A tibble: 13 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 2 mismatch green red \r\n 4 NA BREAK \r\n 5 4 match brown brown \r\n 6 3 mismatch purple green \r\n 7 1 match red red \r\n 8 NA BREAK \r\n 9 4 mismatch brown blue \r\n 10 3 match purple purple\r\n 11 5 mismatch blue purple\r\n 12 NA BREAK \r\n 13 2 match green green\r\n\r\nHow about inserting a break trial after every \"purple\" word trials?\r\nConceptually, we want a matrix that binds these two vectors as rows before collapsing:\r\n\r\n\r\nprint( 1:nrow(shuffled_stroop_trials) )\r\nprint(\r\n replace(rep(0, nrow(shuffled_stroop_trials)),\r\n which(shuffled_stroop_trials$word == \"purple\"), 11)\r\n)\r\n\r\n [1] 1 2 3 4 5 6 7 8 9 10\r\n [1] 0 0 0 0 11 0 0 11 0 0\r\n\r\nAnd this is how you could do that inside slice():\r\n\r\n\r\nshuffled_stroop_trials |> \r\n add_row(trial = \"BREAK\") |> \r\n slice(\r\n c(seq_len(n()-1), replace(rep(0, n()-1), which(word == \"purple\"), n())) |>\r\n matrix(nrow = 2, byrow = TRUE) |> \r\n as.integer()\r\n )\r\n\r\n # A tibble: 12 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 2 mismatch green red \r\n 4 4 match brown brown \r\n 5 3 mismatch purple green \r\n 6 NA BREAK \r\n 7 1 match red red \r\n 8 4 mismatch brown blue \r\n 9 3 match purple purple\r\n 10 NA BREAK \r\n 11 5 mismatch blue purple\r\n 12 2 match green green\r\n\r\nYou might protest that this is a pretty convoluted approach to a seemingly simple problem of inserting rows, and you’d be right!2 Not only is the code difficult to read, you can only insert the same single row over and over.\r\nIt turns out that these cases of row insertion actually fall under the broader class of interweaving unequal categories - let’s see this next.\r\nEvenly distributed row shuffling of unequal categories\r\nLet’s return to our solution for the initial “break every 2 trials” problem:\r\n\r\n\r\nshuffled_stroop_trials |> \r\n add_row(trial = \"BREAK\") |> \r\n slice(\r\n matrix(row_number()[-n()], nrow = 2) |> \r\n rbind(n()) |> \r\n as.integer()\r\n ) |> \r\n slice(-n())\r\n\r\n # A tibble: 14 × 4\r\n item_id trial word color \r\n \r\n 1 1 mismatch red brown \r\n 2 5 match blue blue \r\n 3 NA BREAK \r\n 4 2 mismatch green red \r\n 5 4 match brown brown \r\n 6 NA BREAK \r\n 7 3 mismatch purple green \r\n 8 1 match red red \r\n 9 NA BREAK \r\n 10 4 mismatch brown blue \r\n 11 3 match purple purple\r\n 12 NA BREAK \r\n 13 5 mismatch blue purple\r\n 14 2 match green green\r\n\r\nHere, we were working with a matrix that looks like this, where 11 represents the new row we added representing a break trial:\r\n\r\n [,1] [,2] [,3] [,4] [,5]\r\n [1,] 1 3 5 7 9\r\n [2,] 2 4 6 8 10\r\n [3,] 11 11 11 11 11\r\n\r\nAnd recall that to insert every 3 rows, we needed to pad with 0 first to satisfy the matrix’s rectangle constraint:\r\n\r\n [,1] [,2] [,3] [,4]\r\n [1,] 1 4 7 10\r\n [2,] 2 5 8 0\r\n [3,] 3 6 9 0\r\n [4,] 11 11 11 11\r\n\r\nBut a better way of thinking about this is to have one matrix row representing all row indices, and then add a sparse row that represent breaks:\r\nBreak after every 2 trials:\r\n\r\n\r\nmatrix(c(1:10, rep_len(c(0, 11), 10)), nrow = 2, byrow = TRUE)\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]\r\n [1,] 1 2 3 4 5 6 7 8 9 10\r\n [2,] 0 11 0 11 0 11 0 11 0 11\r\n\r\nBreak after every 3 trials:\r\n\r\n\r\nmatrix(c(1:10, rep_len(c(0, 0, 11), 10)), nrow = 2, byrow = TRUE)\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]\r\n [1,] 1 2 3 4 5 6 7 8 9 10\r\n [2,] 0 0 11 0 0 11 0 0 11 0\r\n\r\nBreak after every 4 trials:\r\n\r\n\r\nmatrix(c(1:10, rep_len(c(0, 0, 0, 11), 10)), nrow = 2, byrow = TRUE)\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]\r\n [1,] 1 2 3 4 5 6 7 8 9 10\r\n [2,] 0 0 0 11 0 0 0 11 0 0\r\n\r\nAnd it turns out that this method generalizes to balanced shuffling across categories that are not equal in size!\r\nLet’s start with a really basic example - here we have three kinds of fruits with varying counts:\r\n\r\n\r\nfruits <- c(\"🍎\", \"🍋\", \"🍇\")[c(2,1,3,3,2,3,1,2,2,1,2,2,3,3,3)]\r\nfruits <- factor(fruits, levels = c(\"🍇\", \"🍋\", \"🍎\"))\r\ntable(fruits)\r\n\r\n fruits\r\n 🍇 🍋 🍎 \r\n 6 6 3\r\n\r\nTheir current order looks like this:\r\n\r\n\r\ncat(levels(fruits)[fruits])\r\n\r\n 🍋 🍎 🍇 🍇 🍋 🍇 🍎 🍋 🍋 🍎 🍋 🍋 🍇 🍇 🍇\r\n\r\nBut I want them to be ordered such that individuals of the same fruit kind are maximally apart from one another. This effectively re-orders the fruits to be distributed “evenly”:\r\n\r\n\r\ncat(levels(fruits)[fruits[c(3,1,2,4,5,0,6,8,10,13,9,0,14,11,7,15,12,0)]])\r\n\r\n 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋\r\n\r\nWith our “build row-wise, collapse col-wise” approach, this takes the following steps:\r\nFind the most frequent category - that N-max becomes the number of columns in the matrix of row indices.\r\nIn this case it’s grapes and lemons, of which there are 6 each:\r\n\r\n\r\ngrape_rows <- which(fruits == \"🍇\")\r\n setNames(grape_rows, rep(\"🍇\", 6))\r\n\r\n 🍇 🍇 🍇 🍇 🍇 🍇 \r\n 3 4 6 13 14 15\r\n\r\n\r\n\r\nlemon_rows <- which(fruits == \"🍋\")\r\n setNames(lemon_rows, rep(\"🍋\", 6))\r\n\r\n 🍋 🍋 🍋 🍋 🍋 🍋 \r\n 1 5 8 9 11 12\r\n\r\nNormalize (“stretch”) all vectors to have the same length as N.\r\nIn this case we need to stretch the apples vector, which is currently only length-3:\r\n\r\n\r\napple_rows <- which(fruits == \"🍎\")\r\n apple_rows\r\n\r\n [1] 2 7 10\r\n\r\nThe desired “sparse” representation is something like this, where each instance of apple is equidistant, with 0s in between:\r\n\r\n\r\napple_rows_sparse <- c(2, 0, 7, 0, 10, 0)\r\n setNames(apple_rows_sparse, c(\"🍎\", \"\", \"🍎\", \"\", \"🍎\", \"\"))\r\n\r\n 🍎 🍎 🍎 \r\n 2 0 7 0 10 0\r\n\r\nThere are many ways to get at this, but one trick involves creating an evenly spaced float sequence from 1 to N-apple over N-max steps:\r\n\r\n\r\nseq(1, 3, length.out = 6)\r\n\r\n [1] 1.0 1.4 1.8 2.2 2.6 3.0\r\n\r\nFrom there, we round the numbers:\r\n\r\n\r\nround(seq(1, 3, length.out = 6))\r\n\r\n [1] 1 1 2 2 3 3\r\n\r\nThen mark the first occurance of each number using !duplicated():\r\n\r\n\r\n!duplicated(round(seq(1, 3, length.out = 6)))\r\n\r\n [1] TRUE FALSE TRUE FALSE TRUE FALSE\r\n\r\nAnd lastly, we initialize a vector of 0s and replace() the TRUEs with apple indices:\r\n\r\n\r\nreplace(\r\n rep(0, 6),\r\n !duplicated(round(seq(1, 3, length.out = 6))),\r\n which(fruits == \"🍎\")\r\n )\r\n\r\n [1] 2 0 7 0 10 0\r\n\r\nStack up the category vectors by row and collapse column-wise:\r\nManually, we would build the full matrix row-by-row like this:\r\n\r\n\r\nfruits_matrix <- matrix(\r\n c(grape_rows, lemon_rows, apple_rows_sparse),\r\n nrow = 3, byrow = TRUE\r\n )\r\n rownames(fruits_matrix) <- c(\"🍇\", \"🍋\", \"🍎\")\r\n fruits_matrix\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6]\r\n 🍇 3 4 6 13 14 15\r\n 🍋 1 5 8 9 11 12\r\n 🍎 2 0 7 0 10 0\r\n\r\nAnd dynamically we can use sapply() to fill the matrix column-by-column, and then t()-ing the output:\r\n\r\n\r\nfruits_distributed <- sapply(levels(fruits), \\(x) {\r\n n_max <- max(table(fruits))\r\n ind <- which(fruits == x)\r\n nums <- seq(1, length(ind), length.out = n_max)\r\n replace(rep(0, n_max), !duplicated(round(nums)), ind)\r\n }) |> \r\n t()\r\n fruits_distributed\r\n\r\n [,1] [,2] [,3] [,4] [,5] [,6]\r\n 🍇 3 4 6 13 14 15\r\n 🍋 1 5 8 9 11 12\r\n 🍎 2 0 7 0 10 0\r\n\r\nFinally, we collapse the vector and we see that it indeed distributed the fruits evenly!\r\n\r\n\r\nfruits[as.integer(fruits_distributed)]\r\n\r\n [1] 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋 🍇 🍋 🍎 🍇 🍋\r\n Levels: 🍇 🍋 🍎\r\n\r\nWe can go even further and wrap the dynamic, sapply()-based solution into a function for use within slice(). Here, I also added an optional argument for shuffling within categories:\r\n\r\n\r\nrshuffle <- function(x, shuffle_within = FALSE) {\r\n categories <- as.factor(x)\r\n n_max <- max(table(categories))\r\n sapply(levels(categories), \\(lvl) {\r\n ind <- which(categories == lvl)\r\n if (shuffle_within) ind <- sample(ind)\r\n nums <- seq(1, length(ind), length.out = n_max)\r\n replace(rep(0, n_max), !duplicated(round(nums)), ind)\r\n }) |> \r\n t() |> \r\n as.integer()\r\n}\r\n\r\n\r\nReturning back to our Stroop experiment template example, imagine we also had two filler trials, where no word is shown and just the color flashes on the screen:\r\n\r\n\r\nstroop_fillers <- tibble(\r\n item_id = 1:2,\r\n trial = \"filler\",\r\n word = NA,\r\n color = c(\"red\", \"blue\")\r\n)\r\nstroop_with_fillers <- bind_rows(stroop_fillers, stroop_trials) |> \r\n mutate(trial = factor(trial, c(\"match\", \"mismatch\", \"filler\")))\r\nstroop_with_fillers\r\n\r\n # A tibble: 12 × 4\r\n item_id trial word color \r\n \r\n 1 1 filler red \r\n 2 2 filler blue \r\n 3 1 mismatch red brown \r\n 4 2 mismatch green red \r\n 5 3 mismatch purple green \r\n 6 4 mismatch brown blue \r\n 7 5 mismatch blue purple\r\n 8 1 match red red \r\n 9 2 match green green \r\n 10 3 match purple purple\r\n 11 4 match brown brown \r\n 12 5 match blue blue\r\n\r\nWe can evenly shuffle between the unequal trial types with our new rshuffle() function:\r\n\r\n\r\nstroop_with_fillers |> \r\n slice( rshuffle(trial, shuffle_within = TRUE) )\r\n\r\n # A tibble: 12 × 4\r\n item_id trial word color \r\n \r\n 1 2 match green green \r\n 2 2 mismatch green red \r\n 3 1 filler red \r\n 4 1 match red red \r\n 5 4 mismatch brown blue \r\n 6 3 match purple purple\r\n 7 3 mismatch purple green \r\n 8 2 filler blue \r\n 9 4 match brown brown \r\n 10 5 mismatch blue purple\r\n 11 5 match blue blue \r\n 12 1 mismatch red brown\r\n\r\nConclusion\r\nWhen I started drafting this blog post, I thought I’d come with a principled taxonomy of row-relational operations. Ha. This was a lot trickier to think through than I thought.\r\nBut I hope that this gallery of esoteric use-cases for slice() inspires you to use it more, and to think about “tidy” solutions to seemingly “untidy” problems.\r\n\r\nThe .by_group = TRUE is not strictly necessary here, but it’s good for visually inspecting the within-group ordering.↩︎\r\nAlthough row insertion is a generally tricky problem for column-major data frame structures, which is partly why dplyr’s row manipulation verbs have stayed experimental for quite some time.↩︎\r\n", "preview": "posts/2023-06-11-row-relational-operations/preview.png", - "last_modified": "2024-09-23T10:29:43-04:00", + "last_modified": "2024-09-23T07:29:43-07:00", "input_file": {}, "preview_width": 1800, "preview_height": 1080 @@ -208,7 +208,7 @@ ], "contents": "\r\n\r\nContents\r\nIntro\r\nTL;DR - Big takeaways\r\nSetup\r\nQuick example\r\nList of 💜s and 💔s\r\n1) 💜 The distinctness of the “grouped df” type\r\n2) 💜 The imperative -! variants\r\n3) 💔 Competition between Base.filter() and DataFrames.subset()\r\n4) 💜 The operation specification syntax is like {data.table}’s j on steroids\r\n5) 💜 Rowwise operations with ByRow() and eachrow()\r\n6) 💔 Confusingly, select() is more like dplyr::transmute() than dplyr::select()\r\n7) 💔 Selection helpers are not powered by boolean algebra\r\n8) 💜 groupby() has select-semantics\r\n9) 💔 No special marking of context-dependent expressions\r\n10) 💜 The op-spec syntax gives you dplyr::across()/c_across() for free\r\n\r\nConcluding thoughts\r\nOverall impression\r\nNext steps\r\n\r\n\r\nIntro\r\nDataFrames.jl is a Julia package for data wrangling.\r\nAs of this writing it is at v1.4.x - it’s a mature library that’s been in active development for over a decade.1\r\nFor some background, I comfortably switch between {dplyr} and {data.table}, having used both for nearly 5 years.\r\nI love digging into the implementational details of both - I really appreciate the thoughtfulness behind {dplyr}’s tidyeval/tidyselect semantics, as well as {data.table}’s conciseness and abstraction in the j.\r\nI have not been exposed to any other data wrangling frameworks but was recently compelled to learn Julia for independent reasons,2 so I decided why not pick up Julia-flavored data wrangling while I’m at it?\r\nThis blog post is a rough (and possibly evolving?) list of my first impressions of DataFrames.jl and “DataFrames.jl accessories”, namely Chain.jl and DataFramesMeta.jl.3\r\nIf you’re Julia-curious and/or just want to hear an R person talk about how another language does data wrangling differently, you’re the target audience!\r\nHowever, this blog post is NOT:\r\nMy first impressions of the Julia language or a pitch for why you should use Julia. If you want that from an R user’s perspective, check out Trang Le’s blog post and the Julia documentation on “Noteworthy differences from R”.\r\nA DataFrames.jl tutorial. But if you’re curious, aside from the docs I learned almost exclusively from Bogumił Kamiński’s JuliaCon 2022 workshop, the Julia Data Science book, and the Julia for Data Analysis book.4\r\nA {dplyr}/{data.table} to DataFrames.jl translation cheatsheet since those already exist, though I’ll be doing some of that myself when it helps illustrate a point.\r\nAll of this to say that I have no skin in the game and I don’t endorse or represent anything I write here.\r\nIn fact I’m a Julia noob myself (it’s only been like 3 months) so take everything with a grain of salt and please feel free to let me know if I did anything wrong or inefficiently!\r\nTL;DR - Big takeaways\r\nThe syntax mimics {dplyr} but works more like {data.table} under the hood. There’s a bit of unlearning to do for {dplyr} users.\r\nThere are not as many idiomatic ways of doing things like in {dplyr}. Whereas you can get very far in {dplyr} without thinking much about base R, learning DataFrames.jl requires a good amount of “base” Julia first (especially distinctions between data types, which R lacks).\r\nI love Chain.jl but I’m not that drawn to DataFramesMeta.jl because it feels like {dtplyr}5 - I’d personally rather just focus on learning the thing itself.\r\nSome aspects of DataFrames.jl are relatively underdeveloped IMO (e.g., context dependent expressions) but it’s in active development and I plan to stick around to see more.\r\nSetup\r\n\r\n\r\nR\r\n\r\n\r\n# R v4.2.1\r\nlibrary(dplyr) # v1.0.10\r\nlibrary(data.table) # v1.14.5\r\nmtcars_df <- mtcars |>\r\n as_tibble(rownames = \"model\") |>\r\n type.convert(as.is = TRUE)\r\nmtcars_dt <- as.data.table(mtcars_df)\r\n\r\n\r\n\r\n\r\nJulia\r\n\r\n# Julia v1.8.2\r\nusing DataFrames # (v1.4.3)\r\nusing DataFramesMeta # (v0.12.0) Also imports Chain.jl\r\n# using Chain.jl (v0.5.0)\r\nusing StatsBase # (v0.33.21) Like base R {stats}\r\nusing RDatasets # (v0.7.7) Self-explanatory; like the {Rdatasets} package\r\nmtcars = RDatasets.dataset(\"datasets\", \"mtcars\")\r\n 32×12 DataFrame\r\n Row │ Model MPG Cyl Disp HP DRat WT QS ⋯\r\n │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯\r\n ─────┼──────────────────────────────────────────────────────────────────────────\r\n 1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯\r\n 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875\r\n 3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32\r\n 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215\r\n 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯\r\n 6 │ Valiant 18.1 6 225.0 105 2.76 3.46\r\n 7 │ Duster 360 14.3 8 360.0 245 3.21 3.57\r\n 8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19\r\n ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱\r\n 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.935 ⋯\r\n 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14\r\n 28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513\r\n 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17\r\n 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯\r\n 31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57\r\n 32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78\r\n 5 columns and 17 rows omitted\r\n\r\n\r\n\r\nQuick example\r\nFrom mtcars…\r\nFilter for rows that represent \"Merc\"6 car models\r\nCalculate the average mpg by cyl\r\nReturn a new column called kmpg that converts miles to kilometers (1:1.61)\r\n\r\n\r\n{dplyr}\r\n\r\n\r\nmtcars_df |>\r\n filter(stringr::str_detect(model, \"^Merc \")) |>\r\n group_by(cyl) |>\r\n summarize(kmpg = mean(mpg) * 1.61)\r\n\r\n # A tibble: 3 × 2\r\n cyl kmpg\r\n \r\n 1 4 38.0\r\n 2 6 29.8\r\n 3 8 26.2\r\n\r\n\r\n\r\n{data.table}\r\n\r\n\r\nmtcars_dt[model %like% \"^Merc \", .(kmpg = mean(mpg) * 1.61), by = cyl]\r\n\r\n cyl kmpg\r\n \r\n 1: 4 37.996\r\n 2: 6 29.785\r\n 3: 8 26.243\r\n\r\n\r\n\r\nDataFrames.jl\r\n\r\n@chain mtcars begin\r\n subset(:Model => x -> occursin.(r\"^Merc \", x))\r\n groupby(:Cyl)\r\n combine(:MPG => (x -> mean(x) * 1.61) => :kmpg)\r\nend\r\n 3×2 DataFrame\r\n Row │ Cyl kmpg\r\n │ Int64 Float64\r\n ─────┼────────────────\r\n 1 │ 4 37.996\r\n 2 │ 6 29.785\r\n 3 │ 8 26.243\r\n\r\n\r\n\r\nDataFramesMeta.jl\r\n\r\n@chain mtcars begin\r\n @rsubset(occursin(r\"^Merc \", :Model))\r\n groupby(:Cyl)\r\n @combine(:kmpg = mean(:MPG) * 1.61)\r\nend\r\n 3×2 DataFrame\r\n Row │ Cyl kmpg\r\n │ Int64 Float64\r\n ─────┼────────────────\r\n 1 │ 4 37.996\r\n 2 │ 6 29.785\r\n 3 │ 8 26.243\r\n\r\n\r\n\r\nList of 💜s and 💔s\r\nDisclaimer: These are not a list of like/dislike or approve/disapprove. The 💜 and 💔 are just vibes - you know what I mean?\r\n1) 💜 The distinctness of the “grouped df” type\r\nIn {dplyr}, it’s easy to think of grouping as a transient operation.\r\nWe don’t really think about group_by() as a data-wrangling function because the returned object is visually very similar to the input.\r\nCoupled with the general expectation that object attributes in R tend to be ephemeral, group_by() often gets treated like a second-class citizen even though grouping is sticky, making it a frequent culprit of subtle bugs.\r\n\r\n\r\ngroup_by(mtcars_df, cyl)\r\n\r\n # A tibble: 32 × 12\r\n # Groups: cyl [3]\r\n model mpg cyl disp hp drat wt qsec vs am gear carb\r\n \r\n 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4\r\n 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4\r\n 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1\r\n 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1\r\n 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2\r\n 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1\r\n 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4\r\n 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2\r\n 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2\r\n 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4\r\n # … with 22 more rows\r\n\r\nIn {data.table}, it’s at the opposite extreme.\r\nGrouping via by= is right there in the slogan dt[i,j,by], sharing a privileged status with row selection i and column manipulation j.\r\nGrouping as an operation is very prominent in the code but that also makes it feel a bit too integrated for my taste.\r\nIt feels too close to SQL, whereas I like my data wrangling operations to be more modular and portable.\r\nI think that DataFrames.jl hits a good middle ground.\r\nIt keeps groupby() as a stand-alone operation while marking the output as very clearly distinct from the input.\r\nThe returned GroupedDataFrame type, when printed, visually splits the dataframe by groups:7\r\n\r\ngroupby(mtcars, :Cyl)\r\n GroupedDataFrame with 3 groups based on key: Cyl\r\n First Group (11 rows): Cyl = 4\r\n Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯\r\n │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float6 ⋯\r\n ─────┼──────────────────────────────────────────────────────────────────────────\r\n 1 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 ⋯\r\n 2 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0\r\n ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱\r\n 10 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9\r\n 11 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.6\r\n 5 columns and 7 rows omitted\r\n ⋮\r\n Last Group (14 rows): Cyl = 8\r\n Row │ Model MPG Cyl Disp HP DRat WT QS ⋯\r\n │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯\r\n ─────┼──────────────────────────────────────────────────────────────────────────\r\n 1 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯\r\n 2 │ Duster 360 14.3 8 360.0 245 3.21 3.57\r\n ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱\r\n 14 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57\r\n 5 columns and 11 rows omitted\r\n\r\n2) 💜 The imperative -! variants\r\nMany verbs in DataFrames.jl have a -! suffix counterpart, like transform()/transform!().\r\nThese -! forms indicate a mutate-in-place operation, which modifies an object it’s called with as a side-effect.8\r\nIt’s conceptually like {data.table}’s set() or the walrus operator := in the j, with no equivalent in {dplyr}.9\r\nI love pure functions and I’m a sucker for functional programming, but one place where I can really see myself using the -! form a lot is for rename!() and sort!(), latter being equivalent to dplyr::arrange():10\r\n\r\nmtcars_sorted = copy(mtcars); # Make a copy\r\nsort!(mtcars_sorted, :MPG); # Sort the copy by ascending MPG value, in-place\r\nmtcars_sorted[ [begin, end], [:Model, :MPG] ] # Check in first and last row\r\n\r\n\r\n 2×2 DataFrame\r\n Row │ Model MPG\r\n │ String31 Float64\r\n ─────┼─────────────────────────────\r\n 1 │ Cadillac Fleetwood 10.4\r\n 2 │ Toyota Corolla 33.9\r\n\r\nThis is also an efficient way of dropping columns that you know you won’t use.11 In {data.table}, you can do this with := but in the funky form of assigning columns to NULL which I never really liked.12\r\n\r\n\r\nmtcars_dropcol <- copy(mtcars_dt)\r\nmtcars_dropcol[, mpg := NULL] # drop single column\r\nmtcars_dropcol[, `:=`(wt = NULL, vs = NULL)] # drop multiple columns in function form\r\nnames(mtcars_dropcol)\r\n\r\n [1] \"model\" \"cyl\" \"disp\" \"hp\" \"drat\" \"qsec\" \"am\" \"gear\" \"carb\"\r\n\r\nIn DataFrames.jl that’s just select!() instead of select(). No copies are created in the calls to select!():\r\n\r\nmtcars_dropcol = copy(mtcars);\r\nselect!(mtcars_dropcol, Not(:MPG));\r\nselect!(mtcars_dropcol, Not([:WT, :VS]));\r\nnames(mtcars_dropcol)\r\n 9-element Vector{String}:\r\n \"Model\"\r\n \"Cyl\"\r\n \"Disp\"\r\n \"HP\"\r\n \"DRat\"\r\n \"QSec\"\r\n \"AM\"\r\n \"Gear\"\r\n \"Carb\"\r\n\r\n3) 💔 Competition between Base.filter() and DataFrames.subset()\r\nIn DataFrames.jl, there are two ways of subsetting rows:\r\nThe Base.filter() generic which has been extended with a DataFrame method. It works kind of like base R’s Filter() in that it takes the object to filter as its second argument:\r\n\r\nfilter(:Gear => x -> x .== 5, mtcars)\r\n 5×12 DataFrame\r\n Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯\r\n │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float ⋯\r\n ─────┼──────────────────────────────────────────────────────────────────────────\r\n 1 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16 ⋯\r\n 2 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16\r\n 3 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14\r\n 4 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15\r\n 5 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 14 ⋯\r\n 5 columns omitted\r\n\r\nThe DataFrames.subset() function which, like other DataFrame transformation verbs, takes the dataframe as its first argument similar to dplyr::filter():\r\n\r\nsubset(mtcars, :Gear => x -> x .== 5)\r\n 5×12 DataFrame\r\n Row │ Model MPG Cyl Disp HP DRat WT QSec ⋯\r\n │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Float ⋯\r\n ─────┼──────────────────────────────────────────────────────────────────────────\r\n 1 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 16 ⋯\r\n 2 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 16\r\n 3 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 14\r\n 4 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15\r\n 5 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 14 ⋯\r\n 5 columns omitted\r\n\r\nI understand the rationale for supporting Julia’s standard filter() function for DataFrame types, but I feel like it’s something that the average DataFrames.jl user shouldn’t encounter unless they specifically go looking for it.\r\nIn my experience this hasn’t been the case, and it’s caused me a lot of confusion when I was first learning DataFrames.jl - I consider the simultaneous teaching of filter() and subset() in the same chapter as the one flaw in the otherwise flawless book Julia Data Science.\r\n4) 💜 The operation specification syntax is like {data.table}’s j on steroids\r\nThe “operation specification syntax”, also called the “transformation mini language”13 refers to the templatic form:\r\n\r\n[input columns] => [transformation function] => [output columns]\r\n\r\n… available inside verbs like select(), transform(), combine(), and subset().\r\nIf I say “take the MPG column and multiply it by 1.61 to create a new column called KMPG”, then in DataFrames.jl that’s:\r\n\r\n@chain mtcars begin\r\n transform( :MPG => (x -> x * 1.61) => :KMPG )\r\n # Below operations cut down rows and cols to save printing space\r\n select(:MPG, :KMPG)\r\n first(5)\r\nend\r\n 5×2 DataFrame\r\n Row │ MPG KMPG\r\n │ Float64 Float64\r\n ─────┼──────────────────\r\n 1 │ 21.0 33.81\r\n 2 │ 21.0 33.81\r\n 3 │ 22.8 36.708\r\n 4 │ 21.4 34.454\r\n 5 │ 18.7 30.107\r\n\r\n… where x -> x * 1.61 inside transform() is an anonymous function14 like R’s \\(x) x * 1.61.\r\nI make the comparison to {data.table}’s j because of the flexibility with what you can compute and return inside that special environment.\r\nFor example, let’s say you want to treat cyl like a categorical variable and do a one-hot encoding for cyl==4, cyl==6, and cyl==8. To comply with R/Julia’s variable naming rules, let’s name these new columns cyl_4, cyl_6, and cyl_8.\r\nIn {data.table} it just takes an lapply() inside the j - you can rely on the invariant that as long as an expression in j evaluates to a list, you will get columns back:\r\n\r\n\r\nmtcars_onehot <- copy(mtcars_dt)\r\ncyl_vals <- sort(unique(mtcars_onehot$cyl)) # [1] 4 6 8\r\ncyl_cols <- paste0(\"cyl_\", cyl_vals) # [1] \"cyl_4\" \"cyl_6\" \"cyl_8\"\r\nmtcars_onehot[, (cyl_cols) := lapply(cyl_vals, \\(x) cyl == x)]\r\nmtcars_onehot[, (setdiff(names(mtcars_onehot), c(\"model\", cyl_cols))) := NULL]\r\nmtcars_onehot[1:5,]\r\n\r\n model cyl_4 cyl_6 cyl_8\r\n \r\n 1: Mazda RX4 FALSE TRUE FALSE\r\n 2: Mazda RX4 Wag FALSE TRUE FALSE\r\n 3: Datsun 710 TRUE FALSE FALSE\r\n 4: Hornet 4 Drive FALSE TRUE FALSE\r\n 5: Hornet Sportabout FALSE FALSE TRUE\r\n\r\nLikewise, in DataFrames.jl, you can rely on the invariant that as long as you’re complying with the op-spec and returning certain data types,15 they will become columns in the table:\r\n\r\nmtcars_onehot = copy(mtcars);\r\ncyl_vals = sort(unique(mtcars_onehot.Cyl));\r\ncyl_cols = \"cyl_\" .* string.(cyl_vals);\r\n@chain mtcars_onehot begin\r\n transform!( :Cyl .=> ByRow(x -> cyl_vals .== x) => cyl_cols )\r\n select!(:Model, Cols(r\"^cyl_\"))\r\nend;\r\nmtcars_onehot[1:5,:]\r\n 5×4 DataFrame\r\n Row │ Model cyl_4 cyl_6 cyl_8\r\n │ String31 Bool Bool Bool\r\n ─────┼────────────────────────────────────────\r\n 1 │ Mazda RX4 false true false\r\n 2 │ Mazda RX4 Wag false true false\r\n 3 │ Datsun 710 true false false\r\n 4 │ Hornet 4 Drive false true false\r\n 5 │ Hornet Sportabout false false true\r\n\r\nRead on for more about the ByRow() function wrapping the anonymous function inside transform!().\r\n5) 💜 Rowwise operations with ByRow() and eachrow()\r\nI mentioned at the beginning that DataFrames.jl doesn’t have a lot of idiomatic ways of doing things but one rare case that I really appreciate is ByRow().\r\nUnder the hood, ByRow() is actually just a function factory that takes a function as input and a broadcasted version of the function as output, kind of like R’s Vectorize():\r\n\r\nByRow(round)([1.2, 3.3]) # same as round.([1.2, 3.3])\r\n 2-element Vector{Float64}:\r\n 1.0\r\n 3.0\r\n\r\nI call ByRow() idiomatic because, as you can see, it doesn’t have anything to do with DataFrame rows.\r\nNevertheless it goes perfectly hand in hand with the op-spec syntax because you can just wrap the middle “[transformation function]” component in ByRow() and call it a day:\r\n\r\nselect(mtcars, :MPG => ByRow( x -> Integer(round(x)) ) => :MPG_whole )[1:3,:]\r\n 3×1 DataFrame\r\n Row │ MPG_whole\r\n │ Int64\r\n ─────┼───────────\r\n 1 │ 21\r\n 2 │ 21\r\n 3 │ 23\r\n\r\nWhereas the equivalent using the . would take two broadcasted functions:\r\n\r\nselect(mtcars, :MPG => ( x -> Integer.(round.(x)) ) => :MPG_whole )[1:3,:]\r\n 3×1 DataFrame\r\n Row │ MPG_whole\r\n │ Int64\r\n ─────┼───────────\r\n 1 │ 21\r\n 2 │ 21\r\n 3 │ 23\r\n\r\nI especially like ByRow() in these cases because I can use Julia’s function composition operator ∘ (\\circ) to re-write x -> Integer(round(x)) into Integer ∘ round, which just looks sooo clean:\r\n\r\nselect(mtcars, :MPG => ByRow(Integer ∘ round) => :MPG_whole )[1:3,:]\r\n 3×1 DataFrame\r\n Row │ MPG_whole\r\n │ Int64\r\n ─────┼───────────\r\n 1 │ 21\r\n 2 │ 21\r\n 3 │ 23\r\n\r\nThe one small qualm I have with ByRow() though is that the docs equate it to dplyr::rowwise() when really it’s more comparable to purrr::map_*().\r\ndplyr::rowwise() is unlike ByRow() because it’s a function that takes a dataframe as input and returns an object of class .\r\nIn that sense, it’s actually more similar to Base.eachrow() which you use to convert a DataFrame into a DataFrameRow object:16\r\n\r\nmtcars_rowwise = eachrow(mtcars[1:5, 1:3])\r\n 5×3 DataFrameRows\r\n Row │ Model MPG Cyl\r\n │ String31 Float64 Int64\r\n ─────┼───────────────────────────────────\r\n 1 │ Mazda RX4 21.0 6\r\n 2 │ Mazda RX4 Wag 21.0 6\r\n 3 │ Datsun 710 22.8 4\r\n 4 │ Hornet 4 Drive 21.4 6\r\n 5 │ Hornet Sportabout 18.7 8\r\n\r\nSimilar to how column-major dataframes are essentially a list of vectors representing each column, the row-major DataFrameRow object is essentially a vector of NamedTuples representing each row under the hood:\r\n\r\ncopy.(mtcars_rowwise)\r\n 5-element Vector{NamedTuple{(:Model, :MPG, :Cyl), Tuple{InlineStrings.String31, Float64, Int64}}}:\r\n (Model = InlineStrings.String31(\"Mazda RX4\"), MPG = 21.0, Cyl = 6)\r\n (Model = InlineStrings.String31(\"Mazda RX4 Wag\"), MPG = 21.0, Cyl = 6)\r\n (Model = InlineStrings.String31(\"Datsun 710\"), MPG = 22.8, Cyl = 4)\r\n (Model = InlineStrings.String31(\"Hornet 4 Drive\"), MPG = 21.4, Cyl = 6)\r\n (Model = InlineStrings.String31(\"Hornet Sportabout\"), MPG = 18.7, Cyl = 8)\r\n\r\n6) 💔 Confusingly, select() is more like dplyr::transmute() than dplyr::select()\r\nInterestingly, select() isn’t just for selecting columns - it can modify columns too:\r\n\r\n@chain mtcars begin\r\n select(:Model, :MPG => (x -> x .* 1.61) => :KMPG)\r\n first(3)\r\nend\r\n 3×2 DataFrame\r\n Row │ Model KMPG\r\n │ String31 Float64\r\n ─────┼────────────────────────\r\n 1 │ Mazda RX4 33.81\r\n 2 │ Mazda RX4 Wag 33.81\r\n 3 │ Datsun 710 36.708\r\n\r\nIn this sense DataFrames.select() feels more like dplyr::transmute() rather than the identically named dplyr::select():17\r\n\r\n\r\nmtcars_df |>\r\n transmute(model, kmpg = mpg * 1.61) |>\r\n head(3)\r\n\r\n # A tibble: 3 × 2\r\n model kmpg\r\n \r\n 1 Mazda RX4 33.8\r\n 2 Mazda RX4 Wag 33.8\r\n 3 Datsun 710 36.7\r\n\r\nI think it’s misleading that select() can also transform() in DataFrames.jl, although I don’t complain that dplyr::select() can also dplyr::rename() so idk:\r\n\r\n\r\nmtcars_df |>\r\n select(car_model = model, mpg) |>\r\n head(3)\r\n\r\n # A tibble: 3 × 2\r\n car_model mpg\r\n \r\n 1 Mazda RX4 21 \r\n 2 Mazda RX4 Wag 21 \r\n 3 Datsun 710 22.8\r\n\r\n7) 💔 Selection helpers are not powered by boolean algebra\r\nThe fact that DataFrames.select() behaves like dplyr::transmute() might explain why you don’t really get a rich {tidyselect}-esque interface to column selection.18\r\nThis has been very challenging as someone who loves {tidyselect}, but DataFrames.jl has been getting new column selection helpers like Cols(), Not(), Between(), etc. which makes things a bit easier compared to {data.table} at least.\r\nBut I don’t really vibe with the implementation of column selection helpers as set operation, as opposed to boolean algebra.19\r\nI wish I’m told that I’m wrong, but it feels really clunky to do something like “select columns that are numeric and has a name that’s two characters long”. In DataFrames.jl that’s:\r\n\r\n@chain mtcars begin\r\n select(_, intersect( names(_, Float64), names(_, r\"^..$\") ) )\r\n first(1)\r\nend\r\n 1×1 DataFrame\r\n Row │ WT\r\n │ Float64\r\n ─────┼─────────\r\n 1 │ 2.62\r\n\r\nWhereas in {dplyr} you just write each condition as a predicate joined by &:\r\n\r\n\r\nmtcars_df |>\r\n select( where(is.double) & matches(\"^..$\") ) |>\r\n head(1)\r\n\r\n # A tibble: 1 × 1\r\n wt\r\n \r\n 1 2.62\r\n\r\n8) 💜 groupby() has select-semantics\r\nOne thing that’s always bugged me a little in {dplyr} is the fact that dplyr::group_by() has mutate-semantics.\r\nSo like, despite the fact that people pretty much only ever20 use group_by() like select() in the form of group_by(col1, col2, col3), you can’t use {tidyselect} helpers:\r\n\r\n\r\nmtcars_df |>\r\n group_by( matches(\"^[va]\") ) |>\r\n summarize(mpg = mean(mpg), .groups = 'drop')\r\n\r\n Error in `group_by()`:\r\n ! Problem adding computed columns.\r\n Caused by error in `mutate()`:\r\n ! Problem while computing `..1 = matches(\"^[va]\")`.\r\n Caused by error:\r\n ! `matches()` must be used within a *selecting* function.\r\n ℹ See for\r\n details.\r\n\r\nInstead you need to bridge select-semantics and mutate-semantics using across()21, which ensures that group_by() receives column vectors:\r\n\r\n\r\nmtcars_df |>\r\n group_by( across(matches(\"^[va]\")) ) |>\r\n summarize(mpg = mean(mpg), .groups = 'drop')\r\n\r\n # A tibble: 4 × 3\r\n vs am mpg\r\n \r\n 1 0 0 15.0\r\n 2 0 1 19.8\r\n 3 1 0 20.7\r\n 4 1 1 28.4\r\n\r\nIn DataFrames.jl, however, groupby() has select-semantics.22\r\nThat lets you use column selection helpers like Cols() to dynamically choose columns like:\r\n\r\n@chain mtcars begin\r\n groupby( Cols(r\"^[VA]\") )\r\n combine(:MPG => mean, renamecols = false)\r\nend\r\n 4×3 DataFrame\r\n Row │ VS AM MPG\r\n │ Int64 Int64 Float64\r\n ─────┼───────────────────────\r\n 1 │ 0 0 15.05\r\n 2 │ 0 1 19.75\r\n 3 │ 1 0 20.7429\r\n 4 │ 1 1 28.3714\r\n\r\n9) 💔 No special marking of context-dependent expressions\r\nIn {dplyr} and {data.table} you get “context dependent expressions” like n() and .N which returns information about the dataframe that you’re currently manipulating.\r\nSo for example if I want to calculate counts by group, in {dplyr} I can do:\r\n\r\n\r\nmtcars_df |>\r\n group_by(cyl) |>\r\n summarize(n = n())\r\n\r\n # A tibble: 3 × 2\r\n cyl n\r\n \r\n 1 4 11\r\n 2 6 7\r\n 3 8 14\r\n\r\nAnd in {data.table} I can do:\r\n\r\n\r\nmtcars_dt[, .(n = .N), by = cyl]\r\n\r\n cyl n\r\n \r\n 1: 6 7\r\n 2: 4 11\r\n 3: 8 14\r\n\r\nLikewise in DataFrames.jl, I can use nrow:\r\n\r\n@chain mtcars begin\r\n groupby(:Cyl)\r\n combine(nrow => :n)\r\nend\r\n 3×2 DataFrame\r\n Row │ Cyl n\r\n │ Int64 Int64\r\n ─────┼──────────────\r\n 1 │ 4 11\r\n 2 │ 6 7\r\n 3 │ 8 14\r\n\r\nSpecial keywords like nrow can be used in the op-spec syntax, and they essentially take up the first and middle “[input columns] => [transformation function]” slots.\r\nBut here’s a thing that’s super confusing about nrow as a context dependent expression.\r\nUnlike {dplyr}’s n() or {data.table}’s .N, Julia’s nrow()23 is also a stand-alone function:\r\n\r\nnrow(mtcars)\r\n 32\r\n\r\nNow imagine I wrote a nrow2() function which does the same thing but with Base.size(), the equivalent of base R’s dim():\r\n\r\nnrow2 = function(x)\r\n size(x, 1) # returns the first dimension (row)\r\nend;\r\nnrow2(mtcars)\r\n 32\r\n\r\nYou might expect our new nrow2 to behave like nrow in the context we just saw, but it doesn’t!\r\n\r\n@chain mtcars begin\r\n groupby(:Cyl)\r\n combine(nrow2 => :n)\r\nend\r\n\r\n\r\n ArgumentError: Unrecognized column selector var\"#101#102\"() => :n in AsTable constructor\r\n\r\nBecause nrow2() is not a special operator like nrow, it gets evaluated ordinarily in the first component of the op-spec syntax (“[input columns]”), throwing a rather unhelpful error message about an unrecognized column selector.\r\nBut it’s really difficult to figure out this exceptionalism of nrow because it also works expectedly outside!\r\nNow compare this to the treatment of {data.table}’s .N and {dplyr}’s n(), where the former isn’t exported (it’s not even defined in the package) and the latter throws a helpful error message:\r\n\r\n\r\ndplyr::n()\r\n\r\n Error in `n()`:\r\n ! Must be used inside dplyr verbs.\r\n\r\nBut thankfully, this is an area of ongoing development in DataFrames.jl so I’m hopeful that the documentation will catch up too.\r\nFor example, v1.4 recently added a couple more context dependent expressions like groupindices, which is equivalent to {dplyr}’s cur_group_id() and {data.table}’s .GRP:\r\n\r\n@chain mtcars begin\r\n groupby(:Cyl)\r\n combine(groupindices)\r\nend\r\n 3×2 DataFrame\r\n Row │ Cyl groupindices\r\n │ Int64 Int64\r\n ─────┼─────────────────────\r\n 1 │ 4 1\r\n 2 │ 6 2\r\n 3 │ 8 3\r\n\r\nHowever, I still feel like there’s ways to go on the clarity front.\r\nLike, if the user doesn’t already have a mental model of context dependent expressions, then it might be confusing that you can use groupindices like this:\r\n\r\ngroupindices( groupby(mtcars[1:5,:], :Cyl) )\r\n 5-element Vector{Union{Missing, Int64}}:\r\n 2\r\n 2\r\n 1\r\n 2\r\n 3\r\n\r\nBut you can’t use nrow in the same way:\r\n\r\nnrow( groupby(mtcars[1:5,:], :Cyl) )\r\n\r\n\r\n MethodError: no method matching nrow(::GroupedDataFrame{DataFrame})\r\n Closest candidates are:\r\n nrow(!Matched::SubDataFrame) at C:\\Users\\jchoe\\.julia\\packages\\DataFrames\\KKiZW\\src\\subdataframe\\subdataframe.jl:157\r\n nrow(!Matched::DataFrame) at C:\\Users\\jchoe\\.julia\\packages\\DataFrames\\KKiZW\\src\\dataframe\\dataframe.jl:459\r\n\r\nBecause, independently of their shared status as context dependent expressions in the op-spec syntax, groupindices and nrow differ as stand-alone functions (only the former has a method defined for GroupedDataFrame types).\r\n10) 💜 The op-spec syntax gives you dplyr::across()/c_across() for free\r\nSo probably my favorite thing about the op-spec syntax is that the leftmost “[input columns]” component can be a vector of multiple columns.\r\nWhen combined with the broadcasted version of the => operator, you get dplyr::across() for free:24\r\n\r\nselect(mtcars, [:DRat, :QSec] .=> ByRow(Integer ∘ round) => uppercase)[1:5,:]\r\n 5×2 DataFrame\r\n Row │ DRAT QSEC\r\n │ Int64 Int64\r\n ─────┼──────────────\r\n 1 │ 4 16\r\n 2 │ 4 17\r\n 3 │ 4 19\r\n 4 │ 3 19\r\n 5 │ 3 17\r\n\r\nIn {dplyr}, the above would be:\r\n\r\n\r\nmtcars_df |>\r\n transmute(\r\n across(\r\n .cols = c(drat, qsec),\r\n .fns = \\(x) as.integer(round(x)),\r\n .names = \"{toupper(.col)}\"\r\n )\r\n ) |>\r\n head(5)\r\n\r\n # A tibble: 5 × 2\r\n DRAT QSEC\r\n \r\n 1 4 16\r\n 2 4 17\r\n 3 4 19\r\n 4 3 19\r\n 5 3 17\r\n\r\nHere’s another fun one replicating {dplyr}’s rowwise() + c_across() workflow.\r\nAs a very contrived example, let’s say I want to make two new columns, calculating the min/max across float/double columns by-row.\r\nIn DataFrames.jl that can be done by selecting multiple columns and sending them in batch to a vararg function:\r\n\r\n@chain mtcars begin\r\n select(_,\r\n :Model,\r\n names(_, Float64) => ByRow( (x...) -> extrema(x) ) => [:min, :max]\r\n )\r\n first(5)\r\nend\r\n\r\n\r\n 5×3 DataFrame\r\n Row │ Model min max\r\n │ String Float64 Float64\r\n ─────┼─────────────────────────────────────\r\n 1 │ Mazda RX4 2.62 160.0\r\n 2 │ Mazda RX4 Wag 2.875 160.0\r\n 3 │ Datsun 710 2.32 108.0\r\n 4 │ Hornet 4 Drive 3.08 258.0\r\n 5 │ Hornet Sportabout 3.15 360.0\r\n\r\nThat kind of operation is costly in {dplyr} because it requires a rowwise() context to operate on columns selected by c_across(), the result of which must then be converted to dataframe:\r\n\r\n\r\nmtcars_df |>\r\n rowwise() |>\r\n transmute(\r\n model,\r\n c_across(where(is.double)) |>\r\n range() |>\r\n as.data.frame.list(col.names = c(\"min\", \"max\"))\r\n ) |>\r\n ungroup() |>\r\n head(5)\r\n\r\n # A tibble: 5 × 3\r\n model min max\r\n \r\n 1 Mazda RX4 2.62 160\r\n 2 Mazda RX4 Wag 2.88 160\r\n 3 Datsun 710 2.32 108\r\n 4 Hornet 4 Drive 3.08 258\r\n 5 Hornet Sportabout 3.15 360\r\n\r\nFor this particular problem you can use pmax and pmin, but that’s of course not generalizable to other arbitrary operations:25\r\n\r\n\r\nmtcars_df |>\r\n transmute(\r\n model,\r\n min = do.call(pmin, across(where(is.double))),\r\n max = do.call(pmax, across(where(is.double)))\r\n ) |>\r\n head(5)\r\n\r\n # A tibble: 5 × 3\r\n model min max\r\n \r\n 1 Mazda RX4 2.62 160\r\n 2 Mazda RX4 Wag 2.88 160\r\n 3 Datsun 710 2.32 108\r\n 4 Hornet 4 Drive 3.08 258\r\n 5 Hornet Sportabout 3.15 360\r\n\r\nTake this benchmarking with a grain of salt but on my machine the DataFrames.jl solution takes ~0.1ms, {dplyr}’s rowwise() + c_across() solution takes ~30ms and {dplyr}’s across() + pmin()/pmax() solution takes ~3ms.\r\nUpdate: a {data.table} solution that’s probably not the most efficient it could be\r\n\r\n\r\nmtcars_dt_rowwise <- copy(mtcars_dt)\r\nmtcars_dt_rowwise[, c(\"min\", \"max\") := as.list(range(.SD)), by = .I, .SDcols = is.double]\r\nmtcars_dt_rowwise[1:5, .(model, min, max)]\r\n\r\n model min max\r\n \r\n 1: Mazda RX4 2.620 160\r\n 2: Mazda RX4 Wag 2.875 160\r\n 3: Datsun 710 2.320 108\r\n 4: Hornet 4 Drive 3.080 258\r\n 5: Hornet Sportabout 3.150 360\r\n\r\nConcluding thoughts\r\nOverall impression\r\nOverall, DataFrames.jl has a cool design.\r\nLearning it has been mostly painless, and writing this blog post has been super fun.\r\nI personally found it to be a great entry point from R to Julia as someone who primarily uses R for data analysis, and would recommend this route if you’re Julia-curious and you like learning by doing.\r\nI was also pleasantly surprised by how much of my experience with {dplyr} and {data.table} transferred over to me learning an entirely different data wrangling framework.\r\nAnd this goes the other way as well - DataFrames.jl helps me appreciate many aspects of {dplyr} and {data.table} that I used to take for granted.\r\nNext steps\r\nA list of other things that I want to explore in DataFrames.jl but haven’t had the chance to:\r\nI hear that Julia and DataFrames.jl have interesting ways of representing and operating on vectors/columns with missing values.\r\nUnlike R where you have NA_character_, NA_integer_, and so on, Julia just has one Missing type and vectors with missing values get a mixed type like Union{Missing, Int64}:\r\n\r\n[1, 2, missing]\r\n 3-element Vector{Union{Missing, Int64}}:\r\n 1\r\n 2\r\n missing\r\n\r\nNesting/unnesting looks amazing but I need to learn more about Julia’s data types first, and the pros and cons of each for different nested column workflows.\r\nJoins and pivoting. You have the usual set of *join() verbs and stack()/unstack() which correspond to base R functions of the same name. Haven’t tried them out yet but looks pretty straightforward from skimming the docs.\r\nRow manipulation operations like append()!/prepend!() which seems to works more like {data.table} than {dplyr}’s eternally-experimental row_*() functions.\r\nThe whole Metadata section of the docs. That includes stuff like keys I think.26\r\nInteraction with Arrow.jl, which is apparently written in pure Julia!\r\nJust more benchmarking and experimentation so I can contribute to the language-wars discourse with hot takes optimize my bilingual R & Julia workflows.\r\n\r\nMaking it slightly younger than {plyr}/{dplyr} and {data.table} by a few years↩︎\r\nI’m doing a semester of independent study on simulation-based power analysis for mixed effects models, where speed is crucial. For that I decided to switch over from {lme4} and pick up MixedModels.jl.↩︎\r\nThere’s also the relatively newer DataFramesMacros.jl that makes DataFrames.jl code even closer to {dplyr}, namely by making the transformation verbs rowwise by default to save you from reasoning about broadcasting, for better or for worse.↩︎\r\nThe Julia for Data Analysis book is estimated for Jan 2023 but I’ve been reading the previews and recommend it very strongly. Bogumił Kamiński, the author of the book and the JuliaCon 2022 workshop (also a core developer of DataFrames.jl) also has a blog full of DataFrames.jl goodies.↩︎\r\nAlthough @macroexpand is very cool and works like dplyr::show_query(), so learning DataFramesMeta.jl shouldn’t interfere with learning DataFrames.jl.↩︎\r\nAbbreviation of Mercedes-Benz, apparently.↩︎\r\nThe visual effect here is similar to what you get from dplyr::group_split(), a costly operation that returns the input dataframe split into a list of dataframes.↩︎\r\nThe -! suffix is a convention for side-effect (or “imperative”) functions also found in other programming languages, like my first language Racket.↩︎\r\nThough you can mimic it in {dplyr} with {magrittr}’s %<>% pipe.↩︎\r\nsort!() probably because I consider row order to be more like a dataframe’s metadata, given how fragile and not-guaranteed-to-be-stable it is.↩︎\r\nAlthough I usually do that inside the file-reading step, like in read_csv(file, col_select = ...).↩︎\r\nFamously the 8.1.55-56th circle of the R inferno.↩︎\r\nI’m just gonna call it op-spec syntax from now on because these names are way too long.↩︎\r\nWrapped in parantheses in the code to make sure that the right-side => operator is not parsed as part of the function.↩︎\r\nNamedTuple, AbstractDataFrame, DataFrameRow, AbstractMatrix, to name a few.↩︎\r\nAlso see the @eachrow! macro from DataFramesMeta.jl for a modify-in-place version.↩︎\r\nOr, keeping up with the current dev version {dplyr} which superceded transmute(), it’d be mutate(..., .keep = \"none\").↩︎\r\nAlthough I feel like you totally could since the “[input columns]” component of the op-spec gets evaluated in its own separate environment anyways (unlike {data.table}’s j where the three components are mashed together). There was an attempt at integrating the now-stale(?) Selections.jl which looked interesting, but alas…↩︎\r\nThe {tidyselect} backend of {dplyr} supports both, namely with ! and - variants, though they recently switched to emphasizing the boolean algebra technique with !.↩︎\r\nThough the mutate-semantics lets you do things like group_by(decade = 10 * (year %/% 10)).↩︎\r\nOr the new pick() function in the upcoming version.↩︎\r\nBut not mutate-semantics, although I won’t miss that.↩︎\r\nWhich is not part of Base but instead comes from DataAPI.jl.↩︎\r\nActually, dplyr::across(.cols = c(col1, col2)) is more like AsTable([:col1, :col2]) => ... => ... in op-spec, but I’ve found that distinction to be seldom important.↩︎\r\nAlso see @mariviere’s solution using pmap_dfr().↩︎\r\nUpdate: Bogumił reached out to let me know of the WIP TableMetadataTools.jl package designed to “make the most common operations on metadata convenient”. I didn’t know about this but apparently metadata is a big topic in the DataFrames.jl world - exciting!↩︎\r\n", "preview": "posts/2022-11-13-dataframes-jl-and-accessories/preview.jpg", - "last_modified": "2022-11-15T09:59:05-05:00", + "last_modified": "2022-11-15T06:59:05-08:00", "input_file": {} }, { @@ -228,7 +228,7 @@ ], "contents": "\r\n\r\nContents\r\nContext\r\nSubmission process\r\nWalkshops and talks I\r\nattended\r\nWorkshops\r\nKeynotes\r\nR in Teaching\r\nBuilding the R Community\r\n1\r\nInterfaces with C,\r\nC++, Rust, and V\r\nExpanding\r\nTidyverse\r\nLearning\r\nggplot2\r\n\r\n\r\nContext\r\nI’ve had my eyes on useR!\r\n2022 ever since they announced it - I had my package {ggtrace}\r\nthat I’d been developing for about half a year by the time calls for\r\nabstracts were announced, and thought that it’d make a good submission\r\nfor my “debut” into the “academic” R world. I’ve built up a bit of a\r\nresistance against going to online conferences over the covid years but\r\nI’ve never been to an R conference before and I’ve heard good things\r\nabout the last virtual useR!\r\n2021, so I was actually very excited about the prospect of\r\nattending.\r\nIn short I’m very very glad I did and it was very educational! It was\r\na great first R conference for me.\r\nAlso if you’re wondering why I’m writing this over a month after the\r\nconference ended, it’s because I had to immediately switch gears to rstudioconf\r\nprep (sorry!).\r\nSubmission process\r\nThe submission process was incredibly simple,1 and\r\nthat low barrier to entry was part of the reason that convinced me to\r\napply. The submission form consisted of a 250-word\r\nabstract that I typed up into a textbox input field, along with\r\nrelevant links about my project.2\r\nI also applied for the diversity scholarship, which was a separate\r\nform sent to me after I got accepted for a talk. I’ve never applied for\r\nthese kind of things before, but when it came to R I felt like I could\r\nreally use this to advocate for myself and others who share my\r\nbackground (students in traditionally “humanities” and “soft science”\r\ndepartments) to convince their programs about the benefits of students\r\nbeing involved in the R community. I’m fortunate enough to be in a\r\nprogram that’s recently been making a lot of effort to incorporate\r\nR/programming into doing science,3 but me, being selfish,\r\nwanted even more R in my life. The conference generously offered\r\nme the diversity scholarship and I am incredibly honored! It covered\r\nmy registration fee and allowed me to attend two workshops, which I’ll\r\ntalk about later.\r\nHere’s the timeline of how things went for me:\r\nMarch 4: I submitted my abstract for a talk\r\nMarch 15: Abstract deadline for talks and elevator pitches\r\nApril 19: Notification of acceptance\r\nApril 27: Invitation to apply for the diversity scholarship\r\nMay 13: Notification of diversity scholarship offer\r\nWalkshops and talks I\r\nattended\r\nHere are my notes from the workshops and some of the talks I\r\nattended, in no particular order. Keep in mind that I was attending from\r\nhome in Korea, so I missed a few live talks and didn’t have great notes\r\nfor others (I’ll be catching up through the recordings).\r\nRecordings\r\nare available on the conference youtube channel\r\nWorkshops\r\nIsabella\r\nBicalho Frazeto Introduction to dimensional reduction in\r\nR: First off, big respect to Isabella for leading a 3.5 hour\r\nworkshop all by herself. Content-wise, the workshop was a balanced,\r\nnot-too-overwhelming overview of dimensional reduction,4\r\nstarting with a mini lecture of dimensional reduction methods in Part 1,\r\nfollowed by a hands-on walkthrough of packages and functions in R\r\nimplementing these methods in Parts 2 and 3. I knew only very little\r\nabout dimensional reduction going in, and I came away from it with at\r\nleast a conceptual understanding of PCA and ICA, as well as the\r\nimportant difference between the two in theory and practice. Getting an\r\nanswer to “what the heck is PCA/ICA?” was my main goal going into the\r\nworkshop, so I’m very satisfied with what I learned. Isabella was also\r\nnice enough to answer my question about the difference between PCA and\r\nfPCA\r\nduring break time - she had never heard about it before,5 but\r\nshe did a quick research on it live, read a whole wikipedia article\r\nabout fPCA, and then dumb-ed it down for me all in the span of like 3\r\nminutes.\r\nDanielle\r\nNavarro, Jonathan Keane, & Stephanie Hazlitt\r\nLarger-than-memory data workflows with Apache Arrow: I\r\ndon’t personally work with big data much but a lot of people around me\r\ndo. And when they do, they often have run into issues trying to analyze\r\ntheir data in-memory, so I figured I’d learn Arrow and “spread the\r\nword,” so to speak. The workshop was AWESOME and used a ton of carefully crafted learning\r\nmaterials, which are also available online. To be honest I took this\r\nlive in the middle of the night for me in Korea time, so I wasn’t that\r\nengaged with the live exercises, but the online materials are so well\r\norganized that I feel like I didn’t miss much between that and the\r\nworkshop recording. Beyond the actual content of workshop exercises, I\r\nespecially liked that a good chunk of time was spent going through the\r\ndesign of Arrow and where Arrow fits in the data-analysis pipeline and\r\nhow it compares to alternatives in the data storage ecosystem. Also\r\n{arrow} + {duckdb} looks very cool - I’m now\r\nan official convert!\r\nKeynotes\r\nPaula\r\nMoraga: This was so interesting even for me knowing nothing\r\nabout public health and spatial statistics going in, and I learned a lot\r\nfrom the talk. It was very information-packed and really showed how you\r\ncould use R in literally all aspects of research. I was especially\r\nimpressed by the idea that you could analyze your uncertainty about a\r\nmeasure (e.g., %vaccinated in a region) as data itself, and use that to\r\ninform future data collection decisions. It was refreshing to hear\r\nuncertainty being talked about in that way because I often only hear\r\npeople talking about proactive solutions for minimizing\r\nuncertainty, or about improving precision in quantifying\r\nuncertainty. Paula’s work was like a special kind of applied statistics\r\nwhich explicitly leverages the fact that you can drive future data\r\ncollection practices using properties of uncertainty in current/prior\r\ndata.\r\nAmanda\r\nCox: Words cannot express how good this talk was. Amanda\r\nCox is a force of nature. I was super looking forward to this given my\r\ninterests in data viz and data journalism, and I was not disappointed.\r\nThere were so many “mind = blown” moments but to keep things short, here\r\nare my two favorites:\r\nYou\r\nDraw It: How Family Income Predicts Children’s College Chances:\r\nEveryone knows that income and education are highly linked, but just how\r\ntight is the connection? Turns out that when the NYT team was\r\ninvestigating this question, they ended up with data that showed a\r\nsuspiciously linear trend to the point of appearing almost uninteresting\r\nand sterile. But they turned this mundaneness of the data into a\r\nsurprise factor by testing it against people’s expectations, allowing\r\nreaders to draw their own trend line to compare to the data. Not only\r\ndid that get people engaged, people got creative with what they drew6 resulting in the piece creating a\r\nvery rich interactive experience. The team correctly determined that the\r\ndata was boring and uninteresting to plot, but then they turned it\r\naround and made something magnificent out of the situation.\r\nOne\r\nRace, Every Medalist Ever: Amanda introduced this piece as an\r\nexample of how R could be used anywhere, even in places where it didn’t\r\nneed to be used.7 She made us watch a whooping 2:30\r\nminute video about olympic sprinters, only to reveal that R was involved\r\nin something seemingly trivial.8 I love the fact that she\r\ndid this build-up in the talk though - I wouldn’t have fully appreciated\r\nthe point if she just said it outright.\r\n\r\nR\r\nin Teaching\r\nAmelia McNamara Teaching modeling in\r\nintroductory statistics: A comparison of formula and tidyverse\r\nsyntaxes: I’ve been hearing about this project on twitter for a\r\nwhile and this was my first time listening to a talk about it. I think\r\nthis is super neat and this kind of “meta” research on R use, with R, is\r\nsomething that I’ve been interested in as well9. A\r\nlot of thought went into the design of the teaching and I admired the\r\nrobustness in the experimental design and how much thought went into it.\r\nIt’s also just always interesting to hear about how things are like to\r\nnovices - you don’t get that from introspection anymore as experienced\r\nusers.\r\nJonathan Love jamovi: An\r\nR-based statistical spreadsheet for the masses: I’ve actually never\r\nseen a WYSIWYG-style statistical program in action. I was fortunate\r\nenough to go through school during the R revolution in university intro\r\nstats courses,10 but I always appreciated people who\r\nworked on these GUI-based tools for beginners because boy learning R at\r\nfirst was so hard! I was really surprised at the fact that jamovi could\r\nproduce full reports and that it also had like 40-50\r\ncommunity-contributed modules. Also TIL jamovi’s sister program jasp\r\nstands for “Jonathan’s Awesome Statistical Program,”\r\nCarsten Lange A better way to teach\r\nhistograms, using the TeachHist package: I love it when a package\r\njust aims to do one thing and does that one thing well.\r\n{TeachHist} is that package - it runs a shiny app that\r\nteaches intro stats students the motivation and intuition behind\r\nhistograms as a data viz tool, and how to read statistical properties of\r\nthe data off of histograms. Purists will not like the design of the\r\nfunctions in the package (e.g., you have big functions with lots of\r\narguments vs. modules composing a histogram like in\r\n{ggplot2}) but I really liked how this was designed with\r\nnon-programmer students in mind.11 I also really liked\r\nthat the talk live-demoed the package/app, since that also demos the\r\nreal-world case of when you’d be using it (live, in front of\r\nstudents).\r\nKirsten B. Gorman et al. The untold story of\r\npalmerpenguins: This talk was actually a call for a\r\nlot of self-reflection for me, because to be honest when\r\n{palmerpenguins} first came out and there was big hype\r\naround it on twitter, I didn’t understand really understand why. I\r\nthought, “it’s just a data package,” and found the dataset to be\r\nunderwhelming (as in, it’s not exactly a big dataset). Of course, I’ve\r\ndone a complete 180° since then and I adore this package now - the data\r\njust works for many different kinds of simple reprexes that I\r\nmake for teaching and answering questions.12\r\nBut also there’s the side of the “story of palmerpenguins” that’s not at\r\nall about the data itself, which was the focus of this talk - the side\r\nabout the people and the community around this project and all the\r\neffort they put in to make palmerpenguins dethrone the\r\niris dataset and get representation in other big-name\r\nplaces like tensorflow.\r\nBecause, you know what? If someone asked me to make a data package go\r\nviral I’d have no idea where to even start, and that thought racing past\r\nmy head while I was listening to this talk was really humbling. I guess\r\nthat’s really what a lot of what making a good data package for\r\npedagogical purposes is, right?13 Its partly about the\r\nactual data, but it’s also about the framing, the “marketing”,\r\nthe cuteness factor of penguins, all the hand-made art, the awesome\r\npkgdown website,14 and more. The fact that you can get\r\npeople excited about a data package is insane, and I really liked\r\nhearing about all the small-to-big issues the\r\n{palmerpenguins} team had to think through.15\r\nLike, did you know about palmerpenguins::path_to_file() for\r\nteaching students how to read in files from a path?\r\nBuilding\r\nthe R Community 1\r\nNjoki Lucy et al. Building an R-Ladies\r\ncommunity during the Covid-19 pandemic: I had been hearing A LOT\r\nabout the RLadies Nairobi team on twitter (from Njoki, Shel, and others) leading up\r\nto this talk, and I’ve always been super impressed at their efforts.\r\nThey’re so innovative and social-media-savvy despite being a new group.\r\nFor example, a while back they hosted a Twitter Space16\r\nand invited Shannon\r\nPileggi for a chat. I I attended that and thought it was so\r\nimpressive and creative - they have practically mastered the art of\r\ngetting people together online. One big wow moment from the talk was\r\nthat they apparently started planning the chapter before covid,\r\nand had to pivot to starting it online, all in the span of like half a\r\nyear!\r\nInterfaces\r\nwith C, C++, Rust, and V\r\nDavid B. Dahl Writing R extensions in\r\nRust: I’m not familiar with any “low level” programming languages,\r\nbut Rust has been catching my eye recently because a few R folks were\r\ndoing cool stuff in it (e.g., {gifski}, {sinab}, {string2path}).\r\nAs expected, all the technical details of the talk went over my head,\r\nbut it was well-motivated and I was made aware of the fact that there\r\nare competing frameworks for R-Rust interoperability (rextender\r\nand cargo).\r\nI wasn’t planning on picking up Rust anytime soon, so I guess I’ll watch\r\nout for a CRAN taskview or something.\r\nExpanding\r\nTidyverse\r\nPatrick Weiss et al. Tidy Finance with\r\nR: You know those projects that you stumble upon and it’s your\r\nfirst time hearing about it but it’s had years of work behind it and\r\nyou’re like “how did I not know about this before”? Well this is one of\r\nthose, and the Tidy Finance with\r\nR project deserves more love! I’m not in this field but it’s always\r\nexciting when people write intro R/data science books because there are\r\nalways hidden gems in there, even if you’re an experienced R user and\r\nnot from that field. For example, there’s a lot of cool\r\ntimeseries/{lubridate} stuff in the book, so I’ll probably\r\nbe using this as reference for those at least.\r\nBryan Shalloway Five ways to do “tidy”\r\npairwise operations: I’ve actually been following this project for\r\na while as Bryan has been\r\ntweeting about it, and I think it’s really neat. We don’t often do\r\noperations over pairs of vectors, but this kind of workflow is such a\r\nwell-defined class of problems to tackle that it’s definitely worthwhile\r\nspending time to figure out the right design for implementing it (kind\r\nof like how {slider}\r\ndoes that for the class of sliding-window functions). The talk was a\r\ngreat overview of something I knew close to nothing about, like the fact\r\nthat {corrr} has the colpair_map() functional\r\nfor arbitrary pairwise operations.\r\nLearning\r\nggplot2\r\nJonathan Carroll ggeasy: Easy\r\naccess to ggplot2 commands: I’ve been hearing about\r\nthis package a lot and I think it’s one-of-a-kind package addresses a\r\nvery common need among ggplot users. It works surprisingly well since\r\ntheme() is modular to begin with,17\r\nso you can layer the shorthand ggeasy::easy_*() functions\r\nas if you’re adding layers - no extra overhead for learners/users! I\r\nasked a question about dealing with vocabulary differences (e.g.,\r\n“panel” = “facet” = “small multiple”) and Jonathan answered that the aim\r\nis to support all variants, so go open that PR if your dialect/idiolect\r\nisn’t represented!\r\nNicola Rennie Learning ggplot2\r\nwith generative art: I LOVED this talk! The talk started with a\r\ngreat point about how it’s kinda intimidating/hard for people to get\r\ninto generative art because you don’t just start by copying code for\r\ngenerative art - art is about the process, so it doesn’t lend\r\nitself to how people normally learn things in programming, which is by\r\ncopying code just to get something working first and then going from\r\nthere. I totally related to that because that was my case too. I feel\r\nmore confident about it now, after seeing examples from the talk about\r\nhow code can inform art and vice-versa. For example, there was a really\r\nhandy trick of “adding arbitrary layers” in a plot by just adding whole\r\nggplots with transparent backgrounds on top of each other using {patchwork}.\r\nI’m definitely stealing this idea, for art or not.\r\nJames Otto (& David Kahle)\r\nggdensity: Improved bivariate density visualization in\r\nR: {ggdensity}\r\nis an example of my favorite kind of packages. It offers a drop-in\r\nreplacement for a standard function that improves upon its defaults\r\n(from ggplot2::geom_density_2d_filled() to\r\nggdensity::geom_hdr()), while providing additional\r\ncapabilities that satisfies folks who want/need to do more complicated\r\nthings. Like I’ve never thought more than 30 seconds about the design\r\nchoices of making a density plot - “It’s just a density plot, what more\r\ndo I need?” And then now I’m like “okay I want all of those cool new\r\nfeatures.” I’m gonna take this moment to put it on record that I\r\nactually used {ggdensity} the DAY that its CRAN release was\r\nannounced, to make a plot in a conference proceedings paper which\r\n(fingers crossed) will get accepted soon. We got reviews back recently\r\nand the reviewers really liked the plot, so this talk was extra\r\nspecial.\r\nMe Stepping into ggplot2\r\ninternals with ggtrace: I spoke in this session and\r\nhere’s the repo to the talk materials\r\nif you want to check it out. I don’t know what to say for my own talk\r\nbut I’ll just mention that I prepped this talk with the rstudioconf talk\r\nin mind, so the two talks touch on different aspects of {ggtrace}/ggplot\r\ninternals. For this talk, I focused more on giving a walkthrough of how\r\nthe Stat and Geom ggprotos work under the hood, exposing its\r\nfunctional-programming nature. I’ll probably have more to say about\r\n{ggtrace} when I write my rstudioconf reflection, so stay\r\ntuned!\r\n\r\nNothing like the 1-2 page length,\r\nproperly formatted abstracts like in academic conferences.↩︎\r\nI linked to the Github\r\nrepository and package website.↩︎\r\nFor example, our department offered\r\nits first data science course for undergrads last year, which I was the\r\nhead TA for.↩︎\r\nActually I still haven’t figures out\r\nthe proper(?) term between dimension/dimension-al/dimension-ality\r\nreduction.↩︎\r\nWhich makes sense - turns out that\r\nfPCA is kinda domain-specific and designed for the analysis of\r\nacoustic data specifically.↩︎\r\nA move that the team also\r\npre-empted with semi-customized feedback.↩︎\r\nBig mood - me too.↩︎\r\nIt was used for the last few seconds\r\nof the video to create the bell sound effect to “show” what the finish\r\ntimes would “sound” like, to emphasize the point that all sprinters are\r\nsimilarly fast in the grand scheme of things.↩︎\r\nWhen I TA-ed for an undergrad data\r\nscience course last year, we used Google Colab, and I wondered the pros\r\nand cons of that vs. RStudio Cloud.↩︎\r\nWhen I was a junior, my alma mater\r\nNorthwestern University started a data science program with a handful of\r\nteaching professor hires to focus on educating undergrads.↩︎\r\nLike, do you know what all the\r\npossible arguments to geom_histogram() is? You probably\r\ndon’t and they’re actually pretty hard to find, but you wouldn’t have\r\nthat problem if you had one function with a bunch of arguments that you\r\ncan find on just the function’s help page with no redirects.↩︎\r\nAnd it has a lot of “data lessons”\r\nbuilt-in, like you can use it to demonstrate the Simpson’s paradox and\r\nk-means clustering.↩︎\r\nIn fact, one of the really\r\ninsightful parts of the talk was Alison talking on the slide “Why has it\r\nbeen so popular?”.↩︎\r\nWhich I embarassingly never visited\r\nbefore this talk.↩︎\r\ne.g., Adelie is always capitalized\r\nbecause it’s named after a person, but optional for chinstraps and\r\ngentoos. The data itself capitalizes all of them.↩︎\r\nI didn’t even know Twitter Space was\r\na thing before this, despite spending a lot of time on Twitter.↩︎\r\nYou can chain\r\n+ theme()’s.↩︎\r\n", "preview": "posts/2022-07-30-user2022/preview.png", - "last_modified": "2022-11-13T09:16:58-05:00", + "last_modified": "2022-11-13T06:16:58-08:00", "input_file": {}, "preview_width": 520, "preview_height": 260 @@ -251,7 +251,7 @@ ], "contents": "\r\n\r\nContents\r\nIntroduction\r\nWhere it all\r\nhappens: Stat$compute_layer()\r\nWhere it\r\nspecifically happens: StatCount$compute_group()\r\nggproto,\r\nminus the “gg” and the “proto”\r\nMy conventions\r\n{ggplot2} conventions\r\n\r\nTemplates and\r\nextensions\r\nThe\r\n$compute_*() family of methods\r\n1) Split\r\n2) Apply\r\n3) Combine\r\n\r\nThe layer’s data\r\nframe representation\r\nOther $compute_*()\r\nextensions\r\nConclusion\r\nSneak peak of Part 3\r\nA taste of writing\r\nggproto extensions\r\n\r\nThis is a developing series of blog posts, scheduled for three\r\nparts:\r\nPart\r\n1: Exploring the logic of after_stat() to peek inside\r\nggplot internals\r\nPart 2: Exposing the Stat ggproto in\r\nfunctional programming terms (you are here)\r\nPart 3: Completing the picture with after_scale()\r\nand stage() (TBD)\r\nIntroduction\r\nLet’s pick up where we left off in Part\r\n1. If you’d like to follow along without going back to re-read it,\r\nthe relevant code that we’ll carry over here is provided below:\r\n\r\nSetup from Part 1\r\n\r\n\r\n# Top-level setup\r\nlibrary(ggplot2)\r\nlibrary(dplyr)\r\nlibrary(palmerpenguins)\r\n\r\nset.seed(2022)\r\ntheme_set(theme_classic(13))\r\n\r\n# Custom function to inspect `after_stat()`\r\ninspect_after_stat <- function(p, i = 1L) {\r\n ._env <- environment()\r\n .out <- NULL\r\n suppressMessages({\r\n trace(\r\n what = \"ggplot_build.ggplot\",\r\n tracer = substitute(assign(\".out\", data[[i]], envir = ._env), ._env),\r\n at = 19L,\r\n print = FALSE,\r\n where = asNamespace(\"ggplot2\")\r\n )\r\n })\r\n ggplot_build(p)\r\n suppressMessages({\r\n untrace(\"ggplot_build.ggplot\", where = asNamespace(\"ggplot2\"))\r\n })\r\n .out\r\n}\r\n\r\n\r\n\r\nTo recap, the big theme of Part\r\n1 was about how each layer of a ggplot transforms the raw\r\ndata under the hood to make it “drawing ready” (i.e., the dataframe\r\nreturned by layer_data()). We saw that a lot happens over\r\nthe course of this data transformation pipeline, one of which is the\r\nstatistical transformation step. For example, the\r\ngeom_bar() layer uses stat = \"count\"\r\n(shorthand for stat = StatCount) by default, which computes\r\nnew variables like count and prop\r\ninternally.\r\nWe also saw how after_stat() allows users to declare a\r\ndelayed aesthetic mapping, which waits to be applied\r\nuntil after this statistical transformation step. For example, the\r\nStatCount stat used by geom_bar() specifies\r\nthe default implicit mapping to after_stat(count).\r\nObjects like StatCount are called ggproto\r\nobjects, and they’re the focus of this Part 2 of the series.\r\nWe’ll be digging into the implementational details of the internal\r\nstatistical transformation step. Along the way, we’ll encounter some\r\nfunny looking functions in the form of\r\n$(). These are called\r\nggproto methods - they look pretty scary, but\r\nneedlessly so for our purposes: aside from their odd syntax, most of\r\nthem are essentially just data wrangling functions that we’re already\r\nfamiliar with.\r\nLet’s dive right in!\r\nWhere it all happens:\r\nStat$compute_layer()\r\nLet’s again use the penguins dataset and visualize\r\npenguin species counts with geom_bar(). This time we’ll\r\nalso give the bars width = 0.7 and facet by island:\r\n\r\n\r\nlibrary(palmerpenguins)\r\np_bar2 <- ggplot(penguins, aes(x = species)) +\r\n geom_bar(width = 0.7) +\r\n facet_wrap(~ island)\r\np_bar2\r\n\r\n\r\n\r\n\r\nAs expected, geom_bar() maps the internally computed\r\ncount variable to the y aesthetic by default.\r\nWe suspect that something like {dplyr}’s\r\ngroup_by() and summarize() (or just\r\ncount()) is happening in the statistical transformation\r\nstage:\r\n\r\n\r\npenguins %>% \r\n group_by(island, species) %>% \r\n summarize(count = n(), .groups = \"drop\")\r\n\r\n\r\n # A tibble: 5 × 3\r\n island species count\r\n \r\n 1 Biscoe Adelie 44\r\n 2 Biscoe Gentoo 124\r\n 3 Dream Adelie 56\r\n 4 Dream Chinstrap 68\r\n 5 Torgersen Adelie 52\r\n\r\n\r\nIn Part\r\n1, I introduced this mystery function called\r\ninspect_after_stat() to show you that this is indeed the\r\ncase:\r\n\r\n\r\ninspect_after_stat(p_bar2)\r\n\r\n\r\n count prop x width flipped_aes PANEL group\r\n 1 44 1 1 0.7 FALSE 1 1\r\n 2 124 1 3 0.7 FALSE 1 3\r\n 3 56 1 1 0.7 FALSE 2 1\r\n 4 68 1 2 0.7 FALSE 2 2\r\n 5 52 1 1 0.7 FALSE 3 1\r\n\r\n\r\nNow it’s time to unveil the mystery behind this function -\r\ninspect_after_stat() was grabbing the return value of the\r\nggproto method Stat$compute_layer() when it was called for\r\nthe first layer of our plot.1\r\nUsing the function ggtrace_inspect_return() from my\r\npackage {ggtrace},\r\nwe can achieve this more explicitly. We pass the function our plot and\r\nthe ggproto method we want to inspect, and it gives us what the method\r\nreturned:\r\n\r\n\r\n# install.packages(\"remotes\")\r\n# remotes::install_github(\"yjunechoe/ggtrace\")\r\nlibrary(ggtrace)\r\ncompute_layer_output <- ggtrace_inspect_return(p_bar2, Stat$compute_layer)\r\ncompute_layer_output\r\n\r\n\r\n count prop x width flipped_aes PANEL group\r\n 1 44 1 1 0.7 FALSE 1 1\r\n 2 124 1 3 0.7 FALSE 1 3\r\n 3 56 1 1 0.7 FALSE 2 1\r\n 4 68 1 2 0.7 FALSE 2 2\r\n 5 52 1 1 0.7 FALSE 3 1\r\n\r\n\r\nIf ggproto methods are essentially functions, they should have inputs\r\nand outputs. We just saw the output of\r\nStat$compute_layer(), but what’s its input?\r\nWe can simply swap out ggtrace_inspect_return() with\r\nggtrace_inspect_args() to look at the arguments that it was\r\ncalled with:\r\n\r\n\r\ncompute_layer_input <- ggtrace_inspect_args(p_bar2, Stat$compute_layer)\r\nnames( compute_layer_input )\r\n\r\n\r\n [1] \"self\" \"data\" \"params\" \"layout\"\r\n\r\n\r\nWe can ignore the self and layout arguments\r\nfor the moment. The crucial ones are the data and\r\nparams arguments, which look like this:\r\n\r\n\r\n\r\n\r\ncompute_layer_input$data\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\ncompute_layer_input$params\r\n\r\n\r\n $width\r\n [1] 0.7\r\n \r\n $na.rm\r\n [1] FALSE\r\n \r\n $orientation\r\n [1] NA\r\n \r\n $flipped_aes\r\n [1] FALSE\r\n\r\n\r\n\r\n\r\nRemember how I said ggproto methods are essentially data wrangling\r\nfunctions? We can think of Stat$compute_layer() as a\r\nfunction that takes a dataframe and a list of parameters, and does a\r\nsimple data wrangling after grouping by the PANEL and\r\ngroup columns:\r\n\r\n\r\ncompute_layer_fn <- function(data, params) {\r\n data %>% \r\n group_by(PANEL, group) %>% \r\n summarize(\r\n count = n(),\r\n x = unique(x),\r\n width = params$width,\r\n flipped_aes = params$flipped_aes,\r\n .groups = 'drop'\r\n ) %>% \r\n relocate(PANEL, group, .after = last_col())\r\n}\r\n\r\n\r\n\r\nThis gets us the a dataframe that’s very similar to the output of\r\nStat$compute_layer():\r\n\r\n\r\ncompute_layer_fn(compute_layer_input$data, compute_layer_input$params)\r\n\r\n\r\n # A tibble: 5 × 6\r\n count x width flipped_aes PANEL group\r\n \r\n 1 44 1 0.7 FALSE 1 1\r\n 2 124 3 0.7 FALSE 1 3\r\n 3 56 1 0.7 FALSE 2 1\r\n 4 68 2 0.7 FALSE 2 2\r\n 5 52 1 0.7 FALSE 3 1\r\n\r\n\r\nSo we see that the Stat$compute_layer() method was the\r\nstep in the internals responsible for calculating the statistical\r\nsummaries necessary to draw our bar layer.\r\nBut there’s a catch - the way that the method is written doesn’t\r\ndirectly reflect this. Nothing about the code for\r\nStat$compute_layer() says anything about counting:\r\n\r\n\r\nStat$compute_layer\r\n\r\n\r\n\r\n\r\n function (self, data, params, layout) \r\n {\r\n check_required_aesthetics(self$required_aes, c(names(data), \r\n names(params)), snake_class(self))\r\n required_aes <- intersect(names(data), unlist(strsplit(self$required_aes, \r\n \"|\", fixed = TRUE)))\r\n data <- remove_missing(data, params$na.rm, c(required_aes, \r\n self$non_missing_aes), snake_class(self), finite = TRUE)\r\n params <- params[intersect(names(params), self$parameters())]\r\n args <- c(list(data = quote(data), scales = quote(scales)), \r\n params)\r\n dapply(data, \"PANEL\", function(data) {\r\n scales <- layout$get_scales(data$PANEL[1])\r\n tryCatch(do.call(self$compute_panel, args), error = function(e) {\r\n warn(glue(\"Computation failed in `{snake_class(self)}()`:\\n{e$message}\"))\r\n new_data_frame()\r\n })\r\n })\r\n }\r\n \r\n \r\n\r\n\r\nWhere’s the relevant calculation actually happening? For that, we\r\nneed to go deeper.\r\nWhere it\r\nspecifically happens: StatCount$compute_group()\r\nThe actual calculation of counts for the bar layer happens inside\r\nanother ggproto method called\r\nStatCount$compute_group():\r\n\r\n\r\nStatCount$compute_group\r\n\r\n\r\n\r\n\r\n function (self, data, scales, width = NULL, flipped_aes = FALSE) \r\n {\r\n data <- flip_data(data, flipped_aes)\r\n x <- data$x\r\n weight <- data$weight %||% rep(1, length(x))\r\n count <- as.numeric(tapply(weight, x, sum, na.rm = TRUE))\r\n count[is.na(count)] <- 0\r\n bars <- new_data_frame(list(count = count, prop = count/sum(abs(count)), \r\n x = sort(unique(x)), width = width, flipped_aes = flipped_aes), \r\n n = length(count))\r\n flip_data(bars, flipped_aes)\r\n }\r\n \r\n \r\n\r\n\r\nWe don’t need to dwell on trying to understand the code - let’s dive\r\nstraight in with the same ggtrace::ggtrace_inspect_*()\r\nfunctions.\r\nWe see that the return value is actually just one row of the return\r\nvalue of Stat$compute_layer() that we saw earlier, minus\r\nthe PANEL and group columns:2\r\n\r\n\r\nggtrace_inspect_return(p_bar2, StatCount$compute_group)\r\n\r\n\r\n count prop x width flipped_aes\r\n 1 44 1 1 0.7 FALSE\r\n\r\n\r\nAs the name suggests, $compute_group() is called for\r\neach group (in this case, bar) in the geom_bar() layer. The\r\nmethod is called after the layer’s data is split by facet and\r\ngroup, much like how we used group_by(PANEL, group) to\r\nsimulate compute_layer_fn() above.\r\nUsing ggtrace_inspect_n() which returns how many times a\r\nggproto method was called in a plot, we confirm that\r\nStatCount$compute_group() was indeed called five times,\r\nonce for each bar in the plot:\r\n\r\n\r\nggtrace_inspect_n(p_bar2, StatCount$compute_group)\r\n\r\n\r\n [1] 5\r\n\r\n\r\nAnd we can (mostly) recover the output of\r\nStat$compute_layer() by combining the return values from\r\nStatCount$compute_group(), all five times it’s called for\r\nthe layer. The cond argument here lets us target the\r\nnth time the method is called:\r\n\r\n\r\nbind_rows(\r\n ggtrace_inspect_return(p_bar2, StatCount$compute_group, cond = 1),\r\n ggtrace_inspect_return(p_bar2, StatCount$compute_group, cond = 2),\r\n ggtrace_inspect_return(p_bar2, StatCount$compute_group, cond = 3),\r\n ggtrace_inspect_return(p_bar2, StatCount$compute_group, cond = 4),\r\n ggtrace_inspect_return(p_bar2, StatCount$compute_group, cond = 5)\r\n)\r\n\r\n\r\n count prop x width flipped_aes\r\n 1 44 1 1 0.7 FALSE\r\n 2 124 1 3 0.7 FALSE\r\n 3 56 1 1 0.7 FALSE\r\n 4 68 1 2 0.7 FALSE\r\n 5 52 1 1 0.7 FALSE\r\n\r\n\r\nSo we get a sense that Stat$compute_layer() simply\r\nsplits the layer’s data by panel and group, while\r\nStatCount$compute_group() does the heavy lifting.\r\nBut what’s the relationship between these two ggproto methods? How\r\ndoes Stat$compute_layer() know to use\r\nStatCount$compute_group()?\r\nLong story short, the statistical transformation step for\r\ngeom_bar() is actually ALL about StatCount. So\r\nStat$compute_layer() is essentially\r\nStatCount$compute_layer(), but also kind of not.\r\nTo understand this distinction fully, we need a slight detour into\r\nthe world of ggproto.\r\nggproto, minus\r\nthe “gg” and the “proto”\r\nThere are existing resources for learning how ggproto works, such as\r\nChapter\r\n20 of the ggplot2 book, so I won’t repeat all the details here. In\r\nfact, if you’re already familiar with object-oriented programming, the\r\nbook chapter has all you need. But you’re like me and found it\r\noverwhelming at first, then this is for you!\r\n\r\nOn notations and conventions\r\nMy conventions\r\nIn this blog post, I’ll refer to ggproto methods like\r\n$method() to distinguish it from normal functions. I’ll\r\nrefer to properties (non-function elements of ggproto objects) like\r\n$property to distinguish it from variables.\r\nAlso, whenever I’m printing ggproto methods, I’m actually using\r\nget(\"method\", object) behind the scenes. I do this to bring\r\nattention to the method body - printing ggproto methods as-is comes with\r\nextra baggage that distracts from the topic of this blog post. Just be\r\naware of this when you go exploring ggproto methods yourself:\r\n\r\n\r\n# This just prints the method body\r\nget(\"setup_data\", Stat)\r\n\r\n\r\n function (data, params) \r\n {\r\n data\r\n }\r\n \r\n \r\n\r\n\r\n\r\n\r\n# Prints extra stuff we don't care about\r\n# - see: `ggplot2:::format.ggproto_method`\r\nStat$setup_data\r\n\r\n\r\n \r\n \r\n function (...) \r\n f(...)\r\n \r\n \r\n function (data, params) \r\n {\r\n data\r\n }\r\n\r\n\r\n{ggplot2} conventions\r\nYou might have noticed how ggproto objects are written in upper camel\r\ncase like StatCount. This is by convention, but they’re\r\nvery important to know and follow. The upper camel case convention for\r\nggproto objects is closely related to two other conventions:\r\nThe first is in the naming of the layer functions we use to write\r\nggplot code and chain with the + operator, like\r\ngeom_bar() and stat_count(). These names are\r\nderived from ggproto objects like GeomBar and\r\nStatCount through the internal function\r\nggplot2:::snake_class(). Thus, the camel case convention is\r\nused to distinguish ggproto objects like StatCount from\r\nconstructor functions like stat_count(), while maintaining\r\na predictable connection between the two.\r\nThe second is in the use of character shorthands to refer to\r\nggproto objects inside layer functions, like\r\ngeom_bar(stat = \"count\"). These shorthands work by turning\r\nthe string into upper camel case using the internal function\r\nggplot2:::camelize(x, first = TRUE) and then prefixing that\r\nwith \"Stat\" (or \"Geom\" or\r\n\"Position\"). So \"count\" gets converted into\r\n\"StatCount\", which then gets looked up in the caller\r\nenvironment. If a ggproto object breaks this convention, the character\r\nshorthand would not work.\r\nRelatedly, note how variable names like StatCount match\r\nthe class name class(StatCount)[1]:\r\n\r\n\r\nclass(StatCount)[1]\r\n\r\n\r\n [1] \"StatCount\"\r\n\r\n\r\nThis is also by convention and it’s useful for figuring out what\r\nspecific Stat/Geom/Position a\r\nlayer uses:\r\n\r\n\r\nclass( geom_bar()$stat )[1] # uses `StatCount` ggproto\r\n\r\n\r\n [1] \"StatCount\"\r\n\r\n\r\nclass( geom_bar()$geom )[1] # uses `GeomBar` ggproto\r\n\r\n\r\n [1] \"GeomBar\"\r\n\r\n\r\nclass( geom_bar()$position )[1] # uses `PositionStack` ggproto\r\n\r\n\r\n [1] \"PositionStack\"\r\n\r\n\r\nWhen we distill it down to the very basics, ggprotos are essentially\r\nlists, and ggproto methods are essentially functions.\r\nFor our purposes, the only truly new concept you need to know about\r\nggproto methods is that they’re functions that live inside\r\nlists.\r\nSo StatCount$compute_group() calls the\r\n$compute_group() function defined inside a list called\r\nStatCount, kind of like this:\r\n\r\n\r\n# Not run\r\nStatCount <- list(\r\n compute_group = function(...) { ... },\r\n ...\r\n)\r\n\r\n\r\n\r\nAnd Stat$compute_layer() calls the\r\n$compute_layer() function defined inside a list called\r\nStat:\r\n\r\n\r\n# Not run\r\nStat <- list(\r\n compute_layer = function(...) { ... },\r\n ...\r\n)\r\n\r\n\r\n\r\nThe reason why the layer-level statistical transformation for\r\ngeom_bar() was handled by Stat$compute_layer()\r\nand not StatCount$compute_layer() is just by technicality.\r\nIn fact, StatCount$compute_layer() does “exist”:\r\n\r\n\r\nStatCount$compute_layer\r\n\r\n\r\n\r\n\r\n function (self, data, params, layout) \r\n {\r\n check_required_aesthetics(self$required_aes, c(names(data), \r\n names(params)), snake_class(self))\r\n required_aes <- intersect(names(data), unlist(strsplit(self$required_aes, \r\n \"|\", fixed = TRUE)))\r\n data <- remove_missing(data, params$na.rm, c(required_aes, \r\n self$non_missing_aes), snake_class(self), finite = TRUE)\r\n params <- params[intersect(names(params), self$parameters())]\r\n args <- c(list(data = quote(data), scales = quote(scales)), \r\n params)\r\n dapply(data, \"PANEL\", function(data) {\r\n scales <- layout$get_scales(data$PANEL[1])\r\n tryCatch(do.call(self$compute_panel, args), error = function(e) {\r\n warn(glue(\"Computation failed in `{snake_class(self)}()`:\\n{e$message}\"))\r\n new_data_frame()\r\n })\r\n })\r\n }\r\n \r\n \r\n\r\n\r\nBut only in a very specific sense - it just recycles the\r\n$compute_layer() function defined in Stat,\r\nsimilar to in this implementation:\r\n\r\n\r\n# Not run\r\nStatCount <- list(\r\n compute_group = function() { ... },\r\n compute_layer = Stat$compute_layer,\r\n ...\r\n)\r\n\r\n\r\n\r\nIn object-orientated programming terms, this is called\r\ninheritance - the StatCount ggproto is a\r\nchild of the parent Stat\r\nggproto that inherits some of the parent’s methods (like\r\n$compute_layer()) while overriding and defining some of its\r\nown (like $compute_group()).\r\n\r\nA note on class inheritance\r\nClass inheritance is reflected in the output of class(),\r\nwhere order of elements matter:\r\n\r\n\r\nclass(Stat)\r\n\r\n\r\n [1] \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\n\r\n\r\nclass(StatCount) # see also: `inherits(StatCount, \"Stat\")`\r\n\r\n\r\n [1] \"StatCount\" \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\nBy design, all Stat* ggprotos inherit from the top-level\r\nparent Stat ggproto. For example, the\r\ngeom_boxplot() layer uses StatBoxplot for its\r\nstat, which also inherits from Stat:\r\n\r\n\r\nclass( geom_boxplot()$stat ) # or `class(StatBoxplot)`\r\n\r\n\r\n [1] \"StatBoxplot\" \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\nSometimes, an inheritance chain can be more complex - for example,\r\nStatDensity2dFilled inherits from\r\nStatDensity2d, which in turn inherits from\r\nStat:\r\n\r\n\r\nclass( StatDensity2dFilled )\r\n\r\n\r\n [1] \"StatDensity2dFilled\" \"StatDensity2d\" \"Stat\" \r\n [4] \"ggproto\" \"gg\"\r\n\r\n\r\nBut what’s the point of all this? Why do we bother with these clunky\r\nggproto objects instead of having a single function that does all\r\nbar-related things, all boxplot-related things, etc.?\r\nTemplates and extensions\r\nThe rationale behind this inheritance-based design is that the\r\ntop-level Stat ggproto serves as a\r\ntemplate that’s meant to be filled and customized.\r\nAnother term for this customization is extension -\r\nfor example, we say that StatCount is an extension\r\nof Stat. This is what’s technically meant by “ggplot2\r\nextension packages” - these packages provide new Stat* or\r\nGeom* ggproto objects (like ggforce::StatSina\r\nand ggtext::GeomRichtext) that are extensions of the\r\ntop-level Stat and Geom ggprotos.\r\nSince Stat is essentially a list and extensions are\r\nessentially a way of customizing certain elements of a template, each\r\nelement of Stat can be thought of as a possible\r\nextension point:\r\n\r\n\r\nnames(Stat)\r\n\r\n\r\n [1] \"compute_layer\" \"parameters\" \"aesthetics\" \"setup_data\" \r\n [5] \"retransform\" \"optional_aes\" \"non_missing_aes\" \"default_aes\" \r\n [9] \"finish_layer\" \"compute_panel\" \"extra_params\" \"compute_group\" \r\n [13] \"required_aes\" \"setup_params\"\r\n\r\n\r\nSome extension points are “methods” (functions) and others are\r\n“properties” (non-functions):\r\n\r\n\r\nsapply(Stat, class)\r\n\r\n\r\n compute_layer parameters aesthetics setup_data retransform \r\n \"function\" \"function\" \"function\" \"function\" \"logical\" \r\n optional_aes non_missing_aes default_aes finish_layer compute_panel \r\n \"character\" \"character\" \"uneval\" \"function\" \"function\" \r\n extra_params compute_group required_aes setup_params \r\n \"character\" \"function\" \"character\" \"function\"\r\n\r\n\r\nIf you want to know what each of these methods and properties are\r\nfor, you can read up on the package\r\nvignette on ggproto. But we don’t need to know every detail - only a\r\nhandful are productive extension points.\r\nBelow are a few examples of specific Stat* extensions\r\n(e.g., StatCount) and information about what\r\nmethods/properties they modify from the top-level Stat\r\nggproto:\r\nTry to get a feel for what kind of methods and properties are the\r\ncommon targets of extensions. Pay specific attention to the distribution\r\nof methods highlighted in green.\r\n\r\nStatCount\r\n\r\n\r\nget_method_inheritance(StatCount)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"finish_layer\" \r\n [5] \"non_missing_aes\" \"optional_aes\" \"parameters\" \"retransform\" \r\n [9] \"setup_data\" \r\n \r\n $StatCount\r\n [1] \"compute_group\" \"default_aes\" \"extra_params\" \"required_aes\" \r\n [5] \"setup_params\"\r\n\r\n\r\nStatBoxplot\r\n\r\n\r\nget_method_inheritance(StatBoxplot)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"default_aes\" \r\n [5] \"finish_layer\" \"optional_aes\" \"parameters\" \"retransform\" \r\n \r\n $StatBoxplot\r\n [1] \"compute_group\" \"extra_params\" \"non_missing_aes\" \"required_aes\" \r\n [5] \"setup_data\" \"setup_params\"\r\n\r\n\r\nStatDensity\r\n\r\n\r\nget_method_inheritance(StatBin)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"finish_layer\" \r\n [5] \"non_missing_aes\" \"optional_aes\" \"parameters\" \"retransform\" \r\n [9] \"setup_data\" \r\n \r\n $StatBin\r\n [1] \"compute_group\" \"default_aes\" \"extra_params\" \"required_aes\" \r\n [5] \"setup_params\"\r\n\r\n\r\nStatSmooth\r\n\r\n\r\nget_method_inheritance(StatSmooth)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"default_aes\" \r\n [5] \"finish_layer\" \"non_missing_aes\" \"optional_aes\" \"parameters\" \r\n [9] \"retransform\" \"setup_data\" \r\n \r\n $StatSmooth\r\n [1] \"compute_group\" \"extra_params\" \"required_aes\" \"setup_params\"\r\n\r\n\r\nStatBin\r\n\r\n\r\nget_method_inheritance(StatBin)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"finish_layer\" \r\n [5] \"non_missing_aes\" \"optional_aes\" \"parameters\" \"retransform\" \r\n [9] \"setup_data\" \r\n \r\n $StatBin\r\n [1] \"compute_group\" \"default_aes\" \"extra_params\" \"required_aes\" \r\n [5] \"setup_params\"\r\n\r\n\r\nStatBin\r\n\r\n\r\nget_method_inheritance(StatContour)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"compute_panel\" \"extra_params\" \r\n [5] \"finish_layer\" \"non_missing_aes\" \"optional_aes\" \"parameters\" \r\n [9] \"retransform\" \"setup_data\" \r\n \r\n $StatContour\r\n [1] \"compute_group\" \"default_aes\" \"required_aes\" \"setup_params\"\r\n\r\n\r\n\r\nDid you notice how all these Stat* extensions define\r\ntheir own $compute_group() method while inheriting the\r\n$compute_layer() and $compute_panel() methods\r\nfrom the parent stat? This is the intended design - check\r\nout how Stat$compute_group() is defined:\r\n\r\n\r\nStat$compute_group\r\n\r\n\r\n\r\n\r\n function (self, data, scales) \r\n {\r\n abort(\"Not implemented\")\r\n }\r\n \r\n \r\n\r\n\r\nAnd this is what I mean by “the top-level Stat ggproto\r\nis a template”. The Stat provides an infrastructure that\r\nsplits the data up by layer (Stat$compute_layer()) and\r\npanel (Stat$compute_panel()), but leaves it to the child\r\nStat* ggprotos to fill in the details about what\r\nstatistical summaries are computed by group, after this splitting takes\r\nplace.\r\n\r\nHow are Stat$compute_layer() and\r\nStat$compute_panel() implemented?\r\nIn $compute_layer(), the data is split by values of the\r\nPANEL column using an internal function\r\nggplot2:::dapply(), and $compute_panel() is\r\ncalled on each split.\r\n\r\n\r\nStat$compute_layer\r\n\r\n\r\n\r\n\r\n function (self, data, params, layout) \r\n {\r\n check_required_aesthetics(self$required_aes, c(names(data), \r\n names(params)), snake_class(self))\r\n required_aes <- intersect(names(data), unlist(strsplit(self$required_aes, \r\n \"|\", fixed = TRUE)))\r\n data <- remove_missing(data, params$na.rm, c(required_aes, \r\n self$non_missing_aes), snake_class(self), finite = TRUE)\r\n params <- params[intersect(names(params), self$parameters())]\r\n args <- c(list(data = quote(data), scales = quote(scales)), \r\n params)\r\n dapply(data, \"PANEL\", function(data) {\r\n scales <- layout$get_scales(data$PANEL[1])\r\n tryCatch(do.call(self$compute_panel, args), error = function(e) {\r\n warn(glue(\"Computation failed in `{snake_class(self)}()`:\\n{e$message}\"))\r\n new_data_frame()\r\n })\r\n })\r\n }\r\n \r\n \r\n\r\n\r\nIn $compute_panel(), the data is first split by values\r\nof the group column using the split()\r\nfunction, then $compute_group() is called on each split\r\ninside lapply().\r\n\r\n\r\nStat$compute_panel\r\n\r\n\r\n\r\n\r\n function (self, data, scales, ...) \r\n {\r\n if (empty(data)) \r\n return(new_data_frame())\r\n groups <- split(data, data$group)\r\n stats <- lapply(groups, function(group) {\r\n self$compute_group(data = group, scales = scales, ...)\r\n })\r\n stats <- mapply(function(new, old) {\r\n if (empty(new)) \r\n return(new_data_frame())\r\n unique <- uniquecols(old)\r\n missing <- !(names(unique) %in% names(new))\r\n cbind(new, unique[rep(1, nrow(new)), missing, drop = FALSE])\r\n }, stats, groups, SIMPLIFY = FALSE)\r\n rbind_dfs(stats)\r\n }\r\n \r\n \r\n\r\n\r\nThese two methods are implemented slightly differently, though the\r\nexact details are not relevant to the current discussion.3\r\nLastly, a note about self. The self\r\nvariable is a reference to the ggproto object that called the method.4 In the context of\r\np_bar2, the self is\r\nStatCount:\r\n\r\n\r\nclass( ggtrace_inspect_args(p_bar2, Stat$compute_layer)$self )\r\n\r\n\r\n [1] \"StatCount\" \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\nThis may be surprising given how $compute_layer() lives\r\ninside Stat, not StatCount. This is an odd\r\nthing about object oriented programming with ggproto that you’ll have to\r\nget used to. Just remember that the choice of the ggproto object is\r\ndetermined by the layer (e.g., stored in geom_bar()$stat)\r\nand that the self variable keeps track of which\r\nggproto object called a method (context-dependent);\r\nbut all of this is separate from the issue of where a method is\r\ndefined in (context-independent).\r\nIn other words, self is StatCount because\r\nthe method $compute_layer() is “looked up” by\r\nStatCount. The fact that the actual\r\n$compute_layer() method is defined inside\r\nStat$compute_layer() is irrelevant here (though it may be\r\nrelevant for other purposes).\r\nIn the next section, we’ll look at how the $compute_*()\r\nfamily of methods behave in the internals.\r\n\r\nWait - what about the other common extension points?\r\nYou might have also noticed a few more repeated extension points\r\nother than $compute_group(). They usually form some subset\r\nof $default_aes, $required_aes,\r\n$setup_data(), $setup_params(), and\r\n$extra_params. These are also like a family - their job is\r\nto prepare the layer’s data before it’s sent off to the\r\n$compute_*() family of methods.\r\nFor example, here’s a walkthrough of how StatCount\r\nprepares the data:\r\nFirst, the default aesthetic mappings are specified such that\r\nboth x and y are mapped to\r\nafter_stat(count):\r\n\r\n\r\n# or `geom_bar()$stat$default_aes`\r\nStatCount$default_aes\r\n\r\n\r\n Aesthetic mapping: \r\n * `x` -> `after_stat(count)`\r\n * `y` -> `after_stat(count)`\r\n * `weight` -> 1\r\n\r\n\r\nSecond, the required aesthetic mappings are specified such that\r\nexactly one of x or y must\r\nbe provided:\r\n\r\n\r\n# or `geom_bar()$stat$required_aes`\r\nStatCount$required_aes\r\n\r\n\r\n [1] \"x|y\"\r\n\r\n\r\nBy requiring the user to supply one of the two aesthetics, the one\r\nleft over takes on the “implicit” value of\r\nafter_stat(count). This is largely handled in\r\nStatCount$setup_params():\r\n\r\n\r\n# Lots of code here but it does 3 things:\r\n# 1) Check if user supplied `y`, not `x` (+ track this in `flipped_aes`)\r\n# 2) Make sure user supplied exactly one of `x` or `y`\r\n# 3) If `flipped_aes` (= `y` is supplied), pretend that it's `x` but\r\n# keep that for later (reverted in `StatCount$compute_group()`)\r\nStatCount$setup_params\r\n\r\n\r\n\r\n\r\n function (data, params) \r\n {\r\n params$flipped_aes <- has_flipped_aes(data, params, main_is_orthogonal = FALSE)\r\n has_x <- !(is.null(data$x) && is.null(params$x))\r\n has_y <- !(is.null(data$y) && is.null(params$y))\r\n if (!has_x && !has_y) {\r\n abort(\"stat_count() requires an x or y aesthetic.\")\r\n }\r\n if (has_x && has_y) {\r\n abort(\"stat_count() can only have an x or y aesthetic.\")\r\n }\r\n if (is.null(params$width)) {\r\n x <- if (params$flipped_aes) \r\n \"y\"\r\n else \"x\"\r\n params$width <- resolution(data[[x]]) * 0.9\r\n }\r\n params\r\n }\r\n \r\n \r\n\r\n\r\nLastly, the role of $extra_params is kind\r\nof weird, but in this case it simply says that\r\nStatCount has support for handling different orientations\r\n(which gets standardized to flipped_aes internally).\r\n\r\n\r\n\r\n\r\nStat$extra_params\r\n\r\n\r\n [1] \"na.rm\"\r\n\r\n\r\n\r\n\r\n\r\n\r\nStatCount$extra_params\r\n\r\n\r\n [1] \"na.rm\" \"orientation\"\r\n\r\n\r\n\r\n\r\nIn case you didn’t know already, orientation is a relatively new\r\n(v3.3.0) argument supported by some layers. It’s allows individual\r\nlayers to be “flipped” without applying coord_flip() to the\r\nwhole plot:\r\n\r\n\r\nggplot(mapping = aes(x = 2, y = 3)) +\r\n geom_col(fill = \"pink\") +\r\n geom_col(fill = \"steelblue\", orientation = \"y\") +\r\n xlim(0, 3) + ylim(0, 4)\r\n\r\n\r\n\r\n\r\nThe $compute_*()\r\nfamily of methods\r\nHere, let’s look at how the statistical transformation is implemented\r\nin the $compute_*() family of methods. The\r\np_bar2 ggplot is printed again below.\r\n\r\n\r\np_bar2\r\n\r\n\r\n\r\n\r\nThe geom_bar() layer in our plot computes and draws five\r\nbars across three panels. This is reflected in the number of calls to\r\nthe $compute_*() methods:\r\nOne call to Stat$compute_layer() for our one bar\r\nlayer\r\n\r\n\r\nggtrace_inspect_n(p_bar2, Stat$compute_layer)\r\n\r\n\r\n [1] 1\r\n\r\n\r\nThree calls to Stat$compute_panel() for three facets in\r\nthe bar layer\r\n\r\n\r\nggtrace_inspect_n(p_bar2, Stat$compute_panel)\r\n\r\n\r\n [1] 3\r\n\r\n\r\nFive calls to StatCount$compute_group() for five bars\r\nacross the three facets\r\n\r\n\r\nggtrace_inspect_n(p_bar2, StatCount$compute_group)\r\n\r\n\r\n [1] 5\r\n\r\n\r\nKeep in mind that there’s a hierarchy to how the\r\n$compute_*() functions are called:\r\n\r\n Stat$compute_layer()\r\n |--- Stat$compute_panel()\r\n |--- StatCount$compute_group()\r\n |--- StatCount$compute_group()\r\n |--- Stat$compute_panel()\r\n |--- StatCount$compute_group()\r\n |--- StatCount$compute_group()\r\n |--- Stat$compute_panel()\r\n |--- StatCount$compute_group()\r\n\r\nThe $compute_*() family of functions implement what’s\r\ncalled a split-apply-combine design. If you haven’t\r\nheard of it before, it’s basically like the “divide and conquer”\r\nstrategy: you break down the problem to the smallest, most essential\r\npieces, work on them individually, then bring them together into one\r\nsolution.\r\nPlay around with the expandable nested tables below to get a sense of\r\nwhat kind of information is passed down to the\r\n$compute_group() method, and what each\r\n$compute_group() call returns for the bar layer in our\r\nplot:\r\n\r\n\r\n\r\n1) Split\r\nThe $compute_layer() method first splits up the data by\r\nfacet and passes them down to $compute_panel(). Then, the\r\n$compute_panel() method splits up the data by group and\r\npasses them down to $compute_group().\r\n\r\n\r\n\r\n\r\n\r\n2) Apply\r\nThe $compute_group() method applies to each of the\r\nsplits and returns a modified data:\r\n\r\n\r\n\r\n\r\n\r\n3) Combine\r\nThe output of $compute_group() calls are combined by\r\npanel and returned by $compute_panel(), then combined again\r\nand returned by $compute_layer():\r\n\r\n\r\n\r\nThe layer’s data frame\r\nrepresentation\r\nRemember how I said that the output of\r\nStat$compute_layer() for our geom_bar() layer\r\nis the dataframe representation of the layer at a\r\nparticular stage in the pipeline? If this dataframe representation were\r\nto change, then it would affect how the layer gets drawn down the\r\nline.\r\nWe saw how Stat* extensions do this by changing how a\r\nggproto method is defined (namely, via $compute_group()).\r\nBut we can also test this on-the-fly as well, without writing a whole\r\nggproto extension ourselves.\r\nFor example, we can modify the output of\r\nStat$compute_layer() that we grabbed with\r\nggtrace_inspect_return() before and save it to a new\r\nvariable:\r\n\r\n\r\nmodified_compute_layer_output <- compute_layer_output %>% \r\n mutate(count = count ^ 2) #< square the counts\r\n\r\nmodified_compute_layer_output\r\n\r\n\r\n count prop x width flipped_aes PANEL group\r\n 1 1936 1 1 0.7 FALSE 1 1\r\n 2 15376 1 3 0.7 FALSE 1 3\r\n 3 3136 1 1 0.7 FALSE 2 1\r\n 4 4624 1 2 0.7 FALSE 2 2\r\n 5 2704 1 1 0.7 FALSE 3 1\r\n\r\n\r\nThen, we force Stat$compute_layer() to return that\r\nmodified dataframe instead when it’s called for p_bar2, by\r\npassing it to the value argument of\r\nggtrace_highjack_return():\r\n\r\n\r\nggtrace_highjack_return(\r\n p_bar2, Stat$compute_layer,\r\n value = modified_compute_layer_output\r\n)\r\n\r\n\r\n\r\n\r\nSee how this has direct consequences for our plot?\r\nHere’s another one - we can modify the dataframe about to be returned\r\nby StatCount$compute_group(), targeting just the third time\r\nit’s called, with cond = 3. This time we do the\r\nmodification in place by passing an expression to the\r\nvalue argument, where returnValue() evaluates\r\nto the value about to be returned by the method:\r\n\r\n\r\nggtrace_highjack_return(\r\n p_bar2, StatCount$compute_group, cond = 3,\r\n value = quote({\r\n returnValue() %>% \r\n mutate(count = count ^ 2)\r\n })\r\n)\r\n\r\n\r\n\r\n\r\nAgain, a big consequence for the plot down the line.\r\nThe lesson here is that the dataframe representation of the\r\nlayer undergoes incremental updates in the internals, gradually\r\nworking up to its final drawing-ready form that we can see using\r\nlayer_data().\r\n\r\n\r\nlayer_data(p_bar2)\r\n\r\n\r\n y count prop x flipped_aes PANEL group ymin ymax xmin xmax colour fill\r\n 1 44 44 1 1 FALSE 1 1 0 44 0.65 1.35 NA grey35\r\n 2 124 124 1 3 FALSE 1 3 0 124 2.65 3.35 NA grey35\r\n 3 56 56 1 1 FALSE 2 1 0 56 0.65 1.35 NA grey35\r\n 4 68 68 1 2 FALSE 2 2 0 68 1.65 2.35 NA grey35\r\n 5 52 52 1 1 FALSE 3 1 0 52 0.65 1.35 NA grey35\r\n size linetype alpha\r\n 1 0.5 1 NA\r\n 2 0.5 1 NA\r\n 3 0.5 1 NA\r\n 4 0.5 1 NA\r\n 5 0.5 1 NA\r\n\r\n\r\nIn this sense, ggproto methods are like functions that intervene at\r\ndifferent steps of what is essentially a data wrangling pipeline, making\r\npiecemeal changes to the layer data. Consequently, the work of a\r\nsingle ggproto method can have far reaching consequences for the plot\r\ndown the line.\r\nI promise this is the end of the {ggtrace} self-promo,\r\nbut I encourage you to play around with the internals using the workflow\r\nfunctions from the package - it can help refine your intuitions about\r\nhow ggplot internals work, with low barrier and no risk. Check out the\r\npackage website if\r\nyou’d like to know more.\r\nOther $compute_*()\r\nextensions\r\nBefore we wrap up, I have to come clean - I’ve actually been\r\nmisleading you in instilling this divide between\r\n$compute_layer() and $compute_panel() on one\r\nhand, and $compute_group() on the other. Of course,\r\n$compute_group() is the most natural extension point, but\r\nnothing stops you from doing the necessary calculations in\r\n$compute_layer() or $compute_panel()\r\ninstead.\r\nIn fact, there are several circumstances where you don’t want to do\r\nthings at the group level:\r\nThe first reason is for efficiency. Consider the case of\r\ngeom_point() - it’s a layer that draws points from\r\nx and y values, as is. The stat\r\nggproto for this layer is StatIdentity, and all it does in\r\nthe $compute_*() step is to return the data as it received\r\nit.\r\n\r\n\r\nclass( geom_point()$stat )\r\n\r\n\r\n [1] \"StatIdentity\" \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\nOne way of implementing this is to define a\r\nStatIdentity$compute_group() that just returns\r\ndata, but this is unnecessarily complex - you’re still\r\nsplitting the data by panel and group, only to not do anything with\r\nit.\r\nTherefore, the appropriate extension point is actually\r\n$compute_layer() - you return the data as soon as you\r\nreceive it, without forwarding the data to $compute_panel()\r\nand $compute_group(). Indeed, that’s how\r\nStatIdentity is actually implemented.\r\n\r\n\r\nget_method_inheritance(StatIdentity)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_group\" \"compute_panel\" \"default_aes\" \r\n [5] \"extra_params\" \"finish_layer\" \"non_missing_aes\" \"optional_aes\" \r\n [9] \"parameters\" \"required_aes\" \"retransform\" \"setup_data\" \r\n [13] \"setup_params\" \r\n \r\n $StatIdentity\r\n [1] \"compute_layer\"\r\n\r\n\r\n\r\n\r\nStatIdentity$compute_layer\r\n\r\n\r\n\r\n\r\n function (self, data, params, layout) \r\n {\r\n data\r\n }\r\n \r\n \r\n\r\n\r\nA second reason is that sometimes you need to do calculations at the\r\npanel level, not the group level. This is the case for\r\nStatUnique. It’s an uncommon stat that’s mainly used to\r\ndeal with overplotted text by removing duplicates in the dataframe\r\nrepresentation of the layer.5\r\nIt’s subtle (you need to zoom in to the plots to see), but there’s a\r\ndifference in quality between overlapping text on the left (with the\r\nStatIdentity default) versus the solution on the right with\r\nStatUnique.\r\n\r\n\r\n\r\n\r\ntibble(x = rep(1, 50), y = x) %>% \r\n ggplot(aes(x, y)) +\r\n geom_text(\r\n aes(label = \"Some Text\"),\r\n size = 25, fontface = \"bold\",\r\n stat = StatIdentity #< default\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\ntibble(x = rep(1, 50), y = x) %>% \r\n ggplot(aes(x, y)) +\r\n geom_text(\r\n aes(label = \"Some Text\"),\r\n size = 25, fontface = \"bold\",\r\n stat = StatUnique\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nFigure 1: Zoomed in to the letter “S” from the two\r\nplots.\r\n\r\n\r\n\r\nYou might think that StatUnique$compute_layer() is\r\nreturning unique(data) at the layer level, but consider\r\nthis behavior of StatUnique:\r\n\r\n\r\ntibble(x = rep(1, 50), y = rep(1, 50),\r\n g = rep_len(LETTERS[1:4], 50)) %>% \r\n ggplot(aes(x, y)) +\r\n geom_text(\r\n aes(label = \"Some Text\"),\r\n size = 10, fontface = \"bold\",\r\n stat = StatUnique\r\n ) +\r\n facet_wrap(~ g)\r\n\r\n\r\n\r\n\r\nWe see that StatUnique doesn’t remove just any data\r\npoint duplicates - it removes duplicates within each facet.\r\nThus, unique(data) is implemented inside\r\nStatUnique$compute_panel(), which is more in line with the\r\ngoal of preventing visually overlapping text.\r\n\r\n\r\nget_method_inheritance(StatUnique)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_group\" \"compute_layer\" \"default_aes\" \r\n [5] \"extra_params\" \"finish_layer\" \"non_missing_aes\" \"optional_aes\" \r\n [9] \"parameters\" \"required_aes\" \"retransform\" \"setup_data\" \r\n [13] \"setup_params\" \r\n \r\n $StatUnique\r\n [1] \"compute_panel\"\r\n\r\n\r\n\r\n\r\nStatUnique$compute_panel\r\n\r\n\r\n\r\n\r\n function (data, scales) \r\n unique(data)\r\n \r\n \r\n\r\n\r\nLastly, there are rare cases of where you’d want to extend multiple\r\n$compute_*() methods at once. For example,\r\nStatYdensity is used by geom_violin() to\r\ncalculate the size and shape of violins, and it extends both\r\n$compute_panel() and $compute_group():\r\n\r\n\r\nclass( geom_violin()$stat )\r\n\r\n\r\n [1] \"StatYdensity\" \"Stat\" \"ggproto\" \"gg\"\r\n\r\n\r\n\r\n\r\nget_method_inheritance(StatYdensity)\r\n\r\n\r\n $Stat\r\n [1] \"aesthetics\" \"compute_layer\" \"default_aes\" \"finish_layer\" \r\n [5] \"optional_aes\" \"parameters\" \"retransform\" \"setup_data\" \r\n \r\n $StatYdensity\r\n [1] \"compute_group\" \"compute_panel\" \"extra_params\" \"non_missing_aes\"\r\n [5] \"required_aes\" \"setup_params\"\r\n\r\n\r\nThis is done to create an effect where densities are calculated per\r\ngroup, and then scaled within each facet. Note how violin areas are\r\nequal within a facet but not across facets.6\r\n\r\n\r\n\r\nFigure 2: https://stackoverflow.com/questions/47174825\r\n\r\n\r\n\r\nConclusion\r\nIn Part\r\n1, we were introduced to this idea of there being a data\r\nframe representation for each layer, which gets updated and\r\naugmented over the course of rendering a ggplot. We saw how one of the\r\nchanges to the layer data is the statistical\r\ntransformation whereby new variables like count\r\nare computed internally and become available for delayed\r\naesthetic mapping using after_stat(). We saw how\r\nthis statistical transformation step isn’t as scary as it looks - it\r\nlooked like a standard data wrangling procedure that we could express\r\nwith group_by() and summarize().\r\nThe goal of Part 2 was to expose the exact details of\r\nthe statistical transformation step. We had our first encounter with\r\nggproto methods which we can think of as data wrangling\r\nfunctions that live inside lists. We saw how a family of\r\n$compute_*() methods called by-layer, by-facet, and\r\nby-group implement a split-apply-combine process much\r\nlike the group_by() + summarize() combo. We\r\nalso learned that the motivation behind this odd-looking ggproto system\r\nis to support an extension mechanism that allows us to\r\nsubclass new Stat* and Geom* ggprotos that\r\ndefine their own custom behavior for a method like\r\n$compute_group().\r\nSorry if Part 2 was a bit too theoretical! In Part 3, we’ll leave\r\nthis whole ggproto thing behind to talk about after_scale()\r\nand stage(), two more delayed aes eval functions.\r\nSneak peak of Part 3\r\nI’ll leave you with two examples as a teaser:\r\nExample 1: after_scale() is like\r\nafter_stat(), but targets the data after the\r\n(non-positional) scale transformation step, which\r\nhappens towards the end of the build pipeline:\r\n\r\n\r\nlibrary(colorspace)\r\np_boxplot <- penguins %>% \r\n filter(!is.na(flipper_length_mm)) %>% \r\n ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +\r\n geom_boxplot(\r\n aes(color = after_scale(darken(fill, .5))),\r\n width = .4\r\n ) +\r\n theme_classic()\r\np_boxplot\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nExample 2: stage() allows you to re-map to the same\r\naesthetic using variables from different points in the build\r\npipeline.\r\n\r\n\r\n# Generates warning in v3.3.6; fixed in dev version (PR #4707)\r\np_boxplot +\r\n geom_text(\r\n aes(y = stage(flipper_length_mm, c(lower, middle, upper)),\r\n label = after_stat(c(lower, middle, upper)),\r\n color = after_scale(darken(fill, .8))),\r\n size = 4, hjust = 1,\r\n position = position_nudge(x = .5),\r\n stat = StatBoxplot\r\n ) +\r\n theme_classic()\r\n\r\n\r\n\r\n\r\nA taste of writing\r\nggproto extensions\r\nThough writing ggproto extensions is beyond the scope of this blog\r\nseries,7 we’re now well prepared for it after\r\nworking through the logic of how the top-level template\r\nStat ggproto gets extended in child Stat*\r\nggprotos like StatCount. All that’s missing is the syntax\r\nthat implements this (that’s usually the easy part!).\r\nJust to give you a taste without going into it too deep, this is an\r\nexample stat extension inspired by Gina Reynolds that calculates\r\nan internal variable called rowid inside\r\n$compute_layer() and also sets a default aesthetic mapping\r\nof label = after_stat(rowid) in\r\n$default_aes:\r\n\r\n\r\nStatRowID <- ggplot2::ggproto(\r\n \r\n # Create a new ggproto of class \"StatRowID\"\r\n `_class` = \"StatRowID\",\r\n # That inherits from the top-level `Stat` ggproto\r\n `_inherit` = Stat,\r\n \r\n # Extension point: add a `rowid` column to the data at layer-level\r\n compute_layer = function(self, data, params, layout) {\r\n data$rowid <- seq_len(nrow(data))\r\n data\r\n },\r\n \r\n # Extension point: map the computed `rowid` variable to `label`\r\n default_aes = aes(label = after_stat(rowid))\r\n \r\n)\r\n\r\ntibble(x = runif(20), y = runif(20), g = rep(c(\"A\", \"B\"), each = 10)) %>% \r\n ggplot(aes(x, y, fill = g)) +\r\n geom_label(stat = StatRowID) # or \"RowID\"\r\n\r\n\r\n\r\n\r\nIf we wanted row IDs to be calculated by group, we simply move the\r\ncomputation to $compute_group()\r\n\r\n\r\nStatRowIDbyGroup <- ggplot2::ggproto(\r\n `_class` = \"StatRowIDbyGroup\",\r\n `_inherit` = Stat,\r\n default_aes = aes(label = after_stat(rowid)),\r\n # Extend `compute_group` instead of `compute_layer`\r\n compute_group = function(self, data, scales) {\r\n data$rowid <- seq_len(nrow(data))\r\n data\r\n }\r\n)\r\n\r\ntibble(x = runif(20), y = runif(20), g = rep(c(\"A\", \"B\"), each = 10)) %>% \r\n ggplot(aes(x, y, fill = g)) +\r\n geom_label(stat = StatRowIDbyGroup) # or \"RowIDbyGroup\"\r\n\r\n\r\n\r\n\r\nIf the sensible use of statRowIDbyGroup is for it to be\r\npaired with the label geom, then we write a layer function (also called\r\na constructor function) that wraps around\r\nggplot2::layer(). Inside stat_rowid(), we hard\r\ncode stat = StatRowIDbyGroup and expose the\r\ngeom = \"label\" default argument to the user:\r\n\r\n\r\nstat_rowid <- function(mapping = NULL, data = NULL,\r\n geom = \"label\", position = \"identity\",\r\n ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {\r\n ggplot2::layer(\r\n # All layers need data and aesthetic mappings\r\n data = data, mapping = mapping,\r\n # The layer's choice of Stat, Geom, and Position\r\n stat = StatRowIDbyGroup, geom = geom, position = position,\r\n # Standard parameters available for all `layer()`s (there are more)\r\n show.legend = show.legend, inherit.aes = inherit.aes,\r\n # Arguments to be passed down to Stat/Geom/Position\r\n params = list(na.rm = na.rm, ...)\r\n )\r\n}\r\n\r\ntibble(x = runif(20), y = runif(20), g = rep(c(\"A\", \"B\"), each = 10)) %>% \r\n ggplot(aes(x, y, fill = g)) +\r\n stat_rowid()\r\n\r\n\r\n\r\n\r\nBy following these design principles, you get a lot of features for\r\nfree, like the ability to swap out the geom and map the internally\r\ncalculated rowid variable elsewhere via\r\nafter_stat():\r\n\r\n\r\ntibble(x = runif(20), y = runif(20), g = rep(c(\"A\", \"B\"), each = 10)) %>% \r\n ggplot(aes(x, y, color = g)) +\r\n stat_rowid(\r\n aes(\r\n size = after_stat(rowid),\r\n alpha = after_stat(rowid/max(rowid))\r\n ),\r\n geom = \"text\"\r\n )\r\n\r\n\r\n\r\n\r\nIf you’re feeling ambitious, I highly recommend starting with Thomas Lin Pedersen’s rstudio::conf\r\ntalk. Good luck!\r\n\r\nTechnically, it was returning the\r\noutput of ggplot2:::Layer$compute_statistic() which returns\r\nthe $compute_layer() method called by the layer’s stat. But\r\nwe don’t need to worry about this distinction for this blog post.↩︎\r\nMissing columns get added back in\r\nwhen $compute_panel() combines the output of\r\n$compute_group() calls.↩︎\r\nFor example, one big difference is\r\nthat parameters are passed in as a list to the params\r\nargument of compute_layer(), which is then spliced and\r\npassed in as the ... of compute_panel()\r\nthrough do.call().↩︎\r\nIt’s actually optional and only\r\navailable if the method specifies an argument called self.\r\nAs long as it’s present in the formals, it can appear in any position.\r\nThe convention is to define all ggproto methods with self\r\nas the first argument.↩︎\r\nAn alternative solution that I\r\npersonally prefer is annotate(geom = \"text\", ..).↩︎\r\nI’m not sure about the rationale for\r\nthis, and people\r\nhave brought up wanting to change this default. Given everything\r\nwe’ve discussed so far, we don’t have to read the source code to know\r\nthat if you want all violins in the layer to have the same\r\nsize, we should move the calculations in $compute_panel()\r\nto $compute_layer() instead.↩︎\r\nOther resources exist for that (in\r\nfact, most resources on learning the internals focus on writing\r\nextensions). Chapter 20 and 21 of the ggplot2\r\nbook is a good place to start.↩︎\r\n", "preview": "posts/2022-07-06-ggplot2-delayed-aes-2/preview.png", - "last_modified": "2022-11-13T09:16:57-05:00", + "last_modified": "2022-11-13T06:16:57-08:00", "input_file": {}, "preview_width": 3469, "preview_height": 2224 @@ -275,7 +275,7 @@ ], "contents": "\n\nContents\nIntroduction\nafter_stat()\nAnother practice with\nprop\nBaby steps to the Stat\nggproto\nSame geom, different\nstats\nSame stat, different\ngeoms\nSneak peak of Part 2\n\nSession Info\n\nThis is a developing series of blog posts, scheduled for three\nparts:\nPart 1: Exploring the logic of\nafter_stat() to peek inside ggplot internals (you\nare here)\nPart 2: Exposing the Stat ggproto in functional\nprogramming terms\nPart 3: Completing the picture with after_scale()\nand stage() (TBD)\nIntroduction\nThe version\n3.3.0 of {ggplot2} released back in early 2020 introduced a family\nof functions for finer control over mapping variables to aesthetics:\nafter_stat(), after_scale(), and\nstage().\nThe big idea behind these functions is that layers sometimes derive\nnew variables from the raw data that they receive, and\nso ggplot should offer a way of letting the user access with these\ninternally-computed variables inside the aes().\nWhile this may sound esoteric, the mapping of an internal\nvariable to an aesthetic is ubiquitous, both in\n{ggplot2} itself as well as in many extension packages. For\nexample, we take it for granted that layers like geom_bar()\nand geom_histogram() draw bars in the x/y-coordinate space\nand yet require just a single mapping to one of the two aesthetics.\n\n\nlibrary(ggplot2)\ntheme_set(theme_classic(14))\n\npenguins <- na.omit(palmerpenguins::penguins)\nrmarkdown::paged_table(penguins)\n\n\n\n\n\nggplot(penguins, aes(x = bill_depth_mm)) +\n geom_histogram()\nggplot(penguins, aes(x = species)) +\n geom_bar()\n\n\n\n\nDespite only specifying a mapping to x, there is clearly\nalso a variable mapped to y in both plots above - a\nvariable called count.\nWe can infer that the count variable is computed by the\nlayer since it doesn’t exist in our raw data\npenguins:\n\n\n\"count\" %in% colnames(penguins)\n\n\n [1] FALSE\n\nWhich also means that if we try to use count in\naes() like we would any other column in our data, we get an\nerror:\n\n\nggplot(penguins, aes(species, y = count)) +\n geom_bar()\n\n\n Error in FUN(X[[i]], ...): object 'count' not found\n\nBut we don’t just have to infer that count is\ncomputed by the layer. We can see it from the output of a layer’s data\ntransformation pipeline, which we can get with\nlayer_data().1 Here, we pass\nlayer_data() our bar plot, which by default returns us the\ndata for the first layer:\n\n\np <- ggplot(penguins, aes(x = species)) +\n geom_bar()\nlayer_data(p, i = 1L) # argument `i` specifies the layer index\n\n\n y count prop x flipped_aes PANEL group ymin ymax xmin xmax colour fill\n 1 146 146 1 1 FALSE 1 1 0 146 0.55 1.45 NA grey35\n 2 68 68 1 2 FALSE 1 2 0 68 1.55 2.45 NA grey35\n 3 119 119 1 3 FALSE 1 3 0 119 2.55 3.45 NA grey35\n size linetype alpha\n 1 0.5 1 NA\n 2 0.5 1 NA\n 3 0.5 1 NA\n\nThere’s a lot going on in this dataframe but pay attention to two\nthings:\nThere is a new column called count\nThe y column is the same as count\nBoth of these must happen for our plots above to have\ncount mapped to y.\nBut how does ggplot know to wait until count is\ncomputed to map it to y?\nafter_stat()\nThe technical details of how and where count is\ncomputed will be covered in the Part 2 of this\nseries. For now, all we need to know is that layers like\ngeom_bar() and geom_histogram() transforms the\nraw data that it receives to derive new values like\ncount.\nSo here’s the big idea: to allow aesthetics to be mapped to\ninternally calculated variables like count, there must be a\nway to delay some aesthetic mappings.\nMore specifically, when it comes to variables like\ncount, we need a way to reference it after\nthe statistical transformation has taken place\ninternally. As you might have guessed, we can do this with the\ntransparently-named function after_stat()! It takes one\nargument x which is captured as an expression, to be\nevaluated later.\nSo in our case, the fact that count is mapped to\ny after the statistical transformation takes place\nsuggests that y is mapped to\nafter_stat(count). This is indeed the case, and we can make\nthis default mapping explicit in the aes():\n\n\nggplot(penguins, aes(species, y = after_stat(count))) +\n geom_bar()\n\n\n\n\nThe expression inside after_stat() is evaluated in\nwhat’s called a data-masked\ncontext, much like in aes() or {dplyr}\nverbs like mutate() and summarize(). This is\nwhy you can just refer to count as if it were a variable\nname and after_stat() will grab that column vector from the\ntransformed data.\nBut now we’re faced with another question: what is the\ntransformed data that after_stat() looks for variables\nin? We know it’s a version of our original data, but what does\nthe output actually look like?\nTo help us answer this question I’m going to introduce a function\ncalled inspect_after_stat(). I’ll leave the code here for\nreproducibility purposes but it’s supposed to be a bit mysterious right\nnow, so don’t read too much into it!\ninspect_after_stat\nNo dependency version:\n\n\n#' Inspect a layer's data after computation by the Stat\n#'\n#' @param p A `ggplot` object\n#' @param i An integer. The position of the layer's data to return.\n#'\n#' @return A dataframe\ninspect_after_stat <- function(p, i = 1L) {\n ._env <- environment()\n .out <- NULL\n suppressMessages({\n trace(\n what = \"ggplot_build.ggplot\",\n tracer = substitute(assign(\".out\", data[[i]], envir = ._env), ._env),\n at = 19L,\n print = FALSE,\n where = asNamespace(\"ggplot2\")\n )\n })\n ggplot_build(p)\n suppressMessages({\n untrace(\"ggplot_build.ggplot\", where = asNamespace(\"ggplot2\"))\n })\n .out\n}\n\n\n\nTechnical details:\ninspect_after_stat() wraps around\nbase::trace(), a debugging function which can inject\nexpressions to be evaluated at a specific position of a function when it\nis called. The function being traced is the S3 method\nggplot_build() as defined for class\n. The function evaluates and logs the value\nof data[[i]] at Step 19 of\nggplot_build.ggplot. This is the value of the\nith layer’s data right after the statistical transformation\nhappens in the preceding step, Step 18:\n\n\n# https://github.com/tidyverse/ggplot2/blob/main/R/plot-build.r#L72\nbody(ggplot2:::ggplot_build.ggplot)[[18]]\n\n\n data <- by_layer(function(l, d) l$compute_statistic(d, layout))\n\nPart 2 will explore what happens here in more\ndetail.\nIf you’re interested in more functions like\ninspect_after_stat() that allow you to interact with ggplot\ninternals, take a look at my package {ggtrace}!\n\n\n# {ggtrace} version, not run\ninspect_after_stat <- function(p, i = 1L) {\n out <- ggtrace::ggtrace_inspect_vars(\n x = p,\n method = ggplot2:::ggplot_build.ggplot,\n at = 19L,\n vars = \"data\"\n )\n out[[1]][[i]]\n}\n\n\n\nWe’ll use inspect_after_stat() to “peak inside” ggplot\ninternals and inspect the state of a layer’s data after the statistical\ncomputation has taken place.\nLet’s start with a simple example of geom_bar():\n\n\np_bar <- ggplot(penguins, aes(species)) +\n geom_bar(aes(fill = species))\np_bar\n\n\n\n\nWhen we inspect the state of the data for the first (and only) layer\nof our plot after the statistical transformation, it looks like the\nfollowing:\n\n\ninspect_after_stat(p_bar)\n\n\n count prop x width flipped_aes fill PANEL group\n 1 146 1 1 0.9 FALSE Adelie 1 1\n 2 68 1 2 0.9 FALSE Chinstrap 1 2\n 3 119 1 3 0.9 FALSE Gentoo 1 3\n\nAt this stage, the data is collapsed such that each row contains\ninformation about one bar.\nNotice how the statistical transformation done by\ngeom_bar() is basically just\ndplyr::count():\n\n\nlibrary(dplyr)\npenguins %>% \n count(species)\n\n\n # A tibble: 3 x 2\n species n\n \n 1 Adelie 146\n 2 Chinstrap 68\n 3 Gentoo 119\n\nIt’s {dplyr} all the way down!2\nIf we take the notion of delayed aesthetic evaluation seriously, then\nall columns in the dataframe returned by\ninspect_after_stat() should be accessible with\nafter_stat(). Indeed this is the case.\nTo demonstrate, I can map y to each variable in the\ntransformed data, converting to numeric as needed to conform to the\nplot’s continuous y scale.\n\n\n# Plots are printed left -> right and top -> bottom\n# in order of `matrix(1:8, ncol = 2, byrow = TRUE)`\np_bar + aes(y = after_stat(count))\np_bar + aes(y = after_stat(prop))\np_bar + aes(y = after_stat(x))\np_bar + aes(y = after_stat(width))\np_bar + aes(y = after_stat(as.numeric(flipped_aes)))\np_bar + aes(y = after_stat(nchar(as.character(fill))))\np_bar + aes(y = after_stat(as.numeric(PANEL)))\np_bar + aes(y = after_stat(group))\n\n\n\n\nBefore I get sacked, I should add that just because you can,\ndoesn’t mean you should! More specifically,\nafter_stat() should only really be used to access variables\ncomputed from the stat stage like count\nand prop (we’ll go over how to tell what variables are from\nthe stat stage in Part 2). The\nstat transformation is just one of many transformations\nthat the data goes through to become drawing-ready, and so the state of\nthe “after stat” data also carries the output of other processes that\ncame before it like facet-ing and positional\nscale transformations which we shouldn’t touch.\nAnd in fact, many more transformations happen after the\nstat stage as well, so it’s neither the first\nnor the last thing to happen to the data. Using\nlayer_data() again, we see that the final form of the data\nbuilds on our data from that we saw from the stat\nstage:\n\n\nlayer_data(p_bar)\n\n\n fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax colour\n 1 #F8766D 146 146 1 1 FALSE 1 1 0 146 0.55 1.45 NA\n 2 #00BA38 68 68 1 2 FALSE 1 2 0 68 1.55 2.45 NA\n 3 #619CFF 119 119 1 3 FALSE 1 3 0 119 2.55 3.45 NA\n size linetype alpha\n 1 0.5 1 NA\n 2 0.5 1 NA\n 3 0.5 1 NA\n\nLastly, note that while additional variables in\nlayer_data() like size is technically\nalso from after the stat stage, they aren’t accessible\nin the after_stat():\n\n\np_bar + aes(y = after_stat(size))\n\n\n Error in after_stat(size): object 'size' not found\n\nThis is because after_stat() references a\nsnapshot of the data right after the stat\ntransformation, which is the data we saw with\ninspect_after_stat(), repeated below. Notice how the\nsize column is not yet present at this stage:\n\n\ninspect_after_stat(p_bar)\n\n\n count prop x width flipped_aes fill PANEL group\n 1 146 1 1 0.9 FALSE Adelie 1 1\n 2 68 1 2 0.9 FALSE Chinstrap 1 2\n 3 119 1 3 0.9 FALSE Gentoo 1 3\n\nAnother practice with\nprop\nLet’s look at another example of after_stat() to drive\nthe point home.\nYou might have noticed in the after stat data how the internally\ncomputed prop column just has the value 1, which isn’t too\ninformative when we plot it:\n\n\nggplot(penguins, aes(species)) +\n geom_bar(aes(y = after_stat(prop)))\n\n\n\n\nWhat is prop and when does it ever take on a different\nvalue?\nIf you look at the documentation for geom_bar() under\nthe section “Computed variables”, you will see the following\ndescriptions for the two computed variables count and\nprop:\ncount: number of points in the bin\nprop: groupwise proportion\nThe value of prop was 1 for p_bar because\neach bar was assigned a unique group as indicated by the values of the\ngroup column, so of course the computation of\ngroupwise proportion was uninteresting!\nIf we want to change that, we’d need to explicitly specify a\ngroup-ing that creates fewer groups than the number of bars\npresent.\nIn this example, we have geom_bar() make a bar for each\ncategory of species as mapped to x, but\ninternally represent the three bars as belonging to one of two groups -\nthe “Adelie group” and the “not-Adelie group”:\n\n\np_bar_prop <- ggplot(penguins, aes(species)) +\n geom_bar(\n aes(\n y = after_stat(prop),\n group = species == \"Adelie\" # Assign the 3 bars into 2 groups\n )\n )\np_bar_prop\n\n\n\n\nIt doesn’t look like there are two groups underlyingly, but\nlet’s see what happens when we inspect the data transformed by the\nStat:\n\n\ninspect_after_stat(p_bar_prop)\n\n\n count prop x width flipped_aes group PANEL\n 1 68 0.3636364 2 0.9 FALSE 1 1\n 2 119 0.6363636 3 0.9 FALSE 1 1\n 3 146 1.0000000 1 0.9 FALSE 2 1\n\nWe see only two groups (values of group are either 1 or\n2), as we expect. And where two bars (rows) belong to the same group,\nthe values of prop add up to 1!\nOur p_bar_prop has grey fill for all bars that make the\ngrouping structure visually ambiguous, but that was for expository\npurposes only. If we wanted to fill the bars by group, we could do it in\nat least two ways.\nA hacky way that is 100% not recommended but possible is to mark the\ngroupings post-hoc, by grabbing the group variable in the\nafter_stat() and mapping that to something like\nfill:\n\n\np_bar_prop +\n aes(fill = after_stat(as.factor(group)))\n\n\n\n\nThis is dangerous not only because it adds an unnecessary line after\ngeom_bar(), but also because it uses + aes()\nto map an after_stat() variable globally. You\nshould avoid this because different layers do different statistical\ntransformations (we’ll discuss this shortly in the next section), and\nthat can lead to surprising behaviors like this:3\n\n\np_bar_prop +\n aes(fill = after_stat(as.factor(group))) +\n # This next line does something unexpected to the fill!\n geom_label(aes(x = 2, y = 0.5, label = \"Hello\"), stat = \"unique\")\n\n\n\n\nA safer, more principled way is to go back inside the\naes() for the bar layer and visually mark the underlying\ngrouping there:\n\n\np_bar_prop_group_fill <- ggplot(penguins, aes(species)) +\n geom_bar(\n aes(\n y = after_stat(prop),\n group = species == \"Adelie\",\n fill = species == \"Adelie\" #< Here!\n )\n )\np_bar_prop_group_fill\n\n\n\n\nFor a quick experiment, note what happens if you only map\nspecies == \"Adelie\" to fill:\n\n\np_bar_prop_only_fill <- ggplot(penguins, aes(species)) +\n geom_bar(\n aes(\n y = after_stat(prop),\n # group = species == \"Adelie\",\n fill = species == \"Adelie\"\n )\n )\np_bar_prop_only_fill\n\n\n\n\nAs you might have guessed, we get a different plot because the\nunderlying grouping structure is different. See how this does not change\nthe value of count (absolute) but does change the value of\nprop (relative):4\n\n\n# 2 groups present\ninspect_after_stat(p_bar_prop_group_fill)\n\n\n count prop x width flipped_aes fill group PANEL\n 1 68 0.3636364 2 0.9 FALSE FALSE 1 1\n 2 119 0.6363636 3 0.9 FALSE FALSE 1 1\n 3 146 1.0000000 1 0.9 FALSE TRUE 2 1\n\n# 3 groups present\ninspect_after_stat(p_bar_prop_only_fill)\n\n\n count prop x width flipped_aes fill PANEL group\n 1 68 1 2 0.9 FALSE FALSE 1 1\n 2 119 1 3 0.9 FALSE FALSE 1 2\n 3 146 1 1 0.9 FALSE TRUE 1 3\n\nLet’s conclude this section by putting the issue of practicality back\non the table.\nThe most common usecase for prop is to use it in\nconjunction with something like group = 15 to\n“normalize” the y-scale within each facet. This is useful when the data\nhas unbalanced samples across panels but you want to emphasize the\nrelative distribution of categories in the x-axis within each\npanel:\n\n\n# Absolute value `count` mapped to `y`\nggplot(penguins, aes(species)) +\n geom_bar(aes(y = after_stat(count))) +\n facet_grid(~ island, scales = \"free_x\", space = \"free\")\n\n\n\n# Relative value `prop` mapped to `y`\nggplot(penguins, aes(species)) +\n geom_bar(aes(y = after_stat(prop), group = 1)) +\n facet_grid(~ island, scales = \"free_x\", space = \"free\")\n\n\n\n\nWe see that aes(group = 1) had the effect of making sure\nthat prop adds up to 1 within each facet because\nall bars in a panel share the same group.6\nBaby steps to the Stat\nggproto\nSame geom, different stats\nOne thing that you might have noticed throughout this exercise is\nthat it’s not transparent from the name geom_bar() that\nthis layer is going to compute internal variables like\ncount and prop and use a default y-aesthetic\nof after_stat(count).\nThis is responsible for one of the most enduring points of confusion\nin ggplot: the difference between geom_col() and\ngeom_bar().\nWhether you call the rectangle a “column” or a “bar” sounds like\nsilly semantics, so why do they behave so differently?\n\n\nggplot(penguins, aes(species)) +\n geom_col() # but works with `geom_bar()`\n\n\n Error in `check_required_aesthetics()`:\n ! geom_col requires the following missing aesthetics: y\n\npenguins %>% \n count(species) %>% \n ggplot(aes(species, n)) +\n geom_bar() # but works with `geom_col()`\n\n\n Error in `f()`:\n ! stat_count() can only have an x or y aesthetic.\n\nYou get a small hint in the error message for the\ngeom_bar() example - it complains that\nstat_count() can only have an x or y aesthetic.\nBut we’ve never said anything about stat_count() - all\nwe have is a layer for geom_bar()!\nWell it turns out that geom_bar() and\nstat_count() are two sides of the same coin. Both these\nfunctions return a layer that has a “bar” geom\na “count” stat.7\n\n\ngeom_bar()\n\n\n geom_bar: width = NULL, na.rm = FALSE, orientation = NA\n stat_count: width = NULL, na.rm = FALSE, orientation = NA\n position_stack\n\nstat_count()\n\n\n geom_bar: na.rm = FALSE, orientation = NA, width = NULL\n stat_count: na.rm = FALSE, orientation = NA, width = NULL\n position_stack\n\nWith that in mind, let’s return to the code for our penguin species\nbar plot:\n\n\nggplot(penguins, aes(species)) +\n geom_bar()\n\n\n\n\nWe’ve said that geom_bar(), by default, has a mapping of\naes(y = after_stat(count)), so let’s make that explicit\nagain:\n\n\nggplot(penguins, aes(species)) +\n geom_bar(\n aes(y = after_stat(count))\n )\n\n\n\nAnd as we just saw, geom_bar() also uses a “count” stat\nby default, so let’s make that explicit as well:\n\n\nggplot(penguins, aes(species)) +\n geom_bar(\n aes(y = after_stat(count)),\n stat = \"count\"\n )\n\n\n\nHopefully the pieces are starting to come together: the\ncomputation of the internal variable count isn’t about the\n“bar” geom - it’s about the “count” stat.\nSo geom_col() errors with after_stat(count)\nnot because it’s drawing something different (it’s not!),\nit’s just calculating something different. In more technical\nterms, geom_col() doesn’t use stat = \"count\"\nlike geom_bar() does. It uses\nstat = \"identity\" instead, which does an “identity”\ntransformation on the data (i.e., leaves it alone):\n\n\ngeom_col()\n\n\n geom_col: width = NULL, na.rm = FALSE\n stat_identity: na.rm = FALSE\n position_stack\n\nTherefore, no new variables get calculated inside\ngeom_col(). In fact, the data before and after the\nstatistical transformation look nearly identical:\n\n\npenguins_counted <- penguins %>% \n count(species)\n\n# The raw data\npenguins_counted\n\n\n # A tibble: 3 x 2\n species n\n \n 1 Adelie 146\n 2 Chinstrap 68\n 3 Gentoo 119\n\n\n\np_col <- penguins_counted %>% \n ggplot(aes(species, n)) +\n geom_col()\n\n# The data after transformation\ninspect_after_stat(p_col)\n\n\n x y PANEL group\n 1 1 146 1 1\n 2 2 68 1 2\n 3 3 119 1 3\n\nWhat’s more, by overriding the stat, you can make\ngeom_bar() behave like geom_col():\n\n\np_bar_identity <- penguins_counted %>% \n ggplot(aes(species, n)) +\n geom_bar(stat = \"identity\") # `geom_bar()` no longer errors\n\n# The data after transformation\ninspect_after_stat(p_bar_identity)\n\n\n x y PANEL group\n 1 1 146 1 1\n 2 2 68 1 2\n 3 3 119 1 3\n\nIn sum, the variables that are available in the\nafter_stat() is determined by what kind of\nstat a layer uses. And as we just saw, even if a layer is\ncalled a geom_*() it still has a stat. In\nfact, every layer has a stat and a geom, even though\nthe functions that generate these layers have names like\nstat_*() and geom_*().\nIt should be noted that the choice of a default stat for\ngeom_*() layers can range from more transparent\n(like the pairing of geom = \"smooth\" and\nstat = \"smooth\" in geom/stat_smooth()) to\nless transparent (like\ngeom_col(stat = \"identity\")). But it’s also worth keeping\nin mind that all defaults are carefully curated by the developers to\nimprove user experience.8\nSame stat, different geoms\nIf all layers have a stat and a geom, and if geom_*()\nlayers have default stats, then stat_*() layers must also\nhave defaults geoms.\nIndeed, this is the case. Remember the stat_count()\nlayer from the error message earlier? Because it uses the “count”\nstat and has a default geom of “bar”,\nit works like geom_bar() right out of the box:\n\n\nggplot(penguins, aes(species)) +\n stat_count()\n\n\n\n\nAnd here are the defaults of stat_count() spelled\nout:\n\n\nggplot(penguins, aes(species)) +\n stat_count(\n aes(y = after_stat(count)), # default mapping to `y`\n geom = \"bar\" # default `geom`\n )\n\n\n\nYou might be wondering: why are there two ways of doing the same\nthing? Some people think its a design mistake but I appreciate the\noption because it can improve readability.\nFor example, we talked about how we can supply a different\nstat to a geom_*() layer like\ngeom_bar(stat = \"identity\"). Actually, we can\noverride the default geom/stat for any layer as long as we\nsatisfy the required set of aesthetic mappings for that\nlayer.\nThis allows us to do some nifty things like setting the\nstat of geom_text() to “count”, and then\nmapping the internally computed variable count to the\ny and label aesthetics to show the number of\ncounts at the top of each bar:\n\n\nggplot(penguins, aes(species)) +\n geom_bar() +\n geom_text(\n aes(y = after_stat(count), label = after_stat(count)),\n stat = \"count\", vjust = -.4 # little vertical nudging with `vjust`\n )\n\n\n\n\nBut notice that you can do the same with two calls to\nstat_count():\n\n\nggplot(penguins, aes(species)) +\n stat_count(geom = \"bar\") +\n stat_count(\n aes(label = after_stat(count)),\n geom = \"text\", vjust = -.4\n )\n\n\n\n\nI personally like the second style using stat_count()\nfor two reasons:\nWhen I read stat_count(), I know going in that the\nvariable count is going to be available in the\nafter_stat() for that layer, so I’m not surprised to see\nlabel = after_stat(count) in the aes() of the\ntext layer.\nIt makes the code tell a coherent story - the theme of this plot\nis about visualizing counts, and what gets drawn to visualize\ncounts is secondary to that overarching goal. This happens to also\naligns pretty well with my actual data viz thought process.\nWhich do you prefer?\nSneak peak of Part 2\nI’ve tried my best to split up this topic of delayed aesthetic\nevaluation to keep it more manageable, so hopefully this wasn’t too\noverwhelming!\nPart 2 will pick up where we left off by exploring\nwhen, where, and how the data actually\nundergoes the statistical transformation. We will do a lot more of\n“peaking inside” the internals similar to what we did with our custom\nfunction inspect_after_stat(), except we will use functions\nfrom my package {ggtrace}\nto interact with ggplots internals. It’s a package partly designed as a\npedagogical tool for folks in early stages of the user-to-developer\ntransition, and I’m very excited to showcase its potentials!\nOne last note: Part 2 will introduce the ggproto\nsystem which powers ggplot internals. It’s notoriously difficult to\ngrasp, and has a big learning curve even if you’re an experienced\nuser of ggplot. But fear not, because we already got a little\ntaste of ggproto! Remember how we were just passing strings to the\ngeom and stat arguments of layer functions?\nWell that was actually just a shorthand for specifying ggproto objects\nlike StatCount and GeomBar:\n\n\nggplot(penguins, aes(species)) +\n geom_bar(stat = StatCount) # same as `stat = \"count\"`\n\nggplot(penguins, aes(species)) +\n stat_count(geom = GeomBar) # same as `geom = \"bar\"`\n\n\n\nWe will talk more about these ggproto objects, especially\nStat ggprotos like StatCount.\nSession Info\n\n\nsessionInfo()\n\n\n R version 4.1.1 (2021-08-10)\n Platform: x86_64-w64-mingw32/x64 (64-bit)\n Running under: Windows 10 x64 (build 19044)\n \n Matrix products: default\n \n locale:\n [1] LC_COLLATE=English_United States.1252 \n [2] LC_CTYPE=English_United States.1252 \n [3] LC_MONETARY=English_United States.1252\n [4] LC_NUMERIC=C \n [5] LC_TIME=English_United States.1252 \n \n attached base packages:\n [1] stats graphics grDevices utils datasets methods base \n \n other attached packages:\n [1] dplyr_1.0.8 ggplot2_3.3.5\n \n loaded via a namespace (and not attached):\n [1] highr_0.8 pillar_1.6.4 bslib_0.3.1 \n [4] compiler_4.1.1 jquerylib_0.1.4 tools_4.1.1 \n [7] digest_0.6.28 downlit_0.4.0 jsonlite_1.7.2 \n [10] evaluate_0.14 memoise_2.0.0 lifecycle_1.0.1 \n [13] tibble_3.1.6 gtable_0.3.0 pkgconfig_2.0.3 \n [16] rlang_1.0.2 DBI_1.1.2 cli_3.1.1 \n [19] rstudioapi_0.13 distill_1.3 yaml_2.2.1 \n [22] xfun_0.29 fastmap_1.1.0 withr_2.4.2 \n [25] stringr_1.4.0 knitr_1.37 generics_0.1.0 \n [28] vctrs_0.3.8 sass_0.4.0 systemfonts_1.0.3 \n [31] tidyselect_1.1.1 grid_4.1.1 glue_1.6.1 \n [34] R6_2.5.1 textshaping_0.3.6 fansi_1.0.2 \n [37] rmarkdown_2.11 farver_2.1.0 purrr_0.3.4 \n [40] magrittr_2.0.1 palmerpenguins_0.1.0 scales_1.1.1 \n [43] htmltools_0.5.2 ellipsis_0.3.2 assertthat_0.2.1 \n [46] colorspace_2.0-2 labeling_0.4.2 ragg_1.2.0 \n [49] utf8_1.2.2 stringi_1.7.5 munsell_0.5.0 \n [52] cachem_1.0.1 crayon_1.4.2\n\n\nI introduce layer_data()\nin a previous\nblog post on stat layers↩︎\nOkay, not quite all the way\ndown because you still have to make and draw graphical objects that\nactually make up the figure, but this is a substantial part of the\ninternals and you get the point.↩︎\nThe label geom, by virtue of not\nhaving any groupings specified, is assigned a group value of -1 (see\nggplot2:::NO_GROUP). This introduces another group category\nto the pre-existing groups 1 and 2 created by the bar layer, and so this\nhas consequences for when the fill scale steps in to derive values of\nfill from the levels of group. ↩︎\nNote also how group\nappears in different column positions. When group is\nexplicitly supplied, it appears to the left of\nPANEL, and when group is derived it appears to\nthe right of PANEL. Where and how does\nthis happen? I’ll leave this question as an exercise (hint: look at\nLayer$compute_aesthetics and\nggplot2:::add_group()). ↩︎\nYou can replace 1 with\nany length-1 value - the point is to just not create any groupings.↩︎\nWhy not across facets?\nThat’s a good question and has to do with how by-panel and by-group\ncomputations are done inside ggplot. We’ll look at this more in detail\nin Part 2!↩︎\nAs well as a “stack”\nposition, but we won’t go into that in this blog post\nseries.↩︎\nDon’t you appreciate how you can just\nuse geom_histogram() to draw a histogram without having to\nthink about stat = \"bin\" every time?↩︎\n", "preview": "posts/2022-03-10-ggplot2-delayed-aes-1/preview.png", - "last_modified": "2024-06-09T03:55:07-04:00", + "last_modified": "2024-06-09T00:55:07-07:00", "input_file": {}, "preview_width": 1457, "preview_height": 872 @@ -299,7 +299,7 @@ ], "contents": "\n\nContents\nIntroduction\nDisclaimer\nfor {showtext} and {extrafont} users\nSetting up {ragg}\n1. Rendering in RStudio\nplot pane\n2. Saving as an external\nfile\n3. Rmarkdown\n4. Quarto\n5. Shiny\n\nInstalling custom fonts\nFinding the right file\nChecking\nthat a font is installed and available\n\nDebugging custom fonts\nHoisting font styles\nAdvanced font features\nLining\nOrdinals\n\nUsecases\nA mash-up\nIcon fonts\nWTF?\nMore by others\n\nSession info\n\n\nThis blog post was featured in the R\nWeekly highlights podcast! Thanks to the R Weekly team for the\ncomprehensive review!\n\nIntroduction\nGetting custom fonts to work in R has historically\nbeen pretty difficult.1 At a high level, it involves the\nnon-trivial task of unifying the graphics device, the\noperating system, and text rendering, (and also in our\ncase, R!) to seamlessly work with each other.\nLuckily for us in post-2021, we have an amazing solution to this problem\nthanks to recent developments in the {ragg},\n{systemfonts}, and {textshaping} packages by\nRStudio. This is great news because a lot of the work for getting custom\nfonts to work in R is already done for us!\nIn this blog post, I’ll start with the basics of setting up\ncustom fonts followed by a walkthrough of the font\ndebugging workflow, concluding with some practical\nuse-cases that can spice up your typography game for data\nviz. \nDisclaimer for\n{showtext} and {extrafont} users\nMany people are familiar with {showtext} and {extrafont}, and may\neven have existing workflows centered around those packages. In this\nblog post, there is no “installing” of fonts of any sort within the\nR session. The magic here is that {ragg} is designed to directly\naccess the fonts installed on your system (with some caveats which I’ll\ndiscuss later). This issue of the graphics device is\nindependent of the kind of things that {showtext} and\n{extrafont} does, which is why they’re not relevant here.\nSo if you want to adopt this font workflow don’t\nuse {showtext} and don’t\nuse {extrafont}.\nSetting up {ragg}\nThe first thing you should do, if you haven’t already, is to install\n{ragg}, {systemfonts}, and {textshaping}.\n\n\ninstall.packages('ragg')\ninstall.packages('systemfonts')\ninstall.packages('textshaping')\n\n\n\nNext, we want to make sure that whenever we output a plot2, we do so using the\nAGG graphics device (that’s the “agg” part of\n“ragg”).\nThere are a couple places where this is relevant:\n1. Rendering in RStudio plot\npane\nFor RStudio >= 1.4, go to Tools > Global\nOptions > General > Graphics and set the\nBackend to AGG.\n\n\n\nFigure 1: Where to set AGG as the graphic device for RStudio - image from\nhttps://ragg.r-lib.org\n\n\n\n2. Saving as an external file\nFor bitmap output, use any of the ragg::agg_*() function\nto render plots using the AGG device.\n\n\n# Set output path\npngfile <- here::here(\"img\", \"my_plot.png\")\n# Initialize device\nragg::agg_png(\n pngfile, width = 10, height = 6, units = \"in\",\n res = 300, scaling = 3\n)\n# Plot\nplot(hist(mtcars$mpg))\n# Close device\ninvisible(dev.off())\n\n\n\nFor ggplot2 figures: as of the new ggplot2\nv3.3.4 release (released 06-16-2021), ggsave()\nautomatically defaults to rendering the output using\nagg_*() devices!\n\nOld disclaimer for {ggplot2} < v3.3.4\n\nThis long-winded way works for any plot, but if you use\n{ggplot2} and ggplot2::ggsave() a lot, you\nmight wonder whether you can just pass in ragg::agg_png()\ninto the device argument and specify the arguments in\nggsave() instead. This turns out to actually not be so\nstraightforward, but will likely be patched in the next update\n(v3.3.4?). 3\n3. Rmarkdown\nTo render figures with {ragg} in knitted files, pass in\na ragg device has res and units specified to\nthe dev argument of knitr::chunk_opts$set() at\nthe top of the script.4\n\nragg_png <- function(..., res = 150) {\n ragg::agg_png(..., res = res, units = \"in\")\n}\nknitr::opts_chunk$$set(dev = \"ragg_png\")\n\nFor rmarkdown chunks that are executed inline (i.e., figures under\ncode chunks), there’s unfortunately no straightforward solution to get\nthem rendered with ragg. My current suggestion is to set chunk output\noption to “Chunk Output in Console”, instead of “Chunk Output inline”\nunder the gear icon next to the knit button in the rmarkdown\ntoolbar.\nIf you’re a diehard fan of inline-ing your plots as you work with\nyour rmarkdown document, keep an eye out on issue #10412\non the RStudio IDE github repo. If you want a hacky workaround in the\nmeantime, try out some of the suggestions from issue\n#9931.\n4. Quarto\nIn quarto, you can set the custom ragg_png device\n(defined above) in the YAML, like so:\n\n knitr:\n opts_chunk: \n dev: \"ragg_png\"\n\nThis essentially calls\nknitr::opts_chunk$set(dev = \"ragg_png\").\nNote that the dev argument does not go under\nexecute, which instead controls chunk execution options\nlike echo and eval.\n5. Shiny\nSimply set options(shiny.useragg = TRUE) before\nrendering. Also check out the {thematic} package for\nimporting/using custom fonts in shiny plot outputs.\nInstalling custom fonts\nNow that you have {ragg} and {systemfonts}\ninstalled, take it for a spin with a custom font! When you’re rendering\nplots using {ragg}, custom fonts should just work\nas long as you have them installed on your local machine.\nIf you haven’t really worked with custom fonts before, “installing a\ncustom font” simply means finding the font file on the internet,\ndownloading it, and drag-and-drop into a special folder on your local\nmachine. It’s something like Network/Library/Fonts for Macs and\nMicrosoft/Windows/Fonts for Windows. There can actually be a\nbit more to this process5, so make sure to google and check\nthe process for installing fonts on your machine.\nFinding the right file\nFont files come in many forms. In general, fonts files that match\nthese two criteria tend to work the best:\nFonts in .otf (OpenType Font) or .ttf\n(TrueType Font) formats. These are font formats that are\ninstallable on your local machine. You want to avoid other formats like\n.woff or .woff2, for example, which are designed for\nuse for the web. In theory both .otf and .ttf should\nwork with {ragg}, though I’ve sometimes had trouble with\n.otf. In those cases, I simply converted the .otf font\nfile to .ttf before installing it, using free online conversion\ntools that you can easily find on Google. I’m of course glossing over\nthe details here and I’m hardly an expert, but you can read\nmore about TrueType and OpenType formats here.\nStatic fonts. In static fonts, each\nmember of the family has their own set of glyphs\n(i.e., there is a font file for each style). This is in\ncontrast to variable fonts, where you have a single font file\nwhich can take the form of multiple styles (either by having many sets\nof glyphs or variable parameters).6 To illustrate, look at\nthe difference between the static (top) vs. variable (bottom) files for\nthe Alegreya\nfamily.\n\n\n\nFigure 2: Static font files for Alegreya\n\n\n\n\n\n\nFigure 3: Variable font files for Alegreya\n\n\n\nWe see that static fonts are differentiated from variable fonts by\nhaving a distinct file for each style, like Alegreya-Black.ttf.\nOn the other hand, variable fonts usually say “variable” somewhere in\nthe file name, and are slightly larger in size than any individual\nstatic member. Note that not all fonts have both static and variable\nfiles, and not all static font files are .ttf (there can be\nstatic .otf and variable .ttf files).7\nThe above two images show the contents of the .zip file that\nyou’d get if you went to Google\nFonts (an awesome repository of free and open-source professional\nfonts) and clicked the Download family button on the page for Alegreya.\nIf you want to use the Alegreya font family (Open\nFont License8) in R, then you simply drag-and-drop\nall the static font files in /static into your system’s font\nfolder (or in Settings > Fonts for Windows 10).\nChecking that a\nfont is installed and available\nOnce you install a custom font on your system, it should also be\navailable elsewhere locally on your machine. For example, I can use\nAlegreya in Microsoft Word after I download it (this is actually my\nfirst go-to sanity check).\n\n\n\nFigure 4: Alegreya in Microsoft Word\n\n\n\nAnd by extension Alegreya should now be available for figures\nrendered with {ragg}. Let’s try using Alegreya in ggplot by\npassing it to the family argument of\ngeom_text()\n\n\nlibrary(ggplot2)\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"The Alegreya font\"),\n size = 18, family = \"Alegreya\"\n )\n\n\n\n\nIt just works!\nMore specifically, it works because Alegreya is visible to\n{systemfonts}, which handles text rendering for\n{ragg}. If we filter list of fonts from\nsystemfonts::system_fonts(), we indeed find the 12 styles\nof Alegreya from the static .ttf files that we installed!\n\n\nlibrary(systemfonts)\nlibrary(dplyr)\nlibrary(stringr)\n\nsystem_fonts() %>% \n filter(family == \"Alegreya\") %>% \n transmute(\n family, style,\n file = str_extract(path, \"[\\\\w-]+\\\\.ttf$\")\n )\n\n\n # A tibble: 12 × 3\n family style file \n \n 1 Alegreya Black Italic Alegreya-BlackItalic.ttf \n 2 Alegreya Bold Alegreya-Bold.ttf \n 3 Alegreya Bold Italic Alegreya-BoldItalic.ttf \n 4 Alegreya ExtraBold Alegreya-ExtraBold.ttf \n 5 Alegreya ExtraBold Italic Alegreya-ExtraBoldItalic.ttf\n 6 Alegreya Italic Alegreya-Italic.ttf \n 7 Alegreya Medium Alegreya-Medium.ttf \n 8 Alegreya Medium Italic Alegreya-MediumItalic.ttf \n 9 Alegreya Regular Alegreya-Regular.ttf \n 10 Alegreya SemiBold Alegreya-SemiBold.ttf \n 11 Alegreya SemiBold Italic Alegreya-SemiBoldItalic.ttf \n 12 Alegreya Black Alegreya-Black.ttf\n\nDebugging custom fonts\nSo far we’ve seen that the workflow for setting up and installing\nfonts is pretty straightforward. But what do we do in times when things\ninevitable go wrong?\nConsider the case of using Font\nAwesome, an icon font that renders special character\nsequences as icon glyphs (check the Icon fonts section for more!).\nFont Awesome has a free version (CC-BY\nand SIL OFL license), and let’s say we want to use it for personal\nuse for a TidyTuesday\nsubmission.\nThe first thing we do is locate the font file. Font Awesome is open\nsource, and the free version (Font Awesome 5 Free) is updated on Github.\nThe most recent release as of this blog post is v5.15.3.\nIf you unzip the file, you’ll find .otf font files\ncorresponding to the three variants available in the free version:\nRegular, Solid, and Brands.\n\n\n\nFigure 5: Font Awesome 5 files\n\n\n\nRemember how I said R tends to play nicer with .ttf than\n.otf fonts?9 Lets go ahead and convert the\n.otf files using an online converter, like https://convertio.co/otf-ttf.\nNow, with the three font files in .ttf format, follow the\ninstructions for installing fonts on your OS.\nOnce Font Awesome is installed on our local machine, it should be\nvisible to {systemfonts}, like this:\n\n\nsystem_fonts() %>% \n filter(str_detect(family, \"Font Awesome 5\")) %>% \n transmute(\n family, style,\n file = stringr::str_extract(path, \"[\\\\w-]+\\\\.ttf$\")\n )\n\n\n # A tibble: 3 × 3\n family style file \n \n 1 Font Awesome 5 Free Solid Font-Awesome-5-Free-Solid-900.ttf \n 2 Font Awesome 5 Brands Regular Font-Awesome-5-Brands-Regular-400.ttf\n 3 Font Awesome 5 Free Regular Font-Awesome-5-Free-Regular-400.ttf\n\nNow let’s try plotting some icons!\nWe see that we can render icons from the Regular variant (“clock”)\nand the Brands variant (“twitter”).\n\n\n# Left plot\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"clock\"),\n size = 50, family = \"Font Awesome 5 Free\"\n )\n\n# Right plot\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"twitter\"),\n size = 50, family = \"Font Awesome 5 Brands\"\n )\n\n\n\n\nBut what about rendering in the Solid variant? Font Awesome tells me\nthat the Solid variant has a “cat”\nicon, so let’s try it.\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(aes(label = \"cat\"), size = 50, family = \"Font Awesome 5 Solid\")\n\n\n\n\nUh oh, that didn’t work. Well that’s because Solid is actually a\nstyle, not a family! If you go back to the output from\nsystem_fonts(), we see that Font Awesome actually consists\nof two font families: Font Awesome 5 Brands\nwhich has a “Regular” style, and Font Awesome 5 Free\nwith a “Regular” style and a “Solid” style.\nThe structure is roughly like this:\n\n Font Awesome 5 Free\n |--- Regular\n |--- Solid\n Font Awesome 5 Brands\n |--- Regular\n\nIn geom_text(), the font style is set by the\nfontface argument. When we don’t specify\nfontface, such as in our working example for the clock and\ntwitter icons, it defaults to the Regular style.10\nSo the solution to our problem is to put in\nfontface = \"Solid\", right…?\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"cat\"), size = 50,\n family = \"Font Awesome 5 Free\", fontface = \"solid\"\n )\n\n\n Error in FUN(X[[i]], ...): invalid fontface solid\n\nWell now it just errors!11 The issue here runs a\nbit deeper: if we track down the error,12\nit takes us to a function inside grid::gpar() that\nvalidates fontface. 13 If we take a look at\nthe code, we see that only a very few font styles are valid, and “solid”\nisn’t one of them.\n\n\nfunction (ch) \nswitch(ch, plain = 1L, bold = 2L, italic = , oblique = 3L, bold.italic = 4L, \n symbol = 5L, cyrillic = 5L, cyrillic.oblique = 6L, EUC = 7L, \n stop(\"invalid fontface \", ch))\n\n\n\nOkay, so then how can we ever access the Solid style of the Font\nAwesome 5 Free family? Luckily, there’s a solution: use\nsystemfonts::register_font() to register the Solid style as\nthe “plain” style of its own font family!\nWe can do this by passing in the name of the new font family in the\nname argument, and passing the path of the font file to the\nplain argument.\n\n\nfa_solid_path <- system_fonts() %>% \n filter(family == \"Font Awesome 5 Free\", style == \"Solid\") %>% \n pull(path)\n\nsystemfonts::register_font(\n name = \"Font Awesome 5 Free Solid\",\n plain = fa_solid_path\n)\n\n\n\nTo check if we were successful in registering this new font variant,\nwe can call systemfonts::registry_fonts() which returns all\nregistered custom fonts in the current session:\n\n\nsystemfonts::registry_fonts() %>% \n transmute(\n family, style,\n file = stringr::str_extract(path, \"[\\\\w-]+\\\\.ttf$\")\n )\n\n\n # A tibble: 4 × 3\n family style file \n \n 1 Font Awesome 5 Free Solid Regular Font-Awesome-5-Free-Solid-900.ttf\n 2 Font Awesome 5 Free Solid Bold Font-Awesome-5-Free-Solid-900.ttf\n 3 Font Awesome 5 Free Solid Italic Font-Awesome-5-Free-Solid-900.ttf\n 4 Font Awesome 5 Free Solid Bold Italic Font-Awesome-5-Free-Solid-900.ttf\n\nWe see that the Solid style is now available as the Regular (a.k.a.\n“plain”) style of its own font family: Font Awesome 5 Free\nSolid!14\nNow we’re back to our cat icon example. Again, because Font Awewsome\nsays there’s a cat icon in the Solid style, we’d expect a cat icon\nif we render the text “cat” in the Solid style. Let’s set the\nfamily argument to our newly registered “Font Awesome 5\nFree Solid” family and see what happens:\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(aes(label = \"cat\"), size = 50, family = \"Font Awesome 5 Free Solid\")\n\n\n\n\nThird time’s the charm !!!\nHoisting font styles\nHopefully the lesson is now clear: to make a custom font work in R,\nthe font must be visible to\nsystemfonts::system_fonts() in a style that is accessible\nto grid::gpar(). The nifty trick of registering an\ninaccessible style as the “plain” style of its own family can\nbe extended and automated as a utility function that is called purely\nfor this side effect. In my experimental package, I\nhave very simple function called font_hoist()\nwhich “hoists”15 all styles of a family as\nthe “plain”/Regular style of their own families. This way, you never\nhave to worry about things going wrong in the fontface\nargument.\n\njunebug::font_hoist()\n\n\nfont_hoist <- function(family, silent = FALSE) {\n font_specs <- systemfonts::system_fonts() %>%\n dplyr::filter(family == .env[[\"family\"]]) %>%\n dplyr::mutate(family = paste(.data[[\"family\"]], .data[[\"style\"]])) %>%\n dplyr::select(plain = .data[[\"path\"]], name = .data[[\"family\"]])\n\n purrr::pwalk(as.list(font_specs), systemfonts::register_font)\n\n if (!silent) message(paste0(\"Hoisted \", nrow(font_specs), \" variants:\\n\",\n paste(font_specs$name, collapse = \"\\n\")))\n}\n\n\n\nLet’s apply this to our Alegreya family. As we saw earlier, it has 12\nstyles, but only 4 can be accessed by grid::gpar().16 But once we hoist the styles, we\ncan access them all!\n\n\n# install_github(\"yjunechoe/junebug\")\njunebug::font_hoist(\"Alegreya\")\n\n\n Hoisted 12 variants:\n Alegreya Black Italic\n Alegreya Bold\n Alegreya Bold Italic\n Alegreya ExtraBold\n Alegreya ExtraBold Italic\n Alegreya Italic\n Alegreya Medium\n Alegreya Medium Italic\n Alegreya Regular\n Alegreya SemiBold\n Alegreya SemiBold Italic\n Alegreya Black\n\n\n\n# Grab the newly registered font families\nalegreya_styles <- systemfonts::registry_fonts() %>% \n filter(str_detect(family, \"Alegreya\"), style == \"Regular\") %>% \n pull(family)\n\n# Render a plot for all 12 styles\npurrr::walk(\n alegreya_styles,\n ~ print(ggplot(NULL, aes(0, 0)) +\n geom_text(aes(label = .x), size = 14, family = .x))\n)\n\n\n\n\nBut note that the registration of custom font variants is not\npersistent across sessions. If you restart R and run\nregistry_fonts() again, it will return an empty data frame,\nindicating that you have no font variants registered. You have to\nregister font variants for every session, which is why it’s nice to have\nthe register_fonts() workflow wrapped into a function like\nfont_hoist().\nAdvanced font features\nBut wait, that’s not all!\nMany modern professional fonts come with OpenType\nfeatures, which mostly consist of stylistic parameters that can\nbe turned on-and-off for a font. Note that despite being called\n“OpenType” features, it’s not something unique to .otf font\nformats. TrueType fonts (.ttf) can have OpenType\nfeatures as well. For a fuller picture, you can check out the\nfull\nlist of registered features and this article\nwith visual examples for commonly used features.\nIt looks overwhelming but only a handful are relevant for data\nvisualization. I’ll showcase two features here: lining and\nordinals.\nLining\nOne of the most practical font features is lining, also\ncalled \"lnum\" (the four-letter feature tag), where all\nnumbers share the same height and baseline.17\nLet’s use our Alegreya font as an example again. By default, Alegreya\nhas what are called “old style” numbers, where number glyphs have\nascending and descending strokes which can make a string of numbers look\nunbalanced. Notice how the digits share different baselines here:\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"123456789\"),\n size = 35, family = \"Alegreya\"\n )\n\n\n\n\nLuckily, Alegreya supports the “lining” feature. We know this because\nthe get_font_features() function from the\n{textshaping} package returns a lists of OpenType features\nsupported by Alegreya, one of which is “lnum”.\n\n\nlibrary(textshaping)\nget_font_features(\"Alegreya\")\n\n\n [[1]]\n [1] \"cpsp\" \"kern\" \"mark\" \"mkmk\" \"aalt\" \"c2sc\" \"case\" \"ccmp\" \"dlig\" \"dnom\"\n [11] \"frac\" \"liga\" \"lnum\" \"locl\" \"numr\" \"onum\" \"ordn\" \"pnum\" \"sinf\" \"smcp\"\n [21] \"ss01\" \"ss02\" \"ss03\" \"ss04\" \"ss05\" \"subs\" \"sups\" \"tnum\"\n\nTo access the lining feature, we can use the\nsystemfonts::register_variant() function, which works\nsimilarly to systemfonts::register_font(). The former is\nsimply a wrapper around the latter, and we use it here for convenience\nbecause “Alegreya” (as in, the default Regular style) is already\naccessible without us having to point to the font file.\nTo turn the lining feature on, we need to set the\nfeatures argument of register_variant() using\nthe helper function systemfonts::font_feature(). The full\ncode looks like this:\n\n\nsystemfonts::register_variant(\n name = \"Alegreya-lining\",\n family = \"Alegreya\",\n features = systemfonts::font_feature(numbers = \"lining\")\n)\n\n\n\nAnd again, we can see if the font variant was successfully registered\nby checking registry_fonts():\n\n\nregistry_fonts() %>% \n filter(family == \"Alegreya-lining\", style == \"Regular\") %>% \n transmute(\n family, style,\n features = names(features[[1]])\n )\n\n\n # A tibble: 1 × 3\n family style features\n \n 1 Alegreya-lining Regular lnum\n\nAnd that’s it! Let’s try rendering the numbers again with the\noriginal “Alegreya” font (top) and the new “Alegreya-lining” variant\n(bottom):\n\n\nggplot(NULL) +\n geom_text(\n aes(0, 1, label = \"123456789\"),\n size = 35, family = \"Alegreya\") +\n geom_text(\n aes(0, 0, label = \"123456789\"),\n size = 35, family = \"Alegreya-lining\"\n ) +\n scale_y_continuous(expand = expansion(add = 0.5))\n\n\n\n\nA subtle but noticeable difference!\nIf we want a font variant to have a mix of different style\nand OpenType features, we have to go back to\nregister_font() (where we register styles as their own\nfamilies by pointing to the files) and set the features\nargument there.\n\n\n# Get file path\nAlegreyaBlackItalic_path <- system_fonts() %>% \n filter(family == \"Alegreya\", style == \"Black Italic\") %>% \n pull(path)\n\n# Register variant\nregister_font(\n name = \"Alegreya Black Italic-lining\",\n plain = AlegreyaBlackItalic_path,\n features = font_feature(numbers = \"lining\")\n)\n\nggplot(NULL) +\n geom_text(\n aes(0, 1, label = \"123456789\"),\n size = 35, family = \"Alegreya Black Italic\"\n ) +\n geom_text(\n aes(0, 0, label = \"123456789\"),\n size = 35, family = \"Alegreya Black Italic-lining\"\n ) +\n scale_y_continuous(expand = expansion(add = 0.5))\n\n\n\n\nOrdinals\nOrdinals (or “ordn”) is a font feature which works almost like a\nsuperscript. It targets all lower case letters, and is intended\nfor formatting ordinals like 1st,\n2nd, 3rd.\nLet’s try it out!\nFirst, we check that “ordn” is supported for Alegreya:\n\n\n\"ordn\" %in% unlist(get_font_features(\"Alegreya\"))\n\n\n [1] TRUE\n\nThen, we register the ordinal variant. Note that “ordn” is not\nbuilt-in as an option for the letters argument of\nfont_features(), unlike “lnum” which is a built-in option\nfor the numbers argument.18\nTherefore, we have to set the “ordn” feature inside the ...\nof font_feature() with \"ordn\" = TRUE. And\nlet’s also simultaneously turn on the lining feature from before as\nwell.\n\n\n# Register variant\nregister_variant(\n name = \"Alegreya-lnum_ordn\",\n family = \"Alegreya\",\n features = font_feature(numbers = \"lining\", \"ordn\" = TRUE)\n)\n\n# Double check registration\nregistry_fonts() %>% \n filter(family == \"Alegreya-lnum_ordn\", style == \"Regular\") %>% \n pull(features)\n\n\n [[1]]\n ordn lnum \n 1 1\n\n\n\nggplot(NULL) +\n geom_text(\n aes(0, 1, label = \"1st 2nd 3rd 4th\"),\n size = 20, family = \"Alegreya\"\n ) +\n geom_text(\n aes(0, 0, label = \"1st 2nd 3rd 4th\"),\n size = 20, family = \"Alegreya-lnum_ordn\"\n ) +\n scale_y_continuous(expand = expansion(add = 0.5))\n\n\n\n\nAgain, it’s important to note that this targets all lower\ncase letters. So something like this renders awkwardly:\n\n\nggplot(NULL) +\n geom_text(\n aes(0, 0, label = \"June 16th 2021\"),\n size = 20, family = \"Alegreya-lnum_ordn\"\n )\n\n\n\n\nWe could turn “June” into all caps, but that still looks pretty\nugly:\n\n\nggplot(NULL) +\n geom_text(\n aes(0, 0, label = \"JUNE 16th 2021\"),\n size = 20, family = \"Alegreya-lnum_ordn\"\n )\n\n\n\n\nOne solution is to render the month in the Regular style and the rest\nin the ordinal variant.19 We can combine text in multiple\nfonts in-line with html syntax supported by geom_richtext()\nfrom {ggtext}. If you’re already familiar with\n{ggtext}, this example shows that it works the same for\nregistered custom font variants!\n\n\nlibrary(ggtext)\n\nformatted_date <- \"16th 2021<\/span>\"\n\nggplot(NULL) +\n geom_richtext(\n aes(0, 0, label = paste(\"June\", formatted_date)),\n size = 20, family = \"Alegreya\",\n fill = NA, label.color = NA\n )\n\n\n\n\nWhat’s extra nice about this is that while {ggtext}\nalready supports the html tag (which formats\ntext as superscript), it’s not as good as the ordinals font\nfeature. Look how the generic solution\n(top) doesn’t look as aesthetically pleasing in comparison:\n\n\nsups <- \"1st<\/sup> 2nd<\/sup> 3rd<\/sup> 4th<\/sup>\"\n\nggplot(NULL) +\n geom_richtext(\n aes(0, 1, label = sups),\n size = 25, family = \"Alegreya-lining\",\n fill = NA, label.color = NA\n ) +\n geom_text(\n aes(0, 0, label = \"1st 2nd 3rd 4th\"),\n size = 25, family = \"Alegreya-lnum_ordn\"\n ) +\n scale_y_continuous(expand = expansion(add = 0.5))\n\n\n\n\nIn my opinion, you should always err towards using the\nsupported font features because they are designed with the\nparticular aesthetics of the font in mind.20\nHopefully this example has convinced you!\nUsecases\nA mash-up\nHere’s a made up plot that mashes up everything we went over so\nfar:\n\n\n\n\nPlot Code\n\n\n# Setting up fonts (repeat from above)\njunebug::font_hoist(\"Font Awesome 5 Free\")\njunebug::font_hoist(\"Alegreya\")\n\nsystemfonts::register_variant(\n name = \"Alegreya-lining\",\n family = \"Alegreya\",\n features = systemfonts::font_feature(numbers = \"lining\")\n)\nsystemfonts::register_variant(\n name = \"Alegreya-lnum_ordn\",\n family = \"Alegreya\",\n features = systemfonts::font_feature(numbers = \"lining\", \"ordn\" = TRUE)\n)\n\n# labelling function for ordinal format\nordinal_style <- function(ordn) {\n function (x) {\n scales::ordinal_format()(as.integer(x)) %>% \n stringr::str_replace(\n \"([a-z]+)$\",\n stringr::str_glue(\"\\\\1<\/span>\")\n )\n }\n}\n\n# data\nset.seed(2021)\nordinal_data <- tibble(\n Quarter = as.factor(1:4),\n Earnings = c(9, 7, 6, 3) * 1e6\n) %>% \n arrange(desc(Earnings)) %>% \n mutate(\n Mood = c(\"smile-beam\", \"meh-blank\", \"meh\", \"dizzy\"),\n color = c(\"forestgreen\", \"goldenrod\", \"goldenrod\", \"firebrick\")\n )\n\n# plot\nggplot(ordinal_data, aes(Quarter, Earnings)) +\n geom_text(\n aes(label = Mood, color = color),\n size = 18, family = \"Font Awesome 5 Free Solid\"\n ) +\n scale_color_identity() +\n scale_y_continuous(\n name = NULL,\n labels = scales::label_dollar(),\n expand = expansion(0.3)\n ) +\n scale_x_discrete(\n labels = ordinal_style(\"Alegreya-lnum_ordn\")\n ) +\n labs(title = \"Quarterly Earnings\") +\n theme_classic() +\n theme(\n text = element_text(\n size = 14,\n family = \"Alegreya\"\n ),\n axis.text.x = ggtext::element_markdown(\n size = 18,\n color = \"black\",\n family = \"Alegreya-lining\"\n ),\n axis.text.y = element_text(\n size= 14,\n color = \"black\",\n family = \"Alegreya-lining\"\n ),\n axis.ticks.x = element_blank(),\n axis.title.x = element_text(\n size = 18,\n family = \"Alegreya Medium\"\n ),\n plot.title = element_text(\n size = 24,\n family = \"Alegreya Black\",\n margin = margin(b = 5, unit = \"mm\")\n )\n )\n\n\n\nIcon fonts\nIf this blog post was your first time encountering icon fonts in R,\nyou probably have a lot of questions right now about using them in data\nvisualizations. You can check out my lightning talk on icon\nfonts that I gave at RLadies Philly for a quick\noverview as well as some tips & tricks!\n\n\n\n\n\n\nSome extra stuff not mentioned in the talk:\n{ragg} supports the rendering of colored fonts like\nemojis,\nwhich also means that it can render colored icon\nfonts.21 Icons don’t often come in colors,\nbut one example is Google’s Material Icons font (Apache\n2.0 license), which has a Two\nTone style where icons have a grey fill in addition to a black\nstroke:22\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"real_estate_agent\"), size = 80,\n family = \"Material Icons Two Tone\"\n ) +\n theme_classic()\n \n\n\n\nAll fonts based on SVG (pretty much the case for all icon fonts)\nshould work with {ragg} as long as you can get it installed\non your local machine. For example, the Bootstrap Icons font (MIT\nlicense) only come in .woff\nand .woff2 formats for web use, but it’s fundamentally just\na collection of SVGs, so it can be installed on your local machine once\nyou convert it to .ttf. Then it should just work right out of\nthe box.\n\n\nggplot(NULL, aes(0, 0)) +\n geom_text(\n aes(label = \"bootstrap-fill\"), color = \"purple\",\n size = 80, family = \"bootstrap-icons\"\n )\n \n\n\n\nIf you’re design oriented, you can also make your own\nicon font for use in R. In Inkscape, you can do this in File\n> New From Template > Typography Canvas (here’s a guide).\nOnce you save your SVG font, you can convert it to .ttf and\nfollow the same installation process, and then it should be available in\nR if you render with {ragg}.\n\n\n\nFigure 6: Making a font in Inkscape\n\n\n\nFor example, here’s my super quick attempt (took me exactly 1 minute)\nat a one-glyph font that just contains my signature (and you could\nimagine a usecase where you put this in a corner of your data viz to\nsign your work):\n\n\nggplot(NULL) +\n geom_text(\n aes(0, 1, label = \"a\"),\n size = 90, family = \"SVGFont 1\"\n ) +\n geom_text(\n aes(0, 0, label = \"a\"),\n color = \"red\", angle = 15,\n size = 90, family = \"SVGFont 1\"\n ) +\n scale_y_continuous(expand = expansion(add = c(.5, 1)))\n \n\n\n\nWTF?\n@yutannihilat_en\nhas a thread\nabout how if you pass in a character to the shape argument\nof geom_point(), it acts like geom_text():\n\n\nggplot(NULL, aes(x = 0, y = 0)) +\n geom_point(\n shape = \"あ\",\n size = 50\n )\n\n\n\n\nNaturally, I wondered if changing the font family affects how the\ncharacter glyph is rendered. geom_point() doesn’t take a\nfamily argument, but we can try it out directly in grid by\nsetting fontfamily to a custom font:\n\n\nggplot(NULL, aes(x = 0, y = 0)) +\n geom_point(\n shape = \"あ\",\n size = 50\n ) +\n expand_limits(x = c(-.2, .5)) +\n annotation_custom(\n grid::pointsGrob(\n pch = \"あ\",\n x = .7, y = .5,\n gp = grid::gpar(fontfamily = \"Noto Sans JP\", fontsize = 50 * .pt))\n )\n\n\n\n\nEmojis work this way:\n\n\nggplot(NULL, aes(x = 0, y = 0)) +\n geom_point(\n shape = emo::ji(\"smile\"),\n size = 50\n )\n\n\n\n\nAnd so do icon fonts, when shape/pch is supplied as Unicode:\n\n\nggplot(NULL) +\n annotation_custom(\n grid::pointsGrob(\n pch = \"\\UF118\",\n x = .5, y = .5,\n gp = grid::gpar(\n fontfamily = \"Font Awesome 5 Free\",\n fontsize = 50 * .pt\n )\n )\n )\n\n\n\n\nNote sure what you’d use this for but and hey it works\nMore by others\nAn extremely detailed step-by-step video walkthrough\nof using custom fonts in R by @dgkeyes.\nThe {hrbragg}\npackage by @hrbrmstr for more\nutility functions for registering font variants and typography-centered\nggplot2 themes.\nThe text\nformatting chapter of Practical Typography by Matthew\nButterick for a general guideline on using different font\nfeatures.\nEverything from Thomas\nLin Pedersen, the main person responsible for these\ndevelopments.\nMany #TidyTuesday\nsubmissions.\nOfficial RStudio blog posts:\nModern\nText Features in R\nUpdates\nto ragg and systemfonts\nsvglite\n2.0.0\nSession info\n\nSession Info\n\n R version 4.2.0 (2022-04-22 ucrt)\n Platform: x86_64-w64-mingw32/x64 (64-bit)\n Running under: Windows 10 x64 (build 19044)\n \n Matrix products: default\n \n locale:\n [1] LC_COLLATE=English_United States.utf8 \n [2] LC_CTYPE=English_United States.utf8 \n [3] LC_MONETARY=English_United States.utf8\n [4] LC_NUMERIC=C \n [5] LC_TIME=English_United States.utf8 \n \n attached base packages:\n [1] stats graphics grDevices utils datasets methods base \n \n other attached packages:\n [1] ggtext_0.1.1 textshaping_0.3.6 stringr_1.4.0 dplyr_1.0.9 \n [5] systemfonts_1.0.4 ggplot2_3.3.6 knitr_1.39 \n \n loaded via a namespace (and not attached):\n [1] tidyselect_1.1.2 xfun_0.31 bslib_0.3.1 purrr_0.3.4 \n [5] colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2 htmltools_0.5.2 \n [9] emo_0.0.0.9000 yaml_2.3.5 utf8_1.2.2 rlang_1.0.2 \n [13] gridtext_0.1.4 jquerylib_0.1.4 pillar_1.7.0 glue_1.6.2 \n [17] withr_2.5.0 DBI_1.1.2 lifecycle_1.0.1 munsell_0.5.0 \n [21] junebug_0.0.0.9000 gtable_0.3.0 ragg_1.2.2 memoise_2.0.1 \n [25] evaluate_0.15 labeling_0.4.2 fastmap_1.1.0 markdown_1.1 \n [29] fansi_1.0.3 highr_0.8 Rcpp_1.0.8.3 scales_1.2.0 \n [33] cachem_1.0.1 jsonlite_1.8.0 farver_2.1.0 distill_1.4 \n [37] png_0.1-7 digest_0.6.29 stringi_1.7.6 grid_4.2.0 \n [41] cli_3.3.0 tools_4.2.0 magrittr_2.0.3 sass_0.4.1 \n [45] tibble_3.1.7 crayon_1.5.1 pkgconfig_2.0.3 downlit_0.4.0.9000\n [49] ellipsis_0.3.2 xml2_1.3.3 data.table_1.14.3 lubridate_1.8.0 \n [53] assertthat_0.2.1 rmarkdown_2.14 rstudioapi_0.13 R6_2.5.1 \n [57] compiler_4.2.0\n\n\nIn fact, text rendering as a whole is\nan incredibly complicated task. Check out Text\nRendering Hates You for a fun and informative read.↩︎\nI’m focusing on outputing to\nbitmap (e.g., .png, .jpeg,\n.tiff). For other formats like SVG (which I often default\nto for online material), you can use svglite - read more on\nthe package website.↩︎\nCheck out the discussion on this issue\nand this\ncommit. There’s also been some talk of making AGG the default\nrenderer, though I don’t know if that’s been settled.↩︎\nThese are used to calculate DPI (dots\nper inch). Resolution is in pixels, so res=150 and\nunits=\"inch\" is the same as dpi=150.\n{ragg} devices don’t have a dpi argument like\nthe default device, so you have to specify both resolution and units.↩︎\nin Windows 10, for example, you have\nto drag and drop fonts onto the “Fonts” section of Settings↩︎\nVariable fonts are hit-or-miss\nbecause while {ragg} and {systemfonts}\ndo support some variable font features (see the section on Advanced font\nfeatures), “variable” can mean many different things, some\nof which are not supported (e.g., variable width). If you install a\nvariable font, it might render with {ragg} but you’re\nunlikely to be able to tweak its parameters (like change the weight, for\nexample).↩︎\nIn my experience, though, static\nfonts tend to be .ttf and variable fonts tend to be\n.otf.↩︎\n“You can use them freely in your\nproducts & projects - print or digital, commercial or otherwise.\nHowever, you can’t sell the fonts on their own.”↩︎\nAgain, YMMV, but myself and a couple\nother folks I’ve talked to share this.↩︎\nTechnically, it defaults to\nfontface = \"plain\" which is the same thing, but\n{systemfonts} and (also probably your OS) calls it the\n“Regular” style↩︎\nIn case you’re wondering, it still\nerrors with “solid”, no caps.↩︎\noptions(error = recover) is your\nfriend! And remember to set options(error = NULL) back once\nyou’re done!↩︎\nYou might wonder: what’s the\n{grid} package doing here? Well, {grid} is\nkinda the “engine” for {ggplot2} that handles the actual\n“drawing to the canvas”, which is why it’s relevant here. For example,\ngeom_text() returns a Graphical\nobject (“Grob”), specifically\ngrid::textGrob(), that inherits arguments like\nfamily and fontface (which are in turn passed\ninto grid::gpar(), where gpar stands for\ngraphical parameters).↩︎\nThe same font file also registered\nas the Bold, Italic, and Bold Italic styles of the family as well, which\nis what happens by default if you only supply the plain\nargument to register_font().↩︎\nBorrowing terminology from\ntidyr::hoist(), the under-appreciated beast of list-column\nworkflows↩︎\nRegular as “plain”, Bold as “bold”,\nItalic as “italic”, and Bold Italic as “bold.italic”.↩︎\nAlso check out the related “pnum”\n(proportional numbers) and “tnum” (tabular numbers) features.↩︎\ncheck the help page\n?systemfonts::font_feature for details.↩︎\nAnother solution would be to use the\nsmall-cap variant (the “smcp” feature) for “une”.↩︎\nBut this also means that not all\nfonts support “ordn”, while is always\navailable.↩︎\nBut not all colored fonts,\nin my experience.↩︎\nThe colors are fixed though - they\ncome colored in black and filled in grey.↩︎\n", "preview": "posts/2021-06-24-setting-up-and-debugging-custom-fonts/preview.png", - "last_modified": "2024-06-09T03:56:06-04:00", + "last_modified": "2024-06-09T00:56:06-07:00", "input_file": {}, "preview_width": 709, "preview_height": 612 @@ -321,7 +321,7 @@ ], "contents": "\r\n\r\nContents\r\nStatic\r\nAn aside on LaTeX equations\r\n\r\nAnimated\r\nFinal Product\r\n\r\n\r\n\r\n\r\nIn my last blogpost, I demonstrated a couple use cases for the higher-order functionals reduce() and accumulate() from the {purrr} package. In one example, I made an animated {kableExtra} table by accumulate()-ing over multiple calls to column_spec() that set a background color for a column.\r\nAnimated tables are virtually non-existent in the wild, and probably for a good reason. but I wanted to extend upon my previous table animation and create something that’s maybe a bit more on the “informative” side.\r\nTo that end, here’s an animate table that simulates sampling from a bivariate normal distribution.\r\nStatic\r\nLet’s first start by generating 100,000 data points:\r\n\r\n\r\nset.seed(2021)\r\n\r\nlibrary(dplyr)\r\n\r\nsamples_data <- MASS::mvrnorm(1e5, c(0, 0), matrix(c(1, .7, .7, 1), ncol = 2)) %>% \r\n as_tibble(.name_repair = ~c(\"x\", \"y\")) %>% \r\n mutate(across(everything(), ~ as.character(.x - .x %% 0.2)))\r\n\r\nsamples_data\r\n\r\n\r\n # A tibble: 100,000 x 2\r\n x y \r\n \r\n 1 0 -0.4 \r\n 2 0.2 0.6 \r\n 3 0.4 0.2 \r\n 4 0.6 -0.2 \r\n 5 0.6 0.8 \r\n 6 -1.8 -2 \r\n 7 0.8 -0.6 \r\n 8 1.2 0.4 \r\n 9 0.4 -0.4 \r\n 10 1.4 1.6 \r\n # ... with 99,990 more rows\r\n\r\nLet’s see how this looks when we turn this into a “matrix”1. To place continuous values into discrete cells in the table, I’m also binning both variables by 0.2:\r\n\r\n\r\nsamples_data_spread <- samples_data %>% \r\n count(x, y) %>% \r\n right_join(\r\n tidyr::crossing(\r\n x = as.character(seq(-3, 3, 0.2)),\r\n y = as.character(seq(-3, 3, 0.2))\r\n ),\r\n by = c(\"x\", \"y\")\r\n ) %>% \r\n tidyr::pivot_wider(names_from = y, values_from = n) %>% \r\n arrange(-as.numeric(x)) %>% \r\n select(c(\"x\", as.character(seq(-3, 3, 0.2)))) %>% \r\n rename(\" \" = x)\r\n\r\nsamples_data_spread\r\n\r\n\r\n # A tibble: 31 x 32\r\n ` ` `-3` `-2.8` `-2.6` `-2.4` `-2.2` `-2` `-1.8` `-1.6` `-1.4` `-1.2`\r\n \r\n 1 3 NA NA NA NA NA NA NA NA NA NA\r\n 2 2.8 NA NA NA NA NA NA NA NA NA NA\r\n 3 2.6 NA NA NA NA NA NA NA NA NA NA\r\n 4 2.4 NA NA NA NA NA NA NA NA NA NA\r\n 5 2.2 NA NA NA NA NA NA NA NA NA NA\r\n 6 2 NA NA NA NA NA NA NA NA NA NA\r\n 7 1.8 NA NA NA NA NA NA NA NA NA NA\r\n 8 1.6 NA NA NA NA NA NA NA 1 1 4\r\n 9 1.4 NA NA NA NA NA NA 1 NA 1 4\r\n 10 1.2 NA NA NA NA NA NA NA 3 2 7\r\n # ... with 21 more rows, and 21 more variables: `-1` , `-0.8` ,\r\n # `-0.6` , `-0.4` , `-0.2` , `0` , `0.2` ,\r\n # `0.4` , `0.6` , `0.8` , `1` , `1.2` , `1.4` ,\r\n # `1.6` , `1.8` , `2` , `2.2` , `2.4` , `2.6` ,\r\n # `2.8` , `3` \r\n\r\nNow we can turn this into a table and fill the cells according to the counts using reduce():\r\n\r\n\r\nlibrary(kableExtra)\r\n\r\nsamples_data_table <- samples_data_spread %>% \r\n kable() %>% \r\n kable_classic() %>% \r\n purrr::reduce(2L:length(samples_data_spread), ~ {\r\n column_spec(\r\n kable_input = .x,\r\n column = .y,\r\n background = spec_color(\r\n samples_data_spread[[.y]],\r\n scale_from = c(1, max(as.numeric(as.matrix(samples_data_spread)), na.rm = TRUE)),\r\n na_color = \"white\",\r\n option = \"plasma\"\r\n ),\r\n color = \"white\"\r\n )},\r\n .init = .\r\n )\r\n\r\nsamples_data_table\r\n\r\n\r\n\r\n\r\n\r\n-3\r\n\r\n\r\n-2.8\r\n\r\n\r\n-2.6\r\n\r\n\r\n-2.4\r\n\r\n\r\n-2.2\r\n\r\n\r\n-2\r\n\r\n\r\n-1.8\r\n\r\n\r\n-1.6\r\n\r\n\r\n-1.4\r\n\r\n\r\n-1.2\r\n\r\n\r\n-1\r\n\r\n\r\n-0.8\r\n\r\n\r\n-0.6\r\n\r\n\r\n-0.4\r\n\r\n\r\n-0.2\r\n\r\n\r\n0\r\n\r\n\r\n0.2\r\n\r\n\r\n0.4\r\n\r\n\r\n0.6\r\n\r\n\r\n0.8\r\n\r\n\r\n1\r\n\r\n\r\n1.2\r\n\r\n\r\n1.4\r\n\r\n\r\n1.6\r\n\r\n\r\n1.8\r\n\r\n\r\n2\r\n\r\n\r\n2.2\r\n\r\n\r\n2.4\r\n\r\n\r\n2.6\r\n\r\n\r\n2.8\r\n\r\n\r\n3\r\n\r\n\r\n3\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n8\r\n\r\n\r\n2\r\n\r\n\r\n7\r\n\r\n\r\n7\r\n\r\n\r\n8\r\n\r\n\r\n4\r\n\r\n\r\n6\r\n\r\n\r\n6\r\n\r\n\r\n2\r\n\r\n\r\n1\r\n\r\n\r\n2.8\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\n3\r\n\r\n\r\n6\r\n\r\n\r\n6\r\n\r\n\r\n20\r\n\r\n\r\n14\r\n\r\n\r\n20\r\n\r\n\r\n15\r\n\r\n\r\n8\r\n\r\n\r\n18\r\n\r\n\r\n6\r\n\r\n\r\n10\r\n\r\n\r\n7\r\n\r\n\r\n3\r\n\r\n\r\n2.6\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n7\r\n\r\n\r\n11\r\n\r\n\r\n21\r\n\r\n\r\n26\r\n\r\n\r\n17\r\n\r\n\r\n26\r\n\r\n\r\n29\r\n\r\n\r\n28\r\n\r\n\r\n21\r\n\r\n\r\n17\r\n\r\n\r\n10\r\n\r\n\r\n3\r\n\r\n\r\n8\r\n\r\n\r\n2.4\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n7\r\n\r\n\r\n11\r\n\r\n\r\n17\r\n\r\n\r\n20\r\n\r\n\r\n32\r\n\r\n\r\n33\r\n\r\n\r\n43\r\n\r\n\r\n52\r\n\r\n\r\n43\r\n\r\n\r\n37\r\n\r\n\r\n23\r\n\r\n\r\n23\r\n\r\n\r\n17\r\n\r\n\r\n9\r\n\r\n\r\n7\r\n\r\n\r\n2.2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n3\r\n\r\n\r\n7\r\n\r\n\r\n6\r\n\r\n\r\n12\r\n\r\n\r\n20\r\n\r\n\r\n18\r\n\r\n\r\n46\r\n\r\n\r\n51\r\n\r\n\r\n59\r\n\r\n\r\n66\r\n\r\n\r\n58\r\n\r\n\r\n73\r\n\r\n\r\n53\r\n\r\n\r\n41\r\n\r\n\r\n21\r\n\r\n\r\n20\r\n\r\n\r\n16\r\n\r\n\r\n8\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\n12\r\n\r\n\r\n17\r\n\r\n\r\n20\r\n\r\n\r\n35\r\n\r\n\r\n53\r\n\r\n\r\n83\r\n\r\n\r\n103\r\n\r\n\r\n93\r\n\r\n\r\n117\r\n\r\n\r\n106\r\n\r\n\r\n111\r\n\r\n\r\n74\r\n\r\n\r\n52\r\n\r\n\r\n42\r\n\r\n\r\n27\r\n\r\n\r\n17\r\n\r\n\r\n5\r\n\r\n\r\n1.8\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n8\r\n\r\n\r\n10\r\n\r\n\r\n14\r\n\r\n\r\n40\r\n\r\n\r\n50\r\n\r\n\r\n81\r\n\r\n\r\n108\r\n\r\n\r\n128\r\n\r\n\r\n132\r\n\r\n\r\n149\r\n\r\n\r\n143\r\n\r\n\r\n146\r\n\r\n\r\n103\r\n\r\n\r\n89\r\n\r\n\r\n57\r\n\r\n\r\n39\r\n\r\n\r\n19\r\n\r\n\r\n23\r\n\r\n\r\n7\r\n\r\n\r\n1.6\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n6\r\n\r\n\r\n6\r\n\r\n\r\n14\r\n\r\n\r\n19\r\n\r\n\r\n39\r\n\r\n\r\n67\r\n\r\n\r\n99\r\n\r\n\r\n136\r\n\r\n\r\n148\r\n\r\n\r\n183\r\n\r\n\r\n197\r\n\r\n\r\n214\r\n\r\n\r\n185\r\n\r\n\r\n170\r\n\r\n\r\n109\r\n\r\n\r\n81\r\n\r\n\r\n60\r\n\r\n\r\n40\r\n\r\n\r\n24\r\n\r\n\r\n11\r\n\r\n\r\n11\r\n\r\n\r\n1.4\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n8\r\n\r\n\r\n17\r\n\r\n\r\n37\r\n\r\n\r\n50\r\n\r\n\r\n74\r\n\r\n\r\n115\r\n\r\n\r\n170\r\n\r\n\r\n225\r\n\r\n\r\n277\r\n\r\n\r\n307\r\n\r\n\r\n323\r\n\r\n\r\n292\r\n\r\n\r\n243\r\n\r\n\r\n186\r\n\r\n\r\n123\r\n\r\n\r\n110\r\n\r\n\r\n61\r\n\r\n\r\n47\r\n\r\n\r\n19\r\n\r\n\r\n7\r\n\r\n\r\n3\r\n\r\n\r\n1.2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\n7\r\n\r\n\r\n22\r\n\r\n\r\n29\r\n\r\n\r\n60\r\n\r\n\r\n88\r\n\r\n\r\n144\r\n\r\n\r\n204\r\n\r\n\r\n273\r\n\r\n\r\n317\r\n\r\n\r\n376\r\n\r\n\r\n381\r\n\r\n\r\n337\r\n\r\n\r\n323\r\n\r\n\r\n262\r\n\r\n\r\n208\r\n\r\n\r\n135\r\n\r\n\r\n92\r\n\r\n\r\n58\r\n\r\n\r\n41\r\n\r\n\r\n17\r\n\r\n\r\n8\r\n\r\n\r\n7\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n5\r\n\r\n\r\n6\r\n\r\n\r\n18\r\n\r\n\r\n26\r\n\r\n\r\n64\r\n\r\n\r\n83\r\n\r\n\r\n160\r\n\r\n\r\n239\r\n\r\n\r\n329\r\n\r\n\r\n375\r\n\r\n\r\n504\r\n\r\n\r\n501\r\n\r\n\r\n474\r\n\r\n\r\n455\r\n\r\n\r\n336\r\n\r\n\r\n315\r\n\r\n\r\n223\r\n\r\n\r\n154\r\n\r\n\r\n94\r\n\r\n\r\n52\r\n\r\n\r\n23\r\n\r\n\r\n11\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\n0.8\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n4\r\n\r\n\r\n8\r\n\r\n\r\n10\r\n\r\n\r\n29\r\n\r\n\r\n68\r\n\r\n\r\n104\r\n\r\n\r\n163\r\n\r\n\r\n269\r\n\r\n\r\n336\r\n\r\n\r\n375\r\n\r\n\r\n517\r\n\r\n\r\n566\r\n\r\n\r\n612\r\n\r\n\r\n572\r\n\r\n\r\n480\r\n\r\n\r\n355\r\n\r\n\r\n256\r\n\r\n\r\n190\r\n\r\n\r\n129\r\n\r\n\r\n58\r\n\r\n\r\n37\r\n\r\n\r\n14\r\n\r\n\r\n14\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\n0.6\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n9\r\n\r\n\r\n20\r\n\r\n\r\n39\r\n\r\n\r\n62\r\n\r\n\r\n104\r\n\r\n\r\n161\r\n\r\n\r\n272\r\n\r\n\r\n373\r\n\r\n\r\n459\r\n\r\n\r\n591\r\n\r\n\r\n674\r\n\r\n\r\n684\r\n\r\n\r\n641\r\n\r\n\r\n587\r\n\r\n\r\n487\r\n\r\n\r\n365\r\n\r\n\r\n251\r\n\r\n\r\n167\r\n\r\n\r\n97\r\n\r\n\r\n56\r\n\r\n\r\n25\r\n\r\n\r\n9\r\n\r\n\r\n5\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\n0.4\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n4\r\n\r\n\r\n27\r\n\r\n\r\n32\r\n\r\n\r\n60\r\n\r\n\r\n101\r\n\r\n\r\n159\r\n\r\n\r\n260\r\n\r\n\r\n413\r\n\r\n\r\n535\r\n\r\n\r\n680\r\n\r\n\r\n794\r\n\r\n\r\n796\r\n\r\n\r\n780\r\n\r\n\r\n704\r\n\r\n\r\n537\r\n\r\n\r\n452\r\n\r\n\r\n345\r\n\r\n\r\n218\r\n\r\n\r\n119\r\n\r\n\r\n69\r\n\r\n\r\n38\r\n\r\n\r\n16\r\n\r\n\r\n8\r\n\r\n\r\n10\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n0.2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\n2\r\n\r\n\r\n9\r\n\r\n\r\n33\r\n\r\n\r\n46\r\n\r\n\r\n91\r\n\r\n\r\n152\r\n\r\n\r\n229\r\n\r\n\r\n388\r\n\r\n\r\n519\r\n\r\n\r\n654\r\n\r\n\r\n777\r\n\r\n\r\n851\r\n\r\n\r\n881\r\n\r\n\r\n712\r\n\r\n\r\n674\r\n\r\n\r\n535\r\n\r\n\r\n389\r\n\r\n\r\n285\r\n\r\n\r\n176\r\n\r\n\r\n102\r\n\r\n\r\n45\r\n\r\n\r\n29\r\n\r\n\r\n14\r\n\r\n\r\n7\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n0\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\n11\r\n\r\n\r\n18\r\n\r\n\r\n51\r\n\r\n\r\n77\r\n\r\n\r\n154\r\n\r\n\r\n210\r\n\r\n\r\n351\r\n\r\n\r\n521\r\n\r\n\r\n645\r\n\r\n\r\n778\r\n\r\n\r\n876\r\n\r\n\r\n866\r\n\r\n\r\n812\r\n\r\n\r\n685\r\n\r\n\r\n593\r\n\r\n\r\n459\r\n\r\n\r\n296\r\n\r\n\r\n190\r\n\r\n\r\n117\r\n\r\n\r\n50\r\n\r\n\r\n39\r\n\r\n\r\n24\r\n\r\n\r\n8\r\n\r\n\r\n2\r\n\r\n\r\n2\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\n-0.2\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n1\r\n\r\n\r\n6\r\n\r\n\r\n15\r\n\r\n\r\n36\r\n\r\n\r\n59\r\n\r\n\r\n112\r\n\r\n\r\n196\r\n\r\n\r\n286\r\n\r\n\r\n410\r\n\r\n\r\n620\r\n\r\n\r\n747\r\n\r\n\r\n856\r\n\r\n\r\n854\r\n\r\n\r\n836\r\n\r\n\r\n721\r\n\r\n\r\n683\r\n\r\n\r\n493\r\n\r\n\r\n344\r\n\r\n\r\n215\r\n\r\n\r\n162\r\n\r\n\r\n70\r\n\r\n\r\n50\r\n\r\n\r\n21\r\n\r\n\r\n6\r\n\r\n\r\n5\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-0.4\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\n12\r\n\r\n\r\n24\r\n\r\n\r\n60\r\n\r\n\r\n85\r\n\r\n\r\n168\r\n\r\n\r\n256\r\n\r\n\r\n373\r\n\r\n\r\n551\r\n\r\n\r\n689\r\n\r\n\r\n785\r\n\r\n\r\n842\r\n\r\n\r\n776\r\n\r\n\r\n773\r\n\r\n\r\n683\r\n\r\n\r\n504\r\n\r\n\r\n395\r\n\r\n\r\n233\r\n\r\n\r\n154\r\n\r\n\r\n94\r\n\r\n\r\n43\r\n\r\n\r\n35\r\n\r\n\r\n7\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-0.6\r\n\r\n\r\nNA\r\n\r\n\r\n4\r\n\r\n\r\n13\r\n\r\n\r\n16\r\n\r\n\r\n30\r\n\r\n\r\n62\r\n\r\n\r\n119\r\n\r\n\r\n219\r\n\r\n\r\n331\r\n\r\n\r\n447\r\n\r\n\r\n573\r\n\r\n\r\n714\r\n\r\n\r\n736\r\n\r\n\r\n787\r\n\r\n\r\n725\r\n\r\n\r\n658\r\n\r\n\r\n524\r\n\r\n\r\n389\r\n\r\n\r\n255\r\n\r\n\r\n219\r\n\r\n\r\n108\r\n\r\n\r\n66\r\n\r\n\r\n37\r\n\r\n\r\n5\r\n\r\n\r\n8\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-0.8\r\n\r\n\r\n3\r\n\r\n\r\n8\r\n\r\n\r\n13\r\n\r\n\r\n40\r\n\r\n\r\n59\r\n\r\n\r\n81\r\n\r\n\r\n181\r\n\r\n\r\n263\r\n\r\n\r\n330\r\n\r\n\r\n469\r\n\r\n\r\n600\r\n\r\n\r\n661\r\n\r\n\r\n681\r\n\r\n\r\n652\r\n\r\n\r\n639\r\n\r\n\r\n484\r\n\r\n\r\n368\r\n\r\n\r\n274\r\n\r\n\r\n160\r\n\r\n\r\n123\r\n\r\n\r\n48\r\n\r\n\r\n26\r\n\r\n\r\n13\r\n\r\n\r\n10\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-1\r\n\r\n\r\n6\r\n\r\n\r\n8\r\n\r\n\r\n21\r\n\r\n\r\n34\r\n\r\n\r\n80\r\n\r\n\r\n133\r\n\r\n\r\n195\r\n\r\n\r\n293\r\n\r\n\r\n386\r\n\r\n\r\n457\r\n\r\n\r\n556\r\n\r\n\r\n626\r\n\r\n\r\n574\r\n\r\n\r\n526\r\n\r\n\r\n461\r\n\r\n\r\n363\r\n\r\n\r\n246\r\n\r\n\r\n190\r\n\r\n\r\n105\r\n\r\n\r\n56\r\n\r\n\r\n26\r\n\r\n\r\n16\r\n\r\n\r\n6\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-1.2\r\n\r\n\r\n10\r\n\r\n\r\n8\r\n\r\n\r\n21\r\n\r\n\r\n45\r\n\r\n\r\n77\r\n\r\n\r\n146\r\n\r\n\r\n198\r\n\r\n\r\n266\r\n\r\n\r\n360\r\n\r\n\r\n436\r\n\r\n\r\n469\r\n\r\n\r\n480\r\n\r\n\r\n457\r\n\r\n\r\n393\r\n\r\n\r\n344\r\n\r\n\r\n242\r\n\r\n\r\n169\r\n\r\n\r\n104\r\n\r\n\r\n79\r\n\r\n\r\n33\r\n\r\n\r\n23\r\n\r\n\r\n9\r\n\r\n\r\n2\r\n\r\n\r\n3\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-1.4\r\n\r\n\r\n6\r\n\r\n\r\n13\r\n\r\n\r\n31\r\n\r\n\r\n62\r\n\r\n\r\n96\r\n\r\n\r\n163\r\n\r\n\r\n200\r\n\r\n\r\n299\r\n\r\n\r\n337\r\n\r\n\r\n360\r\n\r\n\r\n364\r\n\r\n\r\n364\r\n\r\n\r\n319\r\n\r\n\r\n239\r\n\r\n\r\n190\r\n\r\n\r\n129\r\n\r\n\r\n84\r\n\r\n\r\n50\r\n\r\n\r\n33\r\n\r\n\r\n17\r\n\r\n\r\n11\r\n\r\n\r\n7\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-1.6\r\n\r\n\r\n18\r\n\r\n\r\n25\r\n\r\n\r\n39\r\n\r\n\r\n61\r\n\r\n\r\n110\r\n\r\n\r\n138\r\n\r\n\r\n184\r\n\r\n\r\n235\r\n\r\n\r\n281\r\n\r\n\r\n278\r\n\r\n\r\n294\r\n\r\n\r\n246\r\n\r\n\r\n211\r\n\r\n\r\n176\r\n\r\n\r\n148\r\n\r\n\r\n92\r\n\r\n\r\n40\r\n\r\n\r\n23\r\n\r\n\r\n17\r\n\r\n\r\n8\r\n\r\n\r\n5\r\n\r\n\r\n3\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-1.8\r\n\r\n\r\n11\r\n\r\n\r\n19\r\n\r\n\r\n33\r\n\r\n\r\n52\r\n\r\n\r\n90\r\n\r\n\r\n139\r\n\r\n\r\n165\r\n\r\n\r\n185\r\n\r\n\r\n225\r\n\r\n\r\n192\r\n\r\n\r\n206\r\n\r\n\r\n157\r\n\r\n\r\n125\r\n\r\n\r\n100\r\n\r\n\r\n62\r\n\r\n\r\n40\r\n\r\n\r\n28\r\n\r\n\r\n11\r\n\r\n\r\n10\r\n\r\n\r\n2\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-2\r\n\r\n\r\n11\r\n\r\n\r\n29\r\n\r\n\r\n34\r\n\r\n\r\n50\r\n\r\n\r\n78\r\n\r\n\r\n102\r\n\r\n\r\n139\r\n\r\n\r\n141\r\n\r\n\r\n144\r\n\r\n\r\n149\r\n\r\n\r\n110\r\n\r\n\r\n104\r\n\r\n\r\n76\r\n\r\n\r\n49\r\n\r\n\r\n41\r\n\r\n\r\n19\r\n\r\n\r\n19\r\n\r\n\r\n5\r\n\r\n\r\n4\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-2.2\r\n\r\n\r\n11\r\n\r\n\r\n23\r\n\r\n\r\n38\r\n\r\n\r\n48\r\n\r\n\r\n76\r\n\r\n\r\n76\r\n\r\n\r\n99\r\n\r\n\r\n105\r\n\r\n\r\n88\r\n\r\n\r\n81\r\n\r\n\r\n81\r\n\r\n\r\n72\r\n\r\n\r\n36\r\n\r\n\r\n20\r\n\r\n\r\n25\r\n\r\n\r\n9\r\n\r\n\r\n6\r\n\r\n\r\n4\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-2.4\r\n\r\n\r\n12\r\n\r\n\r\n21\r\n\r\n\r\n24\r\n\r\n\r\n44\r\n\r\n\r\n56\r\n\r\n\r\n53\r\n\r\n\r\n51\r\n\r\n\r\n69\r\n\r\n\r\n66\r\n\r\n\r\n54\r\n\r\n\r\n46\r\n\r\n\r\n24\r\n\r\n\r\n21\r\n\r\n\r\n9\r\n\r\n\r\n5\r\n\r\n\r\n7\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-2.6\r\n\r\n\r\n12\r\n\r\n\r\n15\r\n\r\n\r\n20\r\n\r\n\r\n34\r\n\r\n\r\n32\r\n\r\n\r\n30\r\n\r\n\r\n40\r\n\r\n\r\n34\r\n\r\n\r\n36\r\n\r\n\r\n28\r\n\r\n\r\n21\r\n\r\n\r\n15\r\n\r\n\r\n8\r\n\r\n\r\n4\r\n\r\n\r\n2\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-2.8\r\n\r\n\r\n5\r\n\r\n\r\n17\r\n\r\n\r\n28\r\n\r\n\r\n27\r\n\r\n\r\n19\r\n\r\n\r\n14\r\n\r\n\r\n20\r\n\r\n\r\n26\r\n\r\n\r\n15\r\n\r\n\r\n10\r\n\r\n\r\n10\r\n\r\n\r\n4\r\n\r\n\r\n2\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\n-3\r\n\r\n\r\n6\r\n\r\n\r\n10\r\n\r\n\r\n3\r\n\r\n\r\n11\r\n\r\n\r\n21\r\n\r\n\r\n11\r\n\r\n\r\n13\r\n\r\n\r\n6\r\n\r\n\r\n10\r\n\r\n\r\n8\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n3\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nNA\r\n\r\n\r\nAn aside on LaTeX equations\r\nAs an aside, let’s say we also want to annotate this table with the true distribution where this sample came from. As specified in our call to MASS::mvrnorm() used to make samples_data, the distribution is one where both variables have a mean of 0 and a standard deviation of 1, plus a correlation of 0.7:\r\n\\[\\begin{bmatrix} X \\\\ Y \\end{bmatrix}\\ \\sim\\ N(\\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix},\\begin{bmatrix}1 & 0.7 \\\\ 0.7 & 1 \\end{bmatrix})\\]\r\nWhere the LaTeX code for the above formula is:\r\n\r\n \\begin{bmatrix} X \\\\ Y \\end{bmatrix}\\ \\sim\\\r\n N(\\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix},\r\n \\begin{bmatrix}1 & 0.7 \\\\ 0.7 & 1 \\end{bmatrix})\r\n\r\nMany different solutions already exist to LaTeX math annotations. The most common is probably Non-Standard Evaluation (NSE) methods using parse(), expression(), bquote() etc. There are bulkier solutions like the {latex2exp} package that plots plotmath expressions, though it hasn’t been updated since 2015 and I personally had difficulty getting it to work.\r\nOne solution I’ve never heard of/considered before is querying a web LaTeX editor that has an API. The Online LaTeX Equation Editor by CodeCogs is the perfect example of this. A simple link that contains the LaTeX code in a URL-compatible encoding renders the resulting expression as an image!\r\nI wrote a function latex_query (not thoroughly tested) in my personal package that takes LaTeX code and generates a CodeCogs URL containing the rendered expression2\r\n\r\n\r\n# NOTE the string literal syntax using r\"(...)\" is only available in R 4.0.0 and up\r\nlatex_url <- junebug::latex_query(\r\n formula = r\"(\\begin{bmatrix} X \\\\ Y \\end{bmatrix}\\ \\sim\\\r\n N(\\begin{bmatrix} 0 \\\\ 0 \\end{bmatrix},\r\n \\begin{bmatrix}1 & 0.7 \\\\ 0.7 & 1 \\end{bmatrix}))\",\r\n dpi = 150\r\n)\r\n\r\nknitr::include_graphics(latex_url)\r\n\r\n\r\n\r\n\r\nThe variable latex_url is this really long URL which, as we see above, points to a rendered image of the LaTeX expression we fed it!\r\nAnnotating our table, then, is pretty straightforward. We save it as an image, read in the LaTeX equation as an image, then combine!\r\n\r\n\r\nsave_kable(samples_data_table, \"img/samples_data_table.png\")\r\n\r\nlibrary(magick)\r\n\r\nimage_composite(\r\n image_read(\"img/samples_data_table.png\"),\r\n image_read(latex_url),\r\n offset = \"+50+50\"\r\n)\r\n\r\n\r\n\r\n\r\n\r\n\r\nAnimated\r\nFor an animated version, we add a step where we split the data at every 10,000 additional samples before binning the observations into cells. We then draw a table at each point of the accumulation using {kableExtra} with the help of map() and reduce() (plus some more kable styling).\r\n\r\n\r\nsamples_tables <- purrr::map(1L:10L, ~{\r\n samples_slice <- samples_data %>% \r\n slice(1L:(.x * 1e4)) %>% \r\n count(x, y) %>% \r\n right_join(\r\n tidyr::crossing(\r\n x = as.character(seq(-3, 3, 0.2)),\r\n y = as.character(seq(-3, 3, 0.2))\r\n ),\r\n by = c(\"x\", \"y\")\r\n ) %>% \r\n tidyr::pivot_wider(names_from = y, values_from = n) %>% \r\n arrange(-as.numeric(x)) %>% \r\n select(c(\"x\", as.character(seq(-3, 3, 0.2)))) %>% \r\n rename(\" \" = x)\r\n\r\n \r\n samples_slice %>%\r\n kable() %>% \r\n kable_classic() %>% \r\n purrr::reduce(\r\n 2L:length(samples_slice),\r\n ~ {\r\n .x %>% \r\n column_spec(\r\n column = .y,\r\n width_min = \"35px\",\r\n background = spec_color(\r\n samples_slice[[.y]],\r\n scale_from = c(1, max(as.numeric(as.matrix(samples_slice)), na.rm = TRUE)),\r\n na_color = \"white\",\r\n option = \"plasma\"\r\n ),\r\n color = \"white\"\r\n ) %>% \r\n row_spec(\r\n row = .y - 1L,\r\n hline_after = FALSE,\r\n extra_css = \"border-top:none; padding-top:15px;\"\r\n )\r\n },\r\n .init = .\r\n ) %>% \r\n row_spec(0L, bold = TRUE) %>% \r\n column_spec(1L, bold = TRUE, border_right = TRUE) %>% \r\n kable_styling(\r\n full_width = F,\r\n font_size = 10,\r\n html_font = \"IBM Plex Mono\",\r\n )\r\n})\r\n\r\n\r\n\r\nThe result, samples_tables is a list of tables. We can walk() over that list with save_kable() to write them as images and then read them back in with {magick}:\r\n\r\n\r\npurrr::iwalk(samples_tables, ~ save_kable(.x, file = glue::glue(\"tbl_imgs/tbl{.y}.png\")))\r\n\r\ntable_imgs <- image_read(paste0(\"tbl_imgs/tbl\", 1:10, \".png\"))\r\n\r\n\r\n\r\nNow we can add our LaTeX expression from the previous section as an annotation to these table images using image_composite():\r\n\r\n\r\ntable_imgs_annotated <- table_imgs %>% \r\n image_composite(\r\n image_read(latex_url),\r\n offset = \"+100+80\"\r\n )\r\n\r\n\r\n\r\nFinally, we just patch the table images together into an animation using image_animate() and we have our animated table!\r\n\r\n\r\ntable_imgs_animated <- table_imgs_annotated %>% \r\n image_animate(optimize = TRUE)\r\n\r\n\r\n\r\nFinal Product\r\n\r\n\r\n\r\nYou can also see the difference in the degree of “interpolation” by directly comparing the table at 10 thousand vs 100 thousand samples (the first and last frames):\r\n\r\n\r\n\r\nNeat!\r\n\r\nVisually speaking. It’s still a dataframe object for compatibility with {kableExtra}↩︎\r\nDetails about the API - https://www.codecogs.com/latex/editor-api.php↩︎\r\n", "preview": "posts/2021-01-17-random-sampling-a-table-animation/table_preview.png", - "last_modified": "2022-11-13T09:16:57-05:00", + "last_modified": "2022-11-13T06:16:57-08:00", "input_file": {}, "preview_width": 1185, "preview_height": 1180 @@ -343,7 +343,7 @@ ], "contents": "\r\n\r\nContents\r\nIntroduction\r\nHappy pipes\r\nSad (repetitive) pipes\r\nIntroducing purrr::reduce()\r\n\r\nExample 1: {ggplot2}\r\nA reduce() solution\r\nfeat. accumulate()\r\n\r\nExample 2: {kableExtra}\r\nA reduce2() solution\r\nfeat. accumulate2()\r\n\r\nExample 3: {dplyr}\r\nA reduce() solution\r\nfeat. {data.table}\r\n\r\nMisc.\r\n\r\n\r\n\r\n\r\n\r\n\r\nIntroduction\r\nHappy pipes\r\nModern day programming with R is all about pipes.1 You start out with some object that undergoes incremental changes as it is passed (piped) into a chain of functions and finally returned as the desired output, like in this simple example. 2\r\n\r\n\r\nset.seed(2021) # Can 2020 be over already?\r\n\r\nsquare <- function(x) x^2\r\ndeviation <- function(x) x - mean(x)\r\n\r\nnums <- runif(100)\r\n\r\nnums %>%\r\n deviation() %>%\r\n square() %>%\r\n mean() %>%\r\n sqrt()\r\n\r\n\r\n [1] 0.3039881\r\n\r\nWhen we pipe (or pass anything through any function, for that matter), we often do one distinct thing at a time, like in the above example.\r\nSo, we rarely have a chain of functions that look like this:\r\n\r\n\r\nlibrary(dplyr)\r\n\r\nmtcars %>% \r\n mutate(kmpg = mpg/1.6) %>% \r\n mutate(disp = round(disp)) %>% \r\n select(-vs) %>% \r\n select(-am) %>% \r\n select(-gear) %>% \r\n select(-carb) %>% \r\n filter(mpg > 15) %>% \r\n filter(cyl == 6) %>% \r\n filter(wt < 3)\r\n\r\n\r\n\r\n… because many functions are vectorized, or designed to handle multiple values by other means, like this:\r\n\r\n\r\npenguins %>% \r\n mutate(kmpg = mpg/1.6, disp = round(disp)) %>% \r\n select(-(vs:carb)) %>% \r\n filter(mpg > 15, cyl == 6, wt < 3)\r\n\r\n\r\n\r\nSad (repetitive) pipes\r\nBut some functions do not handle multiple inputs the way we want it to, or just not at all. Here are some examples of what I’m talking about.\r\nIn {ggplot2}, this doesn’t plot 3 overlapping points with sizes 8, 4, and 2:\r\n\r\n\r\nlibrary(ggplot2)\r\n\r\nggplot(mtcars, aes(hp, mpg)) + \r\n geom_point(size = c(8, 4, 2), alpha = .5)\r\n\r\n\r\n Error: Aesthetics must be either length 1 or the same as the data (32): size\r\n\r\nSo you have to do this:\r\n\r\n\r\nggplot(mtcars, aes(hp, mpg)) + \r\n geom_point(size = 8, alpha = .5) +\r\n geom_point(size = 4, alpha = .5) +\r\n geom_point(size = 2, alpha = .5)\r\n\r\n\r\n\r\n\r\nIn {kableExtra}, this doesn’t color the third column “skyblue”, the fourth column “forestgreen”, and the fifth column “chocolate”:3\r\n\r\n\r\nlibrary(kableExtra)\r\n\r\nmtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% \r\n column_spec(3:5, background = c(\"skyblue\", \"forestgreen\", \"chocolate\"))\r\n\r\n\r\n Warning in ensure_len_html(background, nrows, \"background\"): The number of\r\n provided values in background does not equal to the number of rows.\r\n\r\n\r\n\r\nmpg\r\n\r\n\r\ncyl\r\n\r\n\r\ndisp\r\n\r\n\r\nhp\r\n\r\n\r\ndrat\r\n\r\n\r\nwt\r\n\r\n\r\nqsec\r\n\r\n\r\nvs\r\n\r\n\r\nam\r\n\r\n\r\ngear\r\n\r\n\r\ncarb\r\n\r\n\r\nMazda RX4\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.620\r\n\r\n\r\n16.46\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nMazda RX4 Wag\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.875\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nDatsun 710\r\n\r\n\r\n22.8\r\n\r\n\r\n4\r\n\r\n\r\n108\r\n\r\n\r\n93\r\n\r\n\r\n3.85\r\n\r\n\r\n2.320\r\n\r\n\r\n18.61\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\nHornet 4 Drive\r\n\r\n\r\n21.4\r\n\r\n\r\n6\r\n\r\n\r\n258\r\n\r\n\r\n110\r\n\r\n\r\n3.08\r\n\r\n\r\n3.215\r\n\r\n\r\n19.44\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nHornet Sportabout\r\n\r\n\r\n18.7\r\n\r\n\r\n8\r\n\r\n\r\n360\r\n\r\n\r\n175\r\n\r\n\r\n3.15\r\n\r\n\r\n3.440\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\nValiant\r\n\r\n\r\n18.1\r\n\r\n\r\n6\r\n\r\n\r\n225\r\n\r\n\r\n105\r\n\r\n\r\n2.76\r\n\r\n\r\n3.460\r\n\r\n\r\n20.22\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nSo you have to do this:\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% \r\n column_spec(3, background = \"skyblue\") %>% \r\n column_spec(4, background = \"forestgreen\") %>% \r\n column_spec(5, background = \"chocolate\")\r\n\r\n\r\n\r\n\r\n\r\nmpg\r\n\r\n\r\ncyl\r\n\r\n\r\ndisp\r\n\r\n\r\nhp\r\n\r\n\r\ndrat\r\n\r\n\r\nwt\r\n\r\n\r\nqsec\r\n\r\n\r\nvs\r\n\r\n\r\nam\r\n\r\n\r\ngear\r\n\r\n\r\ncarb\r\n\r\n\r\nMazda RX4\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.620\r\n\r\n\r\n16.46\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nMazda RX4 Wag\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.875\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nDatsun 710\r\n\r\n\r\n22.8\r\n\r\n\r\n4\r\n\r\n\r\n108\r\n\r\n\r\n93\r\n\r\n\r\n3.85\r\n\r\n\r\n2.320\r\n\r\n\r\n18.61\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\nHornet 4 Drive\r\n\r\n\r\n21.4\r\n\r\n\r\n6\r\n\r\n\r\n258\r\n\r\n\r\n110\r\n\r\n\r\n3.08\r\n\r\n\r\n3.215\r\n\r\n\r\n19.44\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nHornet Sportabout\r\n\r\n\r\n18.7\r\n\r\n\r\n8\r\n\r\n\r\n360\r\n\r\n\r\n175\r\n\r\n\r\n3.15\r\n\r\n\r\n3.440\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\nValiant\r\n\r\n\r\n18.1\r\n\r\n\r\n6\r\n\r\n\r\n225\r\n\r\n\r\n105\r\n\r\n\r\n2.76\r\n\r\n\r\n3.460\r\n\r\n\r\n20.22\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nIn {dplyr}, this doesn’t make 3 new columns named “a”, “b”, and “c”, all filled with NA:4\r\n\r\n\r\nnew_cols <- c(\"a\", \"b\", \"c\")\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n mutate(!!new_cols := NA)\r\n\r\n\r\n Error: The LHS of `:=` must be a string or a symbol\r\n\r\nSo you have to do either one of these:5\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n mutate(\r\n !!new_cols[1] := NA,\r\n !!new_cols[2] := NA,\r\n !!new_cols[3] := NA\r\n )\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n mutate(!!new_cols[1] := NA) %>% \r\n mutate(!!new_cols[2] := NA) %>% \r\n mutate(!!new_cols[3] := NA)\r\n\r\n\r\n\r\n\r\n mpg a b c\r\n 1 21.0 NA NA NA\r\n 2 21.0 NA NA NA\r\n 3 22.8 NA NA NA\r\n 4 21.4 NA NA NA\r\n 5 18.7 NA NA NA\r\n 6 18.1 NA NA NA\r\n\r\nSo we’ve got functions being repeated, but in all these cases it looks like we can’t just throw in a vector and expect the function to loop/map over them internally in the specific way that we want it to. And the “correct ways” I provided here are not very satisfying: that’s a lot of copying and pasting!\r\nPersonally, I think it’d be nice to collapse these repetitive calls - but how?\r\nIntroducing purrr::reduce()\r\nThe reduce() function from the {purrr} package is a powerful functional that allows you to abstract away from a sequence of functions that are applied in a fixed direction. You should go give Advanced R Ch. 9.5 a read if you want an in-depth explanation, but here I’m just gonna give a quick crash course for our application of it to our current problem.6\r\nAll you need to know here is that reduce() takes a vector as its first argument, a function as its second argument, and an optional .init argument.7\r\nHere’s a schematic:\r\n\r\n\r\n\r\nFigure 1: From Advanced R by Hadley Wickham\r\n\r\n\r\n\r\nLet me really quickly demonstrate reduce() in action.\r\nSay you wanted to add up the numbers 1 through 5, but only using the plus operator +. You could do something like this:8\r\n\r\n\r\n1 + 2 + 3 + 4 + 5\r\n\r\n\r\n [1] 15\r\n\r\nWhich is the same as this:\r\n\r\n\r\nlibrary(purrr)\r\nreduce(1:5, `+`)\r\n\r\n\r\n [1] 15\r\n\r\nAnd if you want the start value to be something that’s not the first argument of the vector, pass that to the .init argument:\r\n\r\n\r\nidentical(\r\n 0.5 + 1 + 2 + 3 + 4 + 5,\r\n reduce(1:5, `+`, .init = 0.5)\r\n)\r\n\r\n\r\n [1] TRUE\r\n\r\nIf you want to be specific, you can use an {rlang}-style anonymous function where .x is the accumulated value being passed into the first argument fo the function and .y is the second argument of the function.9\r\n\r\n\r\nidentical(\r\n reduce(1:5, `+`, .init = 0.5),\r\n reduce(1:5, ~ .x + .y, .init = 0.5)\r\n)\r\n\r\n\r\n [1] TRUE\r\n\r\nAnd two more examples just to demonstrate that directionality matters:\r\n\r\n\r\nidentical(\r\n reduce(1:5, `^`, .init = 0.5),\r\n reduce(1:5, ~ .x ^ .y, .init = 0.5) # .x on left, .y on right\r\n)\r\n\r\n\r\n [1] TRUE\r\n\r\nidentical(\r\n reduce(1:5, `^`, .init = 0.5),\r\n reduce(1:5, ~ .y ^ .x, .init = 0.5) # .y on left, .x on right\r\n)\r\n\r\n\r\n [1] FALSE\r\n\r\nThat’s pretty much all you need to know - let’s jump right in!\r\nExample 1: {ggplot2}\r\nA reduce() solution\r\nRecall that we had this sad code:\r\n\r\n\r\nggplot(mtcars, aes(hp, mpg)) + \r\n geom_point(size = 8, alpha = .5) +\r\n geom_point(size = 4, alpha = .5) +\r\n geom_point(size = 2, alpha = .5)\r\n\r\n\r\n\r\nFor illustrative purposes, I’m going to move the + “pipes” to the beginning of each line:\r\n\r\n\r\nggplot(mtcars, aes(hp, mpg))\r\n + geom_point(size = 8, alpha = .5)\r\n + geom_point(size = 4, alpha = .5)\r\n + geom_point(size = 2, alpha = .5)\r\n\r\n\r\n\r\nAt this point, we see a clear pattern emerge line-by-line. We start with ggplot(mtcars, aes(hp, mpg)), which is kind of its own thing. Then we have three repetitions of + geom_point(size = X, alpha = .5) where the X varies between 8, 4, and 2. We also notice that the sequence of calls goes from left to right, as is the normal order of piping.\r\nNow let’s translate these observations into reduce(). I’m bad with words so here’s a visual:\r\n\r\n\r\n\r\nLet’s go over what we did in our call to reduce() above:\r\nIn the first argument, we have the vector of values that are iterated over.\r\nIn the second argument, we have an anonymous function composed of…\r\nThe .x variable, which represents the accumulated value. In this context, we keep the .x on the left because that is the left-hand side that we are carrying over to the next call via the +.\r\nThe .y variable, which takes on values from the first argument passed into reduce(). In this context, .y will be each value of the numeric vector c(8, 4, 2) since .init is given.\r\nThe repeating function call geom_point(size = .y, alpha = .5) that is called with each value of the vector passed in as the first argument.\r\n\r\nIn the third argument .init, we have ggplot(mtcars, aes(hp, mpg)) which is the non-repeating piece of code that we start with.\r\nIf you want to see the actual code run, here it is:\r\n\r\n\r\nreduce(\r\n c(8, 4, 2),\r\n ~ .x + geom_point(size = .y, alpha = .5),\r\n .init = ggplot(mtcars, aes(hp, mpg))\r\n)\r\n\r\n\r\n\r\n\r\nLet’s dig in a bit more, this time with an example that looks prettier.\r\nSuppose you want to collapse the repeated calls to geom_point() in this code:\r\n\r\n\r\nviridis_colors <- viridis::viridis(10)\r\n\r\nmtcars %>% \r\n ggplot(aes(hp, mpg)) +\r\n geom_point(size = 20, color = viridis_colors[10]) +\r\n geom_point(size = 18, color = viridis_colors[9]) +\r\n geom_point(size = 16, color = viridis_colors[8]) +\r\n geom_point(size = 14, color = viridis_colors[7]) +\r\n geom_point(size = 12, color = viridis_colors[6]) +\r\n geom_point(size = 10, color = viridis_colors[5]) +\r\n geom_point(size = 8, color = viridis_colors[4]) +\r\n geom_point(size = 6, color = viridis_colors[3]) +\r\n geom_point(size = 4, color = viridis_colors[2]) +\r\n geom_point(size = 2, color = viridis_colors[1]) +\r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))\r\n\r\n\r\n\r\n\r\nYou can do this with reduce() in a couple ways:10\r\n\r\n\r\nMethod 1\r\nMethod 1: Move all the “constant” parts to .init, since the order of these layers don’t matter.\r\n\r\n\r\nreduce(\r\n 10L:1L,\r\n ~ .x + geom_point(size = .y * 2, color = viridis_colors[.y]),\r\n \r\n .init = mtcars %>% \r\n ggplot(aes(hp, mpg)) +\r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))\r\n \r\n)\r\n\r\n\r\n\r\n\r\n\r\nMethod 2\r\nMethod 2: Use reduce() in place, with the help of the {magrittr} dot .\r\n\r\n\r\nmtcars %>% \r\n ggplot(aes(hp, mpg)) %>% \r\n \r\n reduce(\r\n 10L:1L,\r\n ~ .x + geom_point(size = .y * 2, color = viridis_colors[.y]),\r\n .init = . #<- right here!\r\n ) +\r\n \r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))\r\n\r\n\r\n\r\n\r\n\r\nMethod 3\r\nMethod 3: Move all the “constant” parts to the top, wrap it in parentheses, and pass the whole thing into .init using the {magrittr} dot .\r\n\r\n\r\n(mtcars %>% \r\n ggplot(aes(hp, mpg)) +\r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))) %>% \r\n \r\n reduce(\r\n 10L:1L,\r\n ~ .x + geom_point(size = .y * 2, color = viridis_colors[.y]),\r\n .init = . #<- right here!\r\n )\r\n\r\n\r\n\r\n\r\n\r\nAll in all, we see that reduce() allows us to write more succinct code!\r\nAn obvious advantage to this is that it is now really easy to make a single change that applies to all the repeated calls.\r\nFor example, if I want to make the radius of the points grow/shrink exponentially, I just need to modify the anonymous function in the second argument of reduce():\r\n\r\n\r\n# Using Method 3\r\n(mtcars %>% \r\n ggplot(aes(hp, mpg)) +\r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))) %>% \r\n reduce(\r\n 10L:1L,\r\n ~ .x + geom_point(size = .y ^ 1.5, color = viridis_colors[.y]), # exponential!\r\n .init = .\r\n )\r\n\r\n\r\n\r\n\r\nYay, we collapsed ten layers of geom_point()!\r\nfeat. accumulate()\r\nThere’s actually one more thing I want to show here, which is holding onto intermediate values using accumulate().\r\naccumulate() is like reduce(), except instead of returning a single value which is the output of the very last function call, it keeps all intermediate values and returns them in a list.\r\n\r\n\r\naccumulate(1:5, `+`)\r\n\r\n\r\n [1] 1 3 6 10 15\r\n\r\nCheck out what happens if I change reduce() to accumulate() and return each element of the resulting list:\r\n\r\n\r\nplots <- (mtcars %>% \r\n ggplot(aes(hp, mpg)) +\r\n scale_x_discrete(expand = expansion(.2)) +\r\n scale_y_continuous(expand = expansion(.2)) +\r\n theme_void() +\r\n theme(panel.background = element_rect(fill = \"grey20\"))) %>% \r\n accumulate(\r\n 10L:1L,\r\n ~ .x + geom_point(size = .y ^ 1.5, color = viridis_colors[.y]),\r\n .init = .\r\n )\r\n\r\nfor (i in plots) { plot(i) }\r\n\r\n\r\n\r\n\r\nWe got back the intermediate plots!\r\nAre you thinking what I’m thinking? Let’s animate this!\r\n\r\n\r\nlibrary(magick)\r\n\r\n# change ggplot2 objects into images\r\nimgs <- map(1:length(plots), ~ {\r\n img <- image_graph(width = 672, height = 480)\r\n plot(plots[[.x]])\r\n dev.off()\r\n img\r\n})\r\n\r\n# combine images as frames\r\nimgs <- image_join(imgs)\r\n\r\n# animate\r\nimage_animate(imgs)\r\n\r\n\r\n\r\n\r\n\r\n\r\nNeat!11\r\nExample 2: {kableExtra}\r\nA reduce2() solution\r\nRecall that we had this sad code:\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% \r\n column_spec(3, background = \"skyblue\") %>% \r\n column_spec(4, background = \"forestgreen\") %>% \r\n column_spec(5, background = \"chocolate\")\r\n\r\n\r\n\r\nWe’ve got two things varying here: the column location 3:5 and the background color c(\"skyblue\", \"forestgreen\", \"chocolate\"). We could do the same trick I sneaked into the previous section by just passing one vector to reduce() that basically functions as an index:12\r\n\r\n\r\nnumbers <- 3:5\r\nbackground_colors <- c(\"skyblue\", \"forestgreen\", \"chocolate\")\r\n\r\n(mtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\")) %>% \r\n reduce(\r\n 1:3,\r\n ~ .x %>% column_spec(numbers[.y], background = background_colors[.y]),\r\n .init = .\r\n )\r\n\r\n\r\n\r\nBut I want to use this opportunity to showcase reduce2(), which explicitly takes a second varying argument to the function that you are reduce()-ing over.\r\nHere, ..1 is like the .x and ..2 is like the .y from reduce(). The only new part is ..3 which refers to the second varying argument.\r\n\r\n\r\n(mtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\")) %>% \r\n reduce2(\r\n 3:5, # 1st varying argument (represented by ..2)\r\n c(\"skyblue\", \"forestgreen\", \"chocolate\"), # 2nd varying argument (represented by ..3)\r\n ~ ..1 %>% column_spec(..2, background = ..3),\r\n .init = .\r\n )\r\n\r\n\r\n\r\nWe’re not done yet! We can actually skip the {magrittr} pipe %>% and just stick ..1 as the first argument inside column_spec().13 This actually improves performance because you’re removing the overhead from evaluating the pipe!\r\nAdditionally, because the pipe forces evaluation with each call unlike + in {ggplot2}, we don’t need the parantheses wrapped around the top part of the code for the {magrittr} dot . to work!\r\nHere is the final reduce2() solution for our sad code:\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% # No need to wrap in parentheses!\r\n reduce2(\r\n 3:5, \r\n c(\"skyblue\", \"forestgreen\", \"chocolate\"), \r\n ~ column_spec(..1, ..2, background = ..3), # No need for the pipe!\r\n .init = .\r\n )\r\n\r\n\r\n\r\n\r\n\r\nmpg\r\n\r\n\r\ncyl\r\n\r\n\r\ndisp\r\n\r\n\r\nhp\r\n\r\n\r\ndrat\r\n\r\n\r\nwt\r\n\r\n\r\nqsec\r\n\r\n\r\nvs\r\n\r\n\r\nam\r\n\r\n\r\ngear\r\n\r\n\r\ncarb\r\n\r\n\r\nMazda RX4\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.620\r\n\r\n\r\n16.46\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nMazda RX4 Wag\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.875\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nDatsun 710\r\n\r\n\r\n22.8\r\n\r\n\r\n4\r\n\r\n\r\n108\r\n\r\n\r\n93\r\n\r\n\r\n3.85\r\n\r\n\r\n2.320\r\n\r\n\r\n18.61\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\nHornet 4 Drive\r\n\r\n\r\n21.4\r\n\r\n\r\n6\r\n\r\n\r\n258\r\n\r\n\r\n110\r\n\r\n\r\n3.08\r\n\r\n\r\n3.215\r\n\r\n\r\n19.44\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nHornet Sportabout\r\n\r\n\r\n18.7\r\n\r\n\r\n8\r\n\r\n\r\n360\r\n\r\n\r\n175\r\n\r\n\r\n3.15\r\n\r\n\r\n3.440\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\nValiant\r\n\r\n\r\n18.1\r\n\r\n\r\n6\r\n\r\n\r\n225\r\n\r\n\r\n105\r\n\r\n\r\n2.76\r\n\r\n\r\n3.460\r\n\r\n\r\n20.22\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nAnd of course, we now have the flexibilty to do much more complicated manipulations!\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% \r\n reduce2(\r\n 1:12, \r\n viridis::viridis(12), \r\n ~ column_spec(..1, ..2, background = ..3, color = if(..2 < 5){\"white\"}),\r\n .init = .\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nmpg\r\n\r\n\r\ncyl\r\n\r\n\r\ndisp\r\n\r\n\r\nhp\r\n\r\n\r\ndrat\r\n\r\n\r\nwt\r\n\r\n\r\nqsec\r\n\r\n\r\nvs\r\n\r\n\r\nam\r\n\r\n\r\ngear\r\n\r\n\r\ncarb\r\n\r\n\r\nMazda RX4\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.620\r\n\r\n\r\n16.46\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nMazda RX4 Wag\r\n\r\n\r\n21.0\r\n\r\n\r\n6\r\n\r\n\r\n160\r\n\r\n\r\n110\r\n\r\n\r\n3.90\r\n\r\n\r\n2.875\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n4\r\n\r\n\r\nDatsun 710\r\n\r\n\r\n22.8\r\n\r\n\r\n4\r\n\r\n\r\n108\r\n\r\n\r\n93\r\n\r\n\r\n3.85\r\n\r\n\r\n2.320\r\n\r\n\r\n18.61\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\n1\r\n\r\n\r\nHornet 4 Drive\r\n\r\n\r\n21.4\r\n\r\n\r\n6\r\n\r\n\r\n258\r\n\r\n\r\n110\r\n\r\n\r\n3.08\r\n\r\n\r\n3.215\r\n\r\n\r\n19.44\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nHornet Sportabout\r\n\r\n\r\n18.7\r\n\r\n\r\n8\r\n\r\n\r\n360\r\n\r\n\r\n175\r\n\r\n\r\n3.15\r\n\r\n\r\n3.440\r\n\r\n\r\n17.02\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n2\r\n\r\n\r\nValiant\r\n\r\n\r\n18.1\r\n\r\n\r\n6\r\n\r\n\r\n225\r\n\r\n\r\n105\r\n\r\n\r\n2.76\r\n\r\n\r\n3.460\r\n\r\n\r\n20.22\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n3\r\n\r\n\r\n1\r\n\r\n\r\nfeat. accumulate2()\r\nYep, that’s right - more animations with accumulate() and {magick}!\r\nActually, to be precise, we’re going to use the accumuate2() here to replace our reduce2().\r\nFirst, we save the list of intermediate outputs to tables:\r\n\r\n\r\ntables <- mtcars %>% \r\n head() %>% \r\n kbl() %>% \r\n kable_classic(html_font = \"Roboto\") %>% \r\n kable_styling(full_width = FALSE) %>% # Added to keep aspect ratio constant when saving\r\n accumulate2(\r\n 1:(length(mtcars)+1), \r\n viridis::viridis(length(mtcars)+1), \r\n ~ column_spec(..1, ..2, background = ..3, color = if(..2 < 5){\"white\"}),\r\n .init = .\r\n )\r\n\r\n\r\n\r\nThen, we save each table in tables as an image:\r\n\r\n\r\niwalk(tables, ~ save_kable(.x, file = here::here(\"img\", paste0(\"table\", .y, \".png\")), zoom = 2))\r\n\r\n\r\n\r\nFinally, we read them in and animate:\r\n\r\n\r\ntables <- map(\r\n paste0(\"table\", 1:length(tables), \".png\"),\r\n ~ image_read(here::here(\"img\", .x))\r\n)\r\n\r\ntables <- image_join(tables)\r\n\r\nimage_animate(tables)\r\n\r\n\r\n\r\n\r\n\r\n\r\nBet you don’t see animated tables often!\r\nExample 3: {dplyr}\r\nA reduce() solution\r\nRecall that we had this sad code:\r\n\r\n\r\nnew_cols <- c(\"a\", \"b\", \"c\")\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n mutate(!!new_cols[1] := NA) %>% \r\n mutate(!!new_cols[2] := NA) %>% \r\n mutate(!!new_cols[3] := NA)\r\n\r\n\r\n mpg a b c\r\n 1 21.0 NA NA NA\r\n 2 21.0 NA NA NA\r\n 3 22.8 NA NA NA\r\n 4 21.4 NA NA NA\r\n 5 18.7 NA NA NA\r\n 6 18.1 NA NA NA\r\n\r\nYou know the drill - a simple call to reduce() gives us three new columns with names corresponding to the elements of the new_cols character vector we defined above:\r\n\r\n\r\n# Converting to tibble for nicer printing\r\nmtcars <- as_tibble(mtcars)\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n reduce(\r\n new_cols,\r\n ~ mutate(.x, !!.y := NA),\r\n .init = .\r\n )\r\n\r\n\r\n # A tibble: 6 x 4\r\n mpg a b c \r\n \r\n 1 21 NA NA NA \r\n 2 21 NA NA NA \r\n 3 22.8 NA NA NA \r\n 4 21.4 NA NA NA \r\n 5 18.7 NA NA NA \r\n 6 18.1 NA NA NA\r\n\r\nAgain, this gives you a lot of flexibility, like the ability to dynamically assign values to each new column:\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n reduce(\r\n new_cols,\r\n ~ mutate(.x, !!.y := paste0(.y, \"-\", row_number())),\r\n .init = .\r\n )\r\n\r\n\r\n # A tibble: 6 x 4\r\n mpg a b c \r\n \r\n 1 21 a-1 b-1 c-1 \r\n 2 21 a-2 b-2 c-2 \r\n 3 22.8 a-3 b-3 c-3 \r\n 4 21.4 a-4 b-4 c-4 \r\n 5 18.7 a-5 b-5 c-5 \r\n 6 18.1 a-6 b-6 c-6\r\n\r\nWe can take this even further using context dependent expressions like cur_data(), and do something like keeping track of the columns present at each point a new column has been created via mutate():\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n reduce(\r\n new_cols,\r\n ~ mutate(.x, !!.y := paste(c(names(cur_data()), .y), collapse = \"-\")),\r\n .init = .\r\n )\r\n\r\n\r\n # A tibble: 6 x 4\r\n mpg a b c \r\n \r\n 1 21 mpg-a mpg-a-b mpg-a-b-c\r\n 2 21 mpg-a mpg-a-b mpg-a-b-c\r\n 3 22.8 mpg-a mpg-a-b mpg-a-b-c\r\n 4 21.4 mpg-a mpg-a-b mpg-a-b-c\r\n 5 18.7 mpg-a mpg-a-b mpg-a-b-c\r\n 6 18.1 mpg-a mpg-a-b mpg-a-b-c\r\n\r\nHere’s another example just for fun - an “addition matrix”:14\r\n\r\n\r\nmtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n reduce(\r\n pull(., mpg),\r\n ~ mutate(.x, !!as.character(.y) := .y + mpg),\r\n .init = .\r\n )\r\n\r\n\r\n # A tibble: 6 x 6\r\n mpg `21` `22.8` `21.4` `18.7` `18.1`\r\n \r\n 1 21 42 43.8 42.4 39.7 39.1\r\n 2 21 42 43.8 42.4 39.7 39.1\r\n 3 22.8 43.8 45.6 44.2 41.5 40.9\r\n 4 21.4 42.4 44.2 42.8 40.1 39.5\r\n 5 18.7 39.7 41.5 40.1 37.4 36.8\r\n 6 18.1 39.1 40.9 39.5 36.8 36.2\r\n\r\nLet’s now look at a more practical application of this: explicit dummy coding!\r\nIn R, the factor data structure allows implicit dummy coding, which you can access using contrasts().\r\nHere, in our data penguins from the {palmerpenguins} package, we see that the 3-way contrast between “Adelie”, “Chinstrap”, and “Gentoo” in the species factor column is treatment coded, with “Adelie” set as the reference level:\r\n\r\n\r\ndata(\"penguins\", package = \"palmerpenguins\")\r\n\r\npenguins_implicit <- penguins %>% \r\n na.omit() %>% \r\n select(species, flipper_length_mm) %>% \r\n mutate(species = factor(species))\r\n\r\ncontrasts(penguins_implicit$species)\r\n\r\n\r\n Chinstrap Gentoo\r\n Adelie 0 0\r\n Chinstrap 1 0\r\n Gentoo 0 1\r\n\r\nWe can also infer that from the output of this simple linear model:15\r\n\r\n\r\nbroom::tidy(lm(flipper_length_mm ~ species, data = penguins_implicit))\r\n\r\n\r\n # A tibble: 3 x 5\r\n term estimate std.error statistic p.value\r\n \r\n 1 (Intercept) 190. 0.552 344. 0. \r\n 2 speciesChinstrap 5.72 0.980 5.84 1.25e- 8\r\n 3 speciesGentoo 27.1 0.824 32.9 2.68e-106\r\n\r\nWhat’s cool is that you can make this 3-way treatment coding explicit by expanding the matrix into actual columns of the data!\r\nHere’s a reduce() solution:\r\n\r\n\r\npenguins_explicit <- \r\n reduce(\r\n levels(penguins_implicit$species)[-1],\r\n ~ mutate(.x, !!paste0(\"species\", .y) := as.integer(species == .y)),\r\n .init = penguins_implicit\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\nspecies\r\n\r\n\r\nflipper_length_mm\r\n\r\n\r\nspeciesChinstrap\r\n\r\n\r\nspeciesGentoo\r\n\r\n\r\nAdelie\r\n\r\n\r\n181\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n181\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n182\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n198\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n197\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n194\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n174\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n180\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n180\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n183\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n172\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n180\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n178\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n178\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n188\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n180\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n181\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n182\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n182\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n188\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n200\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n181\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n194\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n192\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n192\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n188\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n198\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n197\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n194\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n202\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n205\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n178\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n192\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n192\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n203\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n183\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n199\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n181\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n197\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n198\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n197\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n196\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n188\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n199\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n189\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n198\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n176\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n202\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n186\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n199\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n197\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n199\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n191\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n200\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n188\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n192\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n185\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n190\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n184\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n195\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n193\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n187\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nAdelie\r\n\r\n\r\n201\r\n\r\n\r\n0\r\n\r\n\r\n0\r\n\r\n\r\nGentoo\r\n\r\n\r\n211\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n218\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n211\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n219\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n209\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n214\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n214\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n217\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n221\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n209\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n218\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n209\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n207\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n219\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n225\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n217\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n225\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n225\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n217\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n224\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n221\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n214\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n231\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n219\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n229\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n223\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n221\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n221\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n217\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n209\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n220\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n223\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n221\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n224\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n228\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n218\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n218\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n218\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n228\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n224\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n214\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n226\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n203\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n225\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n219\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n228\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n228\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n210\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n219\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n208\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n209\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n216\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n229\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n217\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n230\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n214\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n215\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n222\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n212\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nGentoo\r\n\r\n\r\n213\r\n\r\n\r\n0\r\n\r\n\r\n1\r\n\r\n\r\nChinstrap\r\n\r\n\r\n192\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n196\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n193\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n188\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n197\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n198\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n178\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n197\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n198\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n193\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n194\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n185\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n201\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n190\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n201\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n197\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n181\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n190\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n181\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n191\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n193\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n197\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n200\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n200\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n191\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n205\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n201\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n203\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n199\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n210\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n192\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n205\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n210\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n196\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n196\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n196\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n201\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n190\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n212\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n198\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n199\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n201\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n193\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n203\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n187\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n197\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n191\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n203\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n202\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n194\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n206\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n189\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n195\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n207\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n202\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n193\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n210\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\nChinstrap\r\n\r\n\r\n198\r\n\r\n\r\n1\r\n\r\n\r\n0\r\n\r\n\r\n\r\nAnd we get the exact same output from lm() when we throw in the new columns speciesChinstrap and speciesGentoo as the predictors!\r\n\r\n\r\nbroom::tidy(lm(flipper_length_mm ~ speciesChinstrap + speciesGentoo, data = penguins_explicit))\r\n\r\n\r\n # A tibble: 3 x 5\r\n term estimate std.error statistic p.value\r\n \r\n 1 (Intercept) 190. 0.552 344. 0. \r\n 2 speciesChinstrap 5.72 0.980 5.84 1.25e- 8\r\n 3 speciesGentoo 27.1 0.824 32.9 2.68e-106\r\n\r\nBy the way, if you’re wondering how this is practical, some modeling packages in R (like {lavaan} for structural equation modeling) only accept dummy coded variables that exist as independent columns/vectors, not as a metadata of a factor vector.16 This is common enough that some packages like {psych} have a function that does the same transformation we just did, called dummy.code()17:\r\n\r\n\r\nbind_cols(\r\n penguins_implicit,\r\n psych::dummy.code(penguins_implicit$species)\r\n)\r\n\r\n\r\n # A tibble: 333 x 3\r\n species flipper_length_mm ...3[,\"Adelie\"] [,\"Gentoo\"] [,\"Chinstrap\"]\r\n \r\n 1 Adelie 181 1 0 0\r\n 2 Adelie 186 1 0 0\r\n 3 Adelie 195 1 0 0\r\n 4 Adelie 193 1 0 0\r\n 5 Adelie 190 1 0 0\r\n 6 Adelie 181 1 0 0\r\n 7 Adelie 195 1 0 0\r\n 8 Adelie 182 1 0 0\r\n 9 Adelie 191 1 0 0\r\n 10 Adelie 198 1 0 0\r\n # ... with 323 more rows\r\n\r\nfeat. {data.table}\r\nOf course, you could do all of this without reduce() in {data.table} because its walrus := is vectorized.\r\nHere’s the {data.table} solution for our sad code:\r\n\r\n\r\nlibrary(data.table)\r\nnew_cols <- c(\"a\", \"b\", \"c\")\r\n\r\nmtcars_dt <- mtcars %>% \r\n head() %>% \r\n select(mpg) %>% \r\n as.data.table()\r\n\r\nmtcars_dt[, (new_cols) := NA][]\r\n\r\n\r\n mpg a b c\r\n 1: 21.0 NA NA NA\r\n 2: 21.0 NA NA NA\r\n 3: 22.8 NA NA NA\r\n 4: 21.4 NA NA NA\r\n 5: 18.7 NA NA NA\r\n 6: 18.1 NA NA NA\r\n\r\nAnd here’s a {data.table} solution for the explicit dummy coding example:\r\n\r\n\r\npenguins_dt <- as.data.table(penguins_implicit)\r\n\r\ntreatment_lvls <- levels(penguins_dt$species)[-1]\r\ntreatment_cols <- paste0(\"species\", treatment_lvls)\r\n\r\npenguins_dt[, (treatment_cols) := lapply(treatment_lvls, function(x){as.integer(species == x)})][]\r\n\r\n\r\n species flipper_length_mm speciesChinstrap speciesGentoo\r\n 1: Adelie 181 0 0\r\n 2: Adelie 186 0 0\r\n 3: Adelie 195 0 0\r\n 4: Adelie 193 0 0\r\n 5: Adelie 190 0 0\r\n --- \r\n 329: Chinstrap 207 1 0\r\n 330: Chinstrap 202 1 0\r\n 331: Chinstrap 193 1 0\r\n 332: Chinstrap 210 1 0\r\n 333: Chinstrap 198 1 0\r\n\r\nI personally default to using {data.table} over {dplyr} in these cases.\r\nMisc.\r\nYou can also pass in a list of functions instead of a list of arguments because why not.\r\nFor example, this replicates the very first code I showed in this blog post:\r\n\r\n\r\nmy_funs <- list(deviation, square, mean, sqrt)\r\n\r\nreduce(\r\n my_funs,\r\n ~ .y(.x),\r\n .init = nums\r\n)\r\n\r\n\r\n [1] 0.3039881\r\n\r\nYou could also pass in both a list of functions and a list of their arguments if you really want to abstract away from, like, literally everything:\r\n\r\n\r\nLawful Good\r\n\r\n\r\nlibrary(janitor)\r\n\r\nmtcars %>% \r\n clean_names(case = \"title\") %>% \r\n tabyl(2) %>% \r\n adorn_rounding(digits = 2) %>% \r\n adorn_totals()\r\n\r\n\r\n Cyl n percent\r\n 4 11 0.34\r\n 6 7 0.22\r\n 8 14 0.44\r\n Total 32 1.00\r\n\r\n\r\n\r\nChaotic Evil\r\n\r\n\r\njanitor_funs <- list(clean_names, tabyl, adorn_rounding, adorn_totals)\r\njanitor_args <- list(list(case = \"title\"), list(2), list(digits = 2), NULL)\r\n\r\nreduce2(\r\n janitor_funs,\r\n janitor_args,\r\n ~ do.call(..2, c(list(dat = ..1), ..3)),\r\n .init = mtcars\r\n)\r\n\r\n\r\n Cyl n percent\r\n 4 11 0.34\r\n 6 7 0.22\r\n 8 14 0.44\r\n Total 32 1.00\r\n\r\n\r\n\r\nHave fun reducing repetitions in your code with reduce()!\r\n\r\nSo much so that there’s going to be a native pipe operator!↩︎\r\nTaken from Advanced R Ch. 6↩︎\r\nIf you aren’t familiar with {kableExtra}, you just need to know that column_spec() can take a column index as its first argument and a color as the background argument to set the background color of a column to the provided color. And as we see here, if a color vector is passed into background, it’s just recycled to color the rows which is not what we want.↩︎\r\nIf this is your first time seeing the “bang bang” !! operator and the “walrus” := operator being used this way, check out the documentation on quasiquotation.↩︎\r\nFor those of you more familiar with quasiquation in {dplyr}, I should also mention that using “big bang” !!! like in mutate(!!!new_cols := NA) doesn’t work either. As far as I know, := is just an alias of = for the {rlang} parser, and as we know = cannot assign more than one variable at once (unlike Python, for example), which explains the error.↩︎\r\nNote that there are more motivated usescases of reduce() out there, mostly in doing mathy-things, and I’m by no means advocating that you should always use reduce() in our context - I just think it’s fun to play around with!↩︎\r\nThere’s also .dir argument that allows you to specify the direction, but not relevant here because when you pipe, the left-hand side is always the first input to the next function.↩︎\r\nIf it helps, think of it like ((((1 + 2) + 3) + 4) + 5)↩︎\r\nThe function passed into reduce() doesn’t have to be in {rlang} anonymous function syntax, but I like it so I’ll keep using it here.↩︎\r\nBy the way, we could also do this with purrr::map() since multiple ggplot2 layers can be stored into a list and added all together in one step. But then we can’t do this cool thing I’m going to show with accumulate() next!↩︎\r\nBy the way, if you want a whole package dedicated to animating and incrementally building {ggplot2} code, check out @EvaMaeRey’s {flipbookr} package!↩︎\r\nWe are still “iterating” over the numbers and background_colors vectors but in a round-about way by passing a vector of indices for reduce() to iterate over instead and using the indices to access elements of the two vectors. This actually seems like the way to go when you have more than two varying arguments because there’s no pmap() equavalent for reduce() like preduce().↩︎\r\nNote that we couldn’t do this with + in our {ggplot2} example because geom_point() doesn’t take a ggplot object as its first argument. Basically, the + operator is re-purposed as a class method for ggplot objects but it’s kinda complicated so that’s all I’ll say about that.↩︎\r\nNote the use of as.character() to make sure that the left-hand side of the walrus := is converted from numeric to character. Alternatively, using the new glue syntax support from dplyr > 1.0.0, we can simplify !!as.character(.y) := to \"{.y}\" :=↩︎\r\nIf you aren’t familiar with linear models in R, we know that “Adelie” is the reference level because there is no “speciesAdelie” term. The estimate for “Adelie” is represented by the “(Intercept)”!↩︎\r\nFiguring this out has caused some headaches and that’s what I get for not carefully reading the docs↩︎\r\nExcept dummy.code() also returns a column for the reference level whose value is always 1, which is kinda pointless↩︎\r\n", "preview": "posts/2020-12-13-collapse-repetitive-piping-with-reduce/reduce_ggplot.png", - "last_modified": "2022-11-13T09:16:57-05:00", + "last_modified": "2022-11-13T06:16:57-08:00", "input_file": {}, "preview_width": 1233, "preview_height": 775 @@ -366,7 +366,7 @@ ], "contents": "\r\n\r\nContents\r\nBefore\r\nMy Plan\r\nAfter\r\nFirst draft\r\nFinal touch-up\r\n\r\n\r\n\r\n\r\n\r\nThis is the second installment of plot makeover where I take a plot in the wild and make very opinionated modifications to it.\r\nBefore\r\nOur plot-in-the-wild comes from (Yurovsky and Yu 2008), a paper on statistical word learning. The plot that I’ll be looking at here is Figure 2, a bar plot of accuracy in a 3-by-3 experimental design.\r\n\r\n\r\n\r\nFigure 1: Plot from Yurovsky and Yu (2008)\r\n\r\n\r\n\r\nAs you might notice, there’s something interesting going on in this bar plot. It looks like the red and green bars stack together but dodge from the blue bar. It’s looks a bit weird for me as someone who mainly uses {ggplot2} because this kind of a hybrid design is not explicitly supported in the API.\r\nFor this plot makeover, I’ll leave aside the issue of whether having a half-stacked, half-dodged bar plot is a good idea.1 In fact, I’m not even gonna focus much on the “makeover” part. Instead I’m just going to take a shot at recreating this plot (likely made in MATLAB with post-processing in PowerPoint) in {ggplot2}.\r\nMy Plan\r\nAgain, my primary goal here is replication. But I do want to touch up on some aesthetics while I’m at it.\r\nMajor Changes:\r\nMove the title to above the plot\r\nMove the legend inside the plot\r\nMove/remove the y-axis title so it’s not vertically aligned\r\nMinor Changes:\r\nRemove grid lines\r\nPut y-axis in percentages\r\nAdd white borders around the bars for clearer color contrast\r\nAfter\r\nFirst draft\r\nFor a first pass on the makeover, I wanted to get the hybrid design right.\r\nThe plot below isn’t quite there in terms of covering everything I laid out in my plan, but it does replicate the bar plot design specifically.\r\n\r\n\r\nPlot\r\n\r\n\r\n\r\n\r\n\r\nCode\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(extrafont)\r\n\r\ndf <- tribble(\r\n ~Condition, ~Referent, ~Accuracy,\r\n \"Primacy\", \"Single\", 0.63,\r\n \"Primacy\", \"Primacy\", 0.59,\r\n \"Recency\", \"Single\", 0.63,\r\n \"Recency\", \"Recency\", 0.5,\r\n \"Both\", \"Single\", 0.63,\r\n \"Both\", \"Primacy\", 0.5,\r\n \"Both\", \"Recency\", 0.31\r\n) %>% \r\n mutate(\r\n error_low = runif(7, .04, .06),\r\n error_high = runif(7, .04, .06),\r\n Condition_name = factor(Condition, levels = c(\"Primacy\", \"Recency\", \"Both\")),\r\n Condition = as.numeric(Condition_name),\r\n Referent = factor(Referent, levels = c(\"Single\", \"Recency\", \"Primacy\")),\r\n left = Referent == \"Single\",\r\n color = case_when(\r\n Referent == \"Single\" ~ \"#29476B\",\r\n Referent == \"Primacy\" ~ \"#AD403D\",\r\n Referent == \"Recency\" ~ \"#9BBB58\"\r\n )\r\n )\r\n\r\n\r\nggplot(mapping = aes(x = Condition, y = Accuracy, fill = color)) +\r\n geom_col(\r\n data = filter(df, left),\r\n width = .3,\r\n color = \"white\",\r\n position = position_nudge(x = -.3)\r\n ) +\r\n geom_errorbar(\r\n aes(ymin = Accuracy - error_low, ymax = Accuracy + error_high),\r\n data = filter(df, left),\r\n width = .1,\r\n position = position_nudge(x = -.3)\r\n ) +\r\n geom_col(\r\n data = filter(df, !left),\r\n color = \"white\",\r\n width = .3,\r\n ) +\r\n geom_errorbar(\r\n aes(y = y, ymin = y - error_low, ymax = y + error_high),\r\n data = filter(df, !left) %>% \r\n group_by(Condition) %>% \r\n mutate(y = accumulate(Accuracy, sum)),\r\n width = .1\r\n ) +\r\n scale_fill_identity(\r\n labels = levels(df$Referent),\r\n guide = guide_legend(title = \"Referent\")\r\n ) +\r\n scale_x_continuous(\r\n breaks = 1:3 - .15,\r\n labels = levels(df$Condition_name),\r\n expand = expansion(.1)\r\n ) +\r\n scale_y_continuous(\r\n breaks = scales::pretty_breaks(6),\r\n labels = str_remove(scales::pretty_breaks(6)(0:1), \"\\\\.0+\"),\r\n limits = 0:1,\r\n expand = expansion(0)\r\n ) +\r\n labs(\r\n title = \"Exp1: Accuracy by Condition and Word Type\"\r\n ) +\r\n theme_classic(\r\n base_family = \"Roboto\",\r\n base_size = 16\r\n )\r\n\r\n\r\n\r\n\r\n\r\nAs you might guess from my two calls to geom_col() and geom_errorbar(), I actually split the plotting of the bars into two parts. First I drew the blue bars and their errorbars, then I drew the green and red bars and their errorbars.\r\nEffectively, the above plot is a combination of these two:2\r\n\r\n\r\n\r\nA bit hacky, I guess, but it works!\r\n\r\n\r\n\r\n\r\nFinal touch-up\r\n\r\n\r\n\r\n\r\n\r\nggplot(mapping = aes(x = Condition, y = Accuracy, fill = color)) +\r\n geom_col(\r\n data = filter(df, left),\r\n width = .3,\r\n color = \"white\",\r\n position = position_nudge(x = -.3),\r\n ) +\r\n geom_errorbar(\r\n aes(ymin = Accuracy - error_low, ymax = Accuracy + error_high),\r\n data = filter(df, left),\r\n width = .1,\r\n position = position_nudge(x = -.3)\r\n ) +\r\n geom_col(\r\n data = filter(df, !left),\r\n color = \"white\",\r\n width = .3, \r\n ) +\r\n geom_errorbar(\r\n aes(y = y, ymin = y - error_low, ymax = y + error_high),\r\n data = filter(df, !left) %>% \r\n group_by(Condition) %>% \r\n mutate(y = accumulate(Accuracy, sum)),\r\n width = .1\r\n ) +\r\n geom_hline(\r\n aes(yintercept = .25),\r\n linetype = 2,\r\n size = 1,\r\n ) +\r\n geom_text(\r\n aes(x = 3.4, y = .29),\r\n label = \"Chance\",\r\n family = \"Adelle\",\r\n color = \"grey20\",\r\n inherit.aes = FALSE\r\n ) +\r\n scale_fill_identity(\r\n labels = c(\"Single\", \"Primacy\", \"Recency\"),\r\n guide = guide_legend(\r\n title = NULL,\r\n direction = \"horizontal\",\r\n override.aes = list(fill = c(\"#29476B\", \"#AD403D\", \"#9BBB58\"))\r\n )\r\n ) +\r\n scale_x_continuous(\r\n breaks = 1:3 - .15,\r\n labels = levels(df$Condition_name),\r\n expand = expansion(c(.1, .05))\r\n ) +\r\n scale_y_continuous(\r\n breaks = scales::pretty_breaks(6),\r\n labels = scales::percent_format(1),\r\n limits = 0:1,\r\n expand = expansion(0)\r\n ) +\r\n labs(\r\n title = \"Accuracy by Condition and Referent\",\r\n y = NULL\r\n ) +\r\n theme_classic(\r\n base_family = \"Roboto\",\r\n base_size = 16\r\n ) +\r\n theme(\r\n plot.title.position = \"plot\",\r\n plot.title = element_text(\r\n family = \"Roboto Slab\",\r\n margin = margin(0, 0, 1, 0, \"cm\")\r\n ),\r\n legend.position = c(.35, .9),\r\n axis.title.x = element_text(margin = margin(t = .4, unit = \"cm\")),\r\n plot.margin = margin(1, 1, .7, 1, \"cm\")\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\nYurovsky, Daniel, and C. Yu. 2008. Mutual Exclusivity in Cross-Situational Statistical Learning. https://dll.sitehost.iu.edu/papers/Yurovsky_cs08.pdf.\r\n\r\n\r\nI actually don’t even have a strong feeling about this. It does look kinda cool.↩︎\r\nI used a neat trick from the R Markdown Cookbook to get the plots printed side-by-side↩︎\r\n", "preview": "posts/2020-11-08-plot-makeover-2/plot-makeover-2_files/figure-html5/final-1.png", - "last_modified": "2022-11-13T09:16:57-05:00", + "last_modified": "2022-11-13T06:16:57-08:00", "input_file": {}, "preview_width": 1344, "preview_height": 1152 @@ -389,7 +389,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n\r\n\r\n\r\nThings I learned\r\nHow to make waffle charts with {waffle} (finally!)\r\nUsing {patchwork} for a large list of plots using wrap_plots() and theme styling inside plot_annotation()\r\nWorking with a long canvas using the cairo_pdf() device\r\nUsing {ggfittext} for dynamically re-sizing annotations.\r\nThings to improve\r\nCouldn’t figure out background color for the entire visual and white ended up looking a bit too harsh on the eye\r\nIdeally would like to replace the squares with icons. Maybe I could’ve pursued that if I only plotted a couple furnitures.\r\nThe plot ended up being a bit too long. Again could’ve cut down a bit there, but I don’t mind it for this submission because I was more focused on learning how to make waffle charts at all.\r\nOops forgot to put in the data source\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(waffle)\r\nlibrary(extrafont)\r\nlibrary(patchwork)\r\n\r\ntuesdata <- tidytuesdayR::tt_load(2020, week = 45)\r\n\r\nikea_counts <- tuesdata$ikea %>% \r\n count(category) %>% \r\n mutate(n = round(n/5)) %>% \r\n arrange(-n)\r\n\r\nikea_colors <- c(nord::nord_palettes$algoma_forest, dutchmasters::dutchmasters_pal()(13)[-c(1, 8, 12)])\r\n\r\nikea_waffles <- map(1:nrow(ikea_counts), ~ {\r\n df <- slice(ikea_counts, .x)\r\n ggplot(df) +\r\n geom_waffle(\r\n aes(fill = category, values = n),\r\n n_rows = 20,\r\n size = 1.5,\r\n flip = TRUE,\r\n show.legend = FALSE\r\n ) +\r\n scale_fill_manual(values = ikea_colors[.x]) +\r\n ggfittext::geom_fit_text(\r\n aes(xmin = -15, xmax = -5, ymin = .5, ymax = .5 + ceiling(df$n/20), label = category),\r\n size = 54, grow = FALSE, fullheight = FALSE, place = \"left\" ,\r\n family = \"Roboto Slab\", fontface = \"bold\"\r\n ) +\r\n coord_equal(xlim = c(-16, 21)) +\r\n theme_void()\r\n})\r\n\r\nlegend_key <- ggplot() +\r\n annotation_custom(rectGrob(0.5, 0.5, height = .02, width = .02, gp = gpar(fill = \"grey50\", color = \"black\", lwd = 1))) +\r\n annotation_custom(textGrob(\"= 5 units\", gp = gpar(fontfamily = \"Roboto Slab\", fontface = \"bold\", fontsize = 12)), 3, 2.6) +\r\n coord_equal(xlim = c(0, 5), ylim = c(0, 5)) +\r\n theme_void()\r\n\r\npatched <- wrap_plots(ikea_waffles, ncol = 1) +\r\n plot_annotation(\r\n title = \"IKEA<\/span> Furnitures in Stock<\/span>\",\r\n caption = \"@yjunechoe\",\r\n theme = theme(\r\n plot.title = ggtext::element_markdown(\r\n size = 100,\r\n family = \"Noto\",\r\n face = \"bold\",\r\n hjust = .5,\r\n margin = margin(t = 1.5, b = 2, unit = \"in\")\r\n ),\r\n plot.caption = element_text(\r\n size = 32,\r\n family = \"IBM Plex Mono\",\r\n face = \"bold\",\r\n margin = margin(t = 1, b = 1, unit = \"in\")\r\n ),\r\n plot.margin = margin(2, 2, 2, 2, unit = \"in\")\r\n )\r\n ) &\r\n theme(plot.margin = margin(t = .5, b = .5, unit = \"in\")) \r\n\r\n\r\nggsave(\"tidytuesday_2020_45.pdf\", patched, device = cairo_pdf, scale = 2, width = 12, height = 26, limitsize = FALSE)\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-11-03-tidytuesday-2020-week-45/preview.png", - "last_modified": "2022-11-13T09:16:57-05:00", + "last_modified": "2022-11-13T06:16:57-08:00", "input_file": {}, "preview_width": 4443, "preview_height": 2950 @@ -414,7 +414,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n\r\n\r\n\r\nThings I learned\r\nUsing {magick} for animation composition, thanks to the {gganimate} wiki\r\nThe very basics of working with spatial data with {rnaturalearth} and {sf}1\r\nA bit about color schemes for maps (I particularly love this color as a way of de-emphasizing territories in the background)\r\nThings to improve\r\nI couldn’t figure out how to add margins to the bottom, but I now realize that I could’ve just played around with expansion() for the y-axis of the bar animation plot.\r\nImage composition took a while to render, which was a bit frustrating. Need to find a way to speed that up.\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(gganimate)\r\nlibrary(extrafont)\r\n\r\ntuesdata <- tidytuesdayR::tt_load(2020, week = 44)\r\n\r\nwind_turbine <- tuesdata$`wind-turbine` %>% \r\n select(\r\n ID = objectid,\r\n Province = province_territory,\r\n Capacity = total_project_capacity_mw,\r\n Diameter = rotor_diameter_m,\r\n Height = hub_height_m,\r\n Year = commissioning_date,\r\n Lat = latitude,\r\n Lon = longitude\r\n ) %>% \r\n arrange(Year, -Diameter) %>% \r\n mutate(\r\n Year = as.integer(str_match(Year, \"^\\\\d{4}\")[,1])\r\n )\r\n\r\n\r\n\r\nne_map <- rnaturalearth::ne_countries(scale='medium', returnclass = 'sf')\r\n\r\nturbine_anim <- wind_turbine %>% \r\n ggplot() +\r\n geom_rect(\r\n aes(xmin = -150, xmax = -50, ymin = 40, ymax = 72),\r\n fill = \"#B6D0D1\"\r\n ) +\r\n geom_sf(\r\n aes(fill = ifelse(admin == \"Canada\", \"#7BC86C\", \"#FFF8DC\")),\r\n show.legend = FALSE,\r\n data = filter(ne_map, admin %in% c(\"Canada\", \"United States of America\"))\r\n ) +\r\n scale_fill_identity() +\r\n geom_point(\r\n aes(Lon, Lat, group = ID, size = Capacity),\r\n show.legend = FALSE, alpha = 0.5, color = \"#3C59FF\"\r\n ) +\r\n geom_text(\r\n aes(x = -138, y = 43, label = as.character(Year)),\r\n size = 24, color = \"grey35\", family = \"Roboto Slab\"\r\n ) +\r\n geom_rect(\r\n aes(xmin = -150, xmax = -50, ymin = 40, ymax = 72),\r\n fill = \"transparent\", color = \"black\"\r\n ) +\r\n coord_sf(\r\n xlim = c(-150, -50),\r\n ylim = c(40, 72),\r\n expand = FALSE,\r\n clip = \"on\"\r\n ) +\r\n ggtitle(\"Canadian Wind Turbines\") +\r\n theme_void() +\r\n theme(\r\n plot.title = element_text(family = \"Adelle\", s),\r\n plot.margin = margin(1, 1, 1, 1, \"cm\")\r\n ) +\r\n transition_reveal(Year)\r\n\r\nanimate(turbine_anim, width = 1000, height = 600, nframes = 100)\r\n\r\n\r\n\r\ncapacity_data <- wind_turbine %>% \r\n group_by(Year) %>% \r\n summarize(\r\n Capacity = sum(Capacity),\r\n .groups = 'drop'\r\n ) %>% \r\n mutate(\r\n Capacity = accumulate(Capacity, sum),\r\n width = (Capacity/max(Capacity)) * 70\r\n )\r\n\r\ncapacity_anim <- capacity_data %>% \r\n ggplot(aes(x = 1, y = Capacity)) +\r\n geom_col(\r\n fill = \"#3C59FF\",\r\n ) +\r\n geom_text(\r\n aes(label = paste(as.character(round(Capacity * 0.001)), \"GW\")),\r\n hjust = -.2,\r\n family = \"IBM Plex Mono\"\r\n ) +\r\n scale_y_continuous(expand = expansion(c(.1, .4))) +\r\n coord_flip() +\r\n theme_void() +\r\n transition_states(Year)\r\n\r\nanimate(capacity_anim, res = 300, width = 1000, height = 100, nframes = 100)\r\n\r\n\r\nlibrary(magick)\r\n\r\nmap_gif <- image_read(\"turbine_map.gif\")\r\nbar_gif <- image_read(\"capacity_bar.gif\")\r\n\r\nnew_gif <- image_append(c(map_gif[1], bar_gif[1]), stack = TRUE)\r\n\r\nfor(i in 2:100){\r\n combined <- image_append(c(map_gif[i], bar_gif[i]), stack = TRUE)\r\n new_gif <- c(new_gif, combined)\r\n}\r\n\r\nnew_gif\r\n\r\n\r\n\r\n\r\nIf I don’t count all the convenient US-centric data/packages I’ve used to plot American maps before, this would be the first map I’ve made from scratch.↩︎\r\n", "preview": "posts/2020-10-28-tidytuesday-2020-week-44/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 735, "preview_height": 541 @@ -438,7 +438,7 @@ ], "contents": "\r\n\r\nContents\r\nIntroduction\r\nSetup\r\nAnalysis\r\nConclusion\r\n\r\nIntroduction\r\n\r\n\r\nI, along with nearly two-hundred thousand other people, follow @everycolorbot on twitter. @everycolorbot is a twitter bot that tweets an image of a random color every hour (more details on github). It’s 70% a source of inspiration for new color schemes, and 30% a comforting source of constant in my otherwise hectic life.\r\nWhat I’ve been noticing about @everycolorbot’s tweets is that bright, highly saturated neon colors (yellow~green) tend to get less likes compared to cool blue colors and warm pastel colors. You can get a feel of this difference in the number of likes between the two tweets below, tweeted an hour apart:\r\n\r\n\r\n\r\n\r\n0x54e14b pic.twitter.com/Aw0cwm7uy8\r\n\r\n— Every Color (@everycolorbot) October 15, 2020\r\n\r\n\r\n\r\n\r\n\r\n0xaa70a5 pic.twitter.com/NMBF3mffS4\r\n\r\n— Every Color (@everycolorbot) October 15, 2020\r\n\r\n\r\n\r\nThis is actually not a big surprise. Bright pure colors are very harsh and straining to the eye, especially on a white background.1 For this reason bright colors are almost never used in professional web design, and are also discouraged in data visualization.\r\nSo here’s a mini experiment testing that claim: I’ll use @everycolorbot’s tweets (more specifically, the likes on the tweets) as a proxy for likeability/readability/comfortableness/etc. It’ll be a good exercise for getting more familiar with different colors! I’m also going to try a simple descriptive analysis using the HSV color representation, which is a psychologically-motivated mental model of color that I like a lot (and am trying to get a better feel for).\r\n\r\n\r\n\r\nFigure 1: HSV cylinder\r\n\r\n\r\n\r\nSetup\r\nUsing {rtweet} requires authentication from twitter. The steps to do so are very well documented on the package website so I wouldn’t expect too much trouble setting it up if it’s your first time using it. But just for illustration, here’s what my setup looks like:\r\n\r\n\r\napi_key <- 'XXXXXXXXXXXXXXXXXXXXX'\r\napi_secret_key <- 'XXXXXXXXXXXXXXXXXXXXX'\r\naccess_token <- \"XXXXXXXXXXXXXXXXXXXXX\"\r\naccess_token_secret <- \"XXXXXXXXXXXXXXXXXXXXX\"\r\n\r\ntoken <- create_token(\r\n app = \"XXXXXXXXXXXXXXXXXXXXX\",\r\n consumer_key = api_key,\r\n consumer_secret = api_secret_key,\r\n access_token = access_token,\r\n access_secret = access_token_secret\r\n)\r\n\r\n\r\n\r\nAfter authorizing, I queried the last 10,000 tweets made by @everycolorbot. It ended up only returning about a 1/3 of that because the twitter API only allows you to go back so far in time, but that’s plenty for my purposes here.\r\n\r\n\r\n\r\n\r\n\r\ncolortweets <- rtweet::get_timeline(\"everycolorbot\", 10000)\r\n\r\ndim(colortweets)\r\n\r\n\r\n\r\n\r\n [1] 3238 90\r\n\r\nAs you see above, I also got back 90 variables (columns). I only care about the time of the tweet, the number of likes it got, and the color it tweeted, so those are what I’m going to grab. I also want to clean things up a bit for plotting, so I’m going to grab just the hour from the time and just the hex code from the text.\r\n\r\n\r\ncolortweets_df <- colortweets %>% \r\n select(created_at, text, favorite_count) %>%\r\n mutate(\r\n created_at = lubridate::hour(created_at),\r\n text = paste0(\"#\", str_extract(text, \"(?<=0x).*(?= )\"))\r\n ) %>% \r\n rename(\r\n likes = favorite_count,\r\n hour = created_at,\r\n hex = text\r\n )\r\n\r\n\r\n\r\nAnd here’s what we end up with:\r\n\r\n\r\nlikes\r\n\r\n\r\nhour\r\n\r\n\r\nhex\r\n\r\n\r\n36\r\n\r\n\r\n15\r\n\r\n\r\n#65f84e\r\n\r\n\r\n65\r\n\r\n\r\n14\r\n\r\n\r\n#32fc27\r\n\r\n\r\n89\r\n\r\n\r\n13\r\n\r\n\r\n#997e13\r\n\r\n\r\n140\r\n\r\n\r\n12\r\n\r\n\r\n#ccbf09\r\n\r\n\r\n303\r\n\r\n\r\n11\r\n\r\n\r\n#665f84\r\n\r\n\r\n75\r\n\r\n\r\n10\r\n\r\n\r\n#b32fc2\r\n\r\n\r\nHere is the link to this data if you’d like to replicate or extend this analysis yourself.\r\nAnalysis\r\nBelow is a bar plot of colors where the height corresponds to the number of likes. It looks cooler than your usual bar plot because I transformed the x dimension into polar coordinates. My intent in doing this was to control for the hour of day in my analysis and visualize it like a clock (turned out better than expected!)\r\n\r\n\r\ncolortweets_df %>% \r\n arrange(-likes) %>% \r\n ggplot(aes(hour, likes, color = hex)) +\r\n geom_col(\r\n aes(size = likes),\r\n position = \"dodge\",\r\n show.legend = FALSE\r\n ) +\r\n scale_color_identity() +\r\n theme_void() +\r\n theme(\r\n plot.background = element_rect(fill = \"#222222\", color = NA),\r\n ) +\r\n coord_polar()\r\n\r\n\r\n\r\n\r\nCheck out my use of arrange() here: it’s how I tell ggplot to plot the longer bars first then the smaller bars, minimizing the overlap!\r\n\r\n\r\n\r\nI notice at least two interesting contrasts in this visualization:\r\nNeon colors (yellow, green, pink) and dark brown and black seems to dominate the center (least liked colors) while warm red~blue pastel colors dominate around the edges (most liked colors)\r\nThere also seems to be a distinction between pure blue and red in the inner-middle circle vs. the green~blue pastel colors in the outer-middle circle.\r\nSo maybe we can say that there are four clusters here:\r\nLeast liked: Bright neon colors + highly saturated dark colors\r\nLesser liked: Bright pure/near-pure colors\r\nMore liked: Darker pastel RGB\r\nMost liked: Lighter pastel mixed colors\r\nNow’s let’s try to quantitatively describe each cluster.\r\nFirst, as a sanity check, I’m just gonna eyeball the range of likes for each cluster using an un-transformed version of the above plot with units. I think we can roughly divide up the clusters at 100 likes, 200 likes, and 400 likes.\r\n\r\n\r\ncolortweets_df %>% \r\n arrange(-likes) %>% \r\n ggplot(aes(hour, likes, color = hex)) +\r\n geom_col(\r\n aes(size = likes),\r\n position = \"dodge\",\r\n show.legend = FALSE\r\n ) +\r\n geom_hline(\r\n yintercept = c(100, 200, 400), \r\n color = \"white\", \r\n linetype = 2, \r\n size = 2\r\n ) +\r\n scale_y_continuous(breaks = scales::pretty_breaks(10)) +\r\n scale_color_identity() +\r\n theme_void() +\r\n theme(\r\n plot.background = element_rect(fill = \"#222222\", color = NA),\r\n axis.line.y = element_line(color = \"white\"),\r\n axis.text.y = element_text(\r\n size = 14,\r\n color = \"white\",\r\n margin = margin(l = 3, r = 3, unit = \"mm\")\r\n )\r\n )\r\n\r\n\r\n\r\n\r\nIf our initial hypothesis about the four clusters are true, we should see these clusters having distinct profiles. Here, I’m going to use the HSV representation to quantitatively test this. To convert our hex values into HSV, I use the as.hsv() function from the {chroma} package - an R wrapper for the javascript library of the same name.\r\n\r\n\r\ncolortweets_df_hsv <- colortweets_df %>% \r\n mutate(hsv = map(hex, ~as_tibble(chroma::as.hsv(.x)))) %>% \r\n unnest(hsv)\r\n\r\n\r\n\r\n\r\nActually, I used furrr::future_map() here myself because I found the hex-hsv conversion to be sorta slow.\r\nAnd now we have the HSV values (hue, saturation, value)!\r\n\r\n\r\nlikes\r\n\r\n\r\nhour\r\n\r\n\r\nhex\r\n\r\n\r\nh\r\n\r\n\r\ns\r\n\r\n\r\nv\r\n\r\n\r\n36\r\n\r\n\r\n15\r\n\r\n\r\n#65f84e\r\n\r\n\r\n111.88235\r\n\r\n\r\n0.6854839\r\n\r\n\r\n0.9725490\r\n\r\n\r\n65\r\n\r\n\r\n14\r\n\r\n\r\n#32fc27\r\n\r\n\r\n116.90141\r\n\r\n\r\n0.8452381\r\n\r\n\r\n0.9882353\r\n\r\n\r\n89\r\n\r\n\r\n13\r\n\r\n\r\n#997e13\r\n\r\n\r\n47.91045\r\n\r\n\r\n0.8758170\r\n\r\n\r\n0.6000000\r\n\r\n\r\n140\r\n\r\n\r\n12\r\n\r\n\r\n#ccbf09\r\n\r\n\r\n56.00000\r\n\r\n\r\n0.9558824\r\n\r\n\r\n0.8000000\r\n\r\n\r\n303\r\n\r\n\r\n11\r\n\r\n\r\n#665f84\r\n\r\n\r\n251.35135\r\n\r\n\r\n0.2803030\r\n\r\n\r\n0.5176471\r\n\r\n\r\n75\r\n\r\n\r\n10\r\n\r\n\r\n#b32fc2\r\n\r\n\r\n293.87755\r\n\r\n\r\n0.7577320\r\n\r\n\r\n0.7607843\r\n\r\n\r\nWhat do we get if we average across the dimensions of HSV for each cluster?\r\n\r\n\r\ncolortweets_df_hsv <- colortweets_df_hsv %>% \r\n mutate(\r\n cluster = case_when(\r\n likes < 100 ~ \"Center\",\r\n between(likes, 100, 200) ~ \"Inner-Mid\",\r\n between(likes, 201, 400) ~ \"Outer-Mid\",\r\n likes > 400 ~ \"Edge\"\r\n ),\r\n cluster = fct_reorder(cluster, likes)\r\n )\r\n\r\ncolortweets_df_hsv %>% \r\n group_by(cluster) %>% \r\n summarize(across(h:v, mean), .groups = 'drop')\r\n\r\n\r\n # A tibble: 4 x 4\r\n cluster h s v\r\n * \r\n 1 Center 121. 0.770 0.710\r\n 2 Inner-Mid 191. 0.734 0.737\r\n 3 Outer-Mid 197. 0.589 0.710\r\n 4 Edge 249. 0.304 0.832\r\n\r\nThis actually matches up pretty nicely with our initial analysis! We find a general dislike for green colors (h value close to 120) over blue colors (h value close to 240), as well as a dislike for highly saturated colors (intense, bright) over those with low saturation (which is what gives off the “pastel” look). To help make the hue values more interpretable, here’s a color wheel with angles that correspond to the hue values in HSV.2\r\n\r\n\r\n\r\nFigure 2: Hue color wheel\r\n\r\n\r\n\r\nBut we also expect to find within-cluster variation along HSV. In particular, hue is kind of uninterpretable on a scale so it probably doesn’t make a whole lot of sense to take a mean of that. So back to the drawing plotting board!\r\nSince saturation and value do make more sense on a continuous scale, let’s draw a scatterplot for each cluster with saturation on the x-axis and value on the y-axis. I’m also going to map hue to the color of each point, but since hue is abstract on its own, I’m actually just going to replace it with the hex values (i.e., the actual color).\r\n\r\n\r\ncolortweets_df_hsv %>% \r\n ggplot(aes(s, v, color = hex)) +\r\n geom_point() +\r\n scale_color_identity() +\r\n lemon::facet_rep_wrap(~cluster) +\r\n theme_void(base_size = 16, base_family = \"Montserrat Medium\") +\r\n theme(\r\n plot.margin = margin(3, 5, 5, 5, \"mm\"),\r\n strip.text = element_text(margin = margin(b = 3, unit = \"mm\")),\r\n panel.border = element_rect(color = \"black\", fill = NA),\r\n panel.background = element_rect(fill = \"grey75\", color = NA)\r\n )\r\n\r\n\r\n\r\n\r\nHere’s the mappings spelled out again:\r\nsaturation (how colorful a color is) is mapped to the X-dimension\r\nvalue (how light a color is) is mapped to the Y-dimension\r\nhex (the actual color itself) is mapped to the COLOR dimension\r\n\r\nOur plot above reinforce what we’ve found before. Colors are more likeable (literally) the more they…\r\nMove away from green: Neon-green dominates the least-liked cluster, and that’s a blatant fact. Some forest-greens survive to the lesser-liked cluster, but is practically absent in the more-liked cluster and most-liked cluster. It looks like the only way for green to be redeemable is to either mix in with blue to become cyan and turquoise, which dominates the more-liked cluster, or severly drop in saturation to join the ranks of other pastel colors in the most-liked cluster.\r\nIncrease in value and decrease in saturation: It’s clear that the top-left corner is dominated by the more-liked and the most-liked cluster. That region is, again, where pastel colors live. They’re calmer than the bright neon colors that plague the least-liked cluster, and are more liked than highly-saturated and intense colors like those in the top right of the Outer-Mid panel. So perhaps this is a lesson that being “colorful” can only get you so far.\r\nConclusion\r\nObviously, all of this should be taken with a grain of salt. We don’t know the people behind the likes - their tastes, whether they see color differently, what medium they saw the tweet through, their experiences, etc.\r\nAnd of course, we need to remind ourselves that we rarely see a color just by itself in the world. It contrasts and harmonizes with other colors in the environment in very complex ways.\r\nBut that’s what kinda makes our analysis cool - despite all these complexities, we see evidence for many things that experts working with color emphasize: avoid pure neon, mix colors, etc. This dataset also opens us up to many more types of analyses (like an actual cluster analysis) that might be worth looking into.\r\nGood stuff.\r\n\r\nDark mode ftw!↩︎\r\nWhile all color wheels look the same, they aren’t all oriented the same. When using HSV, make sure to reference the color wheel where the red is at 0, green is as 120, and blue is at 240.↩︎\r\n", "preview": "posts/2020-10-22-analysis-of-everycolorbots-tweets/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 2433, "preview_height": 2259 @@ -461,7 +461,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nReflections on guiding aesthetics\r\nWhy bother with guiding aesthetics?\r\nDesigning guiding aesthetics\r\nMy thought process\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n\r\n\r\n\r\nReflections on guiding aesthetics\r\nAdmittedly, the plot itself is quite simple, but I learned a lot from this process. So, breaking from the usual format of my tidytuesday blogposts, I want to talk about the background and motivation behind the plot as this was a big step in a new (and exciting!) direction for me that I’d like to document.\r\nJust for clarification, I’m using the term guiding aesthetics to refer to elements of the plot that do not represent a variable in the data, but serves to emphasize the overall theme or topic of the being visualized. So the mountains in my plot do not themselves contain any data, but it’s the thing that tells readers that the plot is about mountains (a valuable, but different kind of info!). But more on that later.\r\nWhy bother with guiding aesthetics?\r\nThis was my first time adding a huge element to a plot that wasn’t meaningful, in the sense of representing the data. As someone in academia working on obscure topics (as one does in academia), I’m a firm believer in making your plots as simple and minimal as possible. So I, like many others, think that it’s always a huge risk to add elements that are not absolutely necessary.\r\nBut as you might imagine, I first fell in love with data visualization not because of how objective and straightforward they are, but because of how eye-catching they can be. Like, when I was young I used to be really into insects. In fact, I read insect encyclopedias as a hobby. That TMI is relevant because those are full of data(!) visualizations that employ literal mappings of data, since those are easy to interpret for children. For example, consider the following diagram of the life cycle of the Japanese beetle from the USDA.\r\n\r\n\r\n\r\nFigure 1: Diagram of the life cycle of the Japanese beetle\r\n\r\n\r\n\r\nAs a child, I never appreciated/realized all the data that was seamlessly packed into diagrams like this. But with my Grammar of Graphics lens on, I can now see that:\r\nThe developmental stage at each month is mapped to x\r\nThe depth at which the developing beetle lives at each stage is mapped to y\r\nThe appearance of the beetle at each developmental stage is mapped to shape\r\nThe size of the beetle at each developmental stage is mapped to, well, size\r\nBut I could have easily plotted something like this instead:\r\n\r\n\r\nbeetles <- tibble::tribble(\r\n ~Month, ~Depth, ~Stage, ~Size,\r\n \"JAN\", -10, \"Larva\", 10,\r\n \"FEB\", -8, \"Larva\", 12,\r\n \"MAR\", -7, \"Larva\", 14,\r\n \"APR\", -7, \"Larva\", 14,\r\n \"MAY\", -6, \"Pupa\", 11,\r\n \"JUN\", 0, \"Beetle\", 12,\r\n \"JUL\", 0, \"Beetle\", 12,\r\n \"AUG\", -3, \"Larva\", 1,\r\n \"SEP\", -2, \"Larva\", 2,\r\n \"OCT\", -1, \"Larva\", 4,\r\n \"NOV\", -3, \"Larva\", 5,\r\n \"DEC\", -8, \"Larva\", 7\r\n)\r\n\r\nbeetles$Month <- fct_inorder(beetles$Month)\r\n\r\nggplot(beetles, aes(x = Month, y = Depth, size = Size, shape = Stage)) +\r\n geom_point() +\r\n ggtitle(\"The lifecycle of the Japanese beetle\")\r\n\r\n\r\n\r\n\r\nBoth visuals represent the data accurately, but of course the diagram looks better. And not just because it’s complex, but also because it exploits the associations between aesthetic dimensions and their meanings, as well as the strength of those associations.\r\nFor example, the shape dimension literally corresponds to shape, and the shape of the different developmental stages of the beetle are unique enough for there to be interesting within-stage variation and still be recognizable - e.g., the beetle looks different crawling out the ground in June and entering back in August, but it’s still recognizable as a beetle (and the same beetle at that!). This would’ve been difficult if the variable being mapped to shape was something arbitrary and abstract, like the type of protein that’s most produced at a particular stage. The diagram thus exploits the strength of the association between the shape dimension and the literal shapes of the beetle to represent the developmental stages. And it should be clear now that the same goes for y and size.\r\nHow about x? There’s no such strong/literal interpretation of the x-dimension - at best it just means something horizontal. So it’s actually fitting that a similarly abstract concept like the passage of time is mapped to x. We understand time linearly, and often see time as the x-axis in other plots, so it fits pretty naturally here.\r\nLastly, let’s talk about the color dimension. Even though no information was actually mapped to color, we certainly are getting some kind of information from the colors used in the plot. Literally put, we’re getting the information that the grass is green and the soil is brown. Now, that information is actually not representing any data that we care about so it’s technically visual noise, but it helps bring forward the overall theme of the diagram. While this worked out in the end, notice now that you have effectively thrown out color as a dimension that can convey meaningful information. That was a necessary tradeoff, but a well motivated one, since the information that the diagram is trying to convey doesn’t really need color.\r\nDesigning guiding aesthetics\r\nI’m hardly the expert, but I found it helpful to think of the process as mediating the tug of war between the guiding aesthetic and the variables in the data as they fight over space in different mapping dimensions.\r\nThis meant I had to make some changes to my usual workflow for explanatory data visualization, which mostly goes something like this:\r\nTake my response variable and map it to y\r\nFigure out the distribution of my response variables and choose the geom (e.g., boxplot, histogram).\r\nMap my dependent variables to other dimensions - this usually ends at either just x or x + facet groupings\r\nBut if I’m trying to incorporate guiding aesthetics, my workflow would look more like this:\r\nStart with a couple ideas for a visual representation of the topic (scenes, objects, etc.)\r\nFigure out the dimensions that the variables in the data can be mapped to\r\nFigure out the dimensions that each visual representation would intrude in\r\nMake compromises between (2) and (3) in a way that maximizes the quality of the data and the visual representation\r\nOf course, this kind of flexibility is unique to exploratory data visualization, in particular to the kinds where none of the variable is significant or interesting a priori. Of course in real life there will be a lot more constraints, but because we can assume a great degree of naivety towards the data for #tidytuesday, I get to pick and choose what I want to plot (which makes #tidytuesday such a great place to practice baby steps)!\r\nFor illustration, here’s my actual thought process while I was making the #tidytuesday plot.\r\nWARNING: a very non-linear journey ahead!\r\nMy thought process\r\nThe topic was about Himalayan climbing expeditions, so I wanted my visual to involve a mountain shape. The most obvious idea that came to mind was to draw a mountain where the peak of that mountain corresponded with their height_metres, a variable from the peaks data. It’s straightforward and intuitive! So I drew a sketch (these are the original rough sketches - please forgive the quality!)\r\n\r\n\r\n\r\nFigure 2: The first guiding aesthetic idea\r\n\r\n\r\n\r\nBut this felt… a bit empty. I threw in Mount Everest, but now what? I still had data on hundreds of more Himalayan peaks that were in the dataset. Just visualizing Mount Everest is not badly motivated per se (it’s the most famous peak afterall), but it wouldn’t make an interesting plot since there wasn’t much data on individual peaks. I wanted to add a few more mountain shapes, but I struggled to find a handful of peaks that formed a coherent group. I knew that if I wanted to go with the idea of mountain shapes as the guiding aesthetic, I could only manage to fit about a dozen or so without it looking too crowded.\r\nI put that issue aside for the moment and moved on while trying to accommodate for the possibility that I may have to fit in many more peaks. I thought about having a single mountain shape just for Mount Everest, and a point with a vertical spikeline for all other peaks to emphasize the y-axis representing height_metres.\r\n\r\n\r\n\r\nFigure 3: The second guiding aesthetic idea\r\n\r\n\r\n\r\nAt this point I started thinking about the x-axis. If I do use points to represent peaks (specifically, peak height), where would I actually position each point? Just randomly along the x-axis? It really started hitting me at this point that the quality of my data was pretty abysmal. Even if I ended up with a pretty visualization, I didn’t think I could justify calling it a data visualization. I felt that it’d be a reach to use complex visuals just to communicate a single variable.\r\nI toyed around with enriching the data being plotted. What if I use size of the dots to represent the average age of climbers who reached the peak, from the memmbers data? Or what if I used shape of country flags on top of the dots to represent the country that first reached the peak, from the expeditions data?\r\nThese were all cool ideas, but I kept coming back to the need to make the x-dimension meaningful. It just stood out too much. I didn’t think I could prevent the reader from expecting some sort of a meaning from the positioning of the dots along the x-axis.\r\nSo I went back to Step #2. I gathered up all the continuous variables across the three data in the #tidytuesday dataset (peaks, members, expeditions) and evaluated how good of a candidate each of them were for being mapped to x. This was the most time-consuming part of the process, and I narrowed it down to three candidates:\r\nexpeditions$members: looked okay at first, but once I started aggregating (averaging) by peak, the distribution became quite narrow. That made it less interesting and not very ideal for mountain shapes (the typical mountain shape is wider than they are tall).\r\nmembers$age: has a nice distribution and a manageable range with no extreme outliers.\r\npeaks$first_ascent_year: also has the above features + doesn’t need to be aggregated in some way, so the x-axis would have a very straight forward interpretation.\r\nThe first_ascent_year variable looked the most promising, so that’s what I pursued (and ended up ultimately adopting!).\r\n\r\n\r\n\r\nFigure 4: The third guiding aesthetic idea\r\n\r\n\r\n\r\nNow I felt like I had more direction to tackle the very first issue that I ran into during this process: the problem of picking out a small set of peaks that were interesting and well-motivated. I played around more with several options, but I ultimately settled on something very simple - the top 10 most popular peaks. Sure it’s overused and not particularly exciting, but that was a sacrifice that my over-worked brain was willing to make at the time.\r\nAnd actually, it turned out to be a great fit with my new x variable! It turns out that the top 10 most climbed peaks are also those that were among the first to be climbed (a correlation that sorta makes sense), so this set of peaks had an additional benefit of localizing the range of x to between the 1940s-1960s. And because 10 was a manageable number, I went ahead with my very first idea of having a mountain accompanying each point, where the peaks represent the peak of the guiding aesthetic (the mountain shape) as well as the height_metres and first_ascent_year.\r\nFinally, it came time for me to polish up on the mountains. I needed to decide on features of the mountains like how wide the base is, how many valleys and peaks it has, how tall the peaks are relative to each other, etc. I had to be careful that these superfluous features do not encroach on the dimensions where I mapped my data to - the x and y. Here, I had concerns about two of the mountain features in particular: base width and smaller peaks:\r\nThe base width was troubling because how wide the base of the mountain stretches could be interpreted as representing another variable that has to do with year (like the first and last time it was climbed, for example). This was a bit difficult to deal with, but I settled on a solution which was to keep the base width constant. By not having that feature vary at all, I could suppress any expectation for it to carry some sort of meaning. It’s kind of like how when you make a scatterplot with variables mapped to x and y, you don’t imbue any special meaning to the fact that the observations are represented by a circle (point), beacuse all of them are that shape. If they varied in any way, say you also have some rectangles and triangles, then you’d start expecting the shape to represent something meaningful.\r\nThe smaller peaks of the mountain shapes were troubling because I was already using the peak to represent the height. It helped that the actual peaks representing the data were also marked by a point and a label of the peak name. But to make it extra clear that the they were pure noise, I decided to randomly generate peaks and valleys, and tried to make that obvious. In the code attached at the bottom of this post, several parameters of the mountain-generating function allowed me to do this. It also helped that I added a note saying that the mountains were randomly generated when I tweeted it, which is kind of cheating perhaps, but it worked!\r\n\r\n\r\nI've been feeling particularly inspired by this week's #TidyTuesday so I made another plot! This is a simple scatterplot of peak height by year of first ascent, but with a twist: each point is also represented by the peak of a randomly generated mountain! #rstats pic.twitter.com/CqNQjdMYXP\r\n\r\n— June (@yjunechoe) September 25, 2020\r\n\r\nThat wraps up my long rant on how I made my mountains plot! For more context, making the plot took about a half a day worth of work, which isn’t too bad for a first attempt! Definitely looking forward to getting more inspirations like this in the future.\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\n\r\nmake_mountain <- function(x_start, x_end, base = 0, peak_x, peak_y, n_peaks = 3, peaks_ratio = 0.3, side.first = \"left\") {\r\n \r\n midpoint_abs <- (peak_y - base)/2 + base\r\n midpoint_rel <- (peak_y - base)/2\r\n \r\n side_1_n_peaks <- floor(n_peaks/2)\r\n side_2_n_peaks <- n_peaks - side_1_n_peaks -1\r\n \r\n side_1_x <- seq(x_start, peak_x, length.out = side_1_n_peaks * 2 + 2)\r\n side_1_x <- side_1_x[-c(1, length(side_1_x))]\r\n \r\n side_2_x <- seq(peak_x, x_end, length.out = side_2_n_peaks * 2 + 2)\r\n side_2_x <- side_2_x[-c(1, length(side_2_x))]\r\n \r\n side_1_y <- numeric(length(side_1_x))\r\n side_1_y[c(TRUE, FALSE)] <- runif(length(side_1_y)/2, midpoint_abs, midpoint_abs + midpoint_rel * peaks_ratio)\r\n side_1_y[c(FALSE, TRUE)] <- runif(length(side_1_y)/2, midpoint_abs - midpoint_rel * peaks_ratio, midpoint_abs)\r\n \r\n side_2_y <- numeric(length(side_2_x))\r\n side_2_y[c(TRUE, FALSE)] <- runif(length(side_2_y)/2, midpoint_abs, midpoint_abs + midpoint_rel * peaks_ratio)\r\n side_2_y[c(FALSE, TRUE)] <- runif(length(side_2_y)/2, midpoint_abs - midpoint_rel * peaks_ratio, midpoint_abs)\r\n \r\n if (side.first == \"left\") {\r\n side_left <- data.frame(x = side_1_x, y = side_1_y)\r\n side_right <- data.frame(x = side_2_x, y = rev(side_2_y))\r\n } else if (side.first == \"right\") {\r\n side_left <- data.frame(x = side_2_x, y = side_2_y)\r\n side_right <- data.frame(x = side_1_x, y = rev(side_1_y))\r\n } else {\r\n error('Inavlid value for side.first - choose between \"left\" (default) or \"right\"')\r\n }\r\n \r\n polygon_points <- rbind(\r\n data.frame(x = c(x_start, peak_x, x_end), y = c(base, peak_y, base)),\r\n side_left,\r\n side_right\r\n )\r\n \r\n polygon_points[order(polygon_points$x),]\r\n\r\n}\r\n\r\ntuesdata <- tidytuesdayR::tt_load(\"2020-09-22\")\r\n\r\npeaks <- tuesdata$peaks\r\nexpeditions <- tuesdata$expeditions\r\n\r\ntop_peaks <- expeditions %>% \r\n count(peak_name) %>% \r\n slice_max(n, n = 10)\r\n\r\nplot_df <- peaks %>% \r\n filter(peak_name %in% top_peaks$peak_name) %>% \r\n arrange(-height_metres) %>% \r\n mutate(peak_name = fct_inorder(peak_name))\r\n\r\nplot_df %>% \r\n ggplot(aes(x = first_ascent_year, y = height_metres)) +\r\n pmap(list(plot_df$first_ascent_year, plot_df$height_metres, plot_df$peak_name),\r\n ~ geom_polygon(aes(x, y, fill = ..3), alpha = .6,\r\n make_mountain(x_start = 1945, x_end = 1965, base = 5000,\r\n peak_x = ..1, peak_y = ..2, n_peaks = sample(3:5, 1)))\r\n ) +\r\n geom_point(color = \"white\") +\r\n ggrepel::geom_text_repel(aes(label = peak_name),\r\n nudge_y = 100, segment.color = 'white',\r\n family = \"Montserrat\", fontface = \"bold\", color = \"white\") +\r\n guides(fill = guide_none()) +\r\n scale_x_continuous(expand = expansion(0.01, 0)) +\r\n scale_y_continuous(limits = c(5000, 9000), expand = expansion(0.02, 0)) +\r\n theme_minimal(base_family = \"Montserrat\", base_size = 12) +\r\n labs(title = \"TOP 10 Most Attempted Himalayan Peaks\",\r\n x = \"First Ascent Year\", y = \"Peak Height (m)\") +\r\n palettetown::scale_fill_poke(pokemon = \"articuno\") +\r\n theme(\r\n plot.title.position = \"plot\",\r\n plot.title = element_text(size = 24, vjust = 3, family = \"Lora\"),\r\n text = element_text(color = \"white\", face = \"bold\"),\r\n axis.text = element_text(color = \"white\"),\r\n axis.title = element_text(size = 14),\r\n axis.title.x = element_text(vjust = -2),\r\n axis.title.y = element_text(vjust = 4),\r\n plot.margin = margin(1, .5, .7, .7, \"cm\"),\r\n plot.background = element_rect(fill = \"#5C606A\", color = NA),\r\n panel.grid = element_blank(),\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-10-13-designing-guiding-aesthetics/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 8503, "preview_height": 6377 @@ -484,7 +484,7 @@ ], "contents": "\r\n\r\nContents\r\nIntroduction\r\nWhen and why should I use STAT?\r\nInterim Summary #1\r\n\r\nUnderstanding STAT with stat_summary()\r\nInterim Summary #2\r\n\r\nPutting STAT to use\r\n1. Error bars showing 95% confidence interval\r\n2. A color-coded bar plot of medians\r\n3. Pointrange plot with changing size\r\n\r\nConclusion\r\nMain Ideas\r\nSTAT vs. GEOM or STAT and GEOM?\r\n\r\n\r\n\r\n\r\n\r\nUPDATE 10/5/20: This blog post was featured in the rweekly highlights podcast! Thanks to the rweekly team for a flattering review of my tutorial!\r\nIntroduction\r\n(Feel free to skip the intro section if you want to get to the point!)\r\nA powerful concept in the Grammar of Graphics is that variables are mapped onto aesthetics. In {ggplot2}, a class of objects called geom implements this idea. For example, geom_point(mapping = aes(x = mass, y = height)) would give you a plot of points (i.e. a scatter plot), where the x-axis represents the mass variable and the y axis represents the height variable.\r\nBecause geom_*()s1 are so powerful and because aesthetic mappings are easily understandable at an abstract level, you rarely have to think about what happens to the data you feed it. Take this simple histogram for example:\r\n\r\n\r\ndata(\"penguins\", package = \"palmerpenguins\")\r\n\r\nggplot(data = penguins, mapping = aes(x = body_mass_g)) +\r\n geom_histogram()\r\n\r\n\r\n\r\n\r\nWhat’s going on here? You might say that the body_mass_g variable is represented in the x-axis. Sure, that’s not wrong. But a fuller explanation would require you to talk about these extra steps under the hood:\r\nThe variable mapped to x is divided into discrete bins\r\nA count of observations within each bin is calculated\r\nThat new variable is then represented in the y axis\r\nFinally, the provided x variable and the internally calculated y variable is represented by bars that have certain position and height\r\nI don’t mean to say here that you are a total fool if you can’t give a paragraph-long explanation of geom_histogram(). Rather, my intention here is to emphasize that the data-to-aesthetic mapping in GEOM objects is not neutral, although it can often feel very natural, intuitive, and objective (and you should thank the devs for that!). Just think about the many ways in which you can change any of the internal steps above, especially steps 12 and 23, while still having the output look like a histogram.\r\nThis important point rarely crosses our mind, in part because of what we have gotten drilled into our heads when we first started learning ggplot. As beginners we’ve likely experienced the frustration of having all the data we need to plot something, but ggplot just won’t work. You could imagine a beginner today who’s getting frustrated because geom_point(aes(x = mass, y = height)) throws an error with the following data.\r\n\r\n # A tibble: 2 x 4\r\n variable subject1 subject2 subject3\r\n \r\n 1 mass 75 70 55\r\n 2 height 154 172 144\r\n\r\nAnd what would StackOverflow you tell this beginner? You’d probably tell them to put the data in a tidy format4 first.\r\n\r\n # A tibble: 3 x 3\r\n subject mass height\r\n \r\n 1 1 75 154\r\n 2 2 70 172\r\n 3 3 55 144\r\n\r\nNow, that’s something you can tell a beginner for a quick and easy fix. But if you still simply think “the thing that makes ggplot work = tidy data”, it’s important that you unlearn this mantra in order to fully understand the motivation behind stat.\r\nWhen and why should I use STAT?\r\nYou could be using ggplot every day and never even touch any of the two-dozen native stat_*() functions. In fact, because you’ve only used geom_*()s, you may find stat_*()s to be the esoteric and mysterious remnants of the past that only the developers continue to use to maintain law and order in the depths of source code hell.\r\nIf that describes you, you might wonder why you even need to know about all these stat_*() functions.\r\n\r\n\r\n\r\nWell, the main motivation for stat is simply this:\r\n\r\n“Even though the data is tidy it may not represent the values you want to display”5\r\n\r\nThe histogram discussion in the previous section was a good example to this point, but here I’ll introduce another example that I think will hit the point home.\r\nSuppose you have a data simple_data that looks like this:\r\n\r\n\r\nsimple_data <- tibble(group = factor(rep(c(\"A\", \"B\"), each = 15)),\r\n subject = 1:30,\r\n score = c(rnorm(15, 40, 20), rnorm(15, 60, 10)))\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nAnd suppose that you want to draw a bar plot where each bar represents group and the height of the bars corresponds to the mean of score for each group.\r\nIf you’re stuck in the mindset of “the data that I feed in to ggplot() is exactly what gets mapped, so I need to tidy it first and make sure it contains all the aesthetics that each geom needs”, you would need to transform the data before piping it in like this:\r\n\r\n\r\nsimple_data %>%\r\n group_by(group) %>% \r\n summarize(\r\n mean_score = mean(score),\r\n .groups = 'drop' # Remember to ungroup!\r\n ) %>% \r\n ggplot(aes(x = group, y = mean_score)) +\r\n geom_col()\r\n\r\n\r\n\r\n\r\nIt’s a good practice to always ungroup your dataframe before passing into ggplot, as having grouped data can lead to unintended behaviors that are hard to debug.\r\n\r\n\r\n\r\nWhere the data passed in looks like this:\r\n\r\n # A tibble: 2 x 2\r\n group mean_score\r\n \r\n 1 A 43.0\r\n 2 B 57.5\r\n\r\nOk, not really a problem there. But what if we want to add in error bars too? Error bars also plot a summary statistic (the standard error), so we’d need make another summary of the data to pipe into ggplot().\r\nLet’s first plot the error bar by itself, we’re again passing in a transformed data\r\n\r\n\r\nsimple_data %>% \r\n group_by(group) %>% \r\n summarize(\r\n mean_score = mean(score),\r\n se = sqrt(var(score)/length(score)),\r\n .groups = 'drop'\r\n ) %>% \r\n mutate(\r\n lower = mean_score - se,\r\n upper = mean_score + se\r\n ) %>% \r\n ggplot(aes(x = group, y = mean_score, ymin = lower, ymax = upper)) +\r\n geom_errorbar()\r\n\r\n\r\n\r\n\r\nWhere the transformed data looks like this:\r\n\r\n # A tibble: 2 x 5\r\n group mean_score se lower upper\r\n \r\n 1 A 43.0 4.37 38.7 47.4\r\n 2 B 57.5 2.82 54.7 60.4\r\n\r\nOk, now let’s try combining the two. One way to do this is to save the data paseed in for the bar plot and the data passed in for the errorbar plot as two separate variables, and then call each in their respective geoms:\r\n\r\n\r\nsimple_data_bar <- simple_data %>%\r\n group_by(group) %>% \r\n summarize(\r\n mean_score = mean(score),\r\n .groups = 'drop'\r\n )\r\n \r\nsimple_data_errorbar <- simple_data %>% \r\n group_by(group) %>% \r\n summarize(\r\n mean_score = mean(score),\r\n se = sqrt(var(score)/length(score)),\r\n .groups = 'drop'\r\n ) %>% \r\n mutate(\r\n lower = mean_score - se,\r\n upper = mean_score + se\r\n )\r\n\r\nggplot() +\r\n geom_col(\r\n aes(x = group, y = mean_score),\r\n data = simple_data_bar\r\n ) +\r\n geom_errorbar(\r\n aes(x = group, y = mean_score, ymin = lower, ymax = upper),\r\n data = simple_data_errorbar\r\n )\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nYeah… that code is a mouthful. The above approach is not parsimonious because we keep repeating similar processes in different places.6 If you, like myself, don’t like how this looks, then let this be a lesson that this is the consequence of thinking that you must always prepare a tidy data containing values that can be DIRECTLY mapped to geometric objects.\r\nAnd on a more theoretical note, simple_data_bar and simple_data_errorbar aren’t even really “tidy” in the original sense of the term. We need to remind ourselves here that tidy data is about the organization of observations in the data. Under this definition, values like bar height and the top and bottom of whiskers are hardly observations themselves. Rather, they’re abstractions or summaries of the actual observations in our data simple_data which, if you notice, we didn’t even use to make our final plot above!\r\n\r\n\r\n\r\nFigure 1: Tidy data is about the organization of observations.\r\n\r\n\r\n\r\nSo not only is it inefficient to create a transformed dataframe that suits the needs of each geom, this method isn’t even championing the principles of tidy data like we thought.7\r\nWhat we should do instead is to take advantage of the fact that our original data simple_data is the common denominator of simple_data_bar and simple_data_errorbar!\r\nWouldn’t it be nice if you could just pass in the original data containing all observations (simple_data) and have each layer internally transform the data in appropriate ways to suit the needs of the geom for that layer?\r\nOh, so you mean something like this?\r\n\r\n\r\nsimple_data %>% \r\n ggplot(aes(group, score)) +\r\n stat_summary(geom = \"bar\") +\r\n stat_summary(geom = \"errorbar\")\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nInterim Summary #1\r\nIn this section, I built up a tedious walkthrough of making a barplot with error bars using only geom_*()s just to show that two lines of stat_summary() with a single argument can achieve the same without even touching the data through any form of pre-processing.\r\nSo that was a taste of how powerful stat_*()s can be, but how do they work and how can you use them in practice?\r\nUnderstanding STAT with stat_summary()\r\nLet’s analyze stat_summary() as a case study to understand how stat_*()s work more generally. I think that stat_summary() is a good choice because it’s a more primitive version of many other stat_*()s and is likely to be the one that you’d end up using the most for visualizations in data science.\r\nBefore we start, let’s create a toy data to work with. Let’s call this data height_df because it contains data about a group and the height of individuals in that group.\r\n\r\n\r\nheight_df <- tibble(group = \"A\",\r\n height = rnorm(30, 170, 10))\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nWe can visualize the data with a familiar geom, say geom_point():\r\n\r\n\r\nheight_df %>% \r\n ggplot(aes(x = group, y = height)) +\r\n geom_point()\r\n\r\n\r\n\r\n\r\nAs a first step in our investigation, let’s just replace our familiar geom_point() with the scary-looking stat_summary() and see what happens:\r\n\r\n\r\nheight_df %>% \r\n ggplot(aes(x = group, y = height)) +\r\n stat_summary()\r\n\r\n\r\n\r\n\r\nInstead of points, we now see a point and a line through that point. And before you get confused, this is actually one geom, called pointrange, not two separate geoms.8 Now that that’s cleared up, we might ask: what data is being represented by the pointrange?\r\nAnswering this question requires us to zoom out a little bit and ask: what variables does pointrange map as a geom? By looking at the documentation with ?geom_pointrange we can see that geom_pointrange() requires the following aesthetics:\r\nx or y\r\nymin or xmin\r\nymax or xmax\r\nSo now let’s look back at our arguments in aes(). We said that group is mapped to x and that height is mapped to y. But we never said anything about ymin/xmin or ymax/xmax anywhere. So how is stat_summary() drawing a pointrange if we didn’t give it the required aesthetic mappings?\r\nWell, a good guess is that stat_summary() is transforming the data to calculate the necessary values to be mapped to pointrange. Here’s one reason for that guess - I’ve been suppressing message throughout this post but if you run the above code with stat_summary() yourself, you’d actually get this message:\r\n\r\n No summary function supplied, defaulting to `mean_se()`\r\n\r\nHuh, a summary function? That sounds promising. Maybe that’s the key to our mystery!\r\nFirst, we see from the documentation of stat_summary() that this mean_se() thing is the default value for the fun.data argument (we’ll talk more on this later).\r\nNext, let’s call it in the console to see what it is:\r\n\r\n\r\nmean_se\r\n\r\n\r\n function (x, mult = 1) \r\n {\r\n x <- stats::na.omit(x)\r\n se <- mult * sqrt(stats::var(x)/length(x))\r\n mean <- mean(x)\r\n new_data_frame(list(y = mean, ymin = mean - se, ymax = mean + \r\n se), n = 1)\r\n }\r\n \r\n \r\n\r\nOk, so it’s a function that takes some argument x and a second argument mult with the default value 1.\r\nLet’s go over what it does by breaking down the function body line by line:\r\nRemove NA values\r\nCalculate variable se which is the standard error of the values in x using the equation \\(SE = \\sqrt{\\frac{1}{N}\\sum_{i=1}^N(x_i-\\bar{x})^2}\\)\r\nCalculate the variable mean9 which is the mean of x\r\nCreate a new dataframe with one row, with columns y, ymin, and ymax, where y is the mean of x, ymin is one standard error below the mean, and ymax is one standard error above the mean.10\r\nA cool thing about this is that although mean_se() seems to be exclusively used for internal operations, it’s actually available in the global environment from loading {ggplot2}. So let’s pass height_df to mean_se() and see what we get back!\r\n\r\n\r\nmean_se(height_df)\r\n\r\n\r\n Error: Elements must equal the number of rows or 1\r\n\r\n\r\n\r\n\r\nUhhh what?\r\nDo you see what happened just now? This is actually really important: stat_summary() summarizes one dimension of the data.11 mean_se() threw an error when we passed it our whole data because it was expecting just a vector of the variable to be summarized.\r\n\r\nWhenever you’re trying out a new stat_*() function, make sure to check what variables/object types the statistical transformation is being applied to!\r\nOk now that we’ve went over that little mishap, let’s give mean_se() the vector it wants.\r\n\r\n\r\nmean_se(height_df$height)\r\n\r\n\r\n y ymin ymax\r\n 1 171.8 170.3 173.3\r\n\r\nAnd look at that, these look like they’re the same values that were being represented by the mid-point and the end-points of the pointrange plot that we drew with stat_summary() above!\r\nYou know how else we can check that this is the case? With this neat function called layer_data().\r\nWe can pull the data that was used to draw the pointrange by passing our plot object to layer_data() and setting the second argument to 112:\r\n\r\n\r\npointrange_plot <- height_df %>% \r\n ggplot(aes(x = group, y = height)) +\r\n stat_summary()\r\n\r\nlayer_data(pointrange_plot, 1)\r\n\r\n\r\n x group y ymin ymax PANEL flipped_aes colour size linetype shape fill\r\n 1 1 1 171.8 170.3 173.3 1 FALSE black 0.5 1 19 NA\r\n alpha stroke\r\n 1 NA 1\r\n\r\nWould ya look at that! There’s a lot of stuff in there, but it looks like the values for y, ymin, and ymax used for the actual plot match up with the values we calculated with mean_se() above!\r\nWe’ve solved our mystery of how the pointrange was drawn when we didn’t provide all the required mappings!\r\n\r\n\r\n\r\nInterim Summary #2\r\nTo summarize this section (ha!), stat_summary() works in the following order:\r\nThe data that is passed into ggplot() is inherited if one is not provided\r\nThe function passed into the fun.data argument applies transformations to (a part of) that data (defaults to mean_se())\r\nThe result is passed into the geom provided in the geom argument (defaults to pointrange).\r\nIf the data contains all the required mapppings for the geom, the geom will be plotted.\r\nAnd to make things extra clear & to make stat_summary() less mysterious, we can explicitly spell out the two arguments fun.data and geom that we went over in this section.\r\n\r\n\r\nheight_df %>% \r\n ggplot(aes(x = group, y = height)) +\r\n stat_summary(\r\n geom = \"pointrange\",\r\n fun.data = mean_se\r\n )\r\n\r\n\r\n\r\n\r\nYou could also do fun.data = \"mean_se\" but I prefer the unquoted version because it make it extra clear that mean_se is a function, not a special parameter. It also keeps things consistent because if you want to pass in a custom function, they cannot be quoted.\r\n\r\n\r\n\r\nLook, it’s the same plot!\r\nPutting STAT to use\r\nNow we have arrived at the fun part.\r\nHere, I will demonstrate a few ways of modifying stat_summary() to suit particular visualization needs.\r\nFor this section, I will use a modified version of the penguins data that I loaded all the way up in the intro section (I’m just removing NA values here, nothing fancy).\r\n\r\n\r\nmy_penguins <- na.omit(penguins)\r\n\r\n\r\n\r\nAt no point in this section will I be modifying the data being piped into ggplot(). That is the beauty and power of stat.\r\n1. Error bars showing 95% confidence interval\r\nHere, we’re plotting the mean body_mass_g of penguins for each sex, with error bars that show the 95% confidence interval (a range of approx 1.96 standard errors from the mean).\r\n\r\n\r\nmy_penguins %>% \r\n ggplot(aes(sex, body_mass_g)) +\r\n stat_summary(\r\n fun.data = ~mean_se(., mult = 1.96), # Increase `mult` value for bigger interval!\r\n geom = \"errorbar\",\r\n )\r\n\r\n\r\n\r\n\r\nAs of {ggplot2} 3.3.0, you can use {rlang}-style anonymous functions. If you aren’t familiar, ~mean_se(., mult = 1.96) is the same as function(x) {mean_se(x, mult = 1.96)}\r\n\r\n\r\n\r\nThe transformed data used for the errorbar geom inside stat_summary():\r\n\r\n\r\nbind_rows(\r\n mean_se(my_penguins$body_mass_g[my_penguins$sex == \"female\"], mult = 1.96),\r\n mean_se(my_penguins$body_mass_g[my_penguins$sex == \"male\"], mult = 1.96),\r\n)\r\n\r\n\r\n y ymin ymax\r\n 1 3862 3761 3964\r\n 2 4546 4427 4665\r\n\r\n2. A color-coded bar plot of medians\r\nHere, we’re plotting the median bill_length_mm for each penguins species and coloring the groups with median bill_length_mm under 40 in pink.\r\n\r\n\r\ncalc_median_and_color <- function(x, threshold = 40) {\r\n tibble(y = median(x)) %>% \r\n mutate(fill = ifelse(y < threshold, \"pink\", \"grey35\"))\r\n}\r\n\r\nmy_penguins %>% \r\n ggplot(aes(species, bill_length_mm)) +\r\n stat_summary(\r\n fun.data = calc_median_and_color,\r\n geom = \"bar\"\r\n )\r\n\r\n\r\n\r\n\r\nCalculating summaries by group is automatically handled internally when you provide grouping variables (here, the species variable that’s mapped to x), so you don’t have to worry about that in your custom function.\r\n\r\n\r\n\r\nThe transformed data used for the bar geom inside stat_summary():\r\n\r\n\r\ngroup_split(my_penguins, species) %>%\r\n map(~ pull(., bill_length_mm)) %>% \r\n map_dfr(calc_median_and_color)\r\n\r\n\r\n\r\n\r\nThis is a more systematic way of mimicking the internal process of stat_summary(). Run each line incrementally see to what they do!\r\n\r\n # A tibble: 3 x 2\r\n y fill \r\n \r\n 1 38.8 pink \r\n 2 49.6 grey35\r\n 3 47.4 grey35\r\n\r\nNote how you can calculate non-required aesthetics in your custom functions (e.g., fill) and they also be used to make the geom!\r\n3. Pointrange plot with changing size\r\nHere, we’re plotting bill_depth_mm of penguins inhabiting different islands, with the size of each pointrange changing with the number of observations\r\n\r\n\r\nmy_penguins %>% \r\n ggplot(aes(species, bill_depth_mm)) +\r\n stat_summary(\r\n fun.data = function(x) {\r\n \r\n scaled_size <- length(x)/nrow(my_penguins)\r\n \r\n mean_se(x) %>% \r\n mutate(size = scaled_size)\r\n }\r\n )\r\n\r\n\r\n\r\n\r\nIf you don’t want to declare a new function in the environment just for one plot, you can just pass in an anonymous function to the fun.data argument. And of course, if it’s long, you should wrap it in function(x){}.\r\n\r\n\r\n\r\n\r\nLooking back, this is actually a cool plot because you can see how lower number of samples (smaller size) contributes to increased uncertainty (longer range) in the pointrange.\r\nThe transformed data used for the pointrange geom inside stat_summary():\r\n\r\n\r\ngroup_split(my_penguins, species) %>%\r\n map(~ pull(., bill_depth_mm)) %>% \r\n map_dfr(\r\n function(x) {\r\n \r\n scaled_size <- length(x)/nrow(my_penguins)\r\n \r\n mean_se(x) %>% \r\n mutate(size = scaled_size)\r\n }\r\n )\r\n\r\n\r\n y ymin ymax size\r\n 1 18.35 18.25 18.45 0.4384\r\n 2 18.42 18.28 18.56 0.2042\r\n 3 15.00 14.91 15.09 0.3574\r\n\r\nConclusion\r\nMain Ideas\r\nEven though the data is tidy, it may not represent the values you want to display\r\nThe solution is not to transform your already-tidy data so that it contains those values\r\nInstead, you should pass in your original tidy data into ggplot() as is and allow stat_*() functions to apply transformations internally\r\nThese stat_*() functions can be customized for both their geoms and their transformation functions, and works similarly to geom_*() functions in other regards\r\nIf you want to use your own custom function, make sure to check the documentation of that particular stat_*() function to check the variable/data type it requires.\r\nIf you want to use a different geom, make sure that your transformation function calculates all the required aesthetics for that geom\r\nSTAT vs. GEOM or STAT and GEOM?\r\nAlthough I have talked about the limitations of geom_*()s to demonstrate the usefulness of stat_*()s, both have their place. It’s about knowing when to use which; it’s not a question of either-or. In fact, they require each other - just like how stat_summary() had a geom argument, geom_*()s also have a stat argument. At a higher level, stat_*()s and geom_*()s are simply convenient instantiations of the layer() function that builds up the layers of ggplot.\r\nBecause this is important, I’ll wrap up this post with a quote from Hadley explaining this false dichotomy:\r\n\r\nUnfortunately, due to an early design mistake I called these either stat_() or geom_(). A better decision would have been to call them layer_() functions: that’s a more accurate description because every layer involves a stat and a geom.13\r\n\r\n\r\nJust to clarify on notation, I’m using the star symbol * here to say that I’m referencing all the functions that start with geom_ like geom_bar() and geom_point(). This is called the Kleene star and it’s used a lot in regex, if you aren’t familiar.↩︎\r\nYou could have bins of that are not of equal size. Or, you could have bins that bleed into each other to create a rolling window summary.↩︎\r\nYou could calculate the sum of raw values that are in each bin, or calculate proportions instead of counts↩︎\r\nIf you aren’t familiar already, “tidy” is a specific term of art↩︎\r\nThis quote is adapted from Thomas Lin Pedersen’s ggplot2 workshop video↩︎\r\nYes, you can still cut down on the code somewhat, but will it even get as succinct as what I show below with stat_summary()? (9/30 edit) Okay, I was kinda strawmaning, and Hadley(!) has correctly caught me on that. The bar-errorbar plot was not the best choice to demonstrate the benefits of stat_summary(), but I just wanted to get people excited about stat_*()! Sorry for the confusion/irritation!!↩︎\r\nThere’s actually one more argument against transforming data before piping it into ggplot. When you choose the variables to plot, say cyl and mpg in the mtcars dataset, do you call select(cyl, mpg) before piping mtcars into ggplot? No? Well then why would you transform your data beforehand if you can just have that be handled internally instead? It’s the same logic!↩︎\r\nIf you’re still skeptical, save the plot object to a variable like plot and call plot$layers to confirm that geom_pointrange was used to draw the plot.↩︎\r\nI personally don’t agree with this naming choice since mean is also the name of the base function↩︎\r\nThe function new_data_frame() is from {vctrs}. That last line of code in the function body is doing the same thing as data.frame(y = mean, ymin = mean - se, ymax = mean + se), but there’s less room for error the way it’s done in the source code.↩︎\r\nIf you read the documentation, the very first line starts with “stat_summary() operates on unique x or y …” (emphasis mine)↩︎\r\nThis second argument specifies which layer to return. Here, the pointrange layer is the first and only layer in the plot so I actually could have left this argument out.↩︎\r\nEmphasis mine. Source: https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html↩︎\r\n", "preview": "posts/2020-09-26-demystifying-stat-layers-ggplot2/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 240, "preview_height": 278 @@ -507,7 +507,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n\r\n\r\n\r\nThings I learned\r\nHaving a nice background color for the plot (and generally just working with color)\r\nMargin options of various kinds in theme()\r\nUsing {scales}, pretty_breaks() in particular\r\nUsing {ragg} to draw and save high quality plots\r\nThings to improve\r\nThe subtitle is kinda boring (and the entire plot is a bit underwhelming)\r\nFigure out how to increase spacing between y-axis text and the plot (hjust is relative to each label, so doesn’t work)\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\n\r\n# DATA\r\n\r\ntuesdata <- tidytuesdayR::tt_load(\"2020-09-22\")\r\n\r\nclimb_data <- tuesdata$expeditions %>% \r\n left_join(tuesdata$peaks, by = \"peak_name\") %>% \r\n select(peak = peak_name, year, height = height_metres) %>% \r\n arrange(-height) %>% \r\n mutate(height_group = fct_inorder(case_when(peak == \"Everest\" ~ \"Mt. Everest (8850m)\",\r\n between(height, 8000, 8849) ~ \"> 8000m\",\r\n between(height, 7000, 7999) ~ \"7999m ~ 7000m\",\r\n between(height, 6000, 6999) ~ \"6999m ~ 6000m\",\r\n TRUE ~ \"< 6000m\"))\r\n ) %>% \r\n count(five_years = round(year/5) * 5, height_group) %>% \r\n filter(five_years >= 1920) %>% \r\n complete(five_years, height_group, fill = list(n = 0)) %>% \r\n group_by(five_years) %>% \r\n mutate(prop = n / sum(n)) %>% \r\n ungroup()\r\n\r\n\r\n# PLOT\r\n\r\nmountain_palette <- c(\"#6E86A6\", \"#95A2B3\", \"#5C606A\", \"#44464E\", \"#3D3737\")\r\n\r\nclimb_plot <- climb_data %>% \r\n ggplot(aes(five_years, prop)) +\r\n geom_area(aes(fill = height_group, color = height_group)) +\r\n scale_fill_manual(values = mountain_palette) +\r\n scale_color_manual(values = mountain_palette) +\r\n coord_cartesian(xlim = c(1920, 2020), expand = FALSE) +\r\n scale_x_continuous(breaks = scales::pretty_breaks(11)) +\r\n scale_y_continuous(labels = scales::percent) +\r\n labs(\r\n title = \"Himalayan Peaks Attempted Over Time\",\r\n subtitle = \"Over 1/4th of all expeditions were to Mount Everest\",\r\n x = NULL, y = NULL, fill = NULL, color = NULL,\r\n caption = \"By: @yjunechoe | Source: The Himalayan Database\"\r\n ) +\r\n theme_classic(base_family = \"Futura Hv BT\", base_size = 16) +\r\n theme(\r\n plot.title.position = \"plot\",\r\n plot.title = element_text(size = 28, color = \"white\", family = \"Lora\", face = \"bold\"),\r\n plot.subtitle = element_text(size = 14, color = \"white\", face = \"italic\"),\r\n plot.margin = margin(2, 2.5, 2, 2, 'cm'),\r\n plot.caption = element_text(color = \"white\", family = \"Roboto Mono\", hjust = 1.15, vjust = -13),\r\n legend.position = \"top\",\r\n legend.direction = \"horizontal\",\r\n legend.text = element_text(color = \"white\"),\r\n legend.background = element_rect(fill = NA),\r\n axis.text = element_text(color = \"white\"),\r\n axis.text.y = element_text(vjust = -.1),\r\n axis.text.x = element_text(vjust = -2),\r\n axis.ticks = element_blank(),\r\n axis.line = element_blank(),\r\n panel.background = element_blank(),\r\n plot.background = element_rect(fill = \"#606F84\", color = NA)\r\n )\r\n\r\n\r\n# SAVE\r\n\r\npngfile <- fs::path(getwd(), \"plot.png\")\r\nragg::agg_png(\r\n pngfile,\r\n width = 60,\r\n height = 36,\r\n units = \"cm\",\r\n res = 300,\r\n scaling = 2\r\n)\r\nplot(climb_plot); invisible(dev.off())\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-09-23-tidytuesday-2020-week-39/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 7086, "preview_height": 4251 @@ -530,7 +530,7 @@ ], "contents": "\r\n\r\nContents\r\nBefore\r\nMy Plan\r\nAfter\r\nPoint-line plot\r\nBar plot\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nThis is the first installment of plot makeover where I take a plot in the wild and make very opinionated modifications to it.\r\nBefore\r\nOur plot-in-the-wild comes from the recent AMLAP 2020 conference, where I presented my thesis research and had the opportunity to talk with and listen to expert psycholinguists around the world. The plot that I’ll be looking at here is Figure 3 from the abstract of a work by E. Matthew Husband and Nikole Patson (Husband and Patson 2020).\r\n\r\n\r\n\r\nFigure 1: Plot from Husband and Patson (2020)\r\n\r\n\r\n\r\nWhat we have is 6 pairs of barplots with error bars, laid out in a 2-by-3 grid. The total of 12 bars are grouped at three levels which are mapped in the following way:\r\nFirst level is mapped to the grid column.\r\nSecond level is mapped to the grid row.\r\nThird level is mapped to the x-axis.\r\nTo get a better sense of what they did, and to make data for the plot makeover, I have recreated the original plot below:1\r\n1. Data\r\n\r\n\r\nlibrary(tidyverse)\r\ndf <- crossing(level_1 = fct_inorder(c(\"Within\", \"Between\")),\r\n level_2 = fct_inorder(c(\"Some\", \"Number\", \"Or\")),\r\n level_3 = factor(c(\"Strong\", \"Weak\")))\r\ndf$barheight <- c(.63, .35, .72, .55, .61, .15, .60, .55, .52, .63, .17, .16)\r\n\r\ndf\r\n\r\n\r\n\r\n\r\nThe numbers for barheight were eyeballed from looking at the original plot, of course.\r\n\r\n # A tibble: 12 x 4\r\n level_1 level_2 level_3 barheight\r\n \r\n 1 Within Some Strong 0.63\r\n 2 Within Some Weak 0.35\r\n 3 Within Number Strong 0.72\r\n 4 Within Number Weak 0.55\r\n 5 Within Or Strong 0.61\r\n 6 Within Or Weak 0.15\r\n 7 Between Some Strong 0.6 \r\n 8 Between Some Weak 0.55\r\n 9 Between Number Strong 0.52\r\n 10 Between Number Weak 0.63\r\n 11 Between Or Strong 0.17\r\n 12 Between Or Weak 0.16\r\n\r\n2. Plot\r\n\r\n\r\ndf %>% \r\n ggplot(aes(level_3, barheight)) +\r\n geom_col(\r\n aes(fill = level_3),\r\n show.legend = FALSE\r\n ) +\r\n geom_errorbar(\r\n aes(ymin = barheight - .05, ymax = barheight + .05),\r\n width = .1) +\r\n facet_grid(level_2 ~ level_1) +\r\n theme_bw() +\r\n scale_fill_manual(values = c('grey40', 'grey80')) +\r\n ylim(0, 1) +\r\n labs(\r\n y = \"Proportion of Strong Responses\",\r\n x = \"Prime Type\") +\r\n theme_bw()\r\n\r\n\r\n\r\n\r\n\r\nThe original plot for comparison:\r\n\r\n\r\n\r\nMy Plan\r\nMajor Changes:\r\nFlatten the grid in some way so that everything is laid out left-to-right and you can make comparisons horizontally.\r\nCap the y axis to make it clear that the values (proportions) can only lie between 0 and 1.\r\nMinor Changes:\r\nRemove grid lines\r\nIncrease space between axis and axis titles.\r\nRemove boxes around strip labels\r\nMake strip (facet) labels larger and more readable.\r\nIncrease letter spacing (probably by changing font)\r\nAfter\r\nI actually couldn’t settle on one final product2 so here are two plots that incorporate the changes that I wanted to make. I think that both look nice and you may prefer one style over the other depending on what relationships/comparisons you want your graph to emphasize.\r\nPoint-line plot\r\nI got a suggestion that the groups could additionally be mapped to shape for greater clarity, so I’ve incorporated that change.3\r\n\r\n\r\n\r\n\r\n\r\ndodge <- position_dodge(width = .5)\r\n\r\ndf %>% \r\n mutate(level_3 = as.numeric(level_3)) %>% \r\n ggplot(aes(x = level_3, y = barheight, group = level_1)) +\r\n geom_errorbar(\r\n aes(ymin = barheight - .05, ymax = barheight + .05),\r\n width = .2,\r\n position = dodge\r\n ) +\r\n geom_line(\r\n aes(linetype = level_1),\r\n position = dodge,\r\n show.legend = FALSE\r\n ) +\r\n geom_point(\r\n aes(shape = level_1, fill = level_1),\r\n size = 1.5,\r\n stroke = .6,\r\n position = dodge\r\n ) + \r\n scale_fill_manual(values = c(\"black\", \"white\")) +\r\n scale_shape_manual(values = c(21, 24)) +\r\n facet_wrap(~ level_2) +\r\n scale_x_continuous(\r\n breaks = 1:2,\r\n labels = levels(df$level_3),\r\n expand = expansion(.2),\r\n ) +\r\n scale_y_continuous(\r\n limits = c(0, 1),\r\n expand = expansion(c(0, .1))\r\n ) +\r\n lemon::coord_capped_cart(left = \"both\") +\r\n guides(\r\n fill = guide_none(),\r\n shape = guide_legend(\r\n title = NULL,\r\n direction = \"horizontal\",\r\n label.theme = element_text(size = 10, family = \"Montserrat\"),\r\n override.aes = list(fill = c(\"black\", \"white\"))\r\n )\r\n ) +\r\n labs(\r\n y = \"Strong Responses\",\r\n x = \"Prime Type\",\r\n linetype = \"Category\"\r\n ) +\r\n ggthemes::theme_clean(base_size = 14) +\r\n theme(\r\n text = element_text(family = \"Montserrat\"),\r\n legend.position = c(.18, .87),\r\n legend.background = element_rect(color = NA, fill = NA),\r\n strip.text = element_text(size = 13),\r\n plot.margin = margin(5, 5, 5, 5, 'mm'),\r\n axis.title.x = element_text(vjust = -3),\r\n axis.title.y = element_text(vjust = 5),\r\n plot.background = element_blank(),\r\n panel.grid.major.y = element_blank()\r\n )\r\n\r\n\r\n\r\nBar plot\r\n\r\n\r\n\r\n\r\n\r\ndodge <- position_dodge(width = .5)\r\n\r\ndf %>% \r\n mutate(level_3 = as.numeric(level_3)) %>% \r\n ggplot(aes(x = level_3, y = barheight, group = level_1)) +\r\n geom_col(position = dodge, width = .5, color = 'white', aes(fill = level_1)) +\r\n scale_fill_manual(values = c(\"grey30\", \"grey60\")) +\r\n geom_errorbar(\r\n aes(ymin = barheight - .05, ymax = barheight + .05),\r\n width = .2,\r\n position = dodge\r\n ) +\r\n facet_wrap(~ level_2) +\r\n scale_x_continuous(\r\n breaks = 1:2,\r\n labels = levels(df$level_3),\r\n expand = expansion(.2),\r\n ) +\r\n ylim(0, 1) +\r\n lemon::coord_capped_cart(left = \"both\") +\r\n labs(\r\n y = \"Strong Responses\",\r\n x = \"Prime Type\",\r\n fill = NULL\r\n ) +\r\n ggthemes::theme_clean(base_size=14) +\r\n theme(\r\n text = element_text(family = \"Montserrat\"),\r\n legend.text = element_text(size = 10),\r\n legend.key.size = unit(5, 'mm'),\r\n legend.direction = \"horizontal\",\r\n legend.position = c(.17, .85),\r\n legend.background = element_blank(),\r\n strip.text = element_text(size = 14),\r\n axis.ticks.x = element_blank(),\r\n axis.title.x = element_text(vjust = -3),\r\n axis.title.y = element_text(vjust = 5),\r\n panel.grid.major.y = element_blank(),\r\n plot.background = element_blank(),\r\n plot.margin = margin(5, 5, 5, 5, 'mm')\r\n )\r\n\r\n\r\n\r\n\r\nIn this second version, I removed guides() and distributed its arguments across labs() and theme(). I kinda like this layout of having a fat theme(). It’s also not too hard to read if you group and sort the arguments.\r\n\r\n\r\n\r\nHusband, E. Matthew, and Nikole Patson. 2020. Priming of Implicatures Within and Between Categories: The Case of or. AMLaP2020. https://amlap2020.github.io/a/272.pdf.\r\n\r\n\r\nBut note that this is likely not how the original plot was generated: the authors were likely feeding ggplot2 with the raw data (involving 1s and 0s in this case), but here I am just grabbing the summary statistic that was mapped to the bar aesthetic (hence my decision to name the y variable barheight).↩︎\r\nI ran the first plot by a friend who has a degree in design, and she recommended several changes that eventually ended up being the second plot. Some major pointers were removing border lines from the legend, removing x-axis tick marks, and applying color/shade.↩︎\r\nThe plot used to look like this: ↩︎\r\n", "preview": "posts/2020-09-20-plot-makeover-1/plot-makeover-1_files/figure-html5/after_bar_plot-1.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 1248, "preview_height": 768 @@ -553,7 +553,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\nI had difficulty embedding an HTML table without overriding its styles so the table is also available on its own here.\r\n\r\n\r\n\r\nThings I learned\r\nBasics of working with tables and {gt}1\r\nPutting different font styles together in a nice way\r\nThings to improve\r\nSummarize the data a bit more so the table isn’t huge\r\nAdd conditional formatting (learn how tab_style() and tab_options() work)\r\nFigure out how to save {gt} tables into pdf or png\r\nFigure out how to include an html table without overriding css styles\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(gt)\r\n\r\nkids <- tidytuesdayR::tt_load(\"2020-09-15\")$kids\r\n\r\n\r\n# TABLE DATA\r\n\r\nstate_regions <- setNames(c(as.character(state.region), \"Northeast\"), c(state.name, \"District of Columbia\"))\r\n\r\nkids_tbl_data <- kids %>% \r\n filter(variable == \"PK12ed\") %>%\r\n mutate(region = state_regions[state]) %>% \r\n select(region, state, year, inf_adj_perchild) %>% \r\n pivot_wider(names_from = year, values_from = inf_adj_perchild) %>%\r\n mutate(Trend = NA) \r\n\r\n\r\n# SPARKLINE\r\n\r\nplotter <- function(data){\r\n data %>% \r\n tibble(\r\n year = 1997:2016,\r\n value = data\r\n ) %>% \r\n ggplot(aes(year, value)) +\r\n geom_line(size = 10, show.legend = FALSE) +\r\n theme_void() +\r\n scale_y_continuous(expand = c(0, 0))\r\n}\r\n\r\nspark_plots <- kids_tbl_data %>% \r\n group_split(state) %>% \r\n map(~ flatten_dbl(select(.x, where(is.numeric)))) %>% \r\n map(plotter)\r\n\r\n\r\n# TABLE\r\n\r\nkids_tbl <- kids_tbl_data %>% \r\n gt(\r\n groupname_col = 'region',\r\n rowname_col = 'state'\r\n ) %>% \r\n fmt_number(\r\n columns = 3:22\r\n ) %>% \r\n summary_rows(\r\n groups = TRUE,\r\n columns = 3:22,\r\n fns = list(Average = ~mean(.))\r\n ) %>% \r\n text_transform(\r\n locations = cells_body(vars(Trend)),\r\n fn = function(x){\r\n map(spark_plots, ggplot_image, height = px(15), aspect_ratio = 4)\r\n }\r\n ) %>%\r\n tab_header(\r\n title = md(\"**State-by-State Spending on Primary and Secondary Education over 20 years**\"),\r\n subtitle = md(\"*$1000s per child adjusted for inflation*\")\r\n ) %>% \r\n tab_source_note(\r\n md(\"**By**: @yjunechoe
\r\n **Inspiration**: @thomas_mock
\r\n **Data**: Urban Institute | {tidykids} by Joshua Rosenberg\")\r\n ) %>% \r\n tab_style(\r\n style = list(\r\n cell_text(font = \"Futura MdCn BT\")\r\n ),\r\n locations = list(\r\n cells_title(groups = \"title\")\r\n )\r\n ) %>%\r\n tab_options(\r\n table.width = 50,\r\n heading.align = \"left\",\r\n heading.title.font.size = 72,\r\n heading.subtitle.font.size = 32,\r\n row_group.font.size = 42,\r\n row_group.font.weight = 'bold',\r\n row_group.border.top.color = \"black\",\r\n row_group.border.bottom.color = \"black\",\r\n table.border.top.color = \"black\",\r\n heading.border.bottom.color = \"white\",\r\n heading.border.bottom.width = px(10),\r\n table.font.names = \"Roboto\",\r\n column_labels.font.size = 20,\r\n column_labels.border.bottom.color = \"black\",\r\n column_labels.border.bottom.width= px(3),\r\n summary_row.border.color = \"black\", \r\n summary_row.background.color = \"#c0c5ce\",\r\n table.border.bottom.color = \"black\"\r\n )\r\n\r\n\r\n\r\n\r\nMany thanks to Thomas Mock’s blog posts on {gt} (1) (2), a well as to the developers of {gt} for what I think is one of the most comprehensive vignette I’ve ever seen for a package!↩︎\r\n", "preview": "posts/2020-09-14-tidytuesday-2020-week-38/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 1703, "preview_height": 2203 @@ -575,7 +575,7 @@ ], "contents": "\r\n\r\n\r\n\r\nIn a {reactable} table, you can have a row expand to reveal more details by supplying the details argument with a function returning an image, raw html, another reactable table, etc. There are many examples of this in the package vignette, and they give you a good sense of just how flexible and powerful this feature is.\r\nMy first reaction to this was that it seemed like just about anything that can be displayed on a web page can be embedded in the expandable details. So what about something very unusual like… videos? Can {reactable} handle it? Are there potential usecases of this?\r\nAnnotated #tidytuesday screencasts\r\nWhile entertaining this idea, I remembered coming across a tweet by Alex Cookson with a link to a very detailed spreadsheet containing timestamped notes of David Robinson’s live #tidytuesday screencasts.\r\n\r\n\r\nAnyone other #rstats people find @drob's #TidyTuesday screencasts useful?I made a spreadsheet with timestamps for hundreds of specific tasks he does: https://t.co/HvJbLk1chdUseful if, like me, you keep going back and ask, “Where in the video did he do [this thing] again?”\r\n\r\n— Alex Cookson (@alexcookson) January 13, 2020\r\n\r\nSo I turned the spreadsheet into a {reactable} table with rows that can expand to reveal a Youtube video at the timestamp. I actually think this makes a really cool use case - it’s easier here than in Google Spreadsheet to navigate around the table with pagination and search bar, and you don’t need to constantly open and close Youtube videos in new windows (in fact, you can keep multiple videos open across rows here!).\r\nTry it out for yourself!\r\n\r\n\r\n\r\nCode\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(htmltools)\r\nlibrary(reactable)\r\n\r\n# David Robinson's (@drob) #tidytuesday screencast annotations, made by Alex Cookson (@alexcookson)\r\nscreencasts <-\r\n gsheet::gsheet2tbl(\"docs.google.com/spreadsheets/d/1pjj_G9ncJZPGTYPkR1BYwzA6bhJoeTfY2fJeGKSbOKM\") %>% \r\n select(Screencast, Date, Timestamp = `Timestamp (sec)`, Link:Functions) %>% \r\n mutate(Link = str_extract(Link, \"(?<=v=).*(?=&)\"))\r\n\r\n\r\n###############\r\n## The Table ##\r\n###############\r\n\r\nreactable(screencasts,\r\n \r\n # Function to embed Youtube Video \r\n details = function(index){\r\n \r\n # Grab video info from hidden columns\r\n link <- screencasts$Link[index]\r\n time <- screencasts$Timestamp[index]\r\n \r\n # Div container to add grey padding around the video\r\n tags$div(style = \"text-align:center; padding:10px; background:grey\",\r\n \r\n # The actual video\r\n tags$iframe(\r\n height = \"640\", width = \"640\", allow = \"fullscreen\",\r\n src = glue::glue(\"https://www.youtube.com/embed/{link}?start={time}&autoplay=1\")\r\n )\r\n \r\n )\r\n \r\n },\r\n \r\n # Column options\r\n columns = list(\r\n Link = colDef(show = F),\r\n Timestamp = colDef(show = F),\r\n Description = colDef(width = 500)\r\n ),\r\n \r\n # Some theme options\r\n searchable = TRUE,\r\n bordered = TRUE,\r\n fullWidth = TRUE,\r\n theme = reactableTheme(\r\n style = list(fontSize = '14px'),\r\n searchInputStyle = list(width = \"100%\")\r\n ),\r\n \r\n)\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-09-12-videos-in-reactable/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 808, "preview_height": 617 @@ -597,7 +597,7 @@ ], "contents": "\r\n\r\n\r\n\r\n \r\nFor the last few weeks I’ve been reading about and experimenting with fonts for data visualization in my spare time.1 Out of that, I have found a couple fonts that I really like and wanted to do a small showcase of them here.\r\nThese fonts are all free and available for download at Google Fonts.2 Note that not only are they all large font families that come with many different styles, you can also adjust various theme settings like lineheight in {ggplot2}, so what I’m showing here isn’t the full extent of what you can make with these fonts.\r\n \r\n\r\nArial\r\nIt’s the default. It’s dull. It’s just here for comparison.\r\n\r\nMontserrat\r\nSimple design that can handle long lines of text. I like it for minimal plots.\r\n\r\nRoboto Mono\r\nMonospaced member of the Roboto family. Very easy to read.\r\n\r\nFutura Bk BT\r\nA slender and bold member of the Futura family. Looks nice even in larger sizes.\r\n\r\nBarlow\r\nAlso a slender font like Futura, but this has nicer ’j’s\r\n\r\nAdelle\r\nA serif font that doesn’t go overboard. I use it a lot for short paragraphs.\r\n\r\nMerriweather\r\nSimilar to Adelle, but has a bit more pronounced hooks\r\n\r\n\r\n \r\nMisc.\r\nWhy spend 3 minutes copy-pasting code when you can spend an hour automatizing it?\r\nThis was my first time using dynamic Rmarkdown reporting. The plots above and the text descriptions that went with them were generated in a for loop, which I learned about here.\r\nHere is the single chunk of code that made this possible:\r\n\r\n\r\nlibrary(ggplot2)\r\nlibrary(extrafont)\r\nknitr::opts_chunk$set(fig.width = 7, dpi = 600)\r\n\r\ntheme_set(theme_classic(base_size = 14))\r\n\r\nfavorites <- c(\r\n \"Arial\" = \"It's the default. It's dull. It's just here for comparison.\",\r\n \"Montserrat\" = \"Simple design that can handle long lines of text. I like it for minimal plots.\",\r\n \"Roboto Mono\" = \"Monospaced member of the Roboto family. Very easy to read.\",\r\n \"Futura Bk BT\" = \"A slender and bold member of the Futura family. Looks nice even in larger sizes.\",\r\n \"Barlow\" = \"Also a slender font like Futura, but this has nicer 'j's\",\r\n \"Adelle\" = \"A serif font that doesn't go overboard. I use it a lot for short paragraphs.\",\r\n \"Merriweather\" = \"Similar to Adelle, but has a bit more pronounced hooks\"\r\n)\r\n\r\n\r\nfor (font in names(favorites)) {\r\n cat(\"\\n\\n## \", font, \"\\n\\n\")\r\n cat(\"\", favorites[font], \"\\n\\n\")\r\n plot <- qplot(data = mtcars, mpg, disp, color = factor(cyl)) +\r\n annotate(\"text\", 28, 400, label = paste(letters, collapse = ''), family = font) +\r\n geom_curve(aes(x = 28, y = 380, xend = 22, yend = 260),\r\n color = 'black', curvature = -.3, arrow = arrow(), show.legend = FALSE) +\r\n labs(title = \"This is an interesting plot title\",\r\n subtitle = \"Here's the subtitle 1234567890\",\r\n caption = \"This is the plot caption\") +\r\n theme(text = element_text(family = font),\r\n plot.title.position = 'plot')\r\n print(plot)\r\n}\r\n\r\n\r\n\r\n\r\nI reference a lot this great collection of fonts used in profesional visualization here.↩︎\r\nFor how to import local fonts into R to use for plotting, check out {extrafont} and/or {showtext}.↩︎\r\n", "preview": "posts/2020-09-06-fonts-for-graphs/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 1144, "preview_height": 675 @@ -620,7 +620,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n \r\n\r\n\r\n\r\nThings I learned\r\nReally basic image manipulation with {magick}\r\nThat you can get away with not doing the data part of data visualization for TidyTuesday\r\nThings to improve\r\nShould’ve picked a better color to represent air for Aang.\r\nNot sure why it looks like white grid lines are there. All my attempts at getting rid of them failed, so I’ve just concluded that they’re likely an optical illusion.\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(magick)\r\nlibrary(gganimate)\r\n\r\naang <- image_read(\"https://vignette.wikia.nocookie.net/avatar/images/a/ae/Aang_at_Jasmine_Dragon.png\")\r\niroh <- image_read(\"https://vignette.wikia.nocookie.net/avatar/images/c/c1/Iroh_smiling.png\")\r\nsokka <- image_read(\"https://vignette.wikia.nocookie.net/avatar/images/c/cc/Sokka.png\")\r\ntoph <- image_read(\"https://vignette.wikia.nocookie.net/avatar/images/4/46/Toph_Beifong.png\")\r\n\r\n# Script by Georgios Karamanis adapted and wrapped into a function\r\n# - from https://github.com/gkaramanis/aRt/blob/master/split-bar/points-portraits.R\r\nimg_to_df <- function(img, index) {\r\n \r\n img <- image_convert(img, colorspace = \"gray\")\r\n \r\n img_w <- image_info(img)$width\r\n img_h <- image_info(img)$height\r\n \r\n if (img_w >= img_h) {\r\n img <- image_resize(img, \"120\")\r\n } else {\r\n img <- image_resize(img, (\"x120\"))\r\n }\r\n \r\n img_array <- drop(as.integer(img[[1]]))\r\n rownames(img_array) <- 1:nrow(img_array)\r\n colnames(img_array) <- 1:ncol(img_array)\r\n \r\n as.data.frame.table(img_array) %>% \r\n `colnames<-`(c(\"y\", \"x\", \"b\")) %>% \r\n mutate(\r\n across(everything(), as.numeric),\r\n bf = 1 - b / 255\r\n ) %>% \r\n mutate(character_id = index)\r\n}\r\n\r\nplot_data <- imap_dfr(list(aang, iroh, sokka, toph), ~img_to_df(.x, .y)) %>% \r\n group_by(character_id) %>% \r\n mutate(point_id = 1:n()) %>% \r\n ungroup() %>% \r\n mutate(across(contains(\"id\"), as.factor))\r\n\r\nanim <- ggplot(plot_data) +\r\n geom_point(aes(x = x, y = y, size = bf, group = point_id, color = character_id),\r\n shape = 16, show.legend = FALSE) +\r\n scale_y_reverse() +\r\n scale_size_continuous(range = c(0, 4)) +\r\n scale_color_manual(values = c(\"#C0EDFF\", \"#B33000\", \"#206BA4\", \"#8B4513\")) +\r\n coord_fixed(expand = FALSE) +\r\n theme_void() +\r\n theme(panel.grid = element_blank()) +\r\n transition_states(id)\r\n\r\nanimate(anim, width = 12, height = 9, units = \"in\", res = 120)\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-08-17-tidytuesday-2020-week-33/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 1289, "preview_height": 964 @@ -643,7 +643,7 @@ ], "contents": "\r\n\r\nContents\r\n1. rename() inside select()\r\n2. rename() inside count()\r\n3. mutate() inside count()\r\n4. transmute() + select()\r\n5. ungroup() inside summarize()\r\n6. arrange() + other features inside slice()\r\n7. count and sum by group with add_count()\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nUsing a cleaned up version of penguins data from {palmerpenguins}:\r\n\r\n\r\ndata(\"penguins\", package = \"palmerpenguins\")\r\n\r\npenguins <- na.omit(penguins)\r\n\r\n\r\n\r\n\r\n\r\n\r\n1. rename() inside select()\r\nYou can rename a column inside select() by assigning a new name on the left hand side:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# select(species, island) %>% \r\n# rename(penguin_species = species)\r\n\r\npenguins %>% \r\n select(penguin_species = species,\r\n island)\r\n\r\n\r\n # A tibble: 333 x 2\r\n penguin_species island \r\n \r\n 1 Adelie Torgersen\r\n 2 Adelie Torgersen\r\n 3 Adelie Torgersen\r\n 4 Adelie Torgersen\r\n 5 Adelie Torgersen\r\n 6 Adelie Torgersen\r\n 7 Adelie Torgersen\r\n 8 Adelie Torgersen\r\n 9 Adelie Torgersen\r\n 10 Adelie Torgersen\r\n # ... with 323 more rows\r\n\r\nThis also works with {tidyselect} helpers like starts_with(), ends_with(), contains(), and matches():\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# select(species, island) %>% \r\n# rename(penguin_species = species,\r\n# weight = body_weight_g)\r\n\r\npenguins %>% \r\n select(penguin_species = species,\r\n island,\r\n weight = contains(\"mass\"))\r\n\r\n\r\n # A tibble: 333 x 3\r\n penguin_species island weight\r\n \r\n 1 Adelie Torgersen 3750\r\n 2 Adelie Torgersen 3800\r\n 3 Adelie Torgersen 3250\r\n 4 Adelie Torgersen 3450\r\n 5 Adelie Torgersen 3650\r\n 6 Adelie Torgersen 3625\r\n 7 Adelie Torgersen 4675\r\n 8 Adelie Torgersen 3200\r\n 9 Adelie Torgersen 3800\r\n 10 Adelie Torgersen 4400\r\n # ... with 323 more rows\r\n\r\n2. rename() inside count()\r\nYou can rename the new column of counts (n by default) using the name argument:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# count(species) %>% \r\n# rename(total = n)\r\n\r\npenguins %>% \r\n count(species, name = \"total\")\r\n\r\n\r\n # A tibble: 3 x 2\r\n species total\r\n \r\n 1 Adelie 146\r\n 2 Chinstrap 68\r\n 3 Gentoo 119\r\n\r\nYou can also rename the column(s) that are selected for counting in the same way as shown in the select() examples above:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# count(species) %>% \r\n# rename(total = n,\r\n# penguin_species = species)\r\n\r\npenguins %>% \r\n count(penguin_species = species, name = \"total\")\r\n\r\n\r\n # A tibble: 3 x 2\r\n penguin_species total\r\n \r\n 1 Adelie 146\r\n 2 Chinstrap 68\r\n 3 Gentoo 119\r\n\r\nNote that the new name passed into the name argument must be quoted, but the new name for selected column needs not to be unquoted:\r\n\r\n\r\nidentical(\r\n # Method 1: new column name UNQUOTED\r\n penguins %>% \r\n count(penguin_species = species, name = \"total\"),\r\n # Method 2: new column name QUOTED\r\n penguins %>% \r\n count(\"penguin_species\" = species, name = \"total\") \r\n)\r\n\r\n\r\n [1] TRUE\r\n\r\nI prefer to unquote the new column names to keep it consistent with the recommended style for rename()\r\nThis feature of select() may seem weird and hackish (and I guess it sort of is in this demonstration) but it’s explicitly documented here if you want to read more on it.\r\n3. mutate() inside count()\r\nYou can also create a new column to count by inside count(). This works very similarly to the above, but I think it’s worth its own mention.\r\nIt’s pretty simple - you just do what you’d do for mutate() inside count():\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# mutate(long_beak = bill_length_mm > 50) %>% \r\n# count(long_beak)\r\n\r\npenguins %>% \r\n count(long_beak = bill_length_mm > 50)\r\n\r\n\r\n # A tibble: 2 x 2\r\n long_beak n\r\n \r\n 1 FALSE 281\r\n 2 TRUE 52\r\n\r\nAnd of course, this also works when specifying multiple variables to count by:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# mutate(long_beak = bill_length_mm > 50,\r\n# is_adelie = species == \"Adelie\") %>% \r\n# count(is_adelie, long_beak)\r\n\r\npenguins %>% \r\n count(long_beak = bill_length_mm > 50,\r\n is_adelie = species == \"Adelie\")\r\n\r\n\r\n # A tibble: 3 x 3\r\n long_beak is_adelie n\r\n \r\n 1 FALSE FALSE 135\r\n 2 FALSE TRUE 146\r\n 3 TRUE FALSE 52\r\n\r\n4. transmute() + select()\r\ntransmute() is a function that mutates columns and returns only those columns:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# mutate(body_mass_kg = body_mass_g/1000) %>% \r\n# select(body_mass_kg)\r\n\r\npenguins %>% \r\n transmute(body_mass_kg = body_mass_g/1000)\r\n\r\n\r\n # A tibble: 333 x 1\r\n body_mass_kg\r\n \r\n 1 3.75\r\n 2 3.8 \r\n 3 3.25\r\n 4 3.45\r\n 5 3.65\r\n 6 3.62\r\n 7 4.68\r\n 8 3.2 \r\n 9 3.8 \r\n 10 4.4 \r\n # ... with 323 more rows\r\n\r\nI’ve rarely used transmute() in the past because I thought it could only return modified columns, which would be very limiting (like in the above example, what good is a single column of penguin body mass in kilograms?)\r\nBut actually you can just name the columns you want to include in transmute() like you would in select() to carry over columns that you aren’t modifying. And of course, you can “rename” them as you do it1:\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# mutate(body_mass_kg = body_mass_g/1000) %>% \r\n# select(species, island, body_mass_kg) %>% \r\n# rename(penguin_species = species)\r\n\r\npenguins %>% \r\n transmute(penguin_species = species,\r\n island,\r\n body_mass_kg = body_mass_g/1000)\r\n\r\n\r\n # A tibble: 333 x 3\r\n penguin_species island body_mass_kg\r\n \r\n 1 Adelie Torgersen 3.75\r\n 2 Adelie Torgersen 3.8 \r\n 3 Adelie Torgersen 3.25\r\n 4 Adelie Torgersen 3.45\r\n 5 Adelie Torgersen 3.65\r\n 6 Adelie Torgersen 3.62\r\n 7 Adelie Torgersen 4.68\r\n 8 Adelie Torgersen 3.2 \r\n 9 Adelie Torgersen 3.8 \r\n 10 Adelie Torgersen 4.4 \r\n # ... with 323 more rows\r\n\r\n5. ungroup() inside summarize()\r\nI always found using ungroup() after summarize() to be extremely ugly, but I found myself using it a lot to remove left-over groupings after a summarize call:\r\n\r\n\r\npenguins %>% \r\n group_by(island, species) %>% \r\n summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% \r\n ungroup()\r\n\r\n\r\n # A tibble: 5 x 3\r\n island species mean_mass\r\n \r\n 1 Biscoe Adelie 3710.\r\n 2 Biscoe Gentoo 5092.\r\n 3 Dream Adelie 3701.\r\n 4 Dream Chinstrap 3733.\r\n 5 Torgersen Adelie 3709.\r\n\r\n… because summarize() only drops the last grouping variable by defaut, meaning that the output is still grouped by the island variable if ungroup() isn’t called:\r\n\r\n\r\n# Without ungroup()\r\npenguins %>% \r\n group_by(island, species) %>% \r\n summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% \r\n group_vars()\r\n\r\n\r\n [1] \"island\"\r\n\r\n# With ungroup()\r\npenguins %>% \r\n group_by(island, species) %>% \r\n summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>% \r\n ungroup() %>% \r\n group_vars()\r\n\r\n\r\n character(0)\r\n\r\nSince {dplyr} 1.0.0, you can simply set the .groups argument inside summarize() to 'drop' to achieve the same:\r\n\r\n\r\npenguins %>% \r\n group_by(island, species) %>% \r\n summarize(mean_mass = mean(body_mass_g, na.rm = TRUE), .groups = 'drop')\r\n\r\n\r\n # A tibble: 5 x 3\r\n island species mean_mass\r\n \r\n 1 Biscoe Adelie 3710.\r\n 2 Biscoe Gentoo 5092.\r\n 3 Dream Adelie 3701.\r\n 4 Dream Chinstrap 3733.\r\n 5 Torgersen Adelie 3709.\r\n\r\nBut ungroup() still remains relevant as you can now selectively remove grouping variables in {dplyr} 1.0.0.\r\n6. arrange() + other features inside slice()\r\nIn past versions of {dplyr}, if you wanted to grab the top n rows sorted by a column, you’d use top_n(), which provides a simpler way of doing slice() + arrange():\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# arrange(desc(body_mass_g)) %>% \r\n# slice(1:5)\r\n\r\npenguins %>% \r\n top_n(5, wt = body_mass_g)\r\n\r\n\r\n # A tibble: 6 x 8\r\n species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex \r\n \r\n 1 Gentoo Biscoe 49.2 15.2 221 6300 male \r\n 2 Gentoo Biscoe 59.6 17 230 6050 male \r\n 3 Gentoo Biscoe 51.1 16.3 220 6000 male \r\n 4 Gentoo Biscoe 45.2 16.4 223 5950 male \r\n 5 Gentoo Biscoe 49.8 15.9 229 5950 male \r\n 6 Gentoo Biscoe 48.8 16.2 222 6000 male \r\n # ... with 1 more variable: year \r\n\r\nBut the recent {dplyr} 1.0.0 augmented slice() with variants like slice_min() and slice_max() that now supresede top_n():\r\n\r\n\r\n##### Pre-1.0.0 #####\r\n# penguins %>% \r\n# top_n(5, wt = body_mass_g)\r\n\r\npenguins %>% \r\n slice_max(order_by = body_mass_g, n = 5)\r\n\r\n\r\n # A tibble: 6 x 8\r\n species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex \r\n \r\n 1 Gentoo Biscoe 49.2 15.2 221 6300 male \r\n 2 Gentoo Biscoe 59.6 17 230 6050 male \r\n 3 Gentoo Biscoe 51.1 16.3 220 6000 male \r\n 4 Gentoo Biscoe 48.8 16.2 222 6000 male \r\n 5 Gentoo Biscoe 45.2 16.4 223 5950 male \r\n 6 Gentoo Biscoe 49.8 15.9 229 5950 male \r\n # ... with 1 more variable: year \r\n\r\nNote that the order of arguments is different for slice_min/max() - the first argument after piping is where you specify the variable for ordering rather than the number of rows to return, like in top_n().\r\nThis is because slice_min/max() gives you an option to either specify a certain number of rows n or a proportion of rows prop:\r\n\r\n\r\npenguins %>% \r\n slice_max(body_mass_g, prop = .01)\r\n\r\n\r\n # A tibble: 4 x 8\r\n species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex \r\n \r\n 1 Gentoo Biscoe 49.2 15.2 221 6300 male \r\n 2 Gentoo Biscoe 59.6 17 230 6050 male \r\n 3 Gentoo Biscoe 51.1 16.3 220 6000 male \r\n 4 Gentoo Biscoe 48.8 16.2 222 6000 male \r\n # ... with 1 more variable: year \r\n\r\nAnd actually, the most significant change with the new slice_*() functions is from adding appropriate behavior for grouped dataframes.\r\nSo for example, this example below returns the top 5% of penguins by weight for each species:\r\n\r\n\r\npenguins %>% \r\n group_by(species) %>% \r\n slice_max(body_mass_g, prop = .05)\r\n\r\n\r\n # A tibble: 16 x 8\r\n # Groups: species [3]\r\n species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g\r\n \r\n 1 Adelie Biscoe 43.2 19 197 4775\r\n 2 Adelie Biscoe 41 20 203 4725\r\n 3 Adelie Torge~ 42.9 17.6 196 4700\r\n 4 Adelie Torge~ 39.2 19.6 195 4675\r\n 5 Adelie Dream 39.8 19.1 184 4650\r\n 6 Adelie Dream 39.6 18.8 190 4600\r\n 7 Adelie Biscoe 45.6 20.3 191 4600\r\n 8 Chinst~ Dream 52 20.7 210 4800\r\n 9 Chinst~ Dream 52.8 20 205 4550\r\n 10 Chinst~ Dream 53.5 19.9 205 4500\r\n 11 Gentoo Biscoe 49.2 15.2 221 6300\r\n 12 Gentoo Biscoe 59.6 17 230 6050\r\n 13 Gentoo Biscoe 51.1 16.3 220 6000\r\n 14 Gentoo Biscoe 48.8 16.2 222 6000\r\n 15 Gentoo Biscoe 45.2 16.4 223 5950\r\n 16 Gentoo Biscoe 49.8 15.9 229 5950\r\n # ... with 2 more variables: sex , year \r\n\r\nBut note that slice_*() functions do not modify groups in the result if the input is a grouped dataframe, so you need to explicitly add a call to ungroup() if you want to drop groups after slicing.\r\n7. count and sum by group with add_count()\r\nSaving my favorite lesser-known {dplyr} function for last!\r\nadd_count() adds a column with the counts of each group (or combination of groups):\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# group_by(species) %>% \r\n# mutate(count_by_species = n()) %>% \r\n# ungroup()\r\n\r\npenguins %>% \r\n add_count(species, name = \"count_by_species\") %>% \r\n # cutting down some columns to show the new column\r\n select(-contains(\"mm\"))\r\n\r\n\r\n # A tibble: 333 x 6\r\n species island body_mass_g sex year count_by_species\r\n \r\n 1 Adelie Torgersen 3750 male 2007 146\r\n 2 Adelie Torgersen 3800 female 2007 146\r\n 3 Adelie Torgersen 3250 female 2007 146\r\n 4 Adelie Torgersen 3450 female 2007 146\r\n 5 Adelie Torgersen 3650 male 2007 146\r\n 6 Adelie Torgersen 3625 female 2007 146\r\n 7 Adelie Torgersen 4675 male 2007 146\r\n 8 Adelie Torgersen 3200 female 2007 146\r\n 9 Adelie Torgersen 3800 male 2007 146\r\n 10 Adelie Torgersen 4400 male 2007 146\r\n # ... with 323 more rows\r\n\r\nYou can use the wt to effectively get sums by group (perhaps hackish but very very useful):\r\n\r\n\r\n##### Long Form #####\r\n# penguins %>% \r\n# group_by(species) %>% \r\n# mutate(total_weight_by_species = sum(body_mass_g)) %>% \r\n# ungroup()\r\n \r\n\r\npenguins %>% \r\n add_count(species, wt = body_mass_g, name = \"total_weight_by_species\") %>% \r\n # cutting down some columns to show the new column\r\n select(-contains(\"mm\"))\r\n\r\n\r\n # A tibble: 333 x 6\r\n species island body_mass_g sex year total_weight_by_species\r\n \r\n 1 Adelie Torgersen 3750 male 2007 541100\r\n 2 Adelie Torgersen 3800 female 2007 541100\r\n 3 Adelie Torgersen 3250 female 2007 541100\r\n 4 Adelie Torgersen 3450 female 2007 541100\r\n 5 Adelie Torgersen 3650 male 2007 541100\r\n 6 Adelie Torgersen 3625 female 2007 541100\r\n 7 Adelie Torgersen 4675 male 2007 541100\r\n 8 Adelie Torgersen 3200 female 2007 541100\r\n 9 Adelie Torgersen 3800 male 2007 541100\r\n 10 Adelie Torgersen 4400 male 2007 541100\r\n # ... with 323 more rows\r\n\r\nAlso check out its more primitive version add_tally().\r\nBy default, add_tally() adds a count of rows, which you can already do with mutate(n = n()), but it shines when you make use of its wt argument:\r\n\r\n\r\npenguins %>% \r\n add_count(species, wt = body_mass_g, name = \"total_weight_by_species\") %>% \r\n add_tally(wt = body_mass_g, name = \"total_weight_of_all_species\") %>% \r\n select(1:2, last_col(0):last_col(1))\r\n\r\n\r\n # A tibble: 333 x 4\r\n species island total_weight_of_all_species total_weight_by_species\r\n \r\n 1 Adelie Torgersen 1400950 541100\r\n 2 Adelie Torgersen 1400950 541100\r\n 3 Adelie Torgersen 1400950 541100\r\n 4 Adelie Torgersen 1400950 541100\r\n 5 Adelie Torgersen 1400950 541100\r\n 6 Adelie Torgersen 1400950 541100\r\n 7 Adelie Torgersen 1400950 541100\r\n 8 Adelie Torgersen 1400950 541100\r\n 9 Adelie Torgersen 1400950 541100\r\n 10 Adelie Torgersen 1400950 541100\r\n # ... with 323 more rows\r\n\r\n\r\nWhat happens under the hood is actually copying of a sort, so this is probably not the best approach if you care about efficiency. As a case in point, you can’t use {tidyselect} helpers in transmute because you’re creating a new dataframe↩︎\r\n", "preview": "posts/2020-08-07-saving-a-line-of-piping/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 877, "preview_height": 372 @@ -666,7 +666,7 @@ ], "contents": "\r\n\r\nContents\r\nVisualization\r\nThings I learned\r\nThings to improve\r\n\r\nCode\r\n\r\n\r\n\r\n\r\nVisualization\r\n \r\n\r\n\r\n\r\nThings I learned\r\ngeom_dumbbell() from {ggalt}\r\ncoord_capped_cart() and facet_rep_wrap() from {lemon}\r\nUsing the reorder_within() + facet_wrap(scales = \"free_y\") + scale_y_reordered() combo to sort within facets.\r\nUsing override.aes argument to manipulate legend aesthetics after they’re generated by the geom_*()s\r\nUsing slice_max() instead of top_n() to catch up with the new {dplyr} update\r\nThings to improve\r\nFont sizing and image resolution\r\nPlacement and size of legend is sorta awkward.\r\nPlot feels too… empty. I think I treated this too much like a figure for a journal article. Maybe add some background color next time?\r\nCode\r\nAlso available on github\r\n\r\n\r\nlibrary(tidyverse)\r\nlibrary(tidytuesdayR)\r\nlibrary(lemon)\r\nlibrary(ggalt)\r\nlibrary(patchwork)\r\nlibrary(extrafont)\r\n\r\ntheme_set(theme_classic())\r\n\r\n### Data\r\n\r\ntuesdata <- tidytuesdayR::tt_load('2020-08-04')\r\n\r\nenergy_types <- tuesdata$energy_types\r\n\r\nenergy_types_tidy <- energy_types %>% \r\n pivot_longer(where(is.double), names_to = \"Year\", values_to = \"GWh\")\r\n\r\n\r\nplot_data <- energy_types_tidy %>% \r\n add_count(country, Year, wt = GWh, name = \"Total\") %>% \r\n mutate(GWh_prop = GWh/Total) %>% \r\n select(-country_name, -GWh, -Total , -level) %>% \r\n filter(Year %in% c(2016, 2018))\r\n \r\n\r\n### Plotting\r\n\r\np1 <- plot_data %>% \r\n filter(type == \"Conventional thermal\") %>% \r\n pivot_wider(names_from = Year, values_from = GWh_prop) %>% \r\n mutate(country = fct_reorder(country, `2018`, max, .desc = TRUE)) %>% \r\n mutate(increase = (`2018` - `2016`) > 0) %>% \r\n ggplot() +\r\n geom_dumbbell(\r\n aes(y = country, x = `2016`, xend = `2018`, color = increase),\r\n dot_guide = TRUE, dot_guide_size = 0.25,\r\n size = 2, colour_x = \"#babfb6\", colour_xend = \"#5f787b\"\r\n ) +\r\n scale_color_manual(values = c(\"#d69896\", \"#a1cf86\"), labels = c(\"2016\", \"2018\")) +\r\n guides(color = guide_legend(override.aes = list(color = c(\"#babfb6\", \"#5f787b\"), size = 3))) +\r\n labs(title = \"Conventional Thermal Energy\",\r\n y = \"Country Codes\",\r\n color = NULL) +\r\n theme(legend.position = c(.75, .85),\r\n axis.title.y = element_text(size = 12, vjust = 5))\r\n\r\np2 <- plot_data %>% \r\n filter(type != \"Conventional thermal\") %>% \r\n pivot_wider(names_from = Year, values_from = GWh_prop) %>% \r\n mutate(type = fct_lump(type, n = 3, w = `2018`)) %>% \r\n group_by(type, country) %>% \r\n summarize(`2016` = sum(`2016`), `2018` = sum(`2018`)) %>% \r\n slice_max(`2018`, n = 10, with_ties = FALSE) %>% \r\n mutate(country = tidytext::reorder_within(country, `2018`, type)) %>% \r\n mutate(increase = (`2018` - `2016`) > 0) %>% \r\n ggplot() +\r\n geom_dumbbell(\r\n aes(y = country, x = `2016`, xend = `2018`, color = increase),\r\n dot_guide = TRUE, dot_guide_size = .4,\r\n size = 2.5, colour_x = \"#babfb6\", colour_xend = \"#5f787b\",\r\n show.legend = FALSE\r\n ) +\r\n scale_color_manual(values = c(\"#d69896\", \"#a1cf86\")) +\r\n tidytext::scale_y_reordered() +\r\n facet_rep_wrap(~type, scales = \"free_y\") +\r\n labs(title = \"Clean Energy\",\r\n subtitle = \"Leaders in Each Category (Top 10)\",\r\n y = NULL)\r\n\r\npatched <- p1 + p2 &\r\n coord_capped_cart(bottom = \"both\") &\r\n scale_x_continuous(labels = scales::percent) &\r\n labs(x = NULL) &\r\n theme(\r\n plot.title = element_text(hjust = 0.5, size = 16, face = \"bold\"),\r\n text = element_text(family = \"Montserrat\"),\r\n panel.grid.major.y = element_blank(),\r\n plot.margin = unit(c(.4,.2,.2,.4), \"cm\"),\r\n plot.background = element_rect(color = \"transparent\")\r\n )\r\n\r\npatched + plot_annotation(title = \"Electricity Production in Europe\",\r\n subtitle = \"A comparison between 2016 and 2018\",\r\n caption = \"Percent Accounting for the Country's Total Electricity Generated\",\r\n theme = list(plot.title = element_text(size = 22),\r\n plot.subtitle = element_text(face = \"italic\", hjust = .5),\r\n plot.caption = element_text(size = 12, hjust = .5)))\r\n\r\n\r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-08-04-tidytuesday-2020-week-32/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 1444, "preview_height": 805 @@ -691,7 +691,7 @@ ], "contents": "\r\n\r\nContents\r\nBackground\r\nAnalysis #1 - Size of monthly playlists over time\r\nAnalysis #2 - Audio features\r\nAnalysis #3 - Songs during college\r\nAnalysis #4 - “My Top 100” playlists\r\nConclusion\r\n\r\n\r\n\r\n\r\nBackground\r\nOne of my longest running habits is making monthly playlists. At the start of every month, I create a new playlist for that month and add songs that I like. Some songs are carried over from the previous month’s playlist and others are songs that I newly discover, but they’re all representative of the songs that I’m “into” for that month.\r\nI’ve been doing this for many years, and have the best record of my monthly playlists for the past 6 years, which is how long I’ve been using spotify. So when I saw people talking about {spotifyr} - an R wrapper for Spotify’s web API - on twitter, I decided to take a stab at analyzing my monthly playlists (code here).\r\n\r\n\r\n\r\n\r\nAnalysis #1 - Size of monthly playlists over time\r\nWhen I first pulled information about my Spotify account, I noticed that I had some gaps in my monthly playlists. This was a special case of non-random missing data: when I didn’t make a new playlist for the month, it’s because I didn’t think that there was a substantial change in what I’m jamming to from the previous month. The {zoo} package, which I didn’t know about before, came in very handy here for dealing with this missingness with its na.locf() (Last Observation Carried Forward) function.\r\nAfter some more cleaning steps, I first made a simple plot that counted the number of songs that each monthly playlist had.\r\n\r\nThere are some interesting things I notice, and here’s some context for them.\r\n2014, the year I began using Spotify, has high number of songs per playlist. This makes sense because the transition period involved a lot of transferring-over-in-bulk of my favorite music from iTunes (which I used previously).\r\nThe three consecutive months of missing playlists in 2017 was over a summer camp. I had forgotten about that until I started this analysis, but for that summer I just kept one playlist called “Summer 2017.”\r\nBoth the playlist with the most songs (78) and the playlist with the least songs (1) are in 2017. That was the year when I was exposed to a lot with different music genres from going to parties my friends. That was also my Sophomore year, which was a hot mess, but that’s a story for another time. You can get a better sense of my music taste being all over the place in the next section.\r\nMay 2017 was the month of sad azn vibes. My playlist had a single song, which was American Girl by Mitski and I think that says everythihng.\r\nIn November 2017, the month with the largest playlist, a friend who was known for having a great taste in music shared with me their playlist of 200+ EDM songs. I never really got into EDM previously but that was a gamechanger. I kind of went through an EDM phase for a short while after that, and November 2017 playlist is still the playlist I go back to when I want to listen to some good EDM.\r\nThe missing playlist of March 2020 was at perhaps the busiest time of my life - I was juggling my thesis, finals, grad school visits/decisions, finishing up my graduation requirements, moving out of campus back to Korea, and, of course, dealing with the pandemic. Makes sense why that month is missing its playlist.\r\nAnalysis #2 - Audio features\r\nThe real deal of Spotify API is actually the audio features, which Spotify calculates using their special algorithms. Some of the features are listed in the table below (adopted from the documentation). Of these, I decided to narrow down to acousticness, danceability, energy, and valence because others didn’t really show much variation (e.g., I don’t listen to live-recorded music on Spotify, so liveness is always near zero).\r\n\r\nFeature\r\n Description\r\n acousticness\r\n A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.\r\n danceability\r\n Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.\r\n energy\r\n Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.\r\n instrumentalness\r\n Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.\r\n liveness\r\n Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.\r\n loudness\r\n The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.\r\n speechiness\r\n Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.\r\n tempo\r\n The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.\r\n valence\r\n A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).\r\n \r\n\r\nI was interested in looking at how my music taste changed over time, so for each monthly playlist, I calculated the mean values for these four features and made a line plot:\r\n\r\nSome things that pop out:\r\nOverall, I tend to listen to bright music (high dancibility and energy) but there is still a lot of variation (especially in valence).\r\nAcousticness stays pretty low and doesn’t seem to correlate much with time. This makes sense because I primarily listen to pop and electric songs, which are both very far from sounding acoustic.\r\nThe one dip in May 2017 is, you guessed it, that playlist with just that one Mitski song. That was the only playlist where the mean danceability and energy values are below 0.5.\r\nIt also looks like I listen to brighter music with every passing year, though this trend is subtle. This it’s a bit easier to see in the animation below (especially for energy and valence).\r\n\r\nAnalysis #3 - Songs during college\r\nNext, I wanted to focus on my years in college, from Fall 2016 to Spring 2020. For this analysis, I defined time in terms of school years and quarters. While I was a college student, I often felt like the passage of time was defined in terms of quarters, so this scale felt appropriate.\r\nHere is the same line plot, except the feature values are averaged by quarter instead of month, and the plot is now faceted by school year:\r\n\r\nObservations:\r\nThe trend in increasing positivity can be observed in this plot as well (with my Junior year coming out on top).\r\nI was actually wondering whether I’d see a pattern by quarter, but there doesn’t seem to be any strong ones.\r\nAnalysis #4 - “My Top 100” playlists\r\nIn my last analysis, I move from my monthly playlists to the end-of-the-year playlists that Spotify makes for you every year.\r\nFor this, I grabbed audio features of songs in my yearly top 100 playlists from 2016-2019. In this graph, each line represents a song and the top 10 most listened to song of each year are emphasized in black. The thick red line in each panel represents the average of the songs for that year.\r\n\r\nSome observations on the variation in audio features among my top 100 playlists:\r\nAlthough the monthly averages in the previous graphs showed extremely low acousticness values (never going above 0.5), that actually hid a lot of variation. You can see that a good number of songs with high acousticness make it to my top 100 (and sometimes even top 10) songs, but that number seems to gradually decline over time. The song with highest acousticness, Ripchord by Rilo Kiley comes from my 2016 playlist. Rilo Kiley is an indie rock band that had a strong influence on my music taste during high school, so it’s no surprise that this song made it there.\r\nThere’s also a trend of decreasing overall variability in the audio features of my top 100 songs. Perhaps this means that I’m narrowing in on my true music taste? Or maybe that I’m going through a phase in music taste? It’s kind of hard to tell but interesting nonetheless.\r\nConclusion\r\nI didn’t really dig too deeply into the acoustic profile of the songs I listen to in this post, and I doubt that Spotify’s list of audio features are comprehensive enough to describe my music taste, but this was a cool exercise!\r\nAnd although I ignored several of the audio features because they weren’t very informative for the songs I listen to, I thought I should at least leave a summary table showing the mean values for all features that I gave in the table above!\r\n\r\nFeature\r\n Description\r\n Average\r\n acousticness\r\n A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.\r\n 0.16\r\n danceability\r\n Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.\r\n 0.64\r\n energy\r\n Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.\r\n 0.75\r\n instrumentalness\r\n Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.\r\n 0.02\r\n liveness\r\n Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.\r\n 0.19\r\n loudness\r\n The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.\r\n -4.95\r\n speechiness\r\n Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.\r\n 0.08\r\n tempo\r\n The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.\r\n 120.25\r\n valence\r\n A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).\r\n 0.55\r\n \r\n\r\n\r\n\r\n\r\n", "preview": "posts/2020-07-29-six-years-of-my-spotify-playlists/preview.png", - "last_modified": "2022-11-13T09:16:56-05:00", + "last_modified": "2022-11-13T06:16:56-08:00", "input_file": {}, "preview_width": 3593, "preview_height": 2459 @@ -712,7 +712,7 @@ ], "contents": "\r\n\r\n\r\n\r\nWhen I was an RA in the LEARN lab - a child language development lab at Northwestern - I worked on a shiny app that automates snowball search for meta-analysis research (relevant research poster). Long story short, I worked on it for a couple months, got it working, then stopped working on it for another couple months, and had the chance to revisit it just recently.\r\nWhen I picked the project back up, I realized that my old code was poorly commented, somewhat inefficient, and even hackish at times. So I decided to re-write it from scratch. In this second time around, I learned a lot of useful functions/tricks that really helped streamline my code and I thought I’d document my three favorite ones here for future reference.\r\n1. %||% from {rlang}\r\nBasically, %||% is an infix operator that returns the left-hand side when it is not Null and the right-hand side when it is Null. It’s from the {rlang} package but you can also define the function yourself:\r\n\r\n\r\n\"%||%\" <- function(lhs, rhs) {\r\n if (is.null(lhs)) rhs else lhs\r\n}\r\n\r\na <- 10\r\nb <- NULL\r\n\r\na %||% b\r\n\r\n\r\n [1] 10\r\n\r\nb %||% a\r\n\r\n\r\n [1] 10\r\n\r\n# note how the output is different when b is no longer null\r\nb <- 11\r\nb %||% a\r\n\r\n\r\n [1] 11\r\n\r\nI found this operator to be extremely useful when displaying empty tables as placeholders when using DT::datatable(). It allows me to communicate to the user where a table is expected to appear rather than just not showing anything at all when no data is loaded (which is what happens by default).\r\nFor example, if you want to show an empty column with an empty row when the data (here, the reactive variable mydf) is null, you might do the following:\r\n\r\n\r\nmydf_display <- renderDataTable({\r\n datatable(mydf() %||% tibble(` ` = NA))\r\n})\r\n\r\n\r\n\r\nAnother use-case for %||% is when I’m trying a sequence of function calls until one one of them succeeds and returns a non-null value. For example, say I want to scrape some information online and I have API wrappers for different websites that potentially have that information. I can chain them together using %||% like so:\r\n\r\n\r\nmyinfo <- \r\n scrape_website1() %||%\r\n scrape_website2() %||%\r\n scrape_website3()\r\n\r\n\r\n\r\nThis is much neater than nested if…else statements!\r\n2. purrr::imap() and {shinybusy}\r\nUsing my shiny app involves a lot of waiting (querying online databases), so I looked into ways to show a progress bar similar to the family of *Modal() functions from {shiny}. The extension package {shinybusy} (project site) offers a very satisfying solution to this problem.\r\nBasically, you initialize a progress bar with show_modal_progress_*() and increment its value inside whatever operation you’re doing. Here’s a pseudo code demonstrating how it works:\r\n\r\ninitialize a progress bar\r\n\r\ncreate a new_list of same size to store output\r\n\r\nfor index in seq_along(list):\r\n new_list[index] <- calculations(list[index])\r\n increment progress bar by index\r\n \r\nremove progress bar\r\n\r\nreturn new_list\r\n\r\nBut in my case, my “do stuff” part didn’t involve a big wall of code because I packed it into a single function in a separate file that I source at the beginning. This, coupled with my general aversion to for-loops, drove me to imap() and its variants from {purrr}. imap() is like map() except it also keeps track of the index of the element that you’re operating on (to put it another way, it’s like map2() where .y is the index).\r\nNow, you don’t need an explicit for-loop to increment and the above code can be reduced to this:\r\n\r\ninitialize a progress bar\r\n\r\nnew_list <- imap(list,\r\n ~{\r\n calculations(.x)\r\n increment progress bar by .y\r\n })\r\n \r\nremove progress bar\r\n\r\nreturn new_list\r\n\r\nIn my opinion, this is much cleaner! For a more concrete example, here’s a template using actual code:\r\n\r\n\r\nmy_data <- eventReactive(input$my_button, {\r\n \r\n # initialize a progress bar\r\n show_modal_progress_line()\r\n \r\n # do operation on elements of vector\r\n result <- imap(my_reactive_var(),\r\n ~{\r\n update_modal_progress(value = .y / length(my_reactive_var()))\r\n my_operation_on_element(.x)\r\n })\r\n \r\n # remove progress bar\r\n remove_modal_progress()\r\n \r\n # return output\r\n return(result)\r\n \r\n})\r\n\r\n\r\n\r\n3. User inputs inside modalDialog()\r\nIn {shiny}, you can show the user a pop-up message box by first laying out the content of the message in modalDialog() and then rendering it with showModal(). In the first version of my app, I used this to show simple messages like warnings, but did you know that you can include any *Input widgets too?\r\nFor example, this code renders a pop-up box for a file upload in response to a button click:\r\n\r\n\r\nobserveEvent(input$MyButton, {\r\n showModal(modalDialogue(\r\n title = \"Upload File Here\",\r\n fileInput(inputID = \"UploadedFile\", label = \"Upload\")\r\n ))\r\n})\r\n\r\n\r\n\r\nAnd you can access whatever is uploaded using input$UploadedFile like you would if the file upload widget was in the ui side of the app!\r\nThis took me a bit to get used to because you are defining the modal in the server side where the content of the modal looks like the ui side but can be accessed back at the server side. But this was life-changing and it opened up a lot of potential for my GUI to be less cluttered. Using this neat trick, I was able to move a large feature into a modal that would only be available upon a click of a button (it was a feature designed for a rare case scenario so I thought I’d save the user from having to see the entire interface for that if they don’t ask for it).\r\nEnding note\r\nThe more I learn and use shiny, the less I feel like I know. I’m actually enjoying this stage of my progress because every new thing just absolutely wows me (and I hope to continue sharing what I learn - hence this being the “first set”). And very much looking forward to Hadley Wickham’s new book on shiny!\r\n\r\n\r\n\r\n", "preview": "posts/2020-07-20-shiny-tips-1/preview.png", - "last_modified": "2022-11-13T09:16:55-05:00", + "last_modified": "2022-11-13T06:16:55-08:00", "input_file": {}, "preview_width": 746, "preview_height": 133 @@ -734,7 +734,7 @@ ], "contents": "\r\n\r\n\r\n\r\nRaincloud Plots\r\n\r\n\r\n\r\nGeoms for rainplots (a.k.a. split violin plots) already exist, but you might have a very special case where you have pairs of rainplots and you want to track the change in individual datapoints between the rainplot distributions.\r\nFor example, say you want to track the height of a plant species across two timepoints and you want to communicate three information:\r\nThe change in the distribution of plant heights between timepoints.\r\nThe individual variation in height (“intercept”).\r\nThe individual variation in change of height between timepoints (“slope”).\r\nAnd the data looks like this:\r\n\r\n\r\nset.seed(1234)\r\nplants <- tibble(Species = \"Dwarf\",\r\n Plant = rep(factor(1:100), 2),\r\n Timepoint = rep(c(\"First\", \"Second\"), each = 100),\r\n Height = c(rnorm(100, 10, 5), rnorm(100, 20, 8)))\r\n\r\nplants %>% \r\n group_by(Timepoint) %>% \r\n summarize(across(Height, list(mean = mean, sd = sd), .names = \"{col}_{fn}\"))\r\n\r\n\r\n # A tibble: 2 x 3\r\n Timepoint Height_mean Height_sd\r\n \r\n 1 First 9.22 5.02\r\n 2 Second 20.3 8.26\r\n\r\nYou can use geom_violhalf() from the {see} package to do this:\r\n\r\n\r\nlibrary(see)\r\nggplot(plants, aes(Timepoint, Height, fill = Timepoint)) +\r\n geom_violinhalf() +\r\n geom_point(aes(group = Plant),\r\n position = position_nudge(-.05),\r\n alpha = 0.5, shape = 16) +\r\n geom_line(aes(group = Plant),\r\n position = position_nudge(-.05))\r\n\r\n\r\n\r\n\r\nBut it’d look better if the lines don’t cross over the raincloud for the first timepoint.\r\ngeom_paired_raincloud() automatically flips the first raincloud for you! You do get a warining that there are overlapping points, but that’s because the x-axis is categorical and {ggplot2} thinks that flipping the raincloud intrudes into a different category. AFAIK you don’t lose any data despite this warning, but you should double check to be sure.\r\n\r\n\r\ndevtools::source_url(\"https://raw.githubusercontent.com/yjunechoe/geom_paired_raincloud/master/geom_paired_raincloud.R\")\r\n\r\nggplot(plants, aes(Timepoint, Height, fill = Timepoint)) +\r\n geom_paired_raincloud()\r\n\r\n\r\n Warning: position_dodge requires non-overlapping x intervals\r\n\r\n\r\nWe can add individual points and lines onto this plot in a similar way, except you need to use a 2-length vector for position_dodge().\r\n\r\n\r\nplants %>% \r\n # arrange by individual plant\r\n arrange(Plant) %>% \r\n ggplot(aes(Timepoint, Height, fill = Timepoint)) +\r\n geom_paired_raincloud() +\r\n geom_point(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05)),\r\n alpha = 0.5, shape = 16,\r\n show.legend = FALSE) +\r\n geom_line(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05)))\r\n\r\n\r\n\r\n\r\nNOTE: you need to make sure that the data is arranged by the variable you’re using for the group aesthetic (in this case, Plant) before being passed into ggplot() for position_nudge() in the other geoms to work properly (sorry it’s a bit hacky):\r\ngeom_paired_raincloud works as long as the grouping is of length two (i.e., as long as you’re comparing distribution between two levels).\r\nLet’s modify the plants dataset to include another species of plant:\r\n\r\n\r\nplants2 <- plants %>% \r\n bind_rows(\r\n tibble(Species = \"Giant\",\r\n Plant = rep(factor(101:200), 2),\r\n Timepoint = rep(c(\"First\", \"Second\"), each = 100),\r\n Height = c(rnorm(100, 30, 5), rnorm(100, 50, 8)))\r\n )\r\n\r\nplants2 %>% \r\n group_by(Species, Timepoint) %>% \r\n summarize(across(Height, list(mean = mean, sd = sd), .names = \"{col}_{fn}\"))\r\n\r\n\r\n # A tibble: 4 x 4\r\n # Groups: Species [2]\r\n Species Timepoint Height_mean Height_sd\r\n \r\n 1 Dwarf First 9.22 5.02\r\n 2 Dwarf Second 20.3 8.26\r\n 3 Giant First 30.8 4.80\r\n 4 Giant Second 49.9 8.40\r\n\r\nIn this new plot, I just added facet_wrap(~Species)\r\n\r\n\r\nplants2 %>% \r\n arrange(Plant) %>% \r\n ggplot(aes(Timepoint, Height, fill = Timepoint)) +\r\n geom_paired_raincloud() +\r\n geom_point(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05)),\r\n alpha = 0.5, shape = 16,\r\n show.legend = FALSE) +\r\n geom_line(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05))) +\r\n facet_wrap(~Species)\r\n\r\n\r\n\r\n\r\ngeom_paired_raincloud() isn’t particularly useful for plotting comparisons between more than two levels, so it throws a warning when that’s the case:\r\n\r\n\r\n# Adding a third timepoint\r\nplants3 <- plants %>% \r\n bind_rows(tibble(Species = \"Dwarf\",\r\n Plant = factor(1:100),\r\n Timepoint = \"Third\",\r\n Height = rnorm(100, 40, 10)))\r\n\r\nplants3 %>% \r\n group_by(Timepoint) %>% \r\n summarize(across(Height, list(mean = mean, sd = sd), .names = \"{col}_{fn}\"))\r\n\r\n\r\n # A tibble: 3 x 3\r\n Timepoint Height_mean Height_sd\r\n \r\n 1 First 9.22 5.02\r\n 2 Second 20.3 8.26\r\n 3 Third 39.8 11.2\r\n\r\n\r\n\r\nplants3 %>% \r\n arrange(Plant) %>% \r\n ggplot(aes(Timepoint, Height, fill = Timepoint)) +\r\n geom_paired_raincloud() +\r\n geom_point(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05)),\r\n alpha = 0.5, shape = 16,\r\n show.legend = FALSE) +\r\n geom_line(aes(group = Plant),\r\n position = position_nudge(c(.05, -.05)))\r\n\r\n\r\n\r\n\r\nBut I think geom_paired_raincloud() works great if you have the right data. Here’s an example from my recent work, looking at the variation in how subjects respond to stimuli when they’re presented in one condition (Subject Accent) compared to the other (Verb Accent).\r\n\r\n\r\n\r\nThe above plot is a combination of geom_paired_raincloud(), geom_point(), geom_line(), and geom_boxplot(). I like that you can employ all these aesthetics at once without making the plot too overwhelming. I’ve included the important part of the code here and the full code is available at the github repo for this research project.\r\n\r\n\r\nrainplot_data %>% \r\n ggplot(aes(x = Cond, y = z_RT, fill = Cond)) +\r\n geom_paired_raincloud(alpha = .5) +\r\n geom_point(aes(group = Item),\r\n position = position_nudge(c(.15, -.15)),\r\n alpha = .5, shape = 16) +\r\n geom_line(aes(group = Item),\r\n position = position_nudge(c(.13, -.13)),\r\n linetype = 3) +\r\n geom_boxplot(position = position_nudge(c(.07, -.07)),\r\n alpha = .5, width = .04, outlier.shape = \" \") +\r\n facet_wrap(~Type, scales = \"free_x\") +\r\n ...\r\n\r\n\r\n\r\n\r\nNotice how I use position_nudge() to make sure that the points, lines, and boxplots are side-by-side and not overlapping with each other.\r\n\r\n\r\n\r\n", "preview": "posts/2020-07-13-geom-paired-raincloud/preview.png", - "last_modified": "2022-11-13T09:16:55-05:00", + "last_modified": "2022-11-13T06:16:55-08:00", "input_file": {}, "preview_width": 7086, "preview_height": 4251 @@ -758,7 +758,7 @@ ], "contents": "\r\n\r\n\r\n\r\nTo steal the definition from Wikipedia, a treemap is used for “displaying hierarchical data using nested figures, usually rectangles.” There are lots of ways to make one in R, but I didn’t find any one existing solution appealing.\r\nFor illustration, let’s take the pokemon dataset from {highcharter} and plot a treemap with it using different methods.\r\n\r\n\r\n\r\n\r\n\r\ndata(\"pokemon\", package = \"highcharter\")\r\n\r\n# Cleaning up data for a treemap\r\ndata <- pokemon %>% \r\n select(pokemon, type_1, type_2, color_f) %>%\r\n mutate(type_2 = ifelse(is.na(type_2), paste(\"only\", type_1), type_2)) %>% \r\n group_by(type_1, type_2, color_f) %>% \r\n count(type_1, type_2) %>% \r\n ungroup()\r\n\r\nhead(data, 5)\r\n\r\n\r\n\r\ntype_1\r\n\r\n\r\ntype_2\r\n\r\n\r\ncolor_f\r\n\r\n\r\nn\r\n\r\n\r\nbug\r\n\r\n\r\nelectric\r\n\r\n\r\n#BBBD23\r\n\r\n\r\n2\r\n\r\n\r\nbug\r\n\r\n\r\nfighting\r\n\r\n\r\n#AD9721\r\n\r\n\r\n1\r\n\r\n\r\nbug\r\n\r\n\r\nfire\r\n\r\n\r\n#B9AA23\r\n\r\n\r\n2\r\n\r\n\r\nbug\r\n\r\n\r\nflying\r\n\r\n\r\n#A8AE52\r\n\r\n\r\n13\r\n\r\n\r\nbug\r\n\r\n\r\nghost\r\n\r\n\r\n#9AA03D\r\n\r\n\r\n1\r\n\r\n\r\n \r\n1. {treemap}\r\nHere’s a plot made from the {treemap} package:\r\n\r\n\r\nlibrary(treemap)\r\n\r\ntreemap(dtf = data,\r\n index = c(\"type_1\", \"type_2\"),\r\n vSize = \"n\",\r\n vColor = \"type_1\")\r\n\r\n\r\n\r\n\r\nIt actually doesn’t look too bad, but this package hasn’t been updated for 3 years and there aren’t a lot of options for customization. For the options that do exist, they’re a big list of additional arguments to the main workhorse function, treemap(), which feels a bit restrictive if you’re used to {ggplot}’s modular and layered grammar. So while it’s very simple to use, I’d probably use it only for exploring the data for myself.\r\n \r\n2. {highcharter}\r\nAll the way on the other side of this ease<—>customizability spectrum is {highcharter} which is arguably the most powerful data visualization package in R.\r\nWith highcharter, you can turn the previous graph into the following:\r\n\r\nThis looks much better, and it’s even interactive (although this particular one isn’t because I just copy pasted the image from this blog post from 2018). I’d use {highcharter} except that there isn’t a great documentation on plotting treemaps, and it definitely doesn’t help that {highcharter} has a pretty steep learning curve, even if you have a lot of experience with {ggplot2}.\r\nThe main problem I ran into is that the function hc_add_series_treemap() that was used to create the above graph is now depreciated. It redirects you to use hctreemap() which itself is also depreciated. That finally redirects you to use hctreemap2() which is pretty sparse in documentation and use-cases, and overall not very transparent IMO.\r\n \r\n3. {treemapify}\r\n{treemapify} is a ggplot solution to plotting treemaps.\r\nHere’s a plot of the pokemon dataset, adopting the example code from the vignette. Since it follows the layered grammar of ggplot, I figured I’d show what each of the four layers outlined in the code does:\r\n\r\n\r\nlibrary(treemapify)\r\n\r\nggplot(data, aes(area = n, fill = color_f, label = type_2,\r\n subgroup = type_1)) +\r\n # 1. Draw type_2 borders and fill colors\r\n geom_treemap() +\r\n # 2. Draw type_1 borders\r\n geom_treemap_subgroup_border() +\r\n # 3. Print type_1 text\r\n geom_treemap_subgroup_text(place = \"centre\", grow = T, alpha = 0.5, colour = \"black\",\r\n fontface = \"italic\", min.size = 0) +\r\n # 4. Print type_2 text\r\n geom_treemap_text(colour = \"white\", place = \"topleft\", reflow = T) +\r\n theme(legend.position = 0)\r\n\r\n\r\n\r\ngeom_treemap() draws type_2 borders and fill colors\r\n\r\n\r\n\r\ngeom_treemap_subgroup_border() draws type_1 borders\r\n\r\n\r\n\r\ngeom_treemap_subgroup_text() prints type_1 text\r\n\r\n\r\n\r\ngeom_treemap_text() prints type_2 text\r\n\r\n\r\n\r\nI find this the most appealing out of the three options and I do recommend this package, but I’m personally a bit hesistant to use it for three reasons:\r\nI don’t want to learn a whole ’nother family of geom_*s just to plot treemaps.\r\nSome of the ggplot “add-ons” that I like don’t really transfer over. For example, I can’t use geom_text_repel() from {ggrepel} because I have to use {treemapify}’s own text geoms like geom_treemap_subgroup_text() and geom_treemap_text().\r\nCustomization options are kind of a mouthful, and I’ve yet to see a nice-looking treemap that was plotted using this package. There are a couple example treemaps in the vignette but none of them look particularly good. An independently produced example here doesn’t look super great either.\r\n \r\nA Mixed (Hack-ish?) Solution\r\nBasically, I’m very lazy and I want to avoid learning any new packages or functions as much as possible.\r\nI’ve come up with a very simple solution to my self-created problem, which is to draw treemaps using geom_rect() with a little help from the {treemap} package introduced earlier.\r\nSo apparently, there’s a cool feature in treemap::treemap() where you can extract the plotting data.\r\nYou can do this by pulling the tm object from the plot function side-effect, and the underlying dataframe used for plotting looks like this.1:\r\n\r\n\r\ntm <- treemap(\r\n dtf = data,\r\n index = c(\"type_1\", \"type_2\"),\r\n vSize = \"n\",\r\n vColor = \"color_f\",\r\n type = 'color' # {treemap}'s equivalent of scale_fill_identity()\r\n)\r\n\r\n\r\n\r\nhead(tm$tm)\r\n\r\n\r\n\r\ntype_1\r\n\r\n\r\ntype_2\r\n\r\n\r\nvSize\r\n\r\n\r\nvColor\r\n\r\n\r\nstdErr\r\n\r\n\r\nvColorValue\r\n\r\n\r\nlevel\r\n\r\n\r\nx0\r\n\r\n\r\ny0\r\n\r\n\r\nw\r\n\r\n\r\nh\r\n\r\n\r\ncolor\r\n\r\n\r\nbug\r\n\r\n\r\nelectric\r\n\r\n\r\n2\r\n\r\n\r\n#BBBD23\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.4556639\r\n\r\n\r\n0.3501299\r\n\r\n\r\n0.0319174\r\n\r\n\r\n0.0872727\r\n\r\n\r\n#BBBD23\r\n\r\n\r\nbug\r\n\r\n\r\nfighting\r\n\r\n\r\n1\r\n\r\n\r\n#AD9721\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.4556639\r\n\r\n\r\n0.3064935\r\n\r\n\r\n0.0319174\r\n\r\n\r\n0.0436364\r\n\r\n\r\n#AD9721\r\n\r\n\r\nbug\r\n\r\n\r\nfire\r\n\r\n\r\n2\r\n\r\n\r\n#B9AA23\r\n\r\n\r\n2\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.4875812\r\n\r\n\r\n0.3501299\r\n\r\n\r\n0.0319174\r\n\r\n\r\n0.0872727\r\n\r\n\r\n#B9AA23\r\n\r\n\r\nbug\r\n\r\n\r\nflying\r\n\r\n\r\n13\r\n\r\n\r\n#A8AE52\r\n\r\n\r\n13\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.2757660\r\n\r\n\r\n0.2628571\r\n\r\n\r\n0.1160631\r\n\r\n\r\n0.1560000\r\n\r\n\r\n#A8AE52\r\n\r\n\r\nbug\r\n\r\n\r\nghost\r\n\r\n\r\n1\r\n\r\n\r\n#9AA03D\r\n\r\n\r\n1\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.4556639\r\n\r\n\r\n0.2628571\r\n\r\n\r\n0.0319174\r\n\r\n\r\n0.0436364\r\n\r\n\r\n#9AA03D\r\n\r\n\r\nbug\r\n\r\n\r\ngrass\r\n\r\n\r\n6\r\n\r\n\r\n#9CBB2B\r\n\r\n\r\n6\r\n\r\n\r\nNA\r\n\r\n\r\n2\r\n\r\n\r\n0.4744388\r\n\r\n\r\n0.4374026\r\n\r\n\r\n0.0450598\r\n\r\n\r\n0.1854545\r\n\r\n\r\n#9CBB2B\r\n\r\n\r\nWe can simply use this data to recreate the treemap that was made with {treemapify} - except this time we have more flexibility!\r\nFirst, we do some data cleaning:\r\n\r\n\r\ntm_plot_data <- tm$tm %>% \r\n # calculate end coordinates with height and width\r\n mutate(x1 = x0 + w,\r\n y1 = y0 + h) %>% \r\n # get center coordinates for labels\r\n mutate(x = (x0+x1)/2,\r\n y = (y0+y1)/2) %>% \r\n # mark primary groupings and set boundary thickness\r\n mutate(primary_group = ifelse(is.na(type_2), 1.2, .5)) %>% \r\n # remove colors from primary groupings (since secondary is already colored)\r\n mutate(color = ifelse(is.na(type_2), NA, color))\r\n\r\n\r\n\r\nThen we plot. It looks like I can recreate a lot of it with a little help from the {ggfittext} package that was in the source code2:\r\n\r\n\r\nggplot(tm_plot_data, aes(xmin = x0, ymin = y0, xmax = x1, ymax = y1)) + \r\n # add fill and borders for groups and subgroups\r\n geom_rect(aes(fill = color, size = primary_group),\r\n show.legend = FALSE, color = \"black\", alpha = .3) +\r\n scale_fill_identity() +\r\n # set thicker lines for group borders\r\n scale_size(range = range(tm_plot_data$primary_group)) +\r\n # add labels\r\n ggfittext::geom_fit_text(aes(label = type_2), min.size = 1) +\r\n # options\r\n scale_x_continuous(expand = c(0, 0)) +\r\n scale_y_continuous(expand = c(0, 0)) +\r\n theme_void()\r\n\r\n\r\n\r\n\r\nNow, I can be a lot more flexible with my customizations.\r\nFor example, let’s say I wanted to isolate and emphasize the secondary types that have unique type-combinations with steel, AND also provide the name of the corresponding pokemon.\r\nI can do this by using geom_text_repel() for a subset of the labels while keeping the same geom_fit_text() setting for the rest of the labels.\r\n\r\n\r\ntm_plot_data %>% \r\n ggplot(aes(xmin = x0, ymin = y0, xmax = x1, ymax = y1)) + \r\n geom_rect(aes(fill = color, size = primary_group),\r\n show.legend = FALSE, color = \"black\", alpha = .3) +\r\n scale_fill_identity() +\r\n scale_size(range = range(tm_plot_data$primary_group)) +\r\n ggfittext::geom_fit_text(data = filter(tm_plot_data, type_1 != \"steel\" | vSize > 1),\r\n aes(label = type_2), min.size = 1) +\r\n # pick out observations of interest and annotate with geom_text_repel\r\n ggrepel::geom_text_repel(\r\n data = filter(tm_plot_data, vSize == 1, type_1 == \"steel\") %>% \r\n inner_join(pokemon, by = c(\"type_1\", \"type_2\")),\r\n aes(x = x, y = y, label = glue::glue(\"{type_2} ({pokemon})\")),\r\n color = \"black\", xlim = c(1.02, NA), size = 4,\r\n direction = \"y\", vjust = .5, force = 3\r\n ) +\r\n # expand x-axis limits to make room for test annotations\r\n scale_x_continuous(limits = c(0, 1.2), expand = c(0, 0)) +\r\n scale_y_continuous(expand = c(0, 0)) +\r\n theme_void()\r\n\r\n\r\n\r\n\r\nAnd that’s our final product! This would’ve been pretty difficult to do with any of the three options I reviewed at the top!\r\ntl;dr - Use treemap() from the {treemap} package to get positions for geom_rect()s and you’re 90% of the way there to plotting a treemap! Apply your favorite styles (especially _text() geoms) from the {ggplot2} ecosystem for finishing touches!\r\n \r\nSession Info\r\n\r\n\r\nsessionInfo()\r\n\r\n\r\n R version 4.0.3 (2020-10-10)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 10 x64 (build 18363)\r\n \r\n Matrix products: default\r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.1252 \r\n [2] LC_CTYPE=English_United States.1252 \r\n [3] LC_MONETARY=English_United States.1252\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.1252 \r\n \r\n attached base packages:\r\n [1] stats graphics grDevices datasets utils methods base \r\n \r\n other attached packages:\r\n [1] treemapify_2.5.3 treemap_2.4-2 printr_0.1 forcats_0.5.0 \r\n [5] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 \r\n [9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] httr_1.4.2 jsonlite_1.7.1 modelr_0.1.8 \r\n [4] shiny_1.5.0 assertthat_0.2.1 highr_0.8 \r\n [7] blob_1.2.1 renv_0.12.0 cellranger_1.1.0 \r\n [10] ggrepel_0.8.2 yaml_2.2.1 pillar_1.4.6 \r\n [13] backports_1.1.10 glue_1.4.2 digest_0.6.26 \r\n [16] RColorBrewer_1.1-2 promises_1.1.1 rvest_0.3.6 \r\n [19] colorspace_1.4-1 htmltools_0.5.0 httpuv_1.5.4 \r\n [22] pkgconfig_2.0.3 broom_0.7.2 haven_2.3.1 \r\n [25] xtable_1.8-4 scales_1.1.1 later_1.1.0.1 \r\n [28] distill_1.0.1 downlit_0.2.0 generics_0.0.2 \r\n [31] farver_2.0.3 ellipsis_0.3.1 withr_2.2.0 \r\n [34] cli_2.1.0 magrittr_1.5.0.9000 crayon_1.3.4 \r\n [37] readxl_1.3.1 mime_0.9 evaluate_0.14 \r\n [40] fs_1.5.0 fansi_0.4.1 xml2_1.3.2 \r\n [43] tools_4.0.3 data.table_1.13.2 hms_0.5.3 \r\n [46] lifecycle_0.2.0 gridBase_0.4-7 munsell_0.5.0 \r\n [49] reprex_0.3.0 compiler_4.0.3 rlang_0.4.8 \r\n [52] grid_4.0.3 gt_0.2.2 rstudioapi_0.11 \r\n [55] igraph_1.2.6 labeling_0.4.2 rmarkdown_2.5 \r\n [58] gtable_0.3.0 DBI_1.1.0 R6_2.4.1 \r\n [61] lubridate_1.7.9 knitr_1.30 fastmap_1.0.1 \r\n [64] prismatic_0.2.0 stringi_1.5.3 Rcpp_1.0.5 \r\n [67] vctrs_0.3.4 ggfittext_0.9.0 dbplyr_1.4.4 \r\n [70] tidyselect_1.1.0 xfun_0.18\r\n\r\n\r\nYou might get a warning referencing something about data.table here. No worries if this happens. The outdated {treemap} source code is built on {data.table} and contains a deprecated argument.↩︎\r\nI highly recommend checking {ggfittext} out! Here’s the github repo. Also, this is more of a note to myself but I had some trouble getting this to work at first because the min.size argument defaults to 4, meaning that all fitted text smaller than size 4 are simply not plotted (so I couldn’t get geom_fit_text() to print anything in my treemap at first). You can compare and see the threshold by looking at the geom_text_repel() texts in my second example which also has a size of 4.↩︎\r\n", "preview": "posts/2020-06-30-treemap-with-ggplot/2020-06-30-treemap-with-ggplot_files/figure-html5/unnamed-chunk-12-1.png", - "last_modified": "2022-11-13T09:16:55-05:00", + "last_modified": "2022-11-13T06:16:55-08:00", "input_file": {}, "preview_width": 1920, "preview_height": 768 @@ -781,7 +781,7 @@ ], "contents": "\r\n\r\n\r\n\r\nThe {spacyr} package is an R wrapper for Python’s spaCy package, powered by {reticulate}. Although it’s been around for over 3 years, it doesn’t seem to have really been picked up by R users.1 I actually think this makes sense since what makes spaCy so great is its object-oriented approach to NLP (which Python is good at). But perhaps more importantly, a good portion of data wrangling in spaCy is reducible to operating on vectors of such tokens, and I think that comes pretty naturally for R users with a functional programming background.2 So my guess is that since spaCy is accessible to R users, {spacyr} isn’t that widely used.\r\nBut with that said, I like to make my workflow as R-centered as possible and I think there’s still value in {spacyr} at least for very simple, exploratory analysis of text. The results being returned in a tidy format is a huge plus, and it doesn’t seem to sacrifice much speed.\r\nThere’s a good guide to using {spacyr} in the CRAN vignette which covers pretty much everything you need to know if you’re already familiar with spaCy (and if you aren’t, there’s a great cheatsheet from DataCamp).\r\nEverything I just said above was just a whole lot of background information. What I really want to do here to contribute to the discussion around {spacyr} by sharing a tip for analyzing dependency relations from the output of spacy_parse(), which is {spacyr}’s main function that combines both the model-loading and text-processing stages of spaCy.\r\n\r\n\r\n\r\nFor illustration, I’ll be using the 8 State of the Union addresses by President Barack Obama from 2009-2016, which comes from the {sotu} package.\r\n\r\n\r\nlibrary(sotu)\r\ndoc <- tail(sotu::sotu_text, 8)\r\n\r\n# First 100 characters of each speech\r\nstrtrim(doc, 100)\r\n\r\n\r\n [1] \"Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States--she's a\"\r\n [2] \"Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans\"\r\n [3] \"Mr. Speaker, Mr. Vice President, Members of Congress, distinguished guests, and fellow Americans: To\"\r\n [4] \"Mr. Speaker, Mr. Vice President, Members of Congress, distinguished guests, and fellow Americans: La\"\r\n [5] \"Please, everybody, have a seat. Mr. Speaker, Mr. Vice President, Members of Congress, fellow America\"\r\n [6] \"The President. Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Today in A\"\r\n [7] \"The President. Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: We are 15 \"\r\n [8] \"Thank you. Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Tonight marks \"\r\n\r\nWe can pass this document to spacy_parse() to get back a dataframe of tokens and their attributes in tidy format, where each row (observation) is a token.3\r\n\r\n\r\nparsed <- spacy_parse(doc, dep = TRUE, entity = FALSE)\r\n\r\nhead(parsed, 10)\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n1\r\n\r\n\r\nMadam\r\n\r\n\r\nMadam\r\n\r\n\r\nPROPN\r\n\r\n\r\n2\r\n\r\n\r\ncompound\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n2\r\n\r\n\r\nSpeaker\r\n\r\n\r\nSpeaker\r\n\r\n\r\nPROPN\r\n\r\n\r\n2\r\n\r\n\r\nROOT\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n3\r\n\r\n\r\n,\r\n\r\n\r\n,\r\n\r\n\r\nPUNCT\r\n\r\n\r\n2\r\n\r\n\r\npunct\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n4\r\n\r\n\r\nMr. \r\n\r\n\r\nMr. \r\n\r\n\r\nPROPN\r\n\r\n\r\n6\r\n\r\n\r\ncompound\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n5\r\n\r\n\r\nVice\r\n\r\n\r\nVice\r\n\r\n\r\nPROPN\r\n\r\n\r\n6\r\n\r\n\r\ncompound\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n6\r\n\r\n\r\nPresident\r\n\r\n\r\nPresident\r\n\r\n\r\nPROPN\r\n\r\n\r\n2\r\n\r\n\r\nappos\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n7\r\n\r\n\r\n,\r\n\r\n\r\n,\r\n\r\n\r\nPUNCT\r\n\r\n\r\n6\r\n\r\n\r\npunct\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n8\r\n\r\n\r\nMembers\r\n\r\n\r\nMembers\r\n\r\n\r\nPROPN\r\n\r\n\r\n6\r\n\r\n\r\nappos\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n9\r\n\r\n\r\nof\r\n\r\n\r\nof\r\n\r\n\r\nADP\r\n\r\n\r\n8\r\n\r\n\r\nprep\r\n\r\n\r\ntext1\r\n\r\n\r\n1\r\n\r\n\r\n10\r\n\r\n\r\nCongress\r\n\r\n\r\nCongress\r\n\r\n\r\nPROPN\r\n\r\n\r\n9\r\n\r\n\r\npobj\r\n\r\n\r\nThis output format is great for plotting in R with the familiar packages. For example, we can make a bar plot of top adjectives used by Obama in his SOTU addresses with minimal changes to the output.\r\n\r\n\r\n# Load tidytext package for stopwords\r\nlibrary(tidytext)\r\n\r\nparsed %>%\r\n filter(pos == \"ADJ\",\r\n str_detect(lemma, \"^[:alpha:].*[:alpha:]$\"),\r\n !lemma %in% tidytext::stop_words$word) %>%\r\n count(lemma) %>% \r\n mutate(lemma = fct_reorder(str_to_title(lemma), n)) %>%\r\n top_n(15) %>% \r\n ggplot(aes(lemma, n)) +\r\n geom_col() +\r\n coord_flip() +\r\n labs(title = \"Top 15 Adjectives from President Obama's SOTU Addresses\",\r\n x = \"Adjective\", y = \"Count\") +\r\n theme_classic()\r\n\r\n\r\n\r\n\r\nThe Challenge\r\nBut what if we want to dig a little deeper in our analysis of adjectives? What if, for example, we were interested in the adjectives that were used to describe “America”\r\nBecause we set dep = TRUE when we called spacy_parse() earlier, we have information about dependencies in the dep_rel column and the head_token_id column. To be more precise, dep_rel is the .dep_ attribute from spaCy and head_token_id is the row index of the head token (.head attribute from spaCy) that is unique to the spacy_parse() output.\r\nFor example, let’s look at the the 298th sentence from Obama’s third SOTU address:\r\n\r\n\r\nexample_sentence <- parsed %>% \r\n filter(doc_id == \"text3\", sentence_id == 298) %>% \r\n pull(token) %>% \r\n str_c(collapse = \" \") %>% \r\n str_remove_all(\" (?=[:punct:])\")\r\n\r\nexample_sentence\r\n\r\n\r\n [1] \"Now, we've made great strides over the last 2 years in using technology and getting rid of waste.\"\r\n\r\nAnd here’s a visualization of the dependency parse made with displaCy. Sadly, displaCy is not a part of {spacyr}, so I’m just calling Python here using {reticulate}.\r\n\r\n######## Python Code ########\r\nimport spacy\r\nfrom spacy import displacy\r\nnlp = spacy.load('en_core_web_sm')\r\nexample_parsed = nlp(r.example_sentence)\r\n\r\n\r\ndisplacy.render(example_parsed, style = \"dep\")\r\n\r\n\r\n\r\n‘Now,ADVwePRON'veAUXmadeVERBgreatADJstridesNOUNoverADPtheDETlastADJ2NUMyearsNOUNinADPusingVERBtechnologyNOUNandCCONJgettingVERBridVERBofADPwaste.NOUNadvmodnsubjauxamoddobjprepdetamodnummodpobjpreppcompdobjccauxpassconjpreppobj’\r\n\r\n\r\n \r\nBasically, the task here is to find words like “competitive” in the example sentence above where the token is an adjective and its head is the word “America”, but it turns out harder than it seems.\r\nThe output of spacy_parse is set up such that every sentence stands on their own. More specifically speaking, the indices stored in token_id and head_token_id are local indices relative to each sentence.4 So while there are a total of 62791 tokens in parsed, the max token_id is 99, which is the index of the last token in the longest sentence.\r\nA strictly tidyverse approach (which has become a sort of a tunnel-vision for me) would be to split parsed by sentence and map a filter function to each sentence. There are two ways of going about this and both are pretty slow.\r\nThe first way is to explicitly split the dataframe into a list of dataframes at the sentence level then map the filter function, using group_split() then map_df():\r\n\r\n\r\ntic <- Sys.time()\r\n\r\nparsed %>%\r\n group_split(doc_id, sentence_id, .drop = FALSE) %>%\r\n map_df(~filter(., pos == \"ADJ\", slice(.x, head_token_id)$lemma == \"America\"))\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext3\r\n\r\n\r\n302\r\n\r\n\r\n33\r\n\r\n\r\ncompetitive\r\n\r\n\r\ncompetitive\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\namod\r\n\r\n\r\ntext4\r\n\r\n\r\n231\r\n\r\n\r\n33\r\n\r\n\r\nrural\r\n\r\n\r\nrural\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\namod\r\n\r\n\r\ntext5\r\n\r\n\r\n245\r\n\r\n\r\n2\r\n\r\n\r\nstronger\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n3\r\n\r\n\r\namod\r\n\r\n\r\ntext6\r\n\r\n\r\n340\r\n\r\n\r\n18\r\n\r\n\r\nstrong\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n21\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n23\r\n\r\n\r\nliberal\r\n\r\n\r\nliberal\r\n\r\n\r\nADJ\r\n\r\n\r\n24\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n27\r\n\r\n\r\nconservative\r\n\r\n\r\nconservative\r\n\r\n\r\nADJ\r\n\r\n\r\n28\r\n\r\n\r\namod\r\n\r\n\r\nSys.time() - tic\r\n\r\n\r\n Time difference of 12.0861 secs\r\n\r\nThe second way is to implicitly declare a grouping by sentence and then map the filter function, using group_by() then group_map():\r\n\r\n\r\ntic <- Sys.time()\r\n\r\nparsed %>%\r\n group_by(doc_id, sentence_id) %>%\r\n group_map(~filter(., pos == \"ADJ\", slice(.x, head_token_id)$lemma == \"America\"), .keep = TRUE) %>%\r\n bind_rows()\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext3\r\n\r\n\r\n302\r\n\r\n\r\n33\r\n\r\n\r\ncompetitive\r\n\r\n\r\ncompetitive\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\namod\r\n\r\n\r\ntext4\r\n\r\n\r\n231\r\n\r\n\r\n33\r\n\r\n\r\nrural\r\n\r\n\r\nrural\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\namod\r\n\r\n\r\ntext5\r\n\r\n\r\n245\r\n\r\n\r\n2\r\n\r\n\r\nstronger\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n3\r\n\r\n\r\namod\r\n\r\n\r\ntext6\r\n\r\n\r\n340\r\n\r\n\r\n18\r\n\r\n\r\nstrong\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n21\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n23\r\n\r\n\r\nliberal\r\n\r\n\r\nliberal\r\n\r\n\r\nADJ\r\n\r\n\r\n24\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n27\r\n\r\n\r\nconservative\r\n\r\n\r\nconservative\r\n\r\n\r\nADJ\r\n\r\n\r\n28\r\n\r\n\r\namod\r\n\r\n\r\nSys.time() - tic\r\n\r\n\r\n Time difference of 11.63191 secs\r\n\r\nBoth ways give us the result we want, but it’s significantly slower than what we could quickly and easily do in Python.\r\n\r\n######## Python Code ########\r\ndoc = nlp(' '.join(r.doc))\r\n\r\nimport time\r\ntic = time.time()\r\n\r\n[token.text for token in doc if token.pos_ == \"ADJ\" and token.head.lemma_ == \"America\"]\r\n ['competitive', 'rural', 'stronger', 'strong', 'liberal', 'conservative']\r\ntime.time() - tic\r\n 0.04711294174194336\r\n\r\nA Work-around\r\nWhat would really help here is if we had global indices for tokens and head tokens, so that we can directly index a head from a token without going through the trouble of figuring out how sentences are organized in the dataframe.\r\nSo here’s my take on doing this:\r\n\r\n\r\n# Calculate global indices from local indices\r\nglobal_index <- parsed %>% \r\n group_by(doc_id, sentence_id) %>% \r\n # add token counts for each sentence\r\n add_count() %>% \r\n ungroup() %>% \r\n select(doc_id, sentence_id, n) %>% \r\n distinct() %>%\r\n # take the cumulative sum and shift 1 to the right (fill first index with 0)\r\n mutate(n = c(0, cumsum(n)[1:n()-1]))\r\n\r\n# Clean the output\r\nparsed2 <- parsed %>% \r\n inner_join(global_index, by = c(\"doc_id\", \"sentence_id\")) %>% \r\n mutate(token_id_global = token_id + n,\r\n head_token_id_global = head_token_id + n) %>% \r\n relocate(token_id_global, .after = token_id) %>% \r\n relocate(head_token_id_global, .after = head_token_id) %>% \r\n select(-n)\r\n\r\n\r\n\r\nThis adds two colums - token_id_global and head_token_id_global - that stores indices that range over the entire dataframe. Here’s a sample of the new dataframe to demonstrate:\r\n\r\n\r\nsample_n(parsed2, 10)\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken_id_global\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\nhead_token_id_global\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext7\r\n\r\n\r\n37\r\n\r\n\r\n11\r\n\r\n\r\n48388\r\n\r\n\r\nto\r\n\r\n\r\nto\r\n\r\n\r\nADP\r\n\r\n\r\n10\r\n\r\n\r\n48387\r\n\r\n\r\ndative\r\n\r\n\r\ntext3\r\n\r\n\r\n154\r\n\r\n\r\n3\r\n\r\n\r\n18253\r\n\r\n\r\nthe\r\n\r\n\r\nthe\r\n\r\n\r\nDET\r\n\r\n\r\n5\r\n\r\n\r\n18255\r\n\r\n\r\ndet\r\n\r\n\r\ntext1\r\n\r\n\r\n44\r\n\r\n\r\n2\r\n\r\n\r\n1036\r\n\r\n\r\nof\r\n\r\n\r\nof\r\n\r\n\r\nADP\r\n\r\n\r\n1\r\n\r\n\r\n1035\r\n\r\n\r\npcomp\r\n\r\n\r\ntext8\r\n\r\n\r\n17\r\n\r\n\r\n6\r\n\r\n\r\n56007\r\n\r\n\r\nto\r\n\r\n\r\nto\r\n\r\n\r\nADP\r\n\r\n\r\n5\r\n\r\n\r\n56006\r\n\r\n\r\nprep\r\n\r\n\r\ntext5\r\n\r\n\r\n316\r\n\r\n\r\n3\r\n\r\n\r\n38631\r\n\r\n\r\nthis\r\n\r\n\r\nthis\r\n\r\n\r\nDET\r\n\r\n\r\n4\r\n\r\n\r\n38632\r\n\r\n\r\nnsubj\r\n\r\n\r\ntext6\r\n\r\n\r\n340\r\n\r\n\r\n15\r\n\r\n\r\n46579\r\n\r\n\r\nthen\r\n\r\n\r\nthen\r\n\r\n\r\nADV\r\n\r\n\r\n23\r\n\r\n\r\n46587\r\n\r\n\r\nadvmod\r\n\r\n\r\ntext4\r\n\r\n\r\n162\r\n\r\n\r\n7\r\n\r\n\r\n26488\r\n\r\n\r\n\r\n\r\n\r\n\r\nSPACE\r\n\r\n\r\n6\r\n\r\n\r\n26487\r\n\r\n\r\n\r\n\r\ntext6\r\n\r\n\r\n87\r\n\r\n\r\n13\r\n\r\n\r\n41466\r\n\r\n\r\nto\r\n\r\n\r\nto\r\n\r\n\r\nPART\r\n\r\n\r\n14\r\n\r\n\r\n41467\r\n\r\n\r\naux\r\n\r\n\r\ntext8\r\n\r\n\r\n260\r\n\r\n\r\n17\r\n\r\n\r\n60517\r\n\r\n\r\npositioned\r\n\r\n\r\nposition\r\n\r\n\r\nVERB\r\n\r\n\r\n9\r\n\r\n\r\n60509\r\n\r\n\r\nconj\r\n\r\n\r\ntext8\r\n\r\n\r\n281\r\n\r\n\r\n2\r\n\r\n\r\n60856\r\n\r\n\r\nis\r\n\r\n\r\nbe\r\n\r\n\r\nAUX\r\n\r\n\r\n11\r\n\r\n\r\n60865\r\n\r\n\r\nccomp\r\n\r\n\r\nAnd since this process isn’t destructive, we actually don’t need to assign the output to a new object. This is great because we can flexibly incorporate it into the pipeline workflow.\r\nHere is my solution wrapped in a function:5\r\n\r\n\r\nadd_global_index <- function(spacy_parsed) {\r\n \r\n global_index <- spacy_parsed %>% \r\n group_by(doc_id, sentence_id) %>% \r\n add_count() %>% \r\n ungroup() %>% \r\n select(doc_id, sentence_id, n) %>% \r\n distinct() %>%\r\n mutate(n = c(0, cumsum(n)[1:n()-1]))\r\n \r\n spacy_parsed %>% \r\n inner_join(global_index, by = c(\"doc_id\", \"sentence_id\")) %>% \r\n mutate(token_id_global = token_id + n,\r\n head_token_id_global = head_token_id + n) %>% \r\n relocate(token_id_global, .after = token_id) %>% \r\n relocate(head_token_id_global, .after = head_token_id) %>% \r\n select(-n)\r\n \r\n}\r\n\r\n\r\n\r\nIn action:\r\n\r\n\r\n# Find adjectives describing \"America\"\r\nparsed %>% \r\n add_global_index() %>% \r\n filter(pos == \"ADJ\", slice(., head_token_id_global)$lemma == \"America\")\r\n\r\n\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken_id_global\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\nhead_token_id_global\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext3\r\n\r\n\r\n302\r\n\r\n\r\n33\r\n\r\n\r\n21359\r\n\r\n\r\ncompetitive\r\n\r\n\r\ncompetitive\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\n21360\r\n\r\n\r\namod\r\n\r\n\r\ntext4\r\n\r\n\r\n231\r\n\r\n\r\n33\r\n\r\n\r\n27772\r\n\r\n\r\nrural\r\n\r\n\r\nrural\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\n27773\r\n\r\n\r\namod\r\n\r\n\r\ntext5\r\n\r\n\r\n245\r\n\r\n\r\n2\r\n\r\n\r\n36825\r\n\r\n\r\nstronger\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n3\r\n\r\n\r\n36826\r\n\r\n\r\namod\r\n\r\n\r\ntext6\r\n\r\n\r\n340\r\n\r\n\r\n18\r\n\r\n\r\n46582\r\n\r\n\r\nstrong\r\n\r\n\r\nstrong\r\n\r\n\r\nADJ\r\n\r\n\r\n21\r\n\r\n\r\n46585\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n23\r\n\r\n\r\n54054\r\n\r\n\r\nliberal\r\n\r\n\r\nliberal\r\n\r\n\r\nADJ\r\n\r\n\r\n24\r\n\r\n\r\n54055\r\n\r\n\r\namod\r\n\r\n\r\ntext7\r\n\r\n\r\n317\r\n\r\n\r\n27\r\n\r\n\r\n54058\r\n\r\n\r\nconservative\r\n\r\n\r\nconservative\r\n\r\n\r\nADJ\r\n\r\n\r\n28\r\n\r\n\r\n54059\r\n\r\n\r\namod\r\n\r\n\r\n\r\n\r\n# Find adjectives describing \"America\" inside a prepositional phrase\r\nparsed %>% \r\n add_global_index() %>% \r\n filter(pos == \"ADJ\", slice(., head_token_id_global)$lemma == \"America\",\r\n slice(., slice(., head_token_id_global)$head_token_id_global)$dep_rel == \"prep\")\r\n\r\n\r\n\r\n\r\n\r\ndoc_id\r\n\r\n\r\nsentence_id\r\n\r\n\r\ntoken_id\r\n\r\n\r\ntoken_id_global\r\n\r\n\r\ntoken\r\n\r\n\r\nlemma\r\n\r\n\r\npos\r\n\r\n\r\nhead_token_id\r\n\r\n\r\nhead_token_id_global\r\n\r\n\r\ndep_rel\r\n\r\n\r\ntext3\r\n\r\n\r\n302\r\n\r\n\r\n33\r\n\r\n\r\n21359\r\n\r\n\r\ncompetitive\r\n\r\n\r\ncompetitive\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\n21360\r\n\r\n\r\namod\r\n\r\n\r\ntext4\r\n\r\n\r\n231\r\n\r\n\r\n33\r\n\r\n\r\n27772\r\n\r\n\r\nrural\r\n\r\n\r\nrural\r\n\r\n\r\nADJ\r\n\r\n\r\n34\r\n\r\n\r\n27773\r\n\r\n\r\namod\r\n\r\n\r\nPerformance:\r\n\r\n\r\ntest <- function(){\r\n parsed %>% \r\n add_global_index() %>% \r\n filter(pos == \"ADJ\", slice(., head_token_id_global)$lemma == \"America\")\r\n}\r\n\r\nprint(microbenchmark::microbenchmark(test(), unit = \"s\"))\r\n\r\n\r\n Unit: seconds\r\n expr min lq mean median uq max neval\r\n test() 0.0909179 0.09831835 0.1109029 0.10356 0.1122529 0.3129214 100\r\n\r\nMuch better!\r\n \r\nSession Info\r\n\r\n R version 4.0.3 (2020-10-10)\r\n Platform: x86_64-w64-mingw32/x64 (64-bit)\r\n Running under: Windows 10 x64 (build 18363)\r\n \r\n Matrix products: default\r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.1252 \r\n [2] LC_CTYPE=English_United States.1252 \r\n [3] LC_MONETARY=English_United States.1252\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.1252 \r\n \r\n attached base packages:\r\n [1] stats graphics grDevices datasets utils methods base \r\n \r\n other attached packages:\r\n [1] tidytext_0.2.6 sotu_1.0.2 reticulate_1.18 stringr_1.4.0 \r\n [5] spacyr_1.2.1 forcats_0.5.0 ggplot2_3.3.2 purrr_0.3.4 \r\n [9] dplyr_1.0.2 printr_0.1 \r\n \r\n loaded via a namespace (and not attached):\r\n [1] Rcpp_1.0.5 highr_0.8 pillar_1.4.6 \r\n [4] compiler_4.0.3 tokenizers_0.2.1 tools_4.0.3 \r\n [7] digest_0.6.26 downlit_0.2.0 jsonlite_1.7.1 \r\n [10] lattice_0.20-41 evaluate_0.14 lifecycle_0.2.0 \r\n [13] tibble_3.0.4 gtable_0.3.0 pkgconfig_2.0.3 \r\n [16] rlang_0.4.8 Matrix_1.2-18 rstudioapi_0.11 \r\n [19] microbenchmark_1.4-7 distill_1.0 yaml_2.2.1 \r\n [22] xfun_0.18 janeaustenr_0.1.5 withr_2.2.0 \r\n [25] knitr_1.30 rappdirs_0.3.1 generics_0.0.2 \r\n [28] vctrs_0.3.4 grid_4.0.3 tidyselect_1.1.0 \r\n [31] data.table_1.13.2 glue_1.4.2 R6_2.4.1 \r\n [34] fansi_0.4.1 rmarkdown_2.5 farver_2.0.3 \r\n [37] magrittr_1.5.0.9000 SnowballC_0.7.0 prismatic_0.2.0 \r\n [40] scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 \r\n [43] gt_0.2.2 colorspace_1.4-1 renv_0.12.0 \r\n [46] labeling_0.4.2 stringi_1.5.3 munsell_0.5.0 \r\n [49] crayon_1.3.4\r\n\r\n\r\nThere are less than 30 posts about it on StackOverflow, for example.↩︎\r\nI personally found it very easy to pick up vector comprehension in Python after working with purrr::map, for example.↩︎\r\nThe argument entity = FALSE is the same as disable = ['ner'] in spacy.load() in Python. I did this to save computation time.↩︎\r\nThis format is shared across other NLP packages in R based on spacCy, like {cleanNLP}↩︎\r\nThis would need to be tweaked a bit if you want to use it for the output of {cleanNLP} because the column for the local index of token heads, tid_source, is 0 when the token is the ROOT, as opposed to its own token index, which is the case in {spacyr}. You could add something like mutate(tid_source = ifelse(tid_source == 0, tid, tid_source) to the beginning of the pipeline to address this.↩︎\r\n", "preview": "posts/2020-06-25-indexing-tip-for-spacyr/preview.png", - "last_modified": "2022-11-13T09:16:55-05:00", + "last_modified": "2022-11-13T06:16:55-08:00", "input_file": {}, "preview_width": 1920, "preview_height": 686 @@ -804,7 +804,7 @@ ], "contents": "\r\n\r\n\r\n\r\nWhat is Corr in the output of mixed-effects models?\r\nWhen fitting mixed effects regression models, especially those that try to keep it “maximal” (as per Barr, Levy, Scheepers, & Tily 2013), the random effects in the output of the model sometimes displays a column named Corr where some rows have numbers that range from -1 to 1.\r\n\r\nIt’s easy to guess that Corr stands for correlation and that the numbers in the column are correlation coefficients. If there are multiple predictors and the random effects structure includes more than one of those terms (e.g., (1 + Effect_1 * Effect_2 | Subject)), we even get another clue for this from the way that the Corr values spread out in the shape of a right triangle, much like in a correlation matrix.\r\n\r\nDespite the fact that we’re bound to have encountered this at some point when working with or reading about mixed effects models, I’ve found that there aren’t many beginner-friendly material explaining what they are - there are one-paragraph StackExchange answers and dense statistics papers, but not much in between in terms of comprehensiveness.\r\nSo here are my compiled notes on correlation parameters in linear mixed effects models that I’ve made for myself (with a basic knowledge of LMEMs).\r\nBefore we get started\r\nOur toy data and model\r\n\r\n\r\n\r\nFor the purposes of this discussion, I have created a toy experiment data (the code used to generate it is attached at the bottom).\r\nThe dataset toydata has 1,920 rows with the following columns:\r\nSubject: The subject ID, which ranges from 1 to 80\r\nItem: The item ID, which ranges from 1 to 24\r\nCondition: The experimental condition, which is either Control or Treatment\r\nResponse: A continuous observed variable\r\n\r\n\r\n\r\n\r\n\r\ntoydata\r\n\r\n\r\n # A tibble: 1,920 x 4\r\n Subject Item Condition Response\r\n \r\n 1 1 1 Control 226.\r\n 2 1 2 Treatment 300.\r\n 3 1 3 Control 239.\r\n 4 1 4 Treatment 262.\r\n 5 1 5 Control 241.\r\n 6 1 6 Treatment 264.\r\n 7 1 7 Control 237.\r\n 8 1 8 Treatment 230.\r\n 9 1 9 Control 229.\r\n 10 1 10 Treatment 283.\r\n # ... with 1,910 more rows\r\n\r\nImagine toydata to be the results from a very simple experiment. In this imaginary experiment, there are 80 subjects and each subject is tested on 24 items, resulting in a total of 1,920 trials/observations. This is a within-partipant design, so each participant sees 12 of the items in the Control condition and the other 12 in the Treatment condition.\r\nLet’s say that with our toy data, we want to know whether Condition has a positive effect on Response. Our goal by using mixed-effects modeling is to isolate the effect of Condition on Response (fixed effect), while controlling for by-item and by-subject variations (random effects). So let’s fit a simple linear mixed-effects model with the maximal random effects structure, with random intercepts and slopes for Condition by Subject and Item:\r\n\r\n\r\nmodel <- lmer(Response ~ Condition + (1+Condition|Subject) + (1+Condition|Item),\r\n REML = FALSE, control = lmerControl('bobyqa'), data = toydata)\r\n\r\n\r\n\r\nAnd let’s really quickly check model assumptions:\r\n\r\n\r\nperformance::check_model(model)\r\n\r\n\r\n\r\n\r\nEverything looks okay, so let’s look at the model output:\r\n\r\n\r\nsummary(model)\r\n\r\n\r\n Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's\r\n method [lmerModLmerTest]\r\n Formula: Response ~ Condition + (1 + Condition | Subject) + (1 + Condition | \r\n Item)\r\n Data: toydata\r\n Control: lmerControl(\"bobyqa\")\r\n \r\n AIC BIC logLik deviance df.resid \r\n 13267 13317 -6624 13249 1911 \r\n \r\n Scaled residuals: \r\n Min 1Q Median 3Q Max \r\n -3.176 -0.627 -0.065 0.576 4.864 \r\n \r\n Random effects:\r\n Groups Name Variance Std.Dev. Corr\r\n Subject (Intercept) 637.1 25.24 \r\n ConditionTreatment 108.1 10.40 0.85\r\n Item (Intercept) 44.4 6.66 \r\n ConditionTreatment 308.2 17.56 0.14\r\n Residual 37.4 6.11 \r\n Number of obs: 1920, groups: Subject, 80; Item, 24\r\n \r\n Fixed effects:\r\n Estimate Std. Error df t value Pr(>|t|) \r\n (Intercept) 209.64 3.14 100.84 66.8 < 2e-16 ***\r\n ConditionTreatment 35.88 3.78 28.52 9.5 2.5e-10 ***\r\n ---\r\n Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\r\n \r\n Correlation of Fixed Effects:\r\n (Intr)\r\n CndtnTrtmnt 0.291\r\n\r\nThe high t-statistic and low p-value of ConditionTreatment in the fixed effects output suggests that our data is extremely unlikely given the null hypothesis that Condition has no effect on Response (and the sign of the estimate further suggests a positive effect of the Treatment condition on Response compared to the Control condition). Therefore, this is strong evidence in support of our hypothesis.\r\nHere’s a nicer-looking summary table made with tab_model() from the {sjPlot} package. I will keep using this format from this point on. Not only is this nicer to look at, the notations used here (like \\(\\tau_{00}\\) and \\(\\rho_{01}\\)) are from Barr et al. (2013), so it’s easier to connect the pieces IMO.\r\n\r\n\r\nsjPlot::tab_model(model)\r\n\r\n\r\n\r\n \r\n\r\n\r\nResponse\r\n\r\n\r\nPredictors\r\n\r\n\r\nEstimates\r\n\r\n\r\nCI\r\n\r\n\r\np\r\n\r\n\r\n(Intercept)\r\n\r\n\r\n209.64\r\n\r\n\r\n203.49 – 215.80\r\n\r\n\r\n<0.001\r\n\r\n\r\nCondition [Treatment]\r\n\r\n\r\n35.88\r\n\r\n\r\n28.48 – 43.28\r\n\r\n\r\n<0.001\r\n\r\n\r\nRandom Effects\r\n\r\n\r\nσ2\r\n\r\n37.37\r\n\r\n\r\nτ00Subject\r\n\r\n637.14\r\n\r\n\r\nτ00Item\r\n\r\n44.39\r\n\r\n\r\nτ11Subject.ConditionTreatment\r\n\r\n108.10\r\n\r\n\r\nτ11Item.ConditionTreatment\r\n\r\n308.21\r\n\r\n\r\nρ01Subject\r\n\r\n0.85\r\n\r\n\r\nρ01Item\r\n\r\n0.14\r\n\r\n\r\nICC\r\n\r\n\r\n0.97\r\n\r\n\r\nN Subject\r\n\r\n80\r\n\r\n\r\nN Item\r\n\r\n24\r\n\r\n\r\nObservations\r\n\r\n\r\n1920\r\n\r\n\r\nMarginal R2 / Conditional R2\r\n\r\n0.216 / 0.975\r\n\r\n\r\nOkay so our finding is great and all but we’re more interested in the random effects here so let’s look at that.\r\nThe random effects (a review)\r\nWe can isolate the random effects from the model using VarCorr():\r\n\r\n\r\nVarCorr(model)\r\n\r\n\r\n Groups Name Std.Dev. Corr\r\n Subject (Intercept) 25.24 \r\n ConditionTreatment 10.40 0.85\r\n Item (Intercept) 6.66 \r\n ConditionTreatment 17.56 0.14\r\n Residual 6.11\r\n\r\nLet’s ignore the Corr column for a moment and talk about Std.Dev first.\r\nThe Std.Dev. values for the Subject random effects group suggest that the variation in the subject intercepts are fitted with a larger standard deviation of 25.24 and that the variation in subject slopes for Condition are fitted with a smaller standard deviation of 10.4.\r\nLet’s plot the by-subject variation:\r\n\r\n\r\n\r\nWe see a clear variation in the intercepts, and a subtler variation in the slopes. This is overall pretty consistent with the stand deviation values for subject random effects that we found earlier.\r\nLet’s plot the by-item variation as well:\r\n\r\n\r\n\r\nHere, we see the opposite: a clear variation in the slopes, and a subtler variation in the intercepts. Again, this is overall pretty consistent given the item random effects output: a larger standard deviation for the item slopes and a smaller standard deviation for item intercepts, as we found earlier.\r\nNow that we’ve reviewed Std.Dev., let’s talk about Corr, which is our main focus.\r\nThe Correlation Parameter\r\nLooking at the Corr column now, we see two numbers: 0.85 within the Subject random effects group and 0.14 within the Item random effects group.\r\n\r\n Groups Name Std.Dev. Corr\r\n Subject (Intercept) 25.24 \r\n ConditionTreatment 10.40 0.85\r\n Item (Intercept) 6.66 \r\n ConditionTreatment 17.56 0.14\r\n Residual 6.11\r\n\r\nAs you might have guessed, they have something to do with the correlation between random effects (intercept and slope) within each group (subject and item).\r\nBut if we extract the subject random effects, for example, and measure the correlation between subject intercepts and subject slopes, we get a slightly different number:\r\n\r\n\r\n# Extract by-subject intercept and slope\r\nranef_subj <- ranef(model)$Subject\r\n\r\nranef_subj_intercepts <- ranef_subj$`(Intercept)`\r\nranef_subj_slopes <- ranef_subj$ConditionTreatment\r\n\r\n# Calculate correlation\r\ncor(ranef_subj_intercepts, ranef_subj_slopes)\r\n\r\n\r\n [1] 0.88\r\n\r\nIn fact, we get slightly different values for the standard deviation of the subject random intercepts and slopes as well:\r\n\r\n\r\n# Calculate the standard deviation of by-subject intercepts and slopes\r\nsummarize_all(ranef_subj, sd)\r\n\r\n\r\n (Intercept) ConditionTreatment\r\n 1 25.3 10.18\r\n\r\nSo it looks like what the model isn’t just taking the random effects in our toydata dataset and calculating their variations and correlations. So then what is it doing?\r\nHow the model estimates random effects\r\nWhat we have to keep in mind when doing mixed effects modeling is that the model is fitting the random effects in the data, rather than just describing them. More specifically, the model is estimating population parameters that generated the sample of random effects that are seen in the data.1\r\nAnd in fact that’s exactly what we want to do. We don’t care about how individual subjects or items behave in our experiment, in the sense that we don’t care how fast John Doe presses a button, for example. We don’t want to predict John Doe’s behavior, but we do want to estimate, using data from John Doe, Jane Doe, Average Joe, and other participants from our experiment, the overall distribution of people’s idiosyncratic tendencies so that we can statistically control for them to get a better estimate for the fixed effects that we care about.\r\nTake the random intercepts by subject for example. The model estimated the distribution of subject intercepts to follow a normal distribution with a standard deviation of 25.24, which is a an estimate of the Population. The variation in subject intercepts in the data itself that we manually calculated above (25.3) is the Sample standard deviation. Of course, if the sample follows a normal distribution and if we also assume the population to be normally distributed, the sample variance should be the best estimate for the population variance. And in fact they do end up being very close!\r\nSo the numbers in the Std.Dev. colum are the model’s fit for the variation within each random effect.\r\nWith this, we now have a better understanding of the numbers in the Corr column: they are the model’s fit for the correlation between random effects.\r\nTo go more in depth with our discussion, let’s plot the intercepts and slopes for our 80 subjects:\r\n\r\n\r\n\r\nRecall that when we manually calculated the correlation between subject intercepts and subject slopes within our sample of 80 subjects in the data, we got 0.88. That is in fact what is shown by the plot above.\r\nAnd as we discussed earlier, the numbers in the Std.Dev. column are the model’s fit for the variation within each random effect. So the model is saying that the variation for subject intercepts follows a normal distribution with mean = 0 and standard deviation = 25.24. Likewise, the model is saying that the variation for subject slopes follows a normal distribution with mean = 0 and standard deviation = 10.4.2\r\nSo the model estimates these two distributions - one for subject slopes and one for subject intercepts - to capture the overall distribution of subject random effects.\r\nThis is illustrated below, where the normal curve at the top is the distribution of subject intercepts estimated by the model, and the normal curve to the right is the distribution of subject slopes estimated by the model. For every subject, their intercepts and slopes are understood to be generated from these two underlying parameters:\r\n\r\n\r\n\r\nBut is specifying each distribution for intercept and item enough to capture the overall distribution of subject random effects?\r\nOne way to test this is to work backwards and generate some observations from the model’s parameters. The idea here is this: if the two normal distributions (one for subject intercept and one for subject slope) can sufficiently capture the distribution of the subject random effects, then sampling from them should yield a distribution that is in the shape of the actual distribution in our data.\r\nLet’s draw 80 samples of subject intercepts and subject slopes from their respective distributions and then plot them together. Here’s one result:\r\n\r\n\r\n\r\nThese points are consistent with what the two distributions predict: there are more points towards the center (the means of the distributions) and less points towards the corners (the tails of the distributions).\r\nBut this doesn’t look like the actual distribution of our subject random effects.\r\nWe can repeat this sampling procedure many times, but none of them look close to the distribution of the subject random effects in our data:\r\nFor clearer comparison, here is the plot of the subject random effects in our data again.\r\n \r\n\r\n\r\n\r\n\r\n\r\n\r\nSo what are we missing here?\r\nWell, what’s missing here is the correlation between the intercepts and slopes that I conveniently left out to demonstrate that just specifying the individual distributions for subject intercepts and subject slopes poorly captures the actual distribution of subject random effects in our data.\r\nIn more technical terms, treating the two random effects as independently sampled from their respective distributions fails to fit the data well because the two random effects are highly correlated. They should instead be treated as being jointly sampled from a bivariate distribution\r\nAnd that’s exactly what adding the correlation parameter does. Let’s break this down.\r\nWhen we say that two variables are independently sampled from two distributions (as we just did above), then their joint distribution looks something like this, where most of the data is expected to fall within the grey shaded ellipse:\r\n\r\n\r\n\r\nThis is clearly a bad fit for the distribution of our subject random effects…\r\n\r\n\r\n\r\n… because the distribution of the subject random effects actually takes the shape of a tilted ellipse instead (dotted outline):\r\n\r\n\r\n\r\nIn fact, we cannot generate any distribution of a tilted shape with just two independent distributions for each variable. We need to factor in covariation to capture the correlation between variables. Barr et al. (2013) illustrates this clearly in the supplementary materials to their paper. You can see from the plot below (originally Figure 1 on the linked page) that without the correlation parameter, you can only capture distributions that are symmetrical with respect to the axes (the darker ellipses). However, once you add in a correlation parameter (\\(\\rho\\)), you can capture distributions that are in the “tilted” shape (the lighter ellipses) like the distribution of our highly correlated subject intercepts and subject slopes.\r\n\r\n\r\n\r\nFigure 1: Figure from Barr et al. (2013)\r\n\r\n\r\n\r\nPutting it all together\r\nHere’s the output of model again:\r\n\r\n\r\n \r\n\r\n\r\nResponse\r\n\r\n\r\nPredictors\r\n\r\n\r\nEstimates\r\n\r\n\r\nCI\r\n\r\n\r\np\r\n\r\n\r\n(Intercept)\r\n\r\n\r\n209.64\r\n\r\n\r\n203.49 – 215.80\r\n\r\n\r\n<0.001\r\n\r\n\r\nCondition [Treatment]\r\n\r\n\r\n35.88\r\n\r\n\r\n28.48 – 43.28\r\n\r\n\r\n<0.001\r\n\r\n\r\nRandom Effects\r\n\r\n\r\nσ2\r\n\r\n37.37\r\n\r\n\r\nτ00Subject\r\n\r\n637.14\r\n\r\n\r\nτ00Item\r\n\r\n44.39\r\n\r\n\r\nτ11Subject.ConditionTreatment\r\n\r\n108.10\r\n\r\n\r\nτ11Item.ConditionTreatment\r\n\r\n308.21\r\n\r\n\r\nρ01Subject\r\n\r\n0.85\r\n\r\n\r\nρ01Item\r\n\r\n0.14\r\n\r\n\r\nICC\r\n\r\n\r\n0.97\r\n\r\n\r\nN Subject\r\n\r\n80\r\n\r\n\r\nN Item\r\n\r\n24\r\n\r\n\r\nObservations\r\n\r\n\r\n1920\r\n\r\n\r\nMarginal R2 / Conditional R2\r\n\r\n0.216 / 0.975\r\n\r\n\r\nAnd let’s keep focusing on the subject random effects for now.\r\nThere are three parameters that the model estimated to capture the by-subject variation:\r\nThe variation (Std.Dev.) for subject intercept\r\nThe variation (Std.Dev.) for subject slope\r\nThe correlation (Corr) between subject intercept and subject slope.\r\nWith these three parameters, the model is defining a bivariate normal distribution, from which subject intercepts (\\(S_{0s}\\)) and subject slopes (\\(S_{1s}\\)) are sampled from (Equation 3 from Barr et al., 2013):\r\n\\[(S_{0s}, S_{1s})\\ \\sim\\ N(0,\\begin{bmatrix}\\tau_{00}^2 & \\rho\\ \\tau_{00}^2 \\tau_{11}^2 \\\\ \\rho\\ \\tau_{00}^2 \\tau_{11}^2 & \\tau_{11}^2 \\end{bmatrix})\\]\r\nFor the variance-covariance matrix, we can substitute the standard deviation for the subject intercept \\(\\tau_{00}^2\\) with 25.24, the standard deviation for the subject slope \\(\\tau_{11}\\) with 10.4, and the correlation \\(\\rho\\) with 0.85 to get the following:\r\n\\[(S_{0s}, S_{1s})\\ \\sim\\ N(0,\\begin{bmatrix}25.24^2 & 0.85\\ \\times\\ 25.24\\ \\times\\ 10.4 \\\\ 0.85\\ \\times\\ 25.24\\ \\times\\ 10.4 & 10.4^2 \\end{bmatrix})\\]\r\nIf subject intercepts and subject slopes are jointly sampled from the above distribution, most observations should fall within this grey area:\r\n\r\n\r\n\r\n\r\nNotice how I added \\(\\rho = 0.85\\) in this top right corner of this plot.\r\nLet’s again repeatedly sample from this new bivariate distribution (which you can do with mvrnorm() from the {MASS} package) to check:\r\n\r\n\r\n\r\nLike we expected, this new distribution generates observations of subject slopes and subject intercepts that are highly correlated. But more importantly, the distribution of subject random effects in our data looks like it could be one of these samples, meaning that this bivariate normal distribution fits our data well.\r\nGood thing that we had the model estimate this parameter by specifying the random effects structure for subjects as (1 + Condition | Subject) in our model formula!\r\nWhat if we leave out this correlation parameter? Would it significantly worsen model fit?\r\nWe can check by building another model without the correlation term between the subject random effects and comparing it with our original model.3\r\nThe no_subj_cor_model below is a depleted model without the correlation parameter between the random intercepts and random slopes by subject. You can see that the subject group is missing a value in the Corr column.4\r\n\r\n\r\nno_subj_cor_model <- lmer(Response ~ Condition + (1+model.matrix(model)[,2]||Subject) + (1+Condition|Item),\r\n REML = FALSE, control = lmerControl('bobyqa'), data = toydata)\r\n\r\ntab_model(no_subj_cor_model)\r\n\r\n\r\n\r\n \r\n\r\n\r\nResponse\r\n\r\n\r\nPredictors\r\n\r\n\r\nEstimates\r\n\r\n\r\nCI\r\n\r\n\r\np\r\n\r\n\r\n(Intercept)\r\n\r\n\r\n209.64\r\n\r\n\r\n203.44 – 215.84\r\n\r\n\r\n<0.001\r\n\r\n\r\nCondition [Treatment]\r\n\r\n\r\n35.88\r\n\r\n\r\n28.43 – 43.33\r\n\r\n\r\n<0.001\r\n\r\n\r\nRandom Effects\r\n\r\n\r\nσ2\r\n\r\n37.34\r\n\r\n\r\nτ00Subject\r\n\r\n649.20\r\n\r\n\r\nτ00Subject.1\r\n\r\n110.96\r\n\r\n\r\nτ00Item\r\n\r\n44.56\r\n\r\n\r\nτ11Item.ConditionTreatment\r\n\r\n311.76\r\n\r\n\r\nρ01Item\r\n\r\n0.14\r\n\r\n\r\nICC\r\n\r\n\r\n0.96\r\n\r\n\r\nN Subject\r\n\r\n80\r\n\r\n\r\nN Item\r\n\r\n24\r\n\r\n\r\nObservations\r\n\r\n\r\n1920\r\n\r\n\r\nMarginal R2 / Conditional R2\r\n\r\n0.263 / 0.970\r\n\r\n\r\nNow let’s perform a likelihood ratio test using anova():\r\n\r\n\r\nanova(no_subj_cor_model, model, test = 'Chisq')\r\n\r\n\r\n Data: toydata\r\n Models:\r\n no_subj_cor_model: Response ~ Condition + (1 + model.matrix(model)[, 2] || Subject) + \r\n no_subj_cor_model: (1 + Condition | Item)\r\n model: Response ~ Condition + (1 + Condition | Subject) + (1 + Condition | \r\n model: Item)\r\n npar AIC BIC logLik deviance Chisq Df Pr(>Chisq) \r\n no_subj_cor_model 8 13352 13396 -6668 13336 \r\n model 9 13267 13317 -6624 13249 87.2 1 <2e-16 ***\r\n ---\r\n Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\r\n\r\nThe first thing to notice is that no_subj_cor_model has one less Df, or degrees of freedom, than model. This is because every correlation between random effects is an additional parameter that the model is estimating. So removing the correlation between random effects for no_subj_cor_model leaves it with 8 parameters, one less than the original, full model. There’s a good discussion of what parameters are specified by different lmer() formulas in this StackExchange thread.\r\nAfter performing this sanity check, the next thing to note is the very low number in Pr(>Chisq), telling us that the models are significantly different from one another. This might come off as weird if you’re only used to performing ANOVA comparisons to check whether a predictor is significant or not. In fact, there are no obvious differences between the output of no_subj_cor_model and the output of our original model other than the presence/absence of the correlation between subject random effects.\r\nBut clearly something major is going on behind the curtains, so we turn to the last term(s) of interest - AIC and BIC, which are scores for model fit. The numbers are hard to interpret on their own, but useful when comparing models. Here, both the AIC and the BIC of no_subj_cor_model are higher than model, suggesting that no_subj_cor_model has a worse fit, and a statistically significant one at that.\r\nMore specifically, we know that the only meaningful difference between no_subj_cor_model and model is the correlation parameter for the subject random effects, so no_subj_cor_model must be capturing the subject random effects relativelty poorly under its assumption that subject intercepts and subject slopes do not correlate with one another (i.e., that they are independent).\r\nSo let’s look at the random effects calculated by no_subj_cor_model and its poor attempt at fitting their distribution.\r\nFirst, let’s plot the subject intercepts by subject slopes like we did for our original model:\r\n\r\n\r\n\r\nYou might notice that the no_subj_cor_model calculates subject random effects that are very similar to those calculated by our original mode. Here’s a side-by-side comparison of the subject random effects from model and no_subj_cor_model:\r\n\r\n\r\n\r\nThis illustrates a very important point. Removing the correlation parameter does not change the calculation of the random effects (barring any serious convergence failures, of course). This shouldn’t be surprising because random effects, like fixed effects, speak to facts (in the frequentist sense) about how the data that we observe is generated. It is literally the case here since I included these random effects explicitly in making toydata. But more importantly, the idea that there are random variations generated from underlying population-level parameters is an assumption that we are making when we use mixed-effects models.\r\nThe only meaningful difference between the two models here, then, is in their fit - e.g., how well the model captures the distribution of the subject random effects. We actually went over this above - we saw that our original model fits subject random effects using a bivariate normal distribution assuming a correlation, while no_subj_cor_model should be fitting subject random effects using two univariate normal distributions, assuming no (i.e., zero) correlation.\r\nHere’s a visual comparison of model fit, with the plot for model at the top and the plot for no_subj_cor_model at the bottom:\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nWhere the term \\(\\rho = 0\\) in the plot for no_subj_cor_model indicates that the subject intercepts and subject slopes are generated from this bivariate distribution:\r\n\\[(S_{0s}, S_{1s})\\ \\sim\\ N(0,\\begin{bmatrix} \\tau_{00}^2 & 0 \\\\ 0 & \\tau_{11}^2 \\end{bmatrix})\\]\r\nWhich is the same as independetly sampling from these two univariate normal distributions:\r\n\\[S_{0s} \\sim N(0, \\tau_{00})\\]\r\n\\[S_{1s} \\sim N(0, \\tau_{11})\\]\r\nNow, looking at the previous pair of plots, no_subj_cor_model (bottom) clearly fits the distribution of the subject random effects poorly compared to our original model model (top), and that appears to be the driving the significant decrease in fit that we found from the likelihood ratio test earlier. It seems to be the case that the inclusion of the correlation parameter between subject random intercepts is necessary to fit the data well.\r\nAre correlation parameters always necessary?\r\nThe question of how “maximal” our models should be is very tricky, and especially so when it concerns the inclusion/exclusion of correlation parameters (see discussions here and here). For example, Bates, Kliegl, Vasishth, & Baayen (2015) and Matuschek, Kliegl, Vasishth, & Baayen (2017) have called for parsimonious models with stricter criteria for including terms in the model, beyond whether they cause the model to fail to converge.5\r\nI’ll demonstrate one case here where it doesn’t seem like including a correlation parameter particularly improves model fit.\r\nLet’s repeat the model comparison process above, except this time taking out the correlation parameter for item.\r\nFor context, here is the random effects output of model again:\r\n\r\n Groups Name Std.Dev. Corr\r\n Subject (Intercept) 25.24 \r\n ConditionTreatment 10.40 0.85\r\n Item (Intercept) 6.66 \r\n ConditionTreatment 17.56 0.14\r\n Residual 6.11\r\n\r\nAgain, there are three parameters that the model estimated to capture the by-item variation:\r\nThe variation (Std.Dev.) for item intercept\r\nThe variation (Std.Dev.) for item slope\r\nThe correlation (Corr) between item intercept and item slope.\r\nAnd here is what the distribution of item random effects from model look like:\r\n\r\n\r\n\r\nOur model fitted a bivariate normal distribution with the standard deviation of item intercepts = 6.66, the standard deviation of item slopes = 10.4, and correlation = 0.14.\r\nWe can again visualize the fit of model to the distribution of the item random effects:\r\n\r\n\r\n\r\nThe model estimates a low correlation of 0.14, which is reflected in the small tilt of the ellipse. It looks like the model is capturing the distribution of the item random effects pretty well. But is the correlation parameter really that necessary here?\r\nLet’s make another depleted model, no_item_cor_model, with the correlation between item random effects removed:\r\n\r\n\r\nno_item_cor_model <- lmer(Response ~ Condition + (1+Condition|Subject) + (1+model.matrix(model)[,2]||Item),\r\n REML = FALSE, control = lmerControl('bobyqa'), data = toydata)\r\n\r\ntab_model(no_item_cor_model)\r\n\r\n\r\n\r\n \r\n\r\n\r\nResponse\r\n\r\n\r\nPredictors\r\n\r\n\r\nEstimates\r\n\r\n\r\nCI\r\n\r\n\r\np\r\n\r\n\r\n(Intercept)\r\n\r\n\r\n209.64\r\n\r\n\r\n203.49 – 215.80\r\n\r\n\r\n<0.001\r\n\r\n\r\nCondition [Treatment]\r\n\r\n\r\n35.88\r\n\r\n\r\n28.45 – 43.31\r\n\r\n\r\n<0.001\r\n\r\n\r\nRandom Effects\r\n\r\n\r\nσ2\r\n\r\n37.37\r\n\r\n\r\nτ00Subject\r\n\r\n636.98\r\n\r\n\r\nτ00Item\r\n\r\n44.55\r\n\r\n\r\nτ00Item.1\r\n\r\n310.22\r\n\r\n\r\nτ11Subject.ConditionTreatment\r\n\r\n108.08\r\n\r\n\r\nρ01Subject\r\n\r\n0.85\r\n\r\n\r\nICC\r\n\r\n\r\n0.96\r\n\r\n\r\nN Subject\r\n\r\n80\r\n\r\n\r\nN Item\r\n\r\n24\r\n\r\n\r\nObservations\r\n\r\n\r\n1920\r\n\r\n\r\nMarginal R2 / Conditional R2\r\n\r\n0.244 / 0.972\r\n\r\n\r\nAgain, the output of the depleted model printed here does not differ that much from the output of model. We can see also get a sense of this by visualizing the fit of no_item_cor_model to the distribution of item random effects:\r\n\r\n\r\n\r\nWe can see this more clearly with a side-by-side comparison of model fit by model (blue) and no_item_cor_model (red):\r\n\r\n\r\n\r\nDoesn’t seem like there are big differences here, but we have to run some statistics to be sure. So let’s perform another log likelihood ratio test:\r\n\r\n\r\nanova(no_item_cor_model, model)\r\n\r\n\r\n Data: toydata\r\n Models:\r\n no_item_cor_model: Response ~ Condition + (1 + Condition | Subject) + (1 + model.matrix(model)[, \r\n no_item_cor_model: 2] || Item)\r\n model: Response ~ Condition + (1 + Condition | Subject) + (1 + Condition | \r\n model: Item)\r\n npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)\r\n no_item_cor_model 8 13265 13310 -6625 13249 \r\n model 9 13267 13317 -6624 13249 0.47 1 0.49\r\n\r\n\r\n\r\n\r\nFirst, for sanity check, we see that no_item_cor_model has one less Df than model, which is what we’d expect if no_item_cor_model indeed lacks the correlation parameter for item random effects. Next, we see that the value of Pr(>Chisq) is very high, at 0.49, suggesting that the models are not significantly different. This is corroborated by the very small differences between the models’ AIC and BIC values. These small differences here are likely reducible to the fact that AIC and BIC penalize more number of parameters (as a way of balancing model fit with model complexity). In fact, the differences in AIC between the models is approximately 2, which is exactly what you’d expect if you added a redundant parameter with no additional explanatory power to the model.\r\nIn sum, we see that when modeling our dataset toydata, estimating a correlation parameter for subject random effects improves model fit, while doing so for item random effects doesn’t as much.\r\nConclusion and implications for model building\r\nMy main goal here was to simply go over what the correlation parameter in mixed-effects models is, so the question of whether we should be including certain correlation parameter(s) in our models are beyond the scope of this discussion (and should be handled on a case-by-case basis). It’s also something that I’m in the process of learning, so I don’t have a good answer to it yet. But my personal view (which may change later with more knowledge and experience) is to keep correlation parameters in the model unless they make the model fail to converge. So in the case of the correlation parameter for item random effects in model that we discussed above, I’d personally keep that in since we didn’t run into any convergence issues fitting the maximal model. In general, if there’s no meaningful difference either way, I err towards leaving the correlation parameter in there. In other words, I try to keep the model as maximal as possible without overparameterizing it (Barr et al., 2013).\r\nI don’t think that the spirit of this message is really controversial, and the more challenging part of this is putting it into practice. We not only need to balance model fit with model complexity, but we also often need to navigate conflicts between important considerations from statistical theory and from whatever domain our research is in (linguistics, psychology, etc.).\r\nTo resolve this, a lot of people have streamlined different methods of reducing the complexity of the random effects structure in a statistically motivated way. One such example is using Principal Components Analysis (PCA) (suggested in Bates et al., 2015). In Section 3 of their paper, Bates and colleagues outline a procedure for iterative model reduction which involves PCA (now available as rePCA() in the lme4 package) to determine how many random effect terms are sufficient to capture the variance in the random effects. This is still not a perfect solution, of course, but it’s a good next step for putting this knowledge into practice. Or you can just do fancy Bayesian analyses and avoid all these problems, so I hear\r\nAnyways, that’s it for my notes. Here’s the code that generated toydata:\r\n\r\n\r\n###########\r\n## Setup ##\r\n###########\r\n\r\n# Load Packages (make sure dplyr::filter() isn't makes by MASS:filter())\r\nlibrary(MASS)\r\nlibrary(tidyverse)\r\nlibrary(lme4)\r\n\r\n# Set seed\r\nset.seed(1234)\r\n\r\n# Set number of participants and items\r\nn_subjects <- 80\r\nn_items <- 24\r\n\r\n#################\r\n## Make trials ##\r\n#################\r\n\r\n# Generate levels\r\nSubject <- gl(n_subjects, n_items)\r\nItem <- rep(gl(n_items, 1), n_subjects)\r\nCondition <- factor(rep(c(rep(c(\"Control\", \"Treatment\"), n_items/2),\r\n rep(c(\"Treatment\", \"Control\"), n_items/2)),\r\n n_subjects/2))\r\n\r\n# Treatment coding\r\nCondition_coded <- ifelse(Condition == \"Control\", 0, 1)\r\n\r\n# Combine into trials\r\nData <- tibble(Subject, Item, Condition, Condition_coded)\r\n\r\n#############################\r\n## Add Intercept and Slope ##\r\n#############################\r\n\r\n# Add intercept\r\nData$Intercept <- 200\r\n\r\n# Add slope\r\nData$Slope <- ifelse(Data$Condition == \"Treatment\", 30, 0)\r\n\r\n########################\r\n## Add Random Effects ##\r\n########################\r\n\r\n# By-subject variation in intercept and slope (sampled from bivariate normal)\r\nsd_subj_intercept <- 25\r\nsd_subj_slope <- 10\r\nsubj_ranef_cor <- 0.8\r\n\r\nsubj_ranef <- mvrnorm(n_subjects,\r\n # means of two normals are both 0\r\n c(\"Intercept\" = 0, \"Slope\" = 0),\r\n # 2x2 variance-covariance matrix\r\n matrix(\r\n c(sd_subj_intercept^2,\r\n subj_ranef_cor*sd_subj_intercept*sd_subj_slope,\r\n subj_ranef_cor*sd_subj_intercept*sd_subj_slope,\r\n sd_subj_slope^2),\r\n ncol = 2)\r\n )\r\n\r\nData$Subj_intercept <- rep(subj_ranef[,\"Intercept\"], each = n_items)\r\nData$Subj_slope <- rep(subj_ranef[,\"Slope\"], each = n_items)\r\n\r\n# By-item variation in intercept and slope (sampled independently)\r\nData$Item_intercept <- rep(rnorm(n_items, sd = 5), times = n_subjects)\r\nData$Item_slope <- rep(rnorm(n_items, sd = 15), times = n_subjects)\r\n\r\n# Random noise\r\nData$Noise <- rnorm(nrow(Data), 0, 5) + rlnorm(nrow(Data), 0.5)\r\n\r\n###########################\r\n## Generate Observations ##\r\n###########################\r\n\r\nData <- Data %>%\r\n mutate(Response =\r\n Intercept +\r\n Slope * Condition_coded +\r\n Subj_intercept +\r\n Subj_slope * Condition_coded +\r\n Item_intercept +\r\n Item_slope * Condition_coded +\r\n Noise)\r\n\r\n#################\r\n## Toy Dataset ##\r\n#################\r\n\r\ntoydata <- Data %>% \r\n select(Subject, Item, Condition, Response)\r\n\r\n\r\n\r\n\r\nThis distinction is also reflected in the fact that the notation for random effect standard deviation is tau (\\(\\tau\\)), which is a Greek symbol. In statistics, Greek symbols (like \\(\\beta\\), which we may be more familiar with) refers to population-level paramters.↩︎\r\nBut what if the distribution of random effects has a mean that is not equal to zero? Well that just shifts the fixed effects estimate, so the distribution of random effects can be fully characterized by just its variance/standard deviation. This is also why you should never remove a term from fixed effects without removing it from random effects like in Response ~ 1 + (1 + Condiiton | Subject) without a good reason, because the model will assume the fixed effect of Condition to be zero.↩︎\r\nRemoving a correlation term in lmer() turns out to be actually sort of tricky if you don’t explicitly numerically code your factors - sometimes just using the double bar syntax (||) doesn’t always work. I won’t go into the details of how to do that here, but there are good discussions of doing this using model.matrix() in this Rpubs post and Section 5.4 (also Appendix E) of Frossard and Renaud (2019). I could have done numeric coding with something like mutate(data, Condition = as.integer(Condition == \"Treatment\")), but I wanted to try this way out for myself↩︎\r\nHere, \\(\\tau_{00\\ Subject.1}\\) is actually the same as the \\(\\tau_{00\\ Subject.ConditionTreatment}\\) term from the maximal model, model. I don’t know how to suppress this name change after dropping a correlation term - if you do, please let me know!↩︎\r\nThe (simplified) argument here is that having the model estimate superfluous variance components can make it more difficult for the model to detect an effect if it actually exists - i.e., can lead to a loss of power↩︎\r\n", "preview": "posts/2020-06-07-correlation-parameter-mem/preview.png", - "last_modified": "2022-11-13T09:16:55-05:00", + "last_modified": "2022-11-13T06:16:55-08:00", "input_file": {}, "preview_width": 1248, "preview_height": 768 diff --git a/docs/research.html b/docs/research.html index c13439d..c375256 100644 --- a/docs/research.html +++ b/docs/research.html @@ -2697,6 +2697,7 @@

Contents

Links: Google Scholar, Github, OSF

Academic research output

Peer-reviewed Papers

+

June Choe, and Anna Papafragou. (Accepted). Distributional signatures of superordinate nouns. Language Acquisition. DOI: https://doi.org/10.1080/10489223.2024.2440348. PDF OSF

June Choe, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. Journal of Memory and Language, 132, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. PDF OSF

June Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors for World Englishes. In Proceedings of the 29th International Conference on Computational Linguistics, 7177–7186.

May Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech 2022, 1273-1277. DOI: 10.21437/Interspeech.2022-10869

diff --git a/docs/search.json b/docs/search.json index b5b9998..ac2abaf 100644 --- a/docs/search.json +++ b/docs/search.json @@ -5,28 +5,28 @@ "title": "Blog Posts", "author": [], "contents": "\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:52-04:00" + "last_modified": "2024-10-23T07:53:52-07:00" }, { "path": "index.html", "title": "June Choe", "author": [], - "contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n Education\r\n\r\n\r\nPh.D. Linguistics, University of Pennsylvania (~2025)\r\n\r\n\r\nB.A. Lingusitics, Northwestern University (2020)\r\n\r\n\r\n Interests\r\n\r\n\r\nComputational Psycholinguistics\r\n\r\n\r\nLanguage Acquisition and Development\r\n\r\n\r\nStatistical Computing and Graphics\r\n\r\n\r\n\r\n\r\n\r\nI am a linguist, software developer, and scientific programmer. In my PhD research in Linguistics at the University of Pennsylvania, I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.\r\nI am also active in the R programming community as a mentor and open-source developer. I maintain and collaborate on several open-source software in statistical computing (jlmerclusterperm), data visualization (ggtrace), data quality assurance (pointblank), and interfaces to data APIs (openalexR). My work on graphics received an award from the American Statistical Association in 2023. I enjoy writing in my free time and have been maintaining a technical blog over the past 5 years, covering topics including R language internals, software design principles, and practical tutorials for everyday data analysis.\r\n\r\n\r\ncontact me: yjchoe@sas.upenn.edu\r\n\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:53-04:00" + "contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n Education\r\n\r\n\r\nPh.D. Linguistics, University of Pennsylvania (~2025)\r\n\r\n\r\nB.A. Lingusitics, Northwestern University (2020)\r\n\r\n\r\n Interests\r\n\r\n\r\nComputational Psycholinguistics\r\n\r\n\r\nLanguage Acquisition and Development\r\n\r\n\r\nStatistical Computing and Graphics\r\n\r\n\r\n\r\n\r\n\r\nI am a fifth-year PhD candidate in Linguistics at the University of Pennsylvania. I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development.\r\nI am also active in the R programming community as a mentor and open-source developer. I maintain and collaborate on several open-source software in statistical computing (jlmerclusterperm), data visualization (ggtrace), data quality assurance (pointblank), and interfaces to data APIs (openalexR). My work on graphics received an award from the American Statistical Association in 2023. I enjoy writing in my free time and have been maintaining a technical blog over the past 5 years, covering topics including R language internals, software design principles, and practical tutorials for everyday data analysis.\r\n\r\n\r\ncontact me: yjchoe@sas.upenn.edu\r\n\r\n\r\n\r\n\r\n\r\n", + "last_modified": "2024-12-28T19:32:28-08:00" }, { "path": "news.html", "title": "News", "author": [], "contents": "\r\n\r\n\r\nFor more of my personal news external/tangential to research\r\n2023\r\nAugust\r\nI was unfortunately not able to make it in person to JSM 2023 but have my pre-recorded talk has been uploaded!\r\nJune\r\nMy package jlmerclusterperm was published on CRAN!\r\nApril\r\nI was accepted to SMLP (Summer School on Statistical Methods for Linguistics and Psychology), to be held in September at the University of Potsdam, Germany! I will be joining the “Advanced methods in frequentist statistics with Julia” stream. Huge thanks to MindCORE for funding my travels to attend!\r\nJanuary\r\nI received the ASA Statistical Computing and Graphics student award for my paper Sublayer modularity in the Grammar of Graphics! I will be presenting my work at the 2023 Joint Statistical Meetings in Toronto in August.\r\n2022\r\nSeptember\r\nI was invited to a Korean data science podcast dataholic (데이터홀릭) to talk about my experience presenting at the RStudio and useR conferences! Part 1, Part 2\r\nAugust\r\nI led a workshop on IBEX and PCIbex with Nayoun Kim at the Seoul International Conference on Linguistics (SICOL 2022).\r\nJuly\r\nI attended my first in-person R conference at rstudio::conf(2022) and gave a talk on ggplot internals.\r\nJune\r\nI gave a talk on my package {ggtrace} at the useR! 2022 conference. I was awarded the diversity scholarship which covered my registration and workshop fees. My reflections\r\nI gave a talk at RLadies philly on using dplyr’s slice() function for row-relational operations.\r\n2021\r\nJuly\r\nMy tutorial on custom fonts in R was featured as a highlight on the R Weekly podcast!\r\nJune\r\nI gave a talk at RLadies philly on using icon fonts for data viz! I also wrote a follow-up blog post that goes deeper into font rendering in R.\r\nMay\r\nSnowGlobe, a project started in my undergrad, was featured in an article by the Northwestern University Library. We also had a workshop for SnowGlobe which drew participants from over a hundred universities!\r\nJanuary\r\nI joined Nayoun Kim for a workshop on experimental syntax conducted in Korean and held at Sungkyunkwan University (Korea). I helped design materials for a session on scripting online experiments with IBEX, including interactive slides made with R!\r\n2020\r\nNovember\r\nI joined designer Will Chase on his stream to talk about the psycholinguistics of speech production for a data viz project on Michael’s speech errors in The Office. It was a very cool and unique opportunity to bring my two interests together!\r\nOctober\r\nMy tutorial on {ggplot2} stat_*() functions was featured as a highlight on the R Weekly podcast, which curates weekly updates from the R community.\r\nI became a data science tutor at MindCORE to help researchers at Penn with data visualization and R programming.\r\nSeptember\r\nI have moved to Philadelphia to start my PhD in Linguistics at the University of Pennsylvania!\r\nJune\r\nI graduated from Northwestern University with a B.A. in Linguistics (with honors)! I was also elected into Phi Beta Kappa and appointed as the Senior Marshal for Linguistics.\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:54-04:00" + "last_modified": "2024-10-23T07:53:54-07:00" }, { "path": "research.html", "title": "Research and activities", "author": [], - "contents": "\r\n\r\nContents\r\nAcademic research output\r\nPeer-reviewed Papers\r\nConference Talks\r\nConference Presentations\r\n\r\nResearch activities in FOSS\r\nPapers\r\nTalks\r\nSoftware\r\n\r\nTeaching\r\nPositions held\r\nWorkshops led\r\nGuest lectures\r\n\r\nProfessional activities\r\nEditor\r\nReviewer\r\nMembership\r\n\r\n\r\nLinks: Google Scholar, Github, OSF\r\nAcademic research output\r\nPeer-reviewed Papers\r\nJune Choe, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. Journal of Memory and Language, 132, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. PDF OSF\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors for World Englishes. In Proceedings of the 29th International Conference on Computational Linguistics, 7177–7186.\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech 2022, 1273-1277. DOI: 10.21437/Interspeech.2022-10869\r\nJune Choe, Masaya Yoshida, and Jennifer Cole. (2022). The role of prosodic focus in the reanalysis of garden path sentences: Depth of semantic processing impedes the revision of an erroneous local analysis. Glossa Psycholinguistics, 1(1). DOI: 10.5070/G601136\r\nJune Choe, and Anna Papafragou. (2022). The acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. In Proceedings of the Annual Meeting of the Cognitive Science Society, 44, 2745-2752.\r\nSean McWeeny, Jinnie S. Choi, June Choe, Alexander LaTourette, Megan Y. Roberts, and Elizabeth S. Norton. (2022). Rapid automatized naming (RAN) as a kindergarten predictor of future reading in English: A systematic review and meta-analysis. Reading Research Quarterly, 57(4), 1187–1211. DOI: 10.1002/rrq.467\r\nConference Talks\r\nJune Choe, and Anna Papafragou. Children’s sensitivity to informativeness in naming: basic-level vs. superordinate nouns. Talk at the 101st Linguistic Society of America (LSA) conference. 9-12 January 2025. Philadelphia.\r\nJune Choe. Distributional signatures of superordinate nouns. Talk at the 10th MACSIM conference. 6 April 2024. University of Maryland, College Park, MD.\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper\r\nJune Choe. Persona-based social expectations in sentence processing and comprehension. Talk at the Language, Stereotypes & Social Cognition workshop, 22-23 May, 2023. University of Pennsylvania, PA.\r\nJune Choe, and Anna Papafragou. Lexical alternatives and the acquisition of subordinate nouns. Talk at the 47th Boston University Conference on Language Development (BUCLD), 3-6 November, 2022. Boston University, Boston, MA. Slides\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors in American English. Talk at the 28th International Conference on Computational Linguistics (CoLing), 12-17 October, 2022. Gyeongju, South Korea. Slides\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. Talk at the 23rd Conference of the International Speech Communication Association (INTERSPEECH), 18-22 September, 2022. Incheon, South Korea.\r\nConference Presentations\r\nJune Choe, and Anna Papafragou. Distributional signatures of superordinate nouns. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Paula Menyuk Travel Award. Abstract Poster\r\nJune Choe, and Anna Papafragou. Pragmatic underpinnings of the basic-level bias. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Paula Menyuk Travel Award. Abstract Poster\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 9th Mid-Atlantic Colloquium of Studies in Meaning (MACSIM), 15 April 2023. University of Pennsylvania, PA.\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 36th Annual Conference on Human Sentence Processing, 9-11 March 2022. University of Pittsburg, PA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. Poster at the 2nd Experiments in Linguistic Meaning (ELM) conference, 18-20 May 2022. University of Pennsylvania, Philadelphia, PA.\r\nJune Choe, and Anna Papafragou. Beyond the basic level: Levels of informativeness and the acquisition of subordinate nouns. Poster at the 35th Annual Conference on Human Sentence Processing (HSP), 24-26 March 2022. University of California, Santa Cruz, CA.\r\nJune Choe, Jennifer Cole, and Masaya Yoshida. Prosodic Focus Strengthens Semantic Persistence. Poster at The 26th Architectures and Mechanisms for Language Processing (AMLaP), 3-5 September 2020. Potsdam, Germany. Abstract Video Slides\r\nJune Choe. Computer-assisted snowball search for meta-analysis research. Poster at The 2020 Undergraduate Research & Arts Exposition. 27-28 May 2020. Northwestern University, Evanston, IL. 2nd Place Poster Award. Abstract\r\nJune Choe. Social Information in Sentence Processing. Talk at The 2019 Undergraduate Research & Arts Exposition. 29 May 2019. Northwestern University, Evanston, IL. Abstract\r\nJune Choe, Shayne Sloggett, Masaya Yoshida and Annette D’Onofrio. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 32nd CUNY Conference on Human Sentence Processing. 29-31 March 2019. University of Colorado, Boulder, CO.\r\nD’Onofrio, Annette, June Choe and Masaya Yoshida. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 93rd Annual Meeting of the Linguistics Society of America. 3-6 January 2019. New York City, NY.\r\nResearch activities in FOSS\r\nPapers\r\nMassimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and June Choe. (2024). openalexR: An R-tool for collecting bibliometric data from OpenAlex. The R Journal, 15(4), 166-179. Paper, Github\r\nJune Choe. (2022). Sub-layer modularity in the Grammar of Graphics. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper, Github\r\nTalks\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada.\r\nJune Choe. Fast cluster-based permutation test using mixed-effects models. Talk at the Integrated Language Science and Technology (ILST) seminar, 21 April 2023. University of Pennsylvania, PA.\r\nJune Choe. Cracking open ggplot internals with {ggtrace}. Talk at the 2022 RStudio Conference, 25-28 July 2022. Washington D.C. https://github.com/yjunechoe/ggtrace-rstudioconf2022.\r\nJune Choe. Stepping into {ggplot2} internals with {ggtrace}. Talk at the 2022 useR! Conference, 20-23 June 2022. Vanderbilt University, TN. Diversity scholarship award. https://github.com/yjunechoe/ggtrace-user2022\r\nSoftware\r\nJune Choe. (2024). jlmerclusterperm: Cluster-Based Permutation Analysis for Densely Sampled Time Data. R package version 1.1.3. https://cran.r-project.org/package=jlmerclusterperm. Github\r\nRich Iannone, June Choe, Mauricio Vargas Sepulveda. (2024). pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. R package version 0.12.1. https://CRAN.R-project.org/package=pointblank. Github\r\nMassimo Aria, Corrado Cuccurullo, Trang Le, June Choe. (2024). openalexR: Getting Bibliographic Records from ‘OpenAlex’ Database Using ‘DSL’ API. R package version 1.4.0. https://CRAN.R-project.org/package=openalexR. Github\r\nJune Choe. (2024). jlme: Regression Modelling with ‘GLM.jl’ and ‘MixedModels.jl’ in ‘Julia’. R package version 0.3.0. https://cran.r-project.org/package=jlme. Github\r\nSean McWeeny, June Choe, & Elizabeth S. Norton. (2021). SnowGlobe: An Iterative Search Tool for Systematic Reviews and Meta-Analyses [Computer Software]. OSF\r\nTeaching\r\nPositions held\r\nTeaching assistant for “Introduction to Linguistics”. Instructor: Aletheia Cui. Spring 2024. University of Pennsylvania.\r\nTeaching assistant for “Data science for language and the mind”. Instructor: Katie Schuler. Fall 2021, Spring, 2023, and Fall 2023. University of Pennsylvania.\r\nWorkshops led\r\nIntroduction to mixed-effects models in Julia. Workshop at Penn MindCORE. 1 December 2023. Philadelphia, PA. Github Colab notebook\r\nExperimental syntax using IBEX/PCIBEX with Dr. Nayoun Kim. Workshop at the 2022 Seoul International Conference on Linguistics. 11-12 August 2022. Seoul, South Korea. PDF\r\nExperimental syntax using IBEX: a walkthrough with Dr. Nayoun Kim. 2021 BK Winter School-Workshop on Experimental Linguistics/Syntax at Sungkyunkwan University, 19-22 January 2021. Seoul, South Korea. PDF\r\nGuest lectures\r\nHard words and (syntactic) bootstrapping. LING 5750 “The Acquisition of Meaning”. Instructor: Dr. Anna Papafragou. Spring 2024. University of Pennsylvania.\r\nIntroduction to R for psychology research. PSYC 4997 “Senior Honors Seminar in Psychology”. Instructor: Dr. Coren Apicella. Spring 2024. University of Pennsylvania. Colab notebook\r\nModel fitting and diagnosis with MixedModels.jl in Julia. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Fall 2023. University of Pennsylvania.\r\nSimulation-based power analysis for mixed-effects models. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Spring 2023. University of Pennsylvania.\r\nProfessional activities\r\nEditor\r\nPenn Working Papers in Linguistics (PWPL), Volumne 30, Issue 1.\r\nReviewer\r\nCognition\r\nLanguage Learning and Development\r\nJournal of Open Source Software\r\nProceedings of the Annual Meeting of the Cognitive Science Society\r\nMembership\r\nLinguistics Society of America\r\nAmerican Statistical Association\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:56-04:00" + "contents": "\r\n\r\nContents\r\nAcademic research output\r\nPeer-reviewed Papers\r\nConference Talks\r\nConference Presentations\r\n\r\nResearch activities in FOSS\r\nPapers\r\nTalks\r\nSoftware\r\n\r\nTeaching\r\nPositions held\r\nWorkshops led\r\nGuest lectures\r\n\r\nProfessional activities\r\nEditor\r\nReviewer\r\nMembership\r\n\r\n\r\nLinks: Google Scholar, Github, OSF\r\nAcademic research output\r\nPeer-reviewed Papers\r\nJune Choe, and Anna Papafragou. (Accepted). Distributional signatures of superordinate nouns. Language Acquisition. DOI: https://doi.org/10.1080/10489223.2024.2440348. PDF OSF\r\nJune Choe, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. Journal of Memory and Language, 132, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. PDF OSF\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors for World Englishes. In Proceedings of the 29th International Conference on Computational Linguistics, 7177–7186.\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech 2022, 1273-1277. DOI: 10.21437/Interspeech.2022-10869\r\nJune Choe, Masaya Yoshida, and Jennifer Cole. (2022). The role of prosodic focus in the reanalysis of garden path sentences: Depth of semantic processing impedes the revision of an erroneous local analysis. Glossa Psycholinguistics, 1(1). DOI: 10.5070/G601136\r\nJune Choe, and Anna Papafragou. (2022). The acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. In Proceedings of the Annual Meeting of the Cognitive Science Society, 44, 2745-2752.\r\nSean McWeeny, Jinnie S. Choi, June Choe, Alexander LaTourette, Megan Y. Roberts, and Elizabeth S. Norton. (2022). Rapid automatized naming (RAN) as a kindergarten predictor of future reading in English: A systematic review and meta-analysis. Reading Research Quarterly, 57(4), 1187–1211. DOI: 10.1002/rrq.467\r\nConference Talks\r\nJune Choe, and Anna Papafragou. Children’s sensitivity to informativeness in naming: basic-level vs. superordinate nouns. Talk at the 101st Linguistic Society of America (LSA) conference. 9-12 January 2025. Philadelphia.\r\nJune Choe. Distributional signatures of superordinate nouns. Talk at the 10th MACSIM conference. 6 April 2024. University of Maryland, College Park, MD.\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper\r\nJune Choe. Persona-based social expectations in sentence processing and comprehension. Talk at the Language, Stereotypes & Social Cognition workshop, 22-23 May, 2023. University of Pennsylvania, PA.\r\nJune Choe, and Anna Papafragou. Lexical alternatives and the acquisition of subordinate nouns. Talk at the 47th Boston University Conference on Language Development (BUCLD), 3-6 November, 2022. Boston University, Boston, MA. Slides\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors in American English. Talk at the 28th International Conference on Computational Linguistics (CoLing), 12-17 October, 2022. Gyeongju, South Korea. Slides\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. Talk at the 23rd Conference of the International Speech Communication Association (INTERSPEECH), 18-22 September, 2022. Incheon, South Korea.\r\nConference Presentations\r\nJune Choe, and Anna Papafragou. Distributional signatures of superordinate nouns. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Paula Menyuk Travel Award. Abstract Poster\r\nJune Choe, and Anna Papafragou. Pragmatic underpinnings of the basic-level bias. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Paula Menyuk Travel Award. Abstract Poster\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 9th Mid-Atlantic Colloquium of Studies in Meaning (MACSIM), 15 April 2023. University of Pennsylvania, PA.\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 36th Annual Conference on Human Sentence Processing, 9-11 March 2022. University of Pittsburg, PA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. Poster at the 2nd Experiments in Linguistic Meaning (ELM) conference, 18-20 May 2022. University of Pennsylvania, Philadelphia, PA.\r\nJune Choe, and Anna Papafragou. Beyond the basic level: Levels of informativeness and the acquisition of subordinate nouns. Poster at the 35th Annual Conference on Human Sentence Processing (HSP), 24-26 March 2022. University of California, Santa Cruz, CA.\r\nJune Choe, Jennifer Cole, and Masaya Yoshida. Prosodic Focus Strengthens Semantic Persistence. Poster at The 26th Architectures and Mechanisms for Language Processing (AMLaP), 3-5 September 2020. Potsdam, Germany. Abstract Video Slides\r\nJune Choe. Computer-assisted snowball search for meta-analysis research. Poster at The 2020 Undergraduate Research & Arts Exposition. 27-28 May 2020. Northwestern University, Evanston, IL. 2nd Place Poster Award. Abstract\r\nJune Choe. Social Information in Sentence Processing. Talk at The 2019 Undergraduate Research & Arts Exposition. 29 May 2019. Northwestern University, Evanston, IL. Abstract\r\nJune Choe, Shayne Sloggett, Masaya Yoshida and Annette D’Onofrio. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 32nd CUNY Conference on Human Sentence Processing. 29-31 March 2019. University of Colorado, Boulder, CO.\r\nD’Onofrio, Annette, June Choe and Masaya Yoshida. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 93rd Annual Meeting of the Linguistics Society of America. 3-6 January 2019. New York City, NY.\r\nResearch activities in FOSS\r\nPapers\r\nMassimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and June Choe. (2024). openalexR: An R-tool for collecting bibliometric data from OpenAlex. The R Journal, 15(4), 166-179. Paper, Github\r\nJune Choe. (2022). Sub-layer modularity in the Grammar of Graphics. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper, Github\r\nTalks\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada.\r\nJune Choe. Fast cluster-based permutation test using mixed-effects models. Talk at the Integrated Language Science and Technology (ILST) seminar, 21 April 2023. University of Pennsylvania, PA.\r\nJune Choe. Cracking open ggplot internals with {ggtrace}. Talk at the 2022 RStudio Conference, 25-28 July 2022. Washington D.C. https://github.com/yjunechoe/ggtrace-rstudioconf2022.\r\nJune Choe. Stepping into {ggplot2} internals with {ggtrace}. Talk at the 2022 useR! Conference, 20-23 June 2022. Vanderbilt University, TN. Diversity scholarship award. https://github.com/yjunechoe/ggtrace-user2022\r\nSoftware\r\nJune Choe. (2024). jlmerclusterperm: Cluster-Based Permutation Analysis for Densely Sampled Time Data. R package version 1.1.3. https://cran.r-project.org/package=jlmerclusterperm. Github\r\nRich Iannone, June Choe, Mauricio Vargas Sepulveda. (2024). pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. R package version 0.12.1. https://CRAN.R-project.org/package=pointblank. Github\r\nMassimo Aria, Corrado Cuccurullo, Trang Le, June Choe. (2024). openalexR: Getting Bibliographic Records from ‘OpenAlex’ Database Using ‘DSL’ API. R package version 1.4.0. https://CRAN.R-project.org/package=openalexR. Github\r\nJune Choe. (2024). jlme: Regression Modelling with ‘GLM.jl’ and ‘MixedModels.jl’ in ‘Julia’. R package version 0.3.0. https://cran.r-project.org/package=jlme. Github\r\nSean McWeeny, June Choe, & Elizabeth S. Norton. (2021). SnowGlobe: An Iterative Search Tool for Systematic Reviews and Meta-Analyses [Computer Software]. OSF\r\nTeaching\r\nPositions held\r\nTeaching assistant for “Introduction to Linguistics”. Instructor: Aletheia Cui. Spring 2024. University of Pennsylvania.\r\nTeaching assistant for “Data science for language and the mind”. Instructor: Katie Schuler. Fall 2021, Spring, 2023, and Fall 2023. University of Pennsylvania.\r\nWorkshops led\r\nIntroduction to mixed-effects models in Julia. Workshop at Penn MindCORE. 1 December 2023. Philadelphia, PA. Github Colab notebook\r\nExperimental syntax using IBEX/PCIBEX with Dr. Nayoun Kim. Workshop at the 2022 Seoul International Conference on Linguistics. 11-12 August 2022. Seoul, South Korea. PDF\r\nExperimental syntax using IBEX: a walkthrough with Dr. Nayoun Kim. 2021 BK Winter School-Workshop on Experimental Linguistics/Syntax at Sungkyunkwan University, 19-22 January 2021. Seoul, South Korea. PDF\r\nGuest lectures\r\nHard words and (syntactic) bootstrapping. LING 5750 “The Acquisition of Meaning”. Instructor: Dr. Anna Papafragou. Spring 2024. University of Pennsylvania.\r\nIntroduction to R for psychology research. PSYC 4997 “Senior Honors Seminar in Psychology”. Instructor: Dr. Coren Apicella. Spring 2024. University of Pennsylvania. Colab notebook\r\nModel fitting and diagnosis with MixedModels.jl in Julia. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Fall 2023. University of Pennsylvania.\r\nSimulation-based power analysis for mixed-effects models. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Spring 2023. University of Pennsylvania.\r\nProfessional activities\r\nEditor\r\nPenn Working Papers in Linguistics (PWPL), Volumne 30, Issue 1.\r\nReviewer\r\nCognition\r\nLanguage Learning and Development\r\nJournal of Open Source Software\r\nProceedings of the Annual Meeting of the Cognitive Science Society\r\nMembership\r\nLinguistics Society of America\r\nAmerican Statistical Association\r\n\r\n\r\n\r\n", + "last_modified": "2024-12-28T19:35:24-08:00" }, { "path": "resources.html", @@ -34,14 +34,14 @@ "description": "Mostly for R and data visualization\n", "author": [], "contents": "\r\n\r\nContents\r\nLinguistics\r\nData Visualization\r\nPackages and software\r\nTutorial Blog Posts\r\nBy others\r\n\r\nLinguistics\r\nScripting online experiments with IBEX (workshop slides & materials with Nayoun Kim)\r\nData Visualization\r\n{ggplot2} style guide and showcase - most recent version (2/10/2021)\r\nCracking open the internals of ggplot: A {ggtrace} showcase - slides\r\nPackages and software\r\n{ggtrace}: R package for exploring, debugging, and manipulating ggplot internals by exposing the underlying object-oriented system in functional programming terms.\r\n{penngradlings}: R package for the University of Pennsylvania Graduate Linguistics Society.\r\n{LingWER}: R package for linguistic analysis of Word Error Rate for evaluating transcriptions and other speech-to-text output, using a deterministic matrix-based search algorithm optimized for R.\r\n{gridAnnotate}: R package for interactively annotating figures from the plot pane, using {grid} graphical objects.\r\nSnowGlobe: A tool for meta-analysis research. Developed with Jinnie Choi, Sean McWeeny, and Elizabeth Norton, with funding from the Northwestern University Library. Currently under development but basic features are functional. Validation experiments and guides at OSF repo.\r\nTutorial Blog Posts\r\n{ggplot2} stat_*() functions [post]\r\nCustom fonts in R [post]\r\n{purrr} reduce() family [post1, post2]\r\nThe correlation parameter in {lme4} mixed effects models [post]\r\nShortcuts for common chain of {dplyr} functions [post]\r\nPlotting highly-customizable treemaps with {treemap} and {ggplot2} [post]\r\nBy others\r\nTutorials:\r\nA ggplot2 Tutorial for Beautiful Plotting in R by Cédric Scherer\r\nggplot2 Wizardry Hands-On by Cédric Scherer\r\nggplot2 workshop by Thomas Lin Pedersen\r\nBooks:\r\nR for Data Science by Hadley Wickham and Garrett Grolemund\r\nR Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund\r\nggplot2: elegant graphics for data analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen\r\nFundamentals of Data Visualization by Claus O. Wilke\r\nEfficient R Programming by Colin Gillespie and Robin Lovelace\r\nAdvanced R by Hadley Wickham\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:57-04:00" + "last_modified": "2024-10-23T07:53:57-07:00" }, { "path": "software.html", "title": "Software", "author": [], "contents": "\r\n\r\nContents\r\nggtrace\r\njlmerclusterperm\r\npointblank\r\nopenalexR\r\njlme\r\nggcolormeter\r\nSnowglobe (retired)\r\nOther contributions\r\n\r\nMain: Github profile, R-universe profile\r\nggtrace\r\n\r\n\r\n\r\nRole: Maintainer, Author\r\nLanguage: R\r\nLinks: Github, website, talks (useR! 2022, rstudio::conf 2022), paper\r\n\r\nProgrammatically explore, debug, and manipulate ggplot internals. Package {ggtrace} offers a low-level interface that extends base R capabilities of trace, as well as a family of workflow functions that make interactions with ggplot internals more accessible.\r\n\r\njlmerclusterperm\r\n\r\n\r\n\r\nRole: Maintainer, Author\r\nLanguage: R, Julia\r\nLinks: CRAN, Github, website\r\n\r\nAn implementation of fast cluster-based permutation analysis (CPA) for densely-sampled time data developed in Maris & Oostenveld (2007). Supports (generalized, mixed-effects) regression models for the calculation of timewise statistics. Provides both a wholesale and a piecemeal interface to the CPA procedure with an emphasis on interpretability and diagnostics. Integrates Julia libraries MixedModels.jl and GLM.jl for performance improvements, with additional functionalities for interfacing with Julia from ‘R’ powered by the JuliaConnectoR package.\r\n\r\npointblank\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R, HTML/CSS, Javascript\r\nLinks: Github, website\r\n\r\nData quality assessment and metadata reporting for data frames and database tables\r\n\r\nopenalexR\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R\r\nLinks: Github, website\r\n\r\nA set of tools to extract bibliographic content from the OpenAlex database using API https://docs.openalex.org.\r\n\r\njlme\r\nRole: Author, Maintainer\r\nLanguage: R, Julia\r\nLinks: Github, website\r\n\r\nJulia (mixed-effects) regression modelling from R. Powered by the {JuliaConnectoR} R package and Julia libraries GLM, StatsModels, and MixedModels.\r\n\r\nggcolormeter\r\nRole: Maintainer, Author\r\nLanguage: R\r\nLinks: Github\r\n\r\n{ggcolormeter} adds guide_colormeter(), a {ggplot2} color/fill legend guide extension in the style of a dashboard meter.\r\n\r\nSnowglobe (retired)\r\nRole: Maintainer, Author\r\nLanguage: R, SQL\r\nLinks: Github, OSF, poster\r\n\r\nAn iterative search tool for systematic reviews and meta-analyses, implemented as a Shiny app. Retired due to the discontinuation of the Microsoft Academic Graph service in 2021. I now contribute to {openalexR}.\r\n\r\nOther contributions\r\nI am listed as a contributor in:\r\nJuliaConnectoR\r\nreactablefmtr\r\nddplot\r\nI have also contributed via PRs to:\r\nggplot2\r\nfda\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:53:59-04:00" + "last_modified": "2024-10-23T07:53:59-07:00" }, { "path": "visualizations.html", @@ -49,7 +49,7 @@ "description": "Select data visualizations", "author": [], "contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-10-23T10:54:02-04:00" + "last_modified": "2024-10-23T07:54:02-07:00" } ], "collections": ["posts/posts.json"] diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 8ba48a5..78b48ff 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,174 +2,174 @@ https://yjunechoe.github.io/blog.html - 2022-11-13T09:16:58-05:00 + 2022-11-13T06:16:58-08:00 https://yjunechoe.github.io/ - 2024-10-17T21:24:32-04:00 + 2024-12-28T19:32:03-08:00 https://yjunechoe.github.io/news.html - 2023-09-06T15:55:04-04:00 + 2023-09-06T12:55:04-07:00 https://yjunechoe.github.io/research.html - 2024-10-14T08:05:22-04:00 + 2024-12-28T19:34:54-08:00 https://yjunechoe.github.io/resources.html - 2022-11-13T09:17:00-05:00 + 2022-11-13T06:17:00-08:00 https://yjunechoe.github.io/software.html - 2024-10-14T08:16:22-04:00 + 2024-10-14T05:16:22-07:00 https://yjunechoe.github.io/visualizations.html - 2022-11-13T09:17:01-05:00 + 2022-11-13T06:17:01-08:00 https://yjunechoe.github.io/posts/2024-09-22-fetch-files-web/ - 2024-09-22T18:49:08-04:00 + 2024-09-22T15:49:08-07:00 https://yjunechoe.github.io/posts/2024-07-21-enumerate-possible-options/ - 2024-09-01T17:53:55-04:00 + 2024-09-01T14:53:55-07:00 https://yjunechoe.github.io/posts/2024-06-09-ave-for-the-average/ - 2024-06-23T00:35:29-04:00 + 2024-06-22T21:35:29-07:00 https://yjunechoe.github.io/posts/2024-03-04-args-args-args-args/ - 2024-03-05T17:04:54-05:00 + 2024-03-05T14:04:54-08:00 https://yjunechoe.github.io/posts/2024-02-20-helloworld-print/ - 2024-02-20T15:25:44-05:00 + 2024-02-20T12:25:44-08:00 https://yjunechoe.github.io/posts/2023-12-31-2023-year-in-review/ - 2024-01-01T15:43:40-05:00 + 2024-01-01T12:43:40-08:00 https://yjunechoe.github.io/posts/2023-12-03-untidy-select/ - 2023-12-04T10:11:22-05:00 + 2023-12-04T07:11:22-08:00 https://yjunechoe.github.io/posts/2023-07-09-x-y-problem/ - 2023-07-10T04:24:43-04:00 + 2023-07-10T01:24:43-07:00 https://yjunechoe.github.io/posts/2023-06-11-row-relational-operations/ - 2024-09-23T10:29:43-04:00 + 2024-09-23T07:29:43-07:00 https://yjunechoe.github.io/posts/2022-11-13-dataframes-jl-and-accessories/ - 2022-11-15T09:59:05-05:00 + 2022-11-15T06:59:05-08:00 https://yjunechoe.github.io/posts/2022-07-30-user2022/ - 2022-11-13T09:16:58-05:00 + 2022-11-13T06:16:58-08:00 https://yjunechoe.github.io/posts/2022-07-06-ggplot2-delayed-aes-2/ - 2022-11-13T09:16:57-05:00 + 2022-11-13T06:16:57-08:00 https://yjunechoe.github.io/posts/2022-03-10-ggplot2-delayed-aes-1/ - 2024-06-09T03:55:07-04:00 + 2024-06-09T00:55:07-07:00 https://yjunechoe.github.io/posts/2021-06-24-setting-up-and-debugging-custom-fonts/ - 2024-06-09T03:56:06-04:00 + 2024-06-09T00:56:06-07:00 https://yjunechoe.github.io/posts/2021-01-17-random-sampling-a-table-animation/ - 2022-11-13T09:16:57-05:00 + 2022-11-13T06:16:57-08:00 https://yjunechoe.github.io/posts/2020-12-13-collapse-repetitive-piping-with-reduce/ - 2022-11-13T09:16:57-05:00 + 2022-11-13T06:16:57-08:00 https://yjunechoe.github.io/posts/2020-11-08-plot-makeover-2/ - 2022-11-13T09:16:57-05:00 + 2022-11-13T06:16:57-08:00 https://yjunechoe.github.io/posts/2020-11-03-tidytuesday-2020-week-45/ - 2022-11-13T09:16:57-05:00 + 2022-11-13T06:16:57-08:00 https://yjunechoe.github.io/posts/2020-10-28-tidytuesday-2020-week-44/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-10-22-analysis-of-everycolorbots-tweets/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-10-13-designing-guiding-aesthetics/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-26-demystifying-stat-layers-ggplot2/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-23-tidytuesday-2020-week-39/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-20-plot-makeover-1/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-14-tidytuesday-2020-week-38/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-12-videos-in-reactable/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-09-06-fonts-for-graphs/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-08-17-tidytuesday-2020-week-33/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-08-07-saving-a-line-of-piping/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-08-04-tidytuesday-2020-week-32/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-07-29-six-years-of-my-spotify-playlists/ - 2022-11-13T09:16:56-05:00 + 2022-11-13T06:16:56-08:00 https://yjunechoe.github.io/posts/2020-07-20-shiny-tips-1/ - 2022-11-13T09:16:55-05:00 + 2022-11-13T06:16:55-08:00 https://yjunechoe.github.io/posts/2020-07-13-geom-paired-raincloud/ - 2022-11-13T09:16:55-05:00 + 2022-11-13T06:16:55-08:00 https://yjunechoe.github.io/posts/2020-06-30-treemap-with-ggplot/ - 2022-11-13T09:16:55-05:00 + 2022-11-13T06:16:55-08:00 https://yjunechoe.github.io/posts/2020-06-25-indexing-tip-for-spacyr/ - 2022-11-13T09:16:55-05:00 + 2022-11-13T06:16:55-08:00 https://yjunechoe.github.io/posts/2020-06-07-correlation-parameter-mem/ - 2022-11-13T09:16:55-05:00 + 2022-11-13T06:16:55-08:00 diff --git a/index.Rmd b/index.Rmd index 3ccecbf..6c5a599 100644 --- a/index.Rmd +++ b/index.Rmd @@ -85,7 +85,7 @@ d-title p {
-I am a linguist, software developer, and scientific programmer. In my PhD research in [Linguistics at the University of Pennsylvania](https://live-sas-www-ling.pantheon.sas.upenn.edu/), I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development. +I am a fifth-year PhD candidate in [Linguistics at the University of Pennsylvania](https://live-sas-www-ling.pantheon.sas.upenn.edu/). I use experimental, corpus, and computational approaches to studying linguistic meaning, of various flavors. My dissertation investigates the learning problem of acquiring hierarchical noun meanings in early childhood, from the angles of semantics/pragmatics, distributional learning, and conceptual development. I am also active in the R programming community as a mentor and open-source developer. I maintain and collaborate on several [open-source software]((/software.html)) in statistical computing ([jlmerclusterperm](https://yjunechoe.github.io/jlmerclusterperm)), data visualization ([ggtrace](https://yjunechoe.github.io/ggtrace/)), data quality assurance ([pointblank](https://github.com/rstudio/pointblank)), and interfaces to data APIs ([openalexR](https://github.com/ropensci/openalexR)). My work on graphics received an [award from the American Statistical Association in 2023](https://community.amstat.org/jointscsg-section/awards/student-paper-competition). I enjoy writing in my free time and have been maintaining a [technical blog](https://yjunechoe.github.io/blog.html) over the past 5 years, covering topics including R language internals, software design principles, and practical tutorials for everyday data analysis. diff --git a/research.Rmd b/research.Rmd index de5f616..9ece3c6 100644 --- a/research.Rmd +++ b/research.Rmd @@ -16,6 +16,8 @@ Links: [Google Scholar](https://scholar.google.com/citations?user=dFdQTmYAAAAJ), ### Peer-reviewed Papers +**June Choe**, and Anna Papafragou. (Accepted). Distributional signatures of superordinate nouns. _Language Acquisition_. DOI: https://doi.org/10.1080/10489223.2024.2440348. [PDF]() [OSF](https://osf.io/uhpr2/) + **June Choe**, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. _Journal of Memory and Language, 132_, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. [PDF](https://www.langcoglab.com/_files/ugd/7cb05b_746c6f11a2214857a94c3ebbc7afa718.pdf) [OSF](https://osf.io/x8cuf/?view_only=14a975d3618e4838b6cc7681b2ecf692) **June Choe**, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). [Language-specific Effects on Automatic Speech Recognition Errors for World Englishes](https://aclanthology.org/2022.coling-1.628/). In _Proceedings of the 29th International Conference on Computational Linguistics_, 7177–7186.