diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web.Rmd b/_posts/2024-09-22-fetch-files-web/fetch-files-web.Rmd similarity index 89% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web.Rmd rename to _posts/2024-09-22-fetch-files-web/fetch-files-web.Rmd index 3c3ed003..0163f76b 100644 --- a/_posts/2024-09-01-fetch-files-web/fetch-files-web.Rmd +++ b/_posts/2024-09-22-fetch-files-web/fetch-files-web.Rmd @@ -1,7 +1,7 @@ --- title: 'Read files on the web into R' description: | - Mostly a compilation of some code-snippets for my own use + For the download-button-averse of us categories: - tutorial base_url: https://yjunechoe.github.io @@ -10,7 +10,7 @@ author: affiliation: University of Pennsylvania Linguistics affiliation_url: https://live-sas-www-ling.pantheon.sas.upenn.edu/ orcid_id: 0000-0002-0701-921X -date: 09-01-2024 +date: 09-22-2024 output: distill::distill_article: include-after-body: "highlighting.html" @@ -20,7 +20,6 @@ output: editor_options: chunk_output_type: console preview: github-dplyr-starwars.jpg -draft: true --- ```{r setup, include=FALSE} @@ -36,7 +35,7 @@ knitr::opts_chunk$set( Every so often I'll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R. -Over the years I've accumulated some tricks to get data into R "straight from a url", even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I'd write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I'm someone who primarily works with tabular data and use GitHub and OSF as data repositories. +Over the years I've accumulated some tricks to get data into R "straight from a url", even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I'd write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I'm someone who primarily works with tabular data and interface with GitHub and OSF as data repositories. ## GitHub (public repos) @@ -91,9 +90,9 @@ emphatic::hl_diff( ## GitHub (gists) -It's a similar idea with GitHub Gists (sometimes I like to store small datasets for demos as gists). For example, here's a link to a simulated data for a [Stroop experiment](https://en.wikipedia.org/wiki/Stroop_effect) `stroop.csv`: . +It's a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, here's a link to a simulated data for a [Stroop experiment](https://en.wikipedia.org/wiki/Stroop_effect) `stroop.csv`: . -But that's a full on webpage. The url which actually hosts the csv contents is , which you can again get to by clicking the **Raw** button at the top-right corner of the gist +But that's again a full-on webpage. The url which actually hosts the csv contents is , which you can again get to by clicking the **Raw** button at the top-right corner of the gist ```{r, echo=FALSE, fig.align='center', out.width="100%", out.extra="class=external"} knitr::include_graphics("github-gist-stroop.jpg", error = FALSE) @@ -121,7 +120,7 @@ We now turn to the harder problem of accessing a file in a private GitHub reposi Except this time, when you open the file at that url (assuming it can display in plain text), you'll see the url come with a "token" attached at the end (I'll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it *will expire* at some point as GitHub refreshes tokens periodically (so treat them as if they're for single use). -For a more robust approach, you can use the [GitHub Contents API](https://docs.github.com/en/rest/repos/contents). If you have your credentials set up in [`{gh}`](https://gh.r-lib.org/) (which you can check with `gh::gh_whoami()`), you can request a token-tagged url to the private file using the syntax:[^Thanks [@tanho](https://fosstodon.org/@tanho) for pointing me to this at the [R4DS/DSLC](https://fosstodon.org/@DSLC) slack.] +For a more robust approach, you can use the [GitHub Contents API](https://docs.github.com/en/rest/repos/contents). If you have your credentials set up in [`{gh}`](https://gh.r-lib.org/) (which you can check with `gh::gh_whoami()`), you can request a token-tagged url to the private file using the syntax:^[Thanks [@tanho](https://fosstodon.org/@tanho) for pointing me to this at the [R4DS/DSLC](https://fosstodon.org/@DSLC) slack.] ```{r, eval=FALSE} gh::gh("/repos/{user}/{repo}/contents/{path}")$download_url @@ -173,7 +172,7 @@ arrow::read_feather("https://osf.io/download/9vztj/") |> You might have already caught on to this, but the pattern is to simply point to `osf.io/download/` instead of `osf.io/`. -This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects . Navigating to this link will show a web preview of the csv file contents, just like in the GitHub example with `dplyr::starwars`. +This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects . Navigating to this link will show a web preview of the csv file contents. By inserting `/download` into this url, we can read the csv file contents directly: @@ -186,9 +185,9 @@ See also the [`{osfr}`](https://docs.ropensci.org/osfr/reference/osfr-package.ht ## Aside: Can't go wrong with a copy-paste! -Reading remote files aside, I think it's severly under-rated how base R has a `readClipboard()` function and a collection of `read.*()` functions which can also read directly from a `"clipboard"` connection.^[The special value `"clipboard"` works for most base-R read functions that take a `file` or `con` argument.] +Reading remote files aside, I think it's severely underrated how base R has a `readClipboard()` function and a collection of `read.*()` functions which can also read directly from a `"clipboard"` connection.^[The special value `"clipboard"` works for most base-R read functions that take a `file` or `con` argument.] -I sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can also lean on base R's clipboard functionalities. +I sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R's clipboard functionalities. For example, given this markdown table: @@ -197,7 +196,7 @@ aggregate(mtcars, mpg ~ cyl, mean) |> knitr::kable() ``` -You can copy it and run the following code to get that data back as an R data frame: +You can copy its contents and run the following code to get that data back as an R data frame: ```{r, eval=FALSE} read.delim("clipboard") @@ -257,9 +256,13 @@ For this example I will use a [parquet file](https://duckdb.org/docs/data/parque ```{r} # A parquet file of tokens from a sample of child-directed speech file <- "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet" + +# For comparison, reading its contents with {arrow} +arrow::read_parquet(file) |> + head(5) ``` -In duckdb, the `httpfs` extension allows `PARQUET_SCAN`^[Or `READ_PARQUET` - [same thing](https://duckdb.org/docs/data/parquet/overview.html#read_parquet-function).] to read a remote parquet file. +In duckdb, the `httpfs` extension we loaded above allows `PARQUET_SCAN`^[Or `READ_PARQUET` - [same thing](https://duckdb.org/docs/data/parquet/overview.html#read_parquet-function).] to read a remote parquet file. ```{r} query1 <- glue::glue_sql(" @@ -310,11 +313,11 @@ To get the file tree of the repo on the master branch, we use: files <- gh::gh("/repos/yjunechoe/repetition_events/git/trees/master?recursive=true")$tree ``` -With `recursive=true`, this returns all files in the repo. We can filter for just the parquet files we want with a little regex: +With `recursive=true`, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex: ```{r} parquet_files <- sapply(files, `[[`, "path") |> - grep(x = _, pattern = ".*data/tokens_data/.*parquet$", value = TRUE) + grep(x = _, pattern = ".*/tokens_data/.*parquet$", value = TRUE) length(parquet_files) head(parquet_files) ``` @@ -423,21 +426,21 @@ Lastly, I inadvertently(?) started some discussion around remotely accessing spa I also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so I'll be updating this space as more come back to me: -- When reading remote `.rda` or `.RData` files with `load()`, you need to wrap the link in `url()` first (ref: [stackoverflow](https://stackoverflow.com/questions/26108575/loading-rdata-files-from-url)). +- When reading remote `.rda` or `.RData` files with `load()`, you may need to wrap the link in `url()` first (ref: [stackoverflow](https://stackoverflow.com/questions/26108575/loading-rdata-files-from-url)). - [`{vroom}`](https://vroom.r-lib.org/) can [remotely read gzipped files](https://vroom.r-lib.org/articles/vroom.html#reading-remote-files), without having to `download.file()` and `unzip()` first. - [`{curl}`](https://jeroen.cran.dev/curl/), of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using `curl::curl_fetch_memory()` to read the `dplyr::storms` data again from the GitHub raw contents link: - ```{r} - fetched <- curl::curl_fetch_memory( - "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv" +```{r} +fetched <- curl::curl_fetch_memory( + "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv" ) read.csv(text = rawToChar(fetched$content)) |> - dplyr::glimpse() - ``` + dplyr::glimpse() +``` - And even if you're going the route of downloading the file first, `curl::multi_download()` can offer big performance improvements over `download.file()`.^[See an example implemented for [`{openalexR}`](https://github.com/ropensci/openalexR/pull/63), an API package.] Many `{curl}` functions can also handle [retries and stop/resumes](https://fosstodon.org/@eliocamp@mastodon.social/111885424355264237) which is cool too. +- Even if you're going the route of downloading the file first, `curl::multi_download()` can offer big performance improvements over `download.file()`.^[See an example implemented for [`{openalexR}`](https://github.com/ropensci/openalexR/pull/63), an API package.] Many `{curl}` functions can also handle [retries and stop/resumes](https://fosstodon.org/@eliocamp@mastodon.social/111885424355264237) which is cool too. - [`{httr2}`](https://httr2.r-lib.org/) can capture a *continuous data stream* with `httr2::req_perform_stream()` up to a set time or size. diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web.html b/_posts/2024-09-22-fetch-files-web/fetch-files-web.html similarity index 91% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web.html rename to _posts/2024-09-22-fetch-files-web/fetch-files-web.html index 06589d7f..b0083a46 100644 --- a/_posts/2024-09-01-fetch-files-web/fetch-files-web.html +++ b/_posts/2024-09-22-fetch-files-web/fetch-files-web.html @@ -32,7 +32,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } +pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -90,32 +90,32 @@ Read files on the web into R - + - - + + - + - + @@ -1524,7 +1524,7 @@ @@ -1541,13 +1541,13 @@

Read files on the web into R

tutorial
-

Mostly a compilation of some code-snippets for my own use

+

For the download-button-averse of us

@@ -1571,7 +1571,7 @@

Contents

Every so often I’ll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.

-

Over the years I’ve accumulated some tricks to get data into R “straight from a url”, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I’d write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I’m someone who primarily works with tabular data and use GitHub and OSF as data repositories.

+

Over the years I’ve accumulated some tricks to get data into R “straight from a url”, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I’d write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I’m someone who primarily works with tabular data and interface with GitHub and OSF as data repositories.

GitHub (public repos)

GitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the dplyr::starwars dataset from tidyverse/dplyr, at https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv:

@@ -1632,8 +1632,8 @@

GitHub (public repos)

GitHub (gists)

-

It’s a similar idea with GitHub Gists (sometimes I like to store small datasets for demos as gists). For example, here’s a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.

-

But that’s a full on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist

+

It’s a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, here’s a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.

+

But that’s again a full-on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist

@@ -1666,7 +1666,7 @@

GitHub (gists)

GitHub (private repos)

We now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and you’re signed in, you can follow the same step of copying the link that the Raw button redirects to.

Except this time, when you open the file at that url (assuming it can display in plain text), you’ll see the url come with a “token” attached at the end (I’ll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it will expire at some point as GitHub refreshes tokens periodically (so treat them as if they’re for single use).

-

For a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:

+

For a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:1

gh::gh("/repos/{user}/{repo}/contents/{path}")$download_url
@@ -1686,9 +1686,9 @@

GitHub (private repos)

# truncating gsub(x = _, "^(.{100}).*", "\\1...")
-
  [1] "https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR6BQGEERA..."
+
  [1] "https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR2JPXCIX5..."
-

I can then use this url to read the private file:1

+

I can then use this url to read the private file:2

gh::gh("/repos/yjunechoe/my-super-secret-repo/contents/README.md")$download_url |> 
@@ -1718,7 +1718,7 @@ 

OSF

$ yield <int> 1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595…

You might have already caught on to this, but the pattern is to simply point to osf.io/download/ instead of osf.io/.

-

This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents, just like in the GitHub example with dplyr::starwars.

+

This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents.

By inserting /download into this url, we can read the csv file contents directly:

@@ -1735,8 +1735,8 @@

OSF

See also the {osfr} package for a more principled interface to OSF.

Aside: Can’t go wrong with a copy-paste!

-

Reading remote files aside, I think it’s severly under-rated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a "clipboard" connection.2

-

I sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can also lean on base R’s clipboard functionalities.

+

Reading remote files aside, I think it’s severely underrated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a "clipboard" connection.3

+

I sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R’s clipboard functionalities.

For example, given this markdown table:

@@ -1745,28 +1745,28 @@

Aside: Can’t go wrong with a co

- + - + - + - +
cyl mpg
4 26.66364
6 19.74286
8 15.10000
-

You can copy it and run the following code to get that data back as an R data frame:

+

You can copy its contents and run the following code to get that data back as an R data frame:

read.delim("clipboard")
@@ -1779,7 +1779,7 @@ 

Aside: Can’t go wrong with a co 2 6 19.74286 3 8 15.10000

-

If you’re instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:3

+

If you’re instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:4

paste(1:10, collapse = ", ") |> 
@@ -1816,10 +1816,22 @@ 

Streaming with {duckdb}

# A parquet file of tokens from a sample of child-directed speech
-file <- "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet"
-
-
-

In duckdb, the httpfs extension allows PARQUET_SCAN4 to read a remote parquet file.

+file <- "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet" + +# For comparison, reading its contents with {arrow} +arrow::read_parquet(file) |> + head(5)
+
+
  # A tibble: 5 × 3
+    utterance_id gloss   part_of_speech
+           <int> <chr>   <chr>         
+  1            1 www     ""            
+  2            2 bye     "co"          
+  3            3 mhm     "co"          
+  4            4 Mommy's "n:prop"      
+  5            4 here    "adv"
+
+

In duckdb, the httpfs extension we loaded above allows PARQUET_SCAN5 to read a remote parquet file.

query1 <- glue::glue_sql("
@@ -1885,7 +1897,7 @@ 

Streaming with {duckdb}

4 4 Mommy's n:prop 1 5 4 here adv 1
-

To do this more programmatically over all (parquet) files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:

+

To do this more programmatically over all parquet files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:

gh::gh("/repos/{user}/{repo}/git/trees/{branch/tag/commitSHA}?recursive=true")$tree
@@ -1897,11 +1909,11 @@

Streaming with {duckdb}

files <- gh::gh("/repos/yjunechoe/repetition_events/git/trees/master?recursive=true")$tree
-

With recursive=true, this returns all files in the repo. We can filter for just the parquet files we want with a little regex:

+

With recursive=true, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex:

parquet_files <- sapply(files, `[[`, "path") |> 
-  grep(x = _, pattern = ".*data/tokens_data/.*parquet$", value = TRUE)
+  grep(x = _, pattern = ".*/tokens_data/.*parquet$", value = TRUE)
 length(parquet_files)
  [1] 70
@@ -1931,7 +1943,7 @@

Streaming with {duckdb}

[5] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=13/part-1.parquet" [6] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=14/part-2.parquet"
-

Back on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].5 This time, we also ask for a quick computation to count the number of distinct childIDs:

+

Back on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].6 This time, we also ask for a quick computation to count the number of distinct childIDs:

query3 <- glue::glue_sql("
@@ -1955,7 +1967,7 @@ 

Streaming with {duckdb}

1 70

This returns 70 which matches the length of the parquet_files vector listing the files that had been partitioned by childID.

-

For further analyses, we can CREATE TABLE6 our data in our in-memory database con:

+

For further analyses, we can CREATE TABLE7 our data in our in-memory database con:

query4 <- glue::glue_sql("
@@ -2046,21 +2058,22 @@ 

Streaming with {duckdb}

Other sources for data

In writing this blog post, I’m indebted to all the knowledgeable folks on Mastodon who suggested their own recommended tools and workflows for various kinds of remote data. Unfortunately, I’m not familiar enough with most of them enough to do them justice, but I still wanted to record the suggestions I got from there for posterity.

First, a post about reading remote files would not be complete without a mention of the wonderful {googlesheets4} package for reading from Google Sheets. I debated whether I should include a larger discussion of {googlesheets4}, and despite using it quite often myself I ultimately decided to omit it for the sake of space and because the package website is already very comprehensive. I would suggest starting from the Get Started vignette if you are new and interested.

-

Second, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (Maëlle Salmon’s comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - I’m not familiar with it myself so I thought I wouldn’t write anything for it here BUT Isabella Velásquez came in clutch with a whole talk on dynamically loading up-to-date data with {pins} which is a great usecase demo of the unique strength of {pins}.

+

Second, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (Maëlle Salmon’s comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - I’m not familiar with it myself so I thought I wouldn’t write anything for it here BUT Isabella Velásquez came in clutch sharing a recent talk on dynamically loading up-to-date data with {pins} which is a great demo of the unique strengths of {pins}.

Lastly, I inadvertently(?) started some discussion around remotely accessing spatial files. I don’t work with spatial data at all but I can totally imagine how the hassle of the traditional click-download-find-load workflow would be even more pronounced for spatial data which are presumably much larger in size and more difficult to preview. On this note, I’ll just link to Carl Boettiger’s comment about the fact that GDAL has a virtual file system that you can interface with from R packages wrapping this API (ex: {gdalraster}), and to Michael Sumner’s comment/gist + Chris Toney’s comment on the fact that you can even use this feature to stream non-spatial data!

Miscellaneous tips and tricks

I also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so I’ll be updating this space as more come back to me:

    -
  • When reading remote .rda or .RData files with load(), you need to wrap the link in url() first (ref: stackoverflow).

  • +
  • When reading remote .rda or .RData files with load(), you may need to wrap the link in url() first (ref: stackoverflow).

  • {vroom} can remotely read gzipped files, without having to download.file() and unzip() first.

  • -
  • {curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:

    +
  • {curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:

  • +
fetched <- curl::curl_fetch_memory(
-  "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
-    )
-    read.csv(text = rawToChar(fetched$content)) |> 
-  dplyr::glimpse()
+ "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv" +) +read.csv(text = rawToChar(fetched$content)) |> + dplyr::glimpse()
  Rows: 87
   Columns: 14
@@ -2079,7 +2092,8 @@ 

Miscellaneous tips and tricks

$ vehicles <chr> "Snowspeeder, Imperial Speeder Bike", "", "", "", "Imperial… $ starships <chr> "X-wing, Imperial shuttle", "", "", "TIE Advanced x1", "", …
-

And even if you’re going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().[^See an example implemented for {openalexR}, an API package.] Many {curl} functions also take a retry parameter in some form which is cool too.

+
    +
  • Even if you’re going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().8 Many {curl} functions can also handle retries and stop/resumes which is cool too.

  • {httr2} can capture a continuous data stream with httr2::req_perform_stream() up to a set time or size.

sessionInfo()

@@ -2131,16 +2145,18 @@

sessionInfo()

-
+

    -
  1. Note that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).↩︎

  2. -
  3. The special value "clipboard" works for most base-R read functions that take a file or con argument.↩︎

  4. -
  5. Thanks @coolbutuseless for pointing me to textConnection()!↩︎

  6. -
  7. Or READ_PARQUET - same thing.↩︎

  8. -
  9. We can also get this formatting with a combination of shQuote() and toString().↩︎

  10. -
  11. Whereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.↩︎

  12. +
  13. Thanks @tanho for pointing me to this at the R4DS/DSLC slack.↩︎

  14. +
  15. Note that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).↩︎

  16. +
  17. The special value "clipboard" works for most base-R read functions that take a file or con argument.↩︎

  18. +
  19. Thanks @coolbutuseless for pointing me to textConnection()!↩︎

  20. +
  21. Or READ_PARQUET - same thing.↩︎

  22. +
  23. We can also get this formatting with a combination of shQuote() and toString().↩︎

  24. +
  25. Whereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.↩︎

  26. +
  27. See an example implemented for {openalexR}, an API package.↩︎

diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/anchor-4.2.2/anchor.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/anchor-4.2.2/anchor.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/anchor-4.2.2/anchor.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/anchor-4.2.2/anchor.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/bowser-1.9.3/bowser.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/bowser-1.9.3/bowser.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/bowser-1.9.3/bowser.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/bowser-1.9.3/bowser.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/distill-2.2.21/template.v2.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/distill-2.2.21/template.v2.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/distill-2.2.21/template.v2.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/distill-2.2.21/template.v2.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/header-attrs-2.27/header-attrs.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/header-attrs-2.27/header-attrs.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/header-attrs-2.27/header-attrs.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/header-attrs-2.27/header-attrs.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.map b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.map similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.map rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/jquery-3.6.0/jquery-3.6.0.min.map diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/popper-2.6.0/popper.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/popper-2.6.0/popper.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/popper-2.6.0/popper.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/popper-2.6.0/popper.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-bundle.umd.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-bundle.umd.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-bundle.umd.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-bundle.umd.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-light-border.css b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-light-border.css similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-light-border.css rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy-light-border.css diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.css b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.css similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.css rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.css diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.umd.min.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.umd.min.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.umd.min.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/tippy-6.2.7/tippy.umd.min.js diff --git a/_posts/2024-09-01-fetch-files-web/fetch-files-web_files/webcomponents-2.0.0/webcomponents.js b/_posts/2024-09-22-fetch-files-web/fetch-files-web_files/webcomponents-2.0.0/webcomponents.js similarity index 100% rename from _posts/2024-09-01-fetch-files-web/fetch-files-web_files/webcomponents-2.0.0/webcomponents.js rename to _posts/2024-09-22-fetch-files-web/fetch-files-web_files/webcomponents-2.0.0/webcomponents.js diff --git a/_posts/2024-09-01-fetch-files-web/github-dplyr-starwars-csv.jpg b/_posts/2024-09-22-fetch-files-web/github-dplyr-starwars-csv.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/github-dplyr-starwars-csv.jpg rename to _posts/2024-09-22-fetch-files-web/github-dplyr-starwars-csv.jpg diff --git a/_posts/2024-09-01-fetch-files-web/github-dplyr-starwars-raw.jpg b/_posts/2024-09-22-fetch-files-web/github-dplyr-starwars-raw.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/github-dplyr-starwars-raw.jpg rename to _posts/2024-09-22-fetch-files-web/github-dplyr-starwars-raw.jpg diff --git a/_posts/2024-09-01-fetch-files-web/github-dplyr-starwars.jpg b/_posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/github-dplyr-starwars.jpg rename to _posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg diff --git a/_posts/2024-09-01-fetch-files-web/github-gist-stroop.jpg b/_posts/2024-09-22-fetch-files-web/github-gist-stroop.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/github-gist-stroop.jpg rename to _posts/2024-09-22-fetch-files-web/github-gist-stroop.jpg diff --git a/_posts/2024-09-01-fetch-files-web/osf-MixedModels-dyestuff-download.jpg b/_posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff-download.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/osf-MixedModels-dyestuff-download.jpg rename to _posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff-download.jpg diff --git a/_posts/2024-09-01-fetch-files-web/osf-MixedModels-dyestuff.jpg b/_posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff.jpg similarity index 100% rename from _posts/2024-09-01-fetch-files-web/osf-MixedModels-dyestuff.jpg rename to _posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff.jpg diff --git a/docs/blog.html b/docs/blog.html index e61d3dc8..3daf1b4e 100644 --- a/docs/blog.html +++ b/docs/blog.html @@ -2784,6 +2784,22 @@

${suggestion.title}

Blog Posts

+ + + +
+ +
+
+

Read files on the web into R

+
+
tutorial
+
+

For the download-button-averse of us

+
+
- +

2023 Year in Review

@@ -3411,7 +3427,7 @@

Categories

  • Articles -(35) +(36)
  • args @@ -3531,7 +3547,7 @@

    Categories

  • tutorial -(8) +(9)
  • typography diff --git a/docs/blog.xml b/docs/blog.xml index b20bc019..c5ad6754 100644 --- a/docs/blog.xml +++ b/docs/blog.xml @@ -12,7 +12,17 @@ https://yjunechoe.github.io Distill - Sun, 21 Jul 2024 00:00:00 +0000 + Sun, 22 Sep 2024 00:00:00 +0000 + + Read files on the web into R + June Choe + https://yjunechoe.github.io/posts/2024-09-22-fetch-files-web + For the download-button-averse of us + tutorial + https://yjunechoe.github.io/posts/2024-09-22-fetch-files-web + Sun, 22 Sep 2024 00:00:00 +0000 + + Naming patterns for boolean enums June Choe diff --git a/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-csv.jpg b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-csv.jpg new file mode 100644 index 00000000..cff11217 Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-csv.jpg differ diff --git a/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-raw.jpg b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-raw.jpg new file mode 100644 index 00000000..1b12043d Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars-raw.jpg differ diff --git a/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg new file mode 100644 index 00000000..f455dcd0 Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg differ diff --git a/docs/posts/2024-09-22-fetch-files-web/github-gist-stroop.jpg b/docs/posts/2024-09-22-fetch-files-web/github-gist-stroop.jpg new file mode 100644 index 00000000..80fec00c Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/github-gist-stroop.jpg differ diff --git a/docs/posts/2024-09-22-fetch-files-web/index.html b/docs/posts/2024-09-22-fetch-files-web/index.html new file mode 100644 index 00000000..68928f2b --- /dev/null +++ b/docs/posts/2024-09-22-fetch-files-web/index.html @@ -0,0 +1,3332 @@ + + + + + + + + + + + + + + + + + + + + +June Choe: Read files on the web into R + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +

    Read files on the web into R

    + + + + +

    For the download-button-averse of us

    +
    + + + +
    + +

    Every so often I’ll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.

    +

    Over the years I’ve accumulated some tricks to get data into R “straight from a url”, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I’d write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I’m someone who primarily works with tabular data and interface with GitHub and OSF as data repositories.

    +

    GitHub (public repos)

    +

    GitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the dplyr::starwars dataset from tidyverse/dplyr, at https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv:

    +
    +

    +
    +

    That url, despite ending in a .csv, does not point to the raw data - instead, the contents of the page is a full html document:

    +
    +
    +
    rvest::read_html("https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv")
    +
    +
    +
      {html_document}
    +  <html lang="en" data-color-mode="auto" data-light-theme="light" ...
    +  [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
    +  [2] <body class="logged-out env-production page-responsive" style="word-wrap: ...
    +

    To actually point to the csv contents, we want to click on the Raw button to the top-right corner of the preview:

    +
    +

    +
    +

    That gets us to the comma separated values we want, which is at a new url https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv:

    +
    +

    +
    +

    We can then read from that URL at “raw.githubusercontent.com/…” using read.csv():

    +
    +
    +
    read.csv("https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv") |> 
    +  dplyr::glimpse()
    +
    +
      Rows: 87
    +  Columns: 14
    +  $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
    +  $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
    +  $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
    +  $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
    +  $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
    +  $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
    +  $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
    +  $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
    +  $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
    +  $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
    +  $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
    +  $ films      <chr> "A New Hope, The Empire Strikes Back, Return of the Jedi, R…
    +  $ vehicles   <chr> "Snowspeeder, Imperial Speeder Bike", "", "", "", "Imperial…
    +  $ starships  <chr> "X-wing, Imperial shuttle", "", "", "TIE Advanced x1", "", …
    +
    +

    But note that this method of “click the Raw button to get the corresponding raw.githubusercontent.com/… url to the file contents” will not work for file formats that cannot be displayed in plain text (clicking the button will instead download the file via your browser). So sometimes (especially when you have a binary file) you have to construct this “remote-readable” url to the file manually.

    +

    Fortunately, going from one link to the other is pretty formulaic. To demonstrate the difference with the url for the starwars dataset again:

    +
    +
    +
    emphatic::hl_diff(
    +  "https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv",
    +  "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
    +)
    +
    +
    +[1] "https://    github           .com/tidyverse/dplyr/blob/main/data-raw/starwars.csv"
    [1] "https://raw.githubusercontent.com/tidyverse/dplyr /main/data-raw/starwars.csv" +
    +
    +

    GitHub (gists)

    +

    It’s a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, here’s a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.

    +

    But that’s again a full-on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist

    +
    +

    +
    +

    But actually, that long link you get by default points to the current commit, specifically. If you instead want the link to be kept up to date with the most recent commit, you can omit the second hash that comes after raw/:

    +
    +
    +
    emphatic::hl_diff(
    +  "https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv",
    +  "https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv"
    +)
    +
    +
    +[1] "https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv"
    [1] "https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw /stroop.csv" +
    +
    +

    In practice, I don’t use gists to store replicability-sensitive data, so I prefer to just use the shorter link that’s not tied to a specific commit.

    +
    +
    +
    read.csv("https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv") |> 
    +  dplyr::glimpse()
    +
    +
      Rows: 240
    +  Columns: 5
    +  $ subj      <chr> "S01", "S01", "S01", "S01", "S01", "S01", "S01", "S01", "S02…
    +  $ word      <chr> "blue", "blue", "green", "green", "red", "red", "yellow", "y…
    +  $ condition <chr> "match", "mismatch", "match", "mismatch", "match", "mismatch…
    +  $ accuracy  <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
    +  $ RT        <int> 400, 549, 576, 406, 296, 231, 433, 1548, 561, 1751, 286, 710…
    +
    +

    GitHub (private repos)

    +

    We now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and you’re signed in, you can follow the same step of copying the link that the Raw button redirects to.

    +

    Except this time, when you open the file at that url (assuming it can display in plain text), you’ll see the url come with a “token” attached at the end (I’ll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it will expire at some point as GitHub refreshes tokens periodically (so treat them as if they’re for single use).

    +

    For a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:1

    +
    +
    +
    gh::gh("/repos/{user}/{repo}/contents/{path}")$download_url
    +
    +
    +

    Note that this is actually also a general solution to getting a url to GitHub file contents. So for example, even without any credentials set up you can point to dplyr’s starwars.csv since that’s publicly accessible. This method produces the same “raw.githubusercontent.com/…” url we saw earlier:

    +
    +
    +
    gh::gh("/repos/tidyverse/dplyr/contents/data-raw/starwars.csv")$download_url
    +
    +
      [1] "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
    +
    +

    Now for a demonstration with a private repo, here is one of mine that you cannot access https://github.com/yjunechoe/my-super-secret-repo. But because I set up my credentials in {gh}, I can generate a link to a content within that repo with the access token attached (“?token=…”):

    +
    +
    +
    gh::gh("/repos/yjunechoe/my-super-secret-repo/contents/README.md")$download_url |> 
    +  # truncating
    +  gsub(x = _, "^(.{100}).*", "\\1...")
    +
    +
      [1] "https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR2JPXCIX5..."
    +
    +

    I can then use this url to read the private file:2

    +
    +
    +
    gh::gh("/repos/yjunechoe/my-super-secret-repo/contents/README.md")$download_url |> 
    +  readLines()
    +
    +
      [1] "Surprise!"
    +
    +

    OSF

    +

    OSF (the Open Science Framework) is another data repository that I interact with a lot, and reading files off of OSF follows a similar strategy to fetching public files on GitHub.

    +

    Consider, for example, the dyestuff.arrow file in the OSF repository for MixedModels.jl. Browsing the repository through the point-and-click interface can get you to the page for the file at https://osf.io/9vztj/, where it shows:

    +
    +

    +
    +

    The download button can be found inside the dropdown menubar to the right:

    +
    +

    +
    +

    But instead of clicking on the icon (which will start a download via the browser), we can grab the embedded link address: https://osf.io/download/9vztj/. That url can then be passed directly into a read function:

    +
    +
    +
    arrow::read_feather("https://osf.io/download/9vztj/") |> 
    +  dplyr::glimpse()
    +
    +
      Rows: 30
    +  Columns: 2
    +  $ batch <fct> A, A, A, A, A, B, B, B, B, B, C, C, C, C, C, D, D, D, D, D, E, E…
    +  $ yield <int> 1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595…
    +
    +

    You might have already caught on to this, but the pattern is to simply point to osf.io/download/ instead of osf.io/.

    +

    This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents.

    +

    By inserting /download into this url, we can read the csv file contents directly:

    +
    +
    +
    read.csv("https://osf.io/download/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad") |> 
    +  head()
    +
    +
            Item  plaus_bias trans_bias
    +  1 Awakened -0.29631221 -1.2200901
    +  2   Calmed  0.09877074 -0.4102332
    +  3   Choked  1.28401957 -1.4284905
    +  4  Dressed -0.59262442 -1.2087228
    +  5   Failed -0.98770736  0.1098839
    +  6  Groomed -1.08647810  0.9889550
    +
    +

    See also the {osfr} package for a more principled interface to OSF.

    +

    Aside: Can’t go wrong with a copy-paste!

    +

    Reading remote files aside, I think it’s severely underrated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a "clipboard" connection.3

    +

    I sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R’s clipboard functionalities.

    +

    For example, given this markdown table:

    +
    +
    +
    aggregate(mtcars, mpg ~ cyl, mean) |> 
    +  knitr::kable()
    +
    + + + + + + + + + + + + + + + + + + + + + +
    cylmpg
    426.66364
    619.74286
    815.10000
    +
    +

    You can copy its contents and run the following code to get that data back as an R data frame:

    +
    +
    +
    read.delim("clipboard")
    +# Or, `read.delim(text = readClipboard())`
    +
    +
    +
    +
        cyl      mpg
    +  1   4 26.66364
    +  2   6 19.74286
    +  3   8 15.10000
    +
    +

    If you’re instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:4

    +
    +
    +
    paste(1:10, collapse = ", ") |> 
    +  cat()
    +
    +
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    +
    +
    +
    +
    scan("clipboard", sep = ",")
    +# Or, `scan(textConnection(readClipboard()), sep = ",")`
    +
    +
    +
    +
       [1]  1  2  3  4  5  6  7  8  9 10
    +
    +

    It should be noted though that parsing clipboard contents is not a robust feature in base R. If you want a more principled approach to reading data from clipboard, you should use {datapasta}. And for printing data for others to copy-paste into R, use {constructive}. See also {clipr} which extends clipboard read/write functionalities.

    +

    Other goodies

    +

    ⚠️ What lies ahead are denser than the kinds of “low-tech” advice I wrote about above.

    +

    Streaming with {duckdb}

    +

    One caveat to all the “read from web” approaches I covered above is that it often does not actually circumvent the action of downloading the file onto your computer. For example, when you read a file from “raw.githubusercontent.com/…” with read.csv(), there is an implicit download.file() of the data into the current R session’s tempdir().

    +

    An alternative that actually reads the data straight into memory is streaming. Streaming is moreso a feature of database languages, but there’s good integration of such tools with R, so this option is available from within R as well.

    +

    Here, I briefly outline what I learned from (mostly) reading a blog post by François Michonneau, which covers how to stream remote files using {duckdb}. It’s pretty comprehensive but I wanted to make a template for just one method that I prefer.

    +

    We start by loading the {duckdb} package, creating a connection to an in-memory database, installing the httpfs extension (if not installed already), and loading httpfs for the database.

    +
    +
    +
    library(duckdb)
    +con <- dbConnect(duckdb())
    +# dbExecute(con, "INSTALL httpfs;") # You may also need to "INSTALL parquet;"
    +invisible(dbExecute(con, "LOAD httpfs;"))
    +
    +
    +

    For this example I will use a parquet file from one of my projects which is hosted on GitHub: https://github.com/yjunechoe/repetition_events. The data I want to read is at the relative path /data/tokens_data/childID=1/part-7.parquet. I went ahead and converted that into the “raw contents” url shown below:

    +
    +
    +
    # A parquet file of tokens from a sample of child-directed speech
    +file <- "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet"
    +
    +# For comparison, reading its contents with {arrow}
    +arrow::read_parquet(file) |> 
    +  head(5)
    +
    +
      # A tibble: 5 × 3
    +    utterance_id gloss   part_of_speech
    +           <int> <chr>   <chr>         
    +  1            1 www     ""            
    +  2            2 bye     "co"          
    +  3            3 mhm     "co"          
    +  4            4 Mommy's "n:prop"      
    +  5            4 here    "adv"
    +
    +

    In duckdb, the httpfs extension we loaded above allows PARQUET_SCAN5 to read a remote parquet file.

    +
    +
    +
    query1 <- glue::glue_sql("
    +  SELECT *
    +  FROM PARQUET_SCAN({`file`})
    +  LIMIT 5;
    +", .con = con)
    +cat(query1)
    +
    +
      SELECT *
    +  FROM PARQUET_SCAN("https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet")
    +  LIMIT 5;
    +
    +
    dbGetQuery(con, query1)
    +
    +
        utterance_id   gloss part_of_speech
    +  1            1     www               
    +  2            2     bye             co
    +  3            3     mhm             co
    +  4            4 Mommy's         n:prop
    +  5            4    here            adv
    +
    +

    And actually, in my case, the parquet file represents one of many files that had been previously split up via hive partitioning. To preserve this metadata even as I read in just a single file, I need to do two things:

    +
      +
    1. Specify hive_partitioning=true when calling PARQUET_SCAN.
    2. +
    3. Ensure that the hive-partitioning syntax is represented in the url with URLdecode() (since the = character can sometimes be escaped, as in this case).
    4. +
    +
    +
    +
    emphatic::hl_diff(file, URLdecode(file))
    +
    +
    +[1] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet"
    [1] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID= 1/part-7.parquet" +
    +
    +

    With that, the data now shows that the observations are from child #1 in the sample.

    +
    +
    +
    file <- URLdecode(file)
    +query2 <- glue::glue_sql("
    +  SELECT *
    +  FROM PARQUET_SCAN(
    +    {`file`},
    +    hive_partitioning=true
    +  )
    +  LIMIT 5;
    +", .con = con)
    +cat(query2)
    +
    +
      SELECT *
    +  FROM PARQUET_SCAN(
    +    "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet",
    +    hive_partitioning=true
    +  )
    +  LIMIT 5;
    +
    +
    dbGetQuery(con, query2)
    +
    +
        utterance_id   gloss part_of_speech childID
    +  1            1     www                      1
    +  2            2     bye             co       1
    +  3            3     mhm             co       1
    +  4            4 Mommy's         n:prop       1
    +  5            4    here            adv       1
    +
    +

    To do this more programmatically over all parquet files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:

    +
    +
    +
    gh::gh("/repos/{user}/{repo}/git/trees/{branch/tag/commitSHA}?recursive=true")$tree
    +
    +
    +

    To get the file tree of the repo on the master branch, we use:

    +
    +
    +
    files <- gh::gh("/repos/yjunechoe/repetition_events/git/trees/master?recursive=true")$tree
    +
    +
    +

    With recursive=true, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex:

    +
    +
    +
    parquet_files <- sapply(files, `[[`, "path") |> 
    +  grep(x = _, pattern = ".*/tokens_data/.*parquet$", value = TRUE)
    +length(parquet_files)
    +
    +
      [1] 70
    +
    +
    head(parquet_files)
    +
    +
      [1] "data/tokens_data/childID=1/part-7.parquet" 
    +  [2] "data/tokens_data/childID=10/part-0.parquet"
    +  [3] "data/tokens_data/childID=11/part-6.parquet"
    +  [4] "data/tokens_data/childID=12/part-3.parquet"
    +  [5] "data/tokens_data/childID=13/part-1.parquet"
    +  [6] "data/tokens_data/childID=14/part-2.parquet"
    +
    +

    Finally, we complete the path using the “https://raw.githubusercontent.com/…” url:

    +
    +
    +
    parquet_files <- paste0(
    +  "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/",
    +  parquet_files
    +)
    +head(parquet_files)
    +
    +
      [1] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet" 
    +  [2] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=10/part-0.parquet"
    +  [3] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=11/part-6.parquet"
    +  [4] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=12/part-3.parquet"
    +  [5] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=13/part-1.parquet"
    +  [6] "https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=14/part-2.parquet"
    +
    +

    Back on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].6 This time, we also ask for a quick computation to count the number of distinct childIDs:

    +
    +
    +
    query3 <- glue::glue_sql("
    +  SELECT count(DISTINCT childID)
    +  FROM PARQUET_SCAN(
    +    [{parquet_files*}],
    +    hive_partitioning=true
    +  )
    +", .con = con)
    +cat(gsub("^(.{80}).*(.{60})$", "\\1 ... \\2", query3))
    +
    +
      SELECT count(DISTINCT childID)
    +  FROM PARQUET_SCAN(
    +    ['https://raw.githubusercont ... data/childID=9/part-64.parquet'],
    +    hive_partitioning=true
    +  )
    +
    +
    dbGetQuery(con, query3)
    +
    +
        count(DISTINCT childID)
    +  1                      70
    +
    +

    This returns 70 which matches the length of the parquet_files vector listing the files that had been partitioned by childID.

    +

    For further analyses, we can CREATE TABLE7 our data in our in-memory database con:

    +
    +
    +
    query4 <- glue::glue_sql("
    +  CREATE TABLE tokens_data AS
    +  SELECT *
    +  FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning=true)
    +", .con = con)
    +invisible(dbExecute(con, query4))
    +dbListTables(con)
    +
    +
      [1] "tokens_data"
    +
    +

    That lets us reference the table via dplyr::tbl(), at which point we can switch over to another high-level interface like {dplyr} to query it using its familiar functions:

    +
    +
    +
    library(dplyr)
    +tokens_data <- tbl(con, "tokens_data")
    +
    +# Q: What are the most common verbs spoken to children in this sample?
    +tokens_data |> 
    +  filter(part_of_speech == "v") |> 
    +  count(gloss, sort = TRUE) |> 
    +  head() |> 
    +  collect()
    +
    +
      # A tibble: 6 × 2
    +    gloss     n
    +    <chr> <dbl>
    +  1 go    13614
    +  2 see   13114
    +  3 do    11829
    +  4 have  10794
    +  5 want  10560
    +  6 put    9190
    +
    +

    Combined, here’s one (hastily put together) attempt at wrapping this workflow into a function:

    +
    +
    +
    load_dataset_from_gh <- function(con, tblname, user, repo, branch, regex,
    +                                 partition = TRUE, lazy = TRUE) {
    +  
    +  allfiles <- gh::gh(glue::glue("/repos/{user}/{repo}/git/trees/{branch}?recursive=true"))$tree
    +  files_relpath <- grep(regex, sapply(allfiles, `[[`, "path"), value = TRUE)
    +  # Use the actual Contents API here instead, if the repo is private
    +  files <- glue::glue("https://raw.githubusercontent.com/{user}/{repo}/{branch}/{files_relpath}")
    +  
    +  type <- if (lazy) quote(VIEW) else quote(TABLE)
    +  partition <- as.integer(partition)
    +  
    +  dbExecute(con, "LOAD httpfs;")
    +  dbExecute(con, glue::glue_sql("
    +    CREATE {type} {`tblname`} AS
    +    SELECT *
    +    FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning={partition})
    +  ", .con = con))
    +  
    +  invisible(TRUE)
    +
    +}
    +
    +con2 <- dbConnect(duckdb())
    +load_dataset_from_gh(
    +  con = con2,
    +  tblname = "tokens_data",
    +  user = "yjunechoe",
    +  repo = "repetition_events",
    +  branch = "master",
    +  regex = ".*data/tokens_data/.*parquet$"
    +)
    +tbl(con2, "tokens_data")
    +
    +
      # Source:   table<tokens_data> [?? x 4]
    +  # Database: DuckDB v1.0.0 [jchoe@Windows 10 x64:R 4.4.1/:memory:]
    +     utterance_id gloss   part_of_speech childID
    +            <int> <chr>   <chr>            <dbl>
    +   1            1 www     ""                   1
    +   2            2 bye     "co"                 1
    +   3            3 mhm     "co"                 1
    +   4            4 Mommy's "n:prop"             1
    +   5            4 here    "adv"                1
    +   6            5 wanna   "mod:aux"            1
    +   7            5 sit     "v"                  1
    +   8            5 down    "adv"                1
    +   9            6 there   "adv"                1
    +  10            7 let's   "v"                  1
    +  # ℹ more rows
    +
    +

    Other sources for data

    +

    In writing this blog post, I’m indebted to all the knowledgeable folks on Mastodon who suggested their own recommended tools and workflows for various kinds of remote data. Unfortunately, I’m not familiar enough with most of them enough to do them justice, but I still wanted to record the suggestions I got from there for posterity.

    +

    First, a post about reading remote files would not be complete without a mention of the wonderful {googlesheets4} package for reading from Google Sheets. I debated whether I should include a larger discussion of {googlesheets4}, and despite using it quite often myself I ultimately decided to omit it for the sake of space and because the package website is already very comprehensive. I would suggest starting from the Get Started vignette if you are new and interested.

    +

    Second, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (Maëlle Salmon’s comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - I’m not familiar with it myself so I thought I wouldn’t write anything for it here BUT Isabella Velásquez came in clutch sharing a recent talk on dynamically loading up-to-date data with {pins} which is a great demo of the unique strengths of {pins}.

    +

    Lastly, I inadvertently(?) started some discussion around remotely accessing spatial files. I don’t work with spatial data at all but I can totally imagine how the hassle of the traditional click-download-find-load workflow would be even more pronounced for spatial data which are presumably much larger in size and more difficult to preview. On this note, I’ll just link to Carl Boettiger’s comment about the fact that GDAL has a virtual file system that you can interface with from R packages wrapping this API (ex: {gdalraster}), and to Michael Sumner’s comment/gist + Chris Toney’s comment on the fact that you can even use this feature to stream non-spatial data!

    +

    Miscellaneous tips and tricks

    +

    I also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so I’ll be updating this space as more come back to me:

    +
      +
    • When reading remote .rda or .RData files with load(), you may need to wrap the link in url() first (ref: stackoverflow).

    • +
    • {vroom} can remotely read gzipped files, without having to download.file() and unzip() first.

    • +
    • {curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:

    • +
    +
    +
    +
    fetched <- curl::curl_fetch_memory(
    +  "https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
    +)
    +read.csv(text = rawToChar(fetched$content)) |> 
    +  dplyr::glimpse()
    +
    +
      Rows: 87
    +  Columns: 14
    +  $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
    +  $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
    +  $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
    +  $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
    +  $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
    +  $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
    +  $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
    +  $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
    +  $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
    +  $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
    +  $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
    +  $ films      <chr> "A New Hope, The Empire Strikes Back, Return of the Jedi, R…
    +  $ vehicles   <chr> "Snowspeeder, Imperial Speeder Bike", "", "", "", "Imperial…
    +  $ starships  <chr> "X-wing, Imperial shuttle", "", "", "TIE Advanced x1", "", …
    +
    +
      +
    • Even if you’re going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().8 Many {curl} functions can also handle retries and stop/resumes which is cool too.

    • +
    • {httr2} can capture a continuous data stream with httr2::req_perform_stream() up to a set time or size.

    • +
    +

    sessionInfo()

    +
    + +
      R version 4.4.1 (2024-06-14 ucrt)
    +  Platform: x86_64-w64-mingw32/x64
    +  Running under: Windows 11 x64 (build 22631)
    +  
    +  Matrix products: default
    +  
    +  
    +  locale:
    +  [1] LC_COLLATE=English_United States.utf8 
    +  [2] LC_CTYPE=English_United States.utf8   
    +  [3] LC_MONETARY=English_United States.utf8
    +  [4] LC_NUMERIC=C                          
    +  [5] LC_TIME=English_United States.utf8    
    +  
    +  time zone: America/New_York
    +  tzcode source: internal
    +  
    +  attached base packages:
    +  [1] stats     graphics  grDevices utils     datasets  methods   base     
    +  
    +  other attached packages:
    +  [1] dplyr_1.1.4        duckdb_1.0.0       DBI_1.2.3          ggplot2_3.5.1.9000
    +  
    +  loaded via a namespace (and not attached):
    +   [1] rappdirs_0.3.3    sass_0.4.9        utf8_1.2.4        generics_0.1.3   
    +   [5] xml2_1.3.6        distill_1.6       digest_0.6.35     magrittr_2.0.3   
    +   [9] evaluate_0.24.0   grid_4.4.1        blob_1.2.4        fastmap_1.1.1    
    +  [13] jsonlite_1.8.8    processx_3.8.4    chromote_0.3.1    ps_1.7.5         
    +  [17] promises_1.3.0    httr_1.4.7        rvest_1.0.4       purrr_1.0.2      
    +  [21] fansi_1.0.6       scales_1.3.0      httr2_1.0.3.9000  jquerylib_0.1.4  
    +  [25] cli_3.6.2         rlang_1.1.4       dbplyr_2.5.0      gitcreds_0.1.2   
    +  [29] bit64_4.0.5       munsell_0.5.1     withr_3.0.1       cachem_1.0.8     
    +  [33] yaml_2.3.8        tools_4.4.1       tzdb_0.4.0        memoise_2.0.1    
    +  [37] colorspace_2.1-1  assertthat_0.2.1  curl_5.2.1        vctrs_0.6.5      
    +  [41] R6_2.5.1          lifecycle_1.0.4   emphatic_0.1.8    bit_4.0.5        
    +  [45] arrow_16.1.0      pkgconfig_2.0.3   pillar_1.9.0      bslib_0.7.0      
    +  [49] later_1.3.2       gtable_0.3.5      glue_1.7.0        gh_1.4.0         
    +  [53] Rcpp_1.0.12       xfun_0.47         tibble_3.2.1      tidyselect_1.2.1 
    +  [57] highr_0.11        rstudioapi_0.16.0 knitr_1.47        htmltools_0.5.8.1
    +  [61] websocket_1.4.1   rmarkdown_2.27    compiler_4.4.1    downlit_0.4.4
    +
    +
    + +
    +
    +
    +
    +
      +
    1. Thanks @tanho for pointing me to this at the R4DS/DSLC slack.↩︎

    2. +
    3. Note that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).↩︎

    4. +
    5. The special value "clipboard" works for most base-R read functions that take a file or con argument.↩︎

    6. +
    7. Thanks @coolbutuseless for pointing me to textConnection()!↩︎

    8. +
    9. Or READ_PARQUET - same thing.↩︎

    10. +
    11. We can also get this formatting with a combination of shQuote() and toString().↩︎

    12. +
    13. Whereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.↩︎

    14. +
    15. See an example implemented for {openalexR}, an API package.↩︎

    16. +
    +
    + + + +
    + +
    +
    + + + + + +
    + + + + + + + + + + + diff --git a/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff-download.jpg b/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff-download.jpg new file mode 100644 index 00000000..33565d0f Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff-download.jpg differ diff --git a/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff.jpg b/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff.jpg new file mode 100644 index 00000000..e35df02b Binary files /dev/null and b/docs/posts/2024-09-22-fetch-files-web/osf-MixedModels-dyestuff.jpg differ diff --git a/docs/posts/posts.json b/docs/posts/posts.json index a5832764..cdeebe72 100644 --- a/docs/posts/posts.json +++ b/docs/posts/posts.json @@ -1,4 +1,23 @@ [ + { + "path": "posts/2024-09-22-fetch-files-web/", + "title": "Read files on the web into R", + "description": "For the download-button-averse of us", + "author": [ + { + "name": "June Choe", + "url": {} + } + ], + "date": "2024-09-22", + "categories": [ + "tutorial" + ], + "contents": "\r\n\r\nContents\r\nGitHub (public repos)\r\nGitHub (gists)\r\nGitHub (private repos)\r\nOSF\r\nAside: Can’t go wrong with a copy-paste!\r\nOther goodies\r\nStreaming with {duckdb}\r\nOther sources for data\r\nMiscellaneous tips and tricks\r\n\r\nsessionInfo()\r\n\r\nEvery so often I’ll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.\r\nOver the years I’ve accumulated some tricks to get data into R “straight from a url”, even if the url does not point to the raw file contents itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I’d write some of these down for my own reference. This is not meant to be comprehensive though - keep in mind that I’m someone who primarily works with tabular data and interface with GitHub and OSF as data repositories.\r\nGitHub (public repos)\r\nGitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the dplyr::starwars dataset from tidyverse/dplyr, at https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv:\r\n\r\n\r\n\r\nThat url, despite ending in a .csv, does not point to the raw data - instead, the contents of the page is a full html document:\r\n\r\n\r\nrvest::read_html(\"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\")\r\n\r\n\r\n {html_document}\r\n \\n \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia Or…\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", N…\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"…\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",…\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",…\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"femini…\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"T…\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"Huma…\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, R…\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"Imperial…\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", …\r\n\r\nBut note that this method of “click the Raw button to get the corresponding raw.githubusercontent.com/… url to the file contents” will not work for file formats that cannot be displayed in plain text (clicking the button will instead download the file via your browser). So sometimes (especially when you have a binary file) you have to construct this “remote-readable” url to the file manually.\r\nFortunately, going from one link to the other is pretty formulaic. To demonstrate the difference with the url for the starwars dataset again:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\",\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\n\r\n\r\n[1] \"https:// github .com/tidyverse/dplyr/blob/main/data-raw/starwars.csv\"[1] \"https://raw.githubusercontent.com/tidyverse/dplyr /main/data-raw/starwars.csv\"\r\n\r\n\r\nGitHub (gists)\r\nIt’s a similar idea with GitHub Gists, where I sometimes like to store small toy datasets for use in demos. For example, here’s a link to a simulated data for a Stroop experiment stroop.csv: https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6.\r\nBut that’s again a full-on webpage. The url which actually hosts the csv contents is https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv, which you can again get to by clicking the Raw button at the top-right corner of the gist\r\n\r\n\r\n\r\nBut actually, that long link you get by default points to the current commit, specifically. If you instead want the link to be kept up to date with the most recent commit, you can omit the second hash that comes after raw/:\r\n\r\n\r\nemphatic::hl_diff(\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\",\r\n \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\"\r\n)\r\n\r\n\r\n[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv\"[1] \"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw /stroop.csv\"\r\n\r\n\r\nIn practice, I don’t use gists to store replicability-sensitive data, so I prefer to just use the shorter link that’s not tied to a specific commit.\r\n\r\n\r\nread.csv(\"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 240\r\n Columns: 5\r\n $ subj \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S01\", \"S02…\r\n $ word \"blue\", \"blue\", \"green\", \"green\", \"red\", \"red\", \"yellow\", \"y…\r\n $ condition \"match\", \"mismatch\", \"match\", \"mismatch\", \"match\", \"mismatch…\r\n $ accuracy 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …\r\n $ RT 400, 549, 576, 406, 296, 231, 433, 1548, 561, 1751, 286, 710…\r\n\r\nGitHub (private repos)\r\nWe now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and you’re signed in, you can follow the same step of copying the link that the Raw button redirects to.\r\nExcept this time, when you open the file at that url (assuming it can display in plain text), you’ll see the url come with a “token” attached at the end (I’ll show an example further down). This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but note that it will expire at some point as GitHub refreshes tokens periodically (so treat them as if they’re for single use).\r\nFor a more robust approach, you can use the GitHub Contents API. If you have your credentials set up in {gh} (which you can check with gh::gh_whoami()), you can request a token-tagged url to the private file using the syntax:1\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/contents/{path}\")$download_url\r\n\r\n\r\nNote that this is actually also a general solution to getting a url to GitHub file contents. So for example, even without any credentials set up you can point to dplyr’s starwars.csv since that’s publicly accessible. This method produces the same “raw.githubusercontent.com/…” url we saw earlier:\r\n\r\n\r\ngh::gh(\"/repos/tidyverse/dplyr/contents/data-raw/starwars.csv\")$download_url\r\n\r\n [1] \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n\r\nNow for a demonstration with a private repo, here is one of mine that you cannot access https://github.com/yjunechoe/my-super-secret-repo. But because I set up my credentials in {gh}, I can generate a link to a content within that repo with the access token attached (“?token=…”):\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n # truncating\r\n gsub(x = _, \"^(.{100}).*\", \"\\\\1...\")\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/my-super-secret-repo/main/README.md?token=AMTCUR2JPXCIX5...\"\r\n\r\nI can then use this url to read the private file:2\r\n\r\n\r\ngh::gh(\"/repos/yjunechoe/my-super-secret-repo/contents/README.md\")$download_url |> \r\n readLines()\r\n\r\n [1] \"Surprise!\"\r\n\r\nOSF\r\nOSF (the Open Science Framework) is another data repository that I interact with a lot, and reading files off of OSF follows a similar strategy to fetching public files on GitHub.\r\nConsider, for example, the dyestuff.arrow file in the OSF repository for MixedModels.jl. Browsing the repository through the point-and-click interface can get you to the page for the file at https://osf.io/9vztj/, where it shows:\r\n\r\n\r\n\r\nThe download button can be found inside the dropdown menubar to the right:\r\n\r\n\r\n\r\nBut instead of clicking on the icon (which will start a download via the browser), we can grab the embedded link address: https://osf.io/download/9vztj/. That url can then be passed directly into a read function:\r\n\r\n\r\narrow::read_feather(\"https://osf.io/download/9vztj/\") |> \r\n dplyr::glimpse()\r\n\r\n Rows: 30\r\n Columns: 2\r\n $ batch A, A, A, A, A, B, B, B, B, B, C, C, C, C, C, D, D, D, D, D, E, E…\r\n $ yield 1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595…\r\n\r\nYou might have already caught on to this, but the pattern is to simply point to osf.io/download/ instead of osf.io/.\r\nThis method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad. Navigating to this link will show a web preview of the csv file contents.\r\nBy inserting /download into this url, we can read the csv file contents directly:\r\n\r\n\r\nread.csv(\"https://osf.io/download/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad\") |> \r\n head()\r\n\r\n Item plaus_bias trans_bias\r\n 1 Awakened -0.29631221 -1.2200901\r\n 2 Calmed 0.09877074 -0.4102332\r\n 3 Choked 1.28401957 -1.4284905\r\n 4 Dressed -0.59262442 -1.2087228\r\n 5 Failed -0.98770736 0.1098839\r\n 6 Groomed -1.08647810 0.9889550\r\n\r\nSee also the {osfr} package for a more principled interface to OSF.\r\nAside: Can’t go wrong with a copy-paste!\r\nReading remote files aside, I think it’s severely underrated how base R has a readClipboard() function and a collection of read.*() functions which can also read directly from a \"clipboard\" connection.3\r\nI sometimes do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all + copy. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R’s clipboard functionalities.\r\nFor example, given this markdown table:\r\n\r\n\r\naggregate(mtcars, mpg ~ cyl, mean) |> \r\n knitr::kable()\r\n\r\ncyl\r\nmpg\r\n4\r\n26.66364\r\n6\r\n19.74286\r\n8\r\n15.10000\r\n\r\nYou can copy its contents and run the following code to get that data back as an R data frame:\r\n\r\n\r\nread.delim(\"clipboard\")\r\n# Or, `read.delim(text = readClipboard())`\r\n\r\n\r\n\r\n cyl mpg\r\n 1 4 26.66364\r\n 2 6 19.74286\r\n 3 8 15.10000\r\n\r\nIf you’re instead copying something flat like a list of numbers or strings, you can also use scan() and specify the appropriate sep to get that data back as a vector:4\r\n\r\n\r\npaste(1:10, collapse = \", \") |> \r\n cat()\r\n\r\n 1, 2, 3, 4, 5, 6, 7, 8, 9, 10\r\n\r\n\r\n\r\nscan(\"clipboard\", sep = \",\")\r\n# Or, `scan(textConnection(readClipboard()), sep = \",\")`\r\n\r\n\r\n\r\n [1] 1 2 3 4 5 6 7 8 9 10\r\n\r\nIt should be noted though that parsing clipboard contents is not a robust feature in base R. If you want a more principled approach to reading data from clipboard, you should use {datapasta}. And for printing data for others to copy-paste into R, use {constructive}. See also {clipr} which extends clipboard read/write functionalities.\r\nOther goodies\r\n⚠️ What lies ahead are denser than the kinds of “low-tech” advice I wrote about above.\r\nStreaming with {duckdb}\r\nOne caveat to all the “read from web” approaches I covered above is that it often does not actually circumvent the action of downloading the file onto your computer. For example, when you read a file from “raw.githubusercontent.com/…” with read.csv(), there is an implicit download.file() of the data into the current R session’s tempdir().\r\nAn alternative that actually reads the data straight into memory is streaming. Streaming is moreso a feature of database languages, but there’s good integration of such tools with R, so this option is available from within R as well.\r\nHere, I briefly outline what I learned from (mostly) reading a blog post by François Michonneau, which covers how to stream remote files using {duckdb}. It’s pretty comprehensive but I wanted to make a template for just one method that I prefer.\r\nWe start by loading the {duckdb} package, creating a connection to an in-memory database, installing the httpfs extension (if not installed already), and loading httpfs for the database.\r\n\r\n\r\nlibrary(duckdb)\r\ncon <- dbConnect(duckdb())\r\n# dbExecute(con, \"INSTALL httpfs;\") # You may also need to \"INSTALL parquet;\"\r\ninvisible(dbExecute(con, \"LOAD httpfs;\"))\r\n\r\n\r\nFor this example I will use a parquet file from one of my projects which is hosted on GitHub: https://github.com/yjunechoe/repetition_events. The data I want to read is at the relative path /data/tokens_data/childID=1/part-7.parquet. I went ahead and converted that into the “raw contents” url shown below:\r\n\r\n\r\n# A parquet file of tokens from a sample of child-directed speech\r\nfile <- \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"\r\n\r\n# For comparison, reading its contents with {arrow}\r\narrow::read_parquet(file) |> \r\n head(5)\r\n\r\n # A tibble: 5 × 3\r\n utterance_id gloss part_of_speech\r\n \r\n 1 1 www \"\" \r\n 2 2 bye \"co\" \r\n 3 3 mhm \"co\" \r\n 4 4 Mommy's \"n:prop\" \r\n 5 4 here \"adv\"\r\n\r\nIn duckdb, the httpfs extension we loaded above allows PARQUET_SCAN5 to read a remote parquet file.\r\n\r\n\r\nquery1 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN({`file`})\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query1)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\")\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query1)\r\n\r\n utterance_id gloss part_of_speech\r\n 1 1 www \r\n 2 2 bye co\r\n 3 3 mhm co\r\n 4 4 Mommy's n:prop\r\n 5 4 here adv\r\n\r\nAnd actually, in my case, the parquet file represents one of many files that had been previously split up via hive partitioning. To preserve this metadata even as I read in just a single file, I need to do two things:\r\nSpecify hive_partitioning=true when calling PARQUET_SCAN.\r\nEnsure that the hive-partitioning syntax is represented in the url with URLdecode() (since the = character can sometimes be escaped, as in this case).\r\n\r\n\r\nemphatic::hl_diff(file, URLdecode(file))\r\n\r\n\r\n[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID%3D1/part-7.parquet\"[1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID= 1/part-7.parquet\"\r\n\r\n\r\nWith that, the data now shows that the observations are from child #1 in the sample.\r\n\r\n\r\nfile <- URLdecode(file)\r\nquery2 <- glue::glue_sql(\"\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n {`file`},\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\", .con = con)\r\ncat(query2)\r\n\r\n SELECT *\r\n FROM PARQUET_SCAN(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\",\r\n hive_partitioning=true\r\n )\r\n LIMIT 5;\r\n\r\ndbGetQuery(con, query2)\r\n\r\n utterance_id gloss part_of_speech childID\r\n 1 1 www 1\r\n 2 2 bye co 1\r\n 3 3 mhm co 1\r\n 4 4 Mommy's n:prop 1\r\n 5 4 here adv 1\r\n\r\nTo do this more programmatically over all parquet files under /tokens_data in the repository, we need to transition to using the GitHub Trees API. The idea is similar to using the Contents API but now we are requesting a list of all files using the following syntax:\r\n\r\n\r\ngh::gh(\"/repos/{user}/{repo}/git/trees/{branch/tag/commitSHA}?recursive=true\")$tree\r\n\r\n\r\nTo get the file tree of the repo on the master branch, we use:\r\n\r\n\r\nfiles <- gh::gh(\"/repos/yjunechoe/repetition_events/git/trees/master?recursive=true\")$tree\r\n\r\n\r\nWith recursive=true, this returns all files in the repo. Then, we can filter for just the parquet files we want with a little regex:\r\n\r\n\r\nparquet_files <- sapply(files, `[[`, \"path\") |> \r\n grep(x = _, pattern = \".*/tokens_data/.*parquet$\", value = TRUE)\r\nlength(parquet_files)\r\n\r\n [1] 70\r\n\r\nhead(parquet_files)\r\n\r\n [1] \"data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nFinally, we complete the path using the “https://raw.githubusercontent.com/…” url:\r\n\r\n\r\nparquet_files <- paste0(\r\n \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/\",\r\n parquet_files\r\n)\r\nhead(parquet_files)\r\n\r\n [1] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=1/part-7.parquet\" \r\n [2] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=10/part-0.parquet\"\r\n [3] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=11/part-6.parquet\"\r\n [4] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=12/part-3.parquet\"\r\n [5] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=13/part-1.parquet\"\r\n [6] \"https://raw.githubusercontent.com/yjunechoe/repetition_events/master/data/tokens_data/childID=14/part-2.parquet\"\r\n\r\nBack on duckdb, we can use PARQUET_SCAN to read multiple files by supplying a vector ['file1.parquet', 'file2.parquet', ...].6 This time, we also ask for a quick computation to count the number of distinct childIDs:\r\n\r\n\r\nquery3 <- glue::glue_sql(\"\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n [{parquet_files*}],\r\n hive_partitioning=true\r\n )\r\n\", .con = con)\r\ncat(gsub(\"^(.{80}).*(.{60})$\", \"\\\\1 ... \\\\2\", query3))\r\n\r\n SELECT count(DISTINCT childID)\r\n FROM PARQUET_SCAN(\r\n ['https://raw.githubusercont ... data/childID=9/part-64.parquet'],\r\n hive_partitioning=true\r\n )\r\n\r\ndbGetQuery(con, query3)\r\n\r\n count(DISTINCT childID)\r\n 1 70\r\n\r\nThis returns 70 which matches the length of the parquet_files vector listing the files that had been partitioned by childID.\r\nFor further analyses, we can CREATE TABLE7 our data in our in-memory database con:\r\n\r\n\r\nquery4 <- glue::glue_sql(\"\r\n CREATE TABLE tokens_data AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning=true)\r\n\", .con = con)\r\ninvisible(dbExecute(con, query4))\r\ndbListTables(con)\r\n\r\n [1] \"tokens_data\"\r\n\r\nThat lets us reference the table via dplyr::tbl(), at which point we can switch over to another high-level interface like {dplyr} to query it using its familiar functions:\r\n\r\n\r\nlibrary(dplyr)\r\ntokens_data <- tbl(con, \"tokens_data\")\r\n\r\n# Q: What are the most common verbs spoken to children in this sample?\r\ntokens_data |> \r\n filter(part_of_speech == \"v\") |> \r\n count(gloss, sort = TRUE) |> \r\n head() |> \r\n collect()\r\n\r\n # A tibble: 6 × 2\r\n gloss n\r\n \r\n 1 go 13614\r\n 2 see 13114\r\n 3 do 11829\r\n 4 have 10794\r\n 5 want 10560\r\n 6 put 9190\r\n\r\nCombined, here’s one (hastily put together) attempt at wrapping this workflow into a function:\r\n\r\n\r\nload_dataset_from_gh <- function(con, tblname, user, repo, branch, regex,\r\n partition = TRUE, lazy = TRUE) {\r\n \r\n allfiles <- gh::gh(glue::glue(\"/repos/{user}/{repo}/git/trees/{branch}?recursive=true\"))$tree\r\n files_relpath <- grep(regex, sapply(allfiles, `[[`, \"path\"), value = TRUE)\r\n # Use the actual Contents API here instead, if the repo is private\r\n files <- glue::glue(\"https://raw.githubusercontent.com/{user}/{repo}/{branch}/{files_relpath}\")\r\n \r\n type <- if (lazy) quote(VIEW) else quote(TABLE)\r\n partition <- as.integer(partition)\r\n \r\n dbExecute(con, \"LOAD httpfs;\")\r\n dbExecute(con, glue::glue_sql(\"\r\n CREATE {type} {`tblname`} AS\r\n SELECT *\r\n FROM PARQUET_SCAN([{parquet_files*}], hive_partitioning={partition})\r\n \", .con = con))\r\n \r\n invisible(TRUE)\r\n\r\n}\r\n\r\ncon2 <- dbConnect(duckdb())\r\nload_dataset_from_gh(\r\n con = con2,\r\n tblname = \"tokens_data\",\r\n user = \"yjunechoe\",\r\n repo = \"repetition_events\",\r\n branch = \"master\",\r\n regex = \".*data/tokens_data/.*parquet$\"\r\n)\r\ntbl(con2, \"tokens_data\")\r\n\r\n # Source: table [?? x 4]\r\n # Database: DuckDB v1.0.0 [jchoe@Windows 10 x64:R 4.4.1/:memory:]\r\n utterance_id gloss part_of_speech childID\r\n \r\n 1 1 www \"\" 1\r\n 2 2 bye \"co\" 1\r\n 3 3 mhm \"co\" 1\r\n 4 4 Mommy's \"n:prop\" 1\r\n 5 4 here \"adv\" 1\r\n 6 5 wanna \"mod:aux\" 1\r\n 7 5 sit \"v\" 1\r\n 8 5 down \"adv\" 1\r\n 9 6 there \"adv\" 1\r\n 10 7 let's \"v\" 1\r\n # ℹ more rows\r\n\r\nOther sources for data\r\nIn writing this blog post, I’m indebted to all the knowledgeable folks on Mastodon who suggested their own recommended tools and workflows for various kinds of remote data. Unfortunately, I’m not familiar enough with most of them enough to do them justice, but I still wanted to record the suggestions I got from there for posterity.\r\nFirst, a post about reading remote files would not be complete without a mention of the wonderful {googlesheets4} package for reading from Google Sheets. I debated whether I should include a larger discussion of {googlesheets4}, and despite using it quite often myself I ultimately decided to omit it for the sake of space and because the package website is already very comprehensive. I would suggest starting from the Get Started vignette if you are new and interested.\r\nSecond, along the lines of {osfr}, there are other similar rOpensci packages for retrieving data from the kinds of data sources that may be of interest to academics, such as {deposits} for zenodo and figshare, and {piggyback} for GitHub release assets (Maëlle Salmon’s comment pointed me to the first two; I responded with some of my experiences). I was also reminded that {pins} exists - I’m not familiar with it myself so I thought I wouldn’t write anything for it here BUT Isabella Velásquez came in clutch sharing a recent talk on dynamically loading up-to-date data with {pins} which is a great demo of the unique strengths of {pins}.\r\nLastly, I inadvertently(?) started some discussion around remotely accessing spatial files. I don’t work with spatial data at all but I can totally imagine how the hassle of the traditional click-download-find-load workflow would be even more pronounced for spatial data which are presumably much larger in size and more difficult to preview. On this note, I’ll just link to Carl Boettiger’s comment about the fact that GDAL has a virtual file system that you can interface with from R packages wrapping this API (ex: {gdalraster}), and to Michael Sumner’s comment/gist + Chris Toney’s comment on the fact that you can even use this feature to stream non-spatial data!\r\nMiscellaneous tips and tricks\r\nI also have some random tricks that are more situational. Unfortunately, I can only recall like 20% of them at any given moment, so I’ll be updating this space as more come back to me:\r\nWhen reading remote .rda or .RData files with load(), you may need to wrap the link in url() first (ref: stackoverflow).\r\n{vroom} can remotely read gzipped files, without having to download.file() and unzip() first.\r\n{curl}, of course, will always have the most comprehensive set of low-level tools you need to read any arbitrary data remotely. For example, using curl::curl_fetch_memory() to read the dplyr::storms data again from the GitHub raw contents link:\r\n\r\n\r\nfetched <- curl::curl_fetch_memory(\r\n \"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv\"\r\n)\r\nread.csv(text = rawToChar(fetched$content)) |> \r\n dplyr::glimpse()\r\n\r\n Rows: 87\r\n Columns: 14\r\n $ name \"Luke Skywalker\", \"C-3PO\", \"R2-D2\", \"Darth Vader\", \"Leia Or…\r\n $ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…\r\n $ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…\r\n $ hair_color \"blond\", NA, NA, \"none\", \"brown\", \"brown, grey\", \"brown\", N…\r\n $ skin_color \"fair\", \"gold\", \"white, blue\", \"white\", \"light\", \"light\", \"…\r\n $ eye_color \"blue\", \"yellow\", \"red\", \"yellow\", \"brown\", \"blue\", \"blue\",…\r\n $ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …\r\n $ sex \"male\", \"none\", \"none\", \"male\", \"female\", \"male\", \"female\",…\r\n $ gender \"masculine\", \"masculine\", \"masculine\", \"masculine\", \"femini…\r\n $ homeworld \"Tatooine\", \"Tatooine\", \"Naboo\", \"Tatooine\", \"Alderaan\", \"T…\r\n $ species \"Human\", \"Droid\", \"Droid\", \"Human\", \"Human\", \"Human\", \"Huma…\r\n $ films \"A New Hope, The Empire Strikes Back, Return of the Jedi, R…\r\n $ vehicles \"Snowspeeder, Imperial Speeder Bike\", \"\", \"\", \"\", \"Imperial…\r\n $ starships \"X-wing, Imperial shuttle\", \"\", \"\", \"TIE Advanced x1\", \"\", …\r\n\r\nEven if you’re going the route of downloading the file first, curl::multi_download() can offer big performance improvements over download.file().8 Many {curl} functions can also handle retries and stop/resumes which is cool too.\r\n{httr2} can capture a continuous data stream with httr2::req_perform_stream() up to a set time or size.\r\nsessionInfo()\r\n\r\n\r\nsessionInfo()\r\n\r\n R version 4.4.1 (2024-06-14 ucrt)\r\n Platform: x86_64-w64-mingw32/x64\r\n Running under: Windows 11 x64 (build 22631)\r\n \r\n Matrix products: default\r\n \r\n \r\n locale:\r\n [1] LC_COLLATE=English_United States.utf8 \r\n [2] LC_CTYPE=English_United States.utf8 \r\n [3] LC_MONETARY=English_United States.utf8\r\n [4] LC_NUMERIC=C \r\n [5] LC_TIME=English_United States.utf8 \r\n \r\n time zone: America/New_York\r\n tzcode source: internal\r\n \r\n attached base packages:\r\n [1] stats graphics grDevices utils datasets methods base \r\n \r\n other attached packages:\r\n [1] dplyr_1.1.4 duckdb_1.0.0 DBI_1.2.3 ggplot2_3.5.1.9000\r\n \r\n loaded via a namespace (and not attached):\r\n [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4 generics_0.1.3 \r\n [5] xml2_1.3.6 distill_1.6 digest_0.6.35 magrittr_2.0.3 \r\n [9] evaluate_0.24.0 grid_4.4.1 blob_1.2.4 fastmap_1.1.1 \r\n [13] jsonlite_1.8.8 processx_3.8.4 chromote_0.3.1 ps_1.7.5 \r\n [17] promises_1.3.0 httr_1.4.7 rvest_1.0.4 purrr_1.0.2 \r\n [21] fansi_1.0.6 scales_1.3.0 httr2_1.0.3.9000 jquerylib_0.1.4 \r\n [25] cli_3.6.2 rlang_1.1.4 dbplyr_2.5.0 gitcreds_0.1.2 \r\n [29] bit64_4.0.5 munsell_0.5.1 withr_3.0.1 cachem_1.0.8 \r\n [33] yaml_2.3.8 tools_4.4.1 tzdb_0.4.0 memoise_2.0.1 \r\n [37] colorspace_2.1-1 assertthat_0.2.1 curl_5.2.1 vctrs_0.6.5 \r\n [41] R6_2.5.1 lifecycle_1.0.4 emphatic_0.1.8 bit_4.0.5 \r\n [45] arrow_16.1.0 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.7.0 \r\n [49] later_1.3.2 gtable_0.3.5 glue_1.7.0 gh_1.4.0 \r\n [53] Rcpp_1.0.12 xfun_0.47 tibble_3.2.1 tidyselect_1.2.1 \r\n [57] highr_0.11 rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1\r\n [61] websocket_1.4.1 rmarkdown_2.27 compiler_4.4.1 downlit_0.4.4\r\n\r\n\r\n\r\n\r\n\r\nThanks @tanho for pointing me to this at the R4DS/DSLC slack.↩︎\r\nNote that the API will actually generate a new token every time you send a request (and again, these tokens will expire with time).↩︎\r\nThe special value \"clipboard\" works for most base-R read functions that take a file or con argument.↩︎\r\nThanks @coolbutuseless for pointing me to textConnection()!↩︎\r\nOr READ_PARQUET - same thing.↩︎\r\nWe can also get this formatting with a combination of shQuote() and toString().↩︎\r\nWhereas CREATE TABLE results in a physical copy of the data in memory, CREATE VIEW will dynamically fetch the data from the source every time you query the table. If the data fits into memory (as in this case), I prefer CREATE as queries will be much faster (though you pay up-front for the time copying the data). If the data is larger than memory, CREATE VIEW will be your only option.↩︎\r\nSee an example implemented for {openalexR}, an API package.↩︎\r\n", + "preview": "posts/2024-09-22-fetch-files-web/github-dplyr-starwars.jpg", + "last_modified": "2024-09-22T18:49:08-04:00", + "input_file": {} + }, { "path": "posts/2024-07-21-enumerate-possible-options/", "title": "Naming patterns for boolean enums", diff --git a/docs/search.json b/docs/search.json index 4be93676..9042ca1e 100644 --- a/docs/search.json +++ b/docs/search.json @@ -5,7 +5,7 @@ "title": "Blog Posts", "author": [], "contents": "\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:03-04:00" + "last_modified": "2024-09-22T18:50:13-04:00" }, { "path": "index.html", @@ -13,21 +13,21 @@ "description": "Ph.D. Candidate in Linguistics", "author": [], "contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n Education\r\n\r\n\r\nB.A. (hons.) Northwestern University (2016–20)\r\n\r\n\r\nPh.D. University of Pennsylvania (2020 ~)\r\n\r\n\r\n Interests\r\n\r\n\r\n(Computational) Psycholinguistics\r\n\r\n\r\nLanguage Acquisition\r\n\r\n\r\nSentence Processing\r\n\r\n\r\nProsody\r\n\r\n\r\nQuantitative Methods\r\n\r\n\r\n\r\n\r\n\r\n Methods:\r\n\r\nWeb-based experiments, eye-tracking, self-paced reading, corpus analysis\r\n\r\n\r\n\r\n Programming:\r\n\r\nR (fluent) | HTML/CSS, Javascript, Julia (proficient) | Python (coursework)\r\n\r\n\r\n\r\n\r\n\r\nI am a PhD candidate in Linguistics at the University of Pennsylvania, and a student affiliate of Penn MindCORE and the Language and Communication Sciences program. I am a psycholinguist broadly interested in experimental approaches to studying meaning, of various flavors. My advisor is Anna Papafragou and I am a member of the Language & Cognition Lab.\r\nI received my B.A. in Linguistics from Northwestern University, where I worked with Jennifer Cole, Masaya Yoshida, and Annette D’Onofrio. I also worked as a research assistant for the Language, Education, and Reading Neuroscience Lab. My thesis explored the role of prosodic focus in garden-path reanalysis.\r\nBeyond linguistics research, I have interests in data visualization, science communication, and the R programming language. I author packages in statistical computing and graphics (ex: ggtrace, jlmerclusterperm) and collaborate on other open-source software (ex: openalexR, pointblank). I also maintain a technical blog as a hobby and occasionally take on small statistical consulting projects.\r\n\r\n\r\n\r\n\r\ncontact me: yjchoe@sas.upenn.edu\r\n\r\n\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T11:03:58-04:00" + "last_modified": "2024-09-22T18:50:15-04:00" }, { "path": "news.html", "title": "News", "author": [], "contents": "\r\n\r\n\r\nFor more of my personal news external/tangential to research\r\n2023\r\nAugust\r\nI was unfortunately not able to make it in person to JSM 2023 but have my pre-recorded talk has been uploaded!\r\nJune\r\nMy package jlmerclusterperm was published on CRAN!\r\nApril\r\nI was accepted to SMLP (Summer School on Statistical Methods for Linguistics and Psychology), to be held in September at the University of Potsdam, Germany! I will be joining the “Advanced methods in frequentist statistics with Julia” stream. Huge thanks to MindCORE for funding my travels to attend!\r\nJanuary\r\nI received the ASA Statistical Computing and Graphics student award for my paper Sublayer modularity in the Grammar of Graphics! I will be presenting my work at the 2023 Joint Statistical Meetings in Toronto in August.\r\n2022\r\nSeptember\r\nI was invited to a Korean data science podcast dataholic (데이터홀릭) to talk about my experience presenting at the RStudio and useR conferences! Part 1, Part 2\r\nAugust\r\nI led a workshop on IBEX and PCIbex with Nayoun Kim at the Seoul International Conference on Linguistics (SICOL 2022).\r\nJuly\r\nI attended my first in-person R conference at rstudio::conf(2022) and gave a talk on ggplot internals.\r\nJune\r\nI gave a talk on my package {ggtrace} at the useR! 2022 conference. I was awarded the diversity scholarship which covered my registration and workshop fees. My reflections\r\nI gave a talk at RLadies philly on using dplyr’s slice() function for row-relational operations.\r\n2021\r\nJuly\r\nMy tutorial on custom fonts in R was featured as a highlight on the R Weekly podcast!\r\nJune\r\nI gave a talk at RLadies philly on using icon fonts for data viz! I also wrote a follow-up blog post that goes deeper into font rendering in R.\r\nMay\r\nSnowGlobe, a project started in my undergrad, was featured in an article by the Northwestern University Library. We also had a workshop for SnowGlobe which drew participants from over a hundred universities!\r\nJanuary\r\nI joined Nayoun Kim for a workshop on experimental syntax conducted in Korean and held at Sungkyunkwan University (Korea). I helped design materials for a session on scripting online experiments with IBEX, including interactive slides made with R!\r\n2020\r\nNovember\r\nI joined designer Will Chase on his stream to talk about the psycholinguistics of speech production for a data viz project on Michael’s speech errors in The Office. It was a very cool and unique opportunity to bring my two interests together!\r\nOctober\r\nMy tutorial on {ggplot2} stat_*() functions was featured as a highlight on the R Weekly podcast, which curates weekly updates from the R community.\r\nI became a data science tutor at MindCORE to help researchers at Penn with data visualization and R programming.\r\nSeptember\r\nI have moved to Philadelphia to start my PhD in Linguistics at the University of Pennsylvania!\r\nJune\r\nI graduated from Northwestern University with a B.A. in Linguistics (with honors)! I was also elected into Phi Beta Kappa and appointed as the Senior Marshal for Linguistics.\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:06-04:00" + "last_modified": "2024-09-22T18:50:19-04:00" }, { "path": "research.html", "title": "Research and activities", "author": [], "contents": "\r\n\r\nContents\r\nAcademic research output\r\nPeer-reviewed Papers\r\nConference Talks\r\nConference Presentations\r\n\r\nResearch activities in FOSS\r\nPapers\r\nTalks\r\nSoftware\r\n\r\nTeaching\r\nPositions held\r\nWorkshops led\r\nGuest lectures\r\n\r\nProfessional activities\r\nEditor\r\nReviewer\r\nMembership\r\n\r\n\r\nLinks: Google Scholar, Github, OSF\r\nAcademic research output\r\nPeer-reviewed Papers\r\nJune Choe, and Anna Papafragou. (2023). The acquisition of subordinate nouns as pragmatic inference. Journal of Memory and Language, 132, 104432. DOI: https://doi.org/10.1016/j.jml.2023.104432. PDF OSF\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao, and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors for World Englishes. In Proceedings of the 29th International Conference on Computational Linguistics, 7177–7186.\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao, and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. In Proceedings of Interspeech 2022, 1273-1277. DOI: 10.21437/Interspeech.2022-10869\r\nJune Choe, Masaya Yoshida, and Jennifer Cole. (2022). The role of prosodic focus in the reanalysis of garden path sentences: Depth of semantic processing impedes the revision of an erroneous local analysis. Glossa Psycholinguistics, 1(1). DOI: 10.5070/G601136\r\nJune Choe, and Anna Papafragou. (2022). The acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. In Proceedings of the Annual Meeting of the Cognitive Science Society, 44, 2745-2752.\r\nSean McWeeny, Jinnie S. Choi, June Choe, Alexander LaTourette, Megan Y. Roberts, and Elizabeth S. Norton. (2022). Rapid automatized naming (RAN) as a kindergarten predictor of future reading in English: A systematic review and meta-analysis. Reading Research Quarterly, 57(4), 1187–1211. DOI: 10.1002/rrq.467\r\nConference Talks\r\nJune Choe, and Anna Papafragou. Children’s sensitivity to informativeness in naming: basic-level vs. superordinate nouns. Talk at the 101st Linguistic Society of America (LSA) conference. 9-12 January 2025. Philadelphia.\r\nJune Choe. Distributional signatures of superordinate nouns. Talk at the 10th MACSIM conference. 6 April 2024. University of Maryland, College Park, MD.\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper\r\nJune Choe. Persona-based social expectations in sentence processing and comprehension. Talk at the Language, Stereotypes & Social Cognition workshop, 22-23 May, 2023. University of Pennsylvania, PA.\r\nJune Choe, and Anna Papafragou. Lexical alternatives and the acquisition of subordinate nouns. Talk at the 47th Boston University Conference on Language Development (BUCLD), 3-6 November, 2022. Boston University, Boston, MA. Slides\r\nJune Choe, Yiran Chen, May Pik Yu Chan, Aini Li, Xin Gao and Nicole Holliday. (2022). Language-specific Effects on Automatic Speech Recognition Errors in American English. Talk at the 28th International Conference on Computational Linguistics (CoLing), 12-17 October, 2022. Gyeongju, South Korea. Slides\r\nMay Pik Yu Chan, June Choe, Aini Li, Yiran Chen, Xin Gao and Nicole Holliday. (2022). Training and typological bias in ASR performance for world Englishes. Talk at the 23rd Conference of the International Speech Communication Association (INTERSPEECH), 18-22 September, 2022. Incheon, South Korea.\r\nConference Presentations\r\nJune Choe, and Anna Papafragou. Distributional signatures of superordinate nouns. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Pragmatic underpinnings of the basic-level bias. Poster presented at the 48th Boston University Conference on Language Development (BUCLD), 2-5 November, 2023. Boston University, Boston, MA. Abstract Poster\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 9th Mid-Atlantic Colloquium of Studies in Meaning (MACSIM), 15 April 2023. University of Pennsylvania, PA.\r\nJune Choe and Anna Papafragou. Discourse effects on the acquisition of subordinate nouns. Poster presented at the 36th Annual Conference on Human Sentence Processing, 9-11 March 2022. University of Pittsburg, PA. Abstract Poster\r\nJune Choe, and Anna Papafragou. Acquisition of subordinate nouns as pragmatic inference: Semantic alternatives modulate subordinate meanings. Poster at the 2nd Experiments in Linguistic Meaning (ELM) conference, 18-20 May 2022. University of Pennsylvania, Philadelphia, PA.\r\nJune Choe, and Anna Papafragou. Beyond the basic level: Levels of informativeness and the acquisition of subordinate nouns. Poster at the 35th Annual Conference on Human Sentence Processing (HSP), 24-26 March 2022. University of California, Santa Cruz, CA.\r\nJune Choe, Jennifer Cole, and Masaya Yoshida. Prosodic Focus Strengthens Semantic Persistence. Poster at The 26th Architectures and Mechanisms for Language Processing (AMLaP), 3-5 September 2020. Potsdam, Germany. Abstract Video Slides\r\nJune Choe. Computer-assisted snowball search for meta-analysis research. Poster at The 2020 Undergraduate Research & Arts Exposition. 27-28 May 2020. Northwestern University, Evanston, IL. 2nd Place Poster Award. Abstract\r\nJune Choe. Social Information in Sentence Processing. Talk at The 2019 Undergraduate Research & Arts Exposition. 29 May 2019. Northwestern University, Evanston, IL. Abstract\r\nJune Choe, Shayne Sloggett, Masaya Yoshida and Annette D’Onofrio. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 32nd CUNY Conference on Human Sentence Processing. 29-31 March 2019. University of Colorado, Boulder, CO.\r\nD’Onofrio, Annette, June Choe and Masaya Yoshida. Personae in syntactic processing: Socially-specific agents bias expectations of verb transitivity. Poster at The 93rd Annual Meeting of the Linguistics Society of America. 3-6 January 2019. New York City, NY.\r\nResearch activities in FOSS\r\nPapers\r\nMassimo Aria, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and June Choe. (2024). openalexR: An R-tool for collecting bibliometric data from OpenAlex. The R Journal, 15(4), 166-179. Paper, Github\r\nJune Choe. (2022). Sub-layer modularity in the Grammar of Graphics. American Statistical Association (ASA) student paper award in Statistical Computing and Graphics. Paper, Github\r\nTalks\r\nJune Choe. Sub-layer modularity in the Grammar of Graphics. Talk at the 2023 Joint Statistical Meetings, 5-10 August 2023. Toronto, Canada.\r\nJune Choe. Fast cluster-based permutation test using mixed-effects models. Talk at the Integrated Language Science and Technology (ILST) seminar, 21 April 2023. University of Pennsylvania, PA.\r\nJune Choe. Cracking open ggplot internals with {ggtrace}. Talk at the 2022 RStudio Conference, 25-28 July 2022. Washington D.C. https://github.com/yjunechoe/ggtrace-rstudioconf2022\r\nJune Choe. Stepping into {ggplot2} internals with {ggtrace}. Talk at the 2022 useR! Conference, 20-23 June 2022. Vanderbilt University, TN. https://github.com/yjunechoe/ggtrace-user2022\r\nSoftware\r\nJune Choe. (2024). jlmerclusterperm: Cluster-Based Permutation Analysis for Densely Sampled Time Data. R package version 1.1.3. https://cran.r-project.org/package=jlmerclusterperm. Github\r\nRich Iannone, June Choe, Mauricio Vargas Sepulveda. (2024). pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. R package version 0.12.1. https://CRAN.R-project.org/package=pointblank. Github\r\nMassimo Aria, Corrado Cuccurullo, Trang Le, June Choe. (2024). openalexR: Getting Bibliographic Records from ‘OpenAlex’ Database Using ‘DSL’ API. R package version 1.4.0. https://CRAN.R-project.org/package=openalexR. Github\r\nJune Choe. (2024). jlme: Regression Modelling with ‘GLM.jl’ and ‘MixedModels.jl’ in ‘Julia’. R package version 0.3.0. https://cran.r-project.org/package=jlme. Github\r\nSean McWeeny, June Choe, & Elizabeth S. Norton. (2021). SnowGlobe: An Iterative Search Tool for Systematic Reviews and Meta-Analyses [Computer Software]. OSF\r\nTeaching\r\nPositions held\r\nTeaching assistant for “Introduction to Linguistics”. Instructor: Aletheia Cui. Spring 2024. University of Pennsylvania.\r\nTeaching assistant for “Data science for language and the mind”. Instructor: Katie Schuler. Fall 2021, Spring, 2023, and Fall 2023. University of Pennsylvania.\r\nWorkshops led\r\nIntroduction to mixed-effects models in Julia. Workshop at Penn MindCORE. 1 December 2023. Philadelphia, PA. Github Colab notebook\r\nExperimental syntax using IBEX/PCIBEX with Dr. Nayoun Kim. Workshop at the 2022 Seoul International Conference on Linguistics. 11-12 August 2022. Seoul, South Korea. PDF\r\nExperimental syntax using IBEX: a walkthrough with Dr. Nayoun Kim. 2021 BK Winter School-Workshop on Experimental Linguistics/Syntax at Sungkyunkwan University, 19-22 January 2021. Seoul, South Korea. PDF\r\nGuest lectures\r\nHard words and (syntactic) bootstrapping. LING 5750 “The Acquisition of Meaning”. Instructor: Dr. Anna Papafragou. Spring 2024. University of Pennsylvania.\r\nIntroduction to R for psychology research. PSYC 4997 “Senior Honors Seminar in Psychology”. Instructor: Dr. Coren Apicella. Spring 2024. University of Pennsylvania. Colab notebook\r\nModel fitting and diagnosis with MixedModels.jl in Julia. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Fall 2023. University of Pennsylvania.\r\nSimulation-based power analysis for mixed-effects models. LING 5670 “Quantitative Study of Linguistic Variation”. Instructor: Dr. Meredith Tamminga. Spring 2023. University of Pennsylvania.\r\nProfessional activities\r\nEditor\r\nPenn Working Papers in Linguistics (PWPL), Volumne 30, Issue 1.\r\nReviewer\r\nCognition\r\nLanguage Learning and Development\r\nJournal of Open Source Software\r\nProceedings of the Annual Meeting of the Cognitive Science Society\r\nMembership\r\nLinguistics Society of America\r\nAmerican Statistical Association\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:08-04:00" + "last_modified": "2024-09-22T18:50:22-04:00" }, { "path": "resources.html", @@ -35,14 +35,14 @@ "description": "Mostly for R and data visualization\n", "author": [], "contents": "\r\n\r\nContents\r\nLinguistics\r\nData Visualization\r\nPackages and software\r\nTutorial Blog Posts\r\nBy others\r\n\r\nLinguistics\r\nScripting online experiments with IBEX (workshop slides & materials with Nayoun Kim)\r\nData Visualization\r\n{ggplot2} style guide and showcase - most recent version (2/10/2021)\r\nCracking open the internals of ggplot: A {ggtrace} showcase - slides\r\nPackages and software\r\n{ggtrace}: R package for exploring, debugging, and manipulating ggplot internals by exposing the underlying object-oriented system in functional programming terms.\r\n{penngradlings}: R package for the University of Pennsylvania Graduate Linguistics Society.\r\n{LingWER}: R package for linguistic analysis of Word Error Rate for evaluating transcriptions and other speech-to-text output, using a deterministic matrix-based search algorithm optimized for R.\r\n{gridAnnotate}: R package for interactively annotating figures from the plot pane, using {grid} graphical objects.\r\nSnowGlobe: A tool for meta-analysis research. Developed with Jinnie Choi, Sean McWeeny, and Elizabeth Norton, with funding from the Northwestern University Library. Currently under development but basic features are functional. Validation experiments and guides at OSF repo.\r\nTutorial Blog Posts\r\n{ggplot2} stat_*() functions [post]\r\nCustom fonts in R [post]\r\n{purrr} reduce() family [post1, post2]\r\nThe correlation parameter in {lme4} mixed effects models [post]\r\nShortcuts for common chain of {dplyr} functions [post]\r\nPlotting highly-customizable treemaps with {treemap} and {ggplot2} [post]\r\nBy others\r\nTutorials:\r\nA ggplot2 Tutorial for Beautiful Plotting in R by Cédric Scherer\r\nggplot2 Wizardry Hands-On by Cédric Scherer\r\nggplot2 workshop by Thomas Lin Pedersen\r\nBooks:\r\nR for Data Science by Hadley Wickham and Garrett Grolemund\r\nR Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund\r\nggplot2: elegant graphics for data analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen\r\nFundamentals of Data Visualization by Claus O. Wilke\r\nEfficient R Programming by Colin Gillespie and Robin Lovelace\r\nAdvanced R by Hadley Wickham\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:09-04:00" + "last_modified": "2024-09-22T18:50:24-04:00" }, { "path": "software.html", "title": "Software", "author": [], "contents": "\r\n\r\nContents\r\nggtrace\r\njlmerclusterperm\r\npointblank\r\nopenalexR\r\nggcolormeter\r\nddplot\r\nSnowglobe (retired)\r\n\r\nMain: Github profile, R-universe profile\r\nggtrace\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R\r\nLinks: Github, website, talks (useR! 2022, rstudio::conf 2022), paper\r\n\r\nProgrammatically explore, debug, and manipulate ggplot internals. Package {ggtrace} offers a low-level interface that extends base R capabilities of trace, as well as a family of workflow functions that make interactions with ggplot internals more accessible.\r\n\r\njlmerclusterperm\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R, Julia\r\nLinks: CRAN, Github, website\r\n\r\nAn implementation of fast cluster-based permutation analysis (CPA) for densely-sampled time data developed in Maris & Oostenveld (2007). Supports (generalized, mixed-effects) regression models for the calculation of timewise statistics. Provides both a wholesale and a piecemeal interface to the CPA procedure with an emphasis on interpretability and diagnostics. Integrates Julia libraries MixedModels.jl and GLM.jl for performance improvements, with additional functionalities for interfacing with Julia from ‘R’ powered by the JuliaConnectoR package.\r\n\r\npointblank\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R, HTML/CSS, Javascript\r\nLinks: Github, website\r\n\r\nData quality assessment and metadata reporting for data frames and database tables\r\n\r\nopenalexR\r\n\r\n\r\n\r\nRole: Author\r\nLanguage: R\r\nLinks: Github, website\r\n\r\nA set of tools to extract bibliographic content from the OpenAlex database using API https://docs.openalex.org.\r\n\r\nggcolormeter\r\nRole: Author\r\nLanguage: R\r\nLinks: Github\r\n\r\n{ggcolormeter} adds guide_colormeter(), a {ggplot2} color/fill legend guide extension in the style of a dashboard meter.\r\n\r\nddplot\r\nRole: Contributor\r\nLanguage: R, JavaScript\r\nLinks: Github, website\r\n\r\nCreate ‘D3’ based ‘SVG’ (‘Scalable Vector Graphics’) graphics using a simple ‘R’ API. The package aims to simplify the creation of many ‘SVG’ plot types using a straightforward ‘R’ API. The package relies on the ‘r2d3’ ‘R’ package and the ‘D3’ ‘JavaScript’ library. See https://rstudio.github.io/r2d3/ and https://d3js.org/ respectively.\r\n\r\nSnowglobe (retired)\r\nRole: Author\r\nLanguage: R, SQL\r\nLinks: Github, OSF, poster\r\n\r\nAn iterative search tool for systematic reviews and meta-analyses, implemented as a Shiny app. Retired due to the discontinuation of the Microsoft Academic Graph service in 2021. I now contribute to {openalexR}.\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:11-04:00" + "last_modified": "2024-09-22T18:50:25-04:00" }, { "path": "visualizations.html", @@ -50,7 +50,7 @@ "description": "Select data visualizations", "author": [], "contents": "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n", - "last_modified": "2024-09-20T10:56:14-04:00" + "last_modified": "2024-09-22T18:50:28-04:00" } ], "collections": ["posts/posts.json"] diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 6a76546b..aac761f7 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -28,6 +28,10 @@ https://yjunechoe.github.io/visualizations.html 2022-11-13T09:17:01-05:00 + + https://yjunechoe.github.io/posts/2024-09-22-fetch-files-web/ + 2024-09-22T18:49:08-04:00 + https://yjunechoe.github.io/posts/2024-07-21-enumerate-possible-options/ 2024-09-01T17:53:55-04:00