Skip to content

Commit

Permalink
fix date and start draft
Browse files Browse the repository at this point in the history
  • Loading branch information
yjunechoe committed Sep 1, 2024
1 parent 8843f80 commit 13bf477
Show file tree
Hide file tree
Showing 34 changed files with 17,565 additions and 7,080 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ author:
affiliation: University of Pennsylvania Linguistics
affiliation_url: https://live-sas-www-ling.pantheon.sas.upenn.edu/
orcid_id: 0000-0002-0701-921X
date: "`r Sys.Date()`"
date: 07-21-2024
output:
distill::distill_article:
include-after-body: "highlighting.html"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@


<!-- https://schema.org/Article -->
<meta property="article:published" itemprop="datePublished" content="2024-09-01"/>
<meta property="article:created" itemprop="dateCreated" content="2024-09-01"/>
<meta property="article:published" itemprop="datePublished" content="2024-07-21"/>
<meta property="article:created" itemprop="dateCreated" content="2024-07-21"/>
<meta name="article:author" content="June Choe"/>

<!-- https://developers.facebook.com/docs/sharing/webmasters#markup -->
Expand All @@ -115,7 +115,7 @@
<!--radix_placeholder_rmarkdown_metadata-->

<script type="text/json" id="radix-rmarkdown-metadata">
{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["title","description","categories","base_url","author","date","output","editor_options","preview"]}},"value":[{"type":"character","attributes":{},"value":["Naming patterns for boolean enums"]},{"type":"character","attributes":{},"value":["Some thoughts on the principle of enumerating possible options, even for booleans\n"]},{"type":"character","attributes":{},"value":["design"]},{"type":"character","attributes":{},"value":["https://yjunechoe.github.io"]},{"type":"list","attributes":{},"value":[{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["name","affiliation","affiliation_url","orcid_id"]}},"value":[{"type":"character","attributes":{},"value":["June Choe"]},{"type":"character","attributes":{},"value":["University of Pennsylvania Linguistics"]},{"type":"character","attributes":{},"value":["https://live-sas-www-ling.pantheon.sas.upenn.edu/"]},{"type":"character","attributes":{},"value":["0000-0002-0701-921X"]}]}]},{"type":"character","attributes":{},"value":["2024-09-01"]},{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["distill::distill_article"]}},"value":[{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["include-after-body","toc","self_contained","css"]}},"value":[{"type":"character","attributes":{},"value":["highlighting.html"]},{"type":"logical","attributes":{},"value":[true]},{"type":"logical","attributes":{},"value":[false]},{"type":"character","attributes":{},"value":["../../styles.css"]}]}]},{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["chunk_output_type"]}},"value":[{"type":"character","attributes":{},"value":["console"]}]},{"type":"character","attributes":{},"value":["preview.jpg"]}]}
{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["title","description","categories","base_url","author","date","output","editor_options","preview"]}},"value":[{"type":"character","attributes":{},"value":["Naming patterns for boolean enums"]},{"type":"character","attributes":{},"value":["Some thoughts on the principle of enumerating possible options, even for booleans\n"]},{"type":"character","attributes":{},"value":["design"]},{"type":"character","attributes":{},"value":["https://yjunechoe.github.io"]},{"type":"list","attributes":{},"value":[{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["name","affiliation","affiliation_url","orcid_id"]}},"value":[{"type":"character","attributes":{},"value":["June Choe"]},{"type":"character","attributes":{},"value":["University of Pennsylvania Linguistics"]},{"type":"character","attributes":{},"value":["https://live-sas-www-ling.pantheon.sas.upenn.edu/"]},{"type":"character","attributes":{},"value":["0000-0002-0701-921X"]}]}]},{"type":"character","attributes":{},"value":["07-21-2024"]},{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["distill::distill_article"]}},"value":[{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["include-after-body","toc","self_contained","css"]}},"value":[{"type":"character","attributes":{},"value":["highlighting.html"]},{"type":"logical","attributes":{},"value":[true]},{"type":"logical","attributes":{},"value":[false]},{"type":"character","attributes":{},"value":["../../styles.css"]}]}]},{"type":"list","attributes":{"names":{"type":"character","attributes":{},"value":["chunk_output_type"]}},"value":[{"type":"character","attributes":{},"value":["console"]}]},{"type":"character","attributes":{},"value":["preview.jpg"]}]}
</script>
<!--/radix_placeholder_rmarkdown_metadata-->

Expand Down Expand Up @@ -1524,7 +1524,7 @@
<!--radix_placeholder_front_matter-->

<script id="distill-front-matter" type="text/json">
{"title":"Naming patterns for boolean enums","description":"Some thoughts on the principle of enumerating possible options, even for booleans","authors":[{"author":"June Choe","authorURL":"#","affiliation":"University of Pennsylvania Linguistics","affiliationURL":"https://live-sas-www-ling.pantheon.sas.upenn.edu/","orcidID":"0000-0002-0701-921X"}],"publishedDate":"2024-09-01T00:00:00.000-04:00","citationText":"Choe, 2024"}
{"title":"Naming patterns for boolean enums","description":"Some thoughts on the principle of enumerating possible options, even for booleans","authors":[{"author":"June Choe","authorURL":"#","affiliation":"University of Pennsylvania Linguistics","affiliationURL":"https://live-sas-www-ling.pantheon.sas.upenn.edu/","orcidID":"0000-0002-0701-921X"}],"publishedDate":"2024-07-21T00:00:00.000-04:00","citationText":"Choe, 2024"}
</script>

<!--/radix_placeholder_front_matter-->
Expand All @@ -1547,7 +1547,7 @@ <h1>Naming patterns for boolean enums</h1>
<div class="d-byline">
June Choe (University of Pennsylvania Linguistics)<a href="https://live-sas-www-ling.pantheon.sas.upenn.edu/" class="uri">https://live-sas-www-ling.pantheon.sas.upenn.edu/</a>

<br/>2024-09-01
<br/>07-21-2024
</div>

<div class="d-article">
Expand Down
229 changes: 229 additions & 0 deletions _posts/2024-09-01-fetch-files-web/fetch-files-web.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
---
title: 'Read files on the web from R'
description: |
Compilation of some code-snippets, mostly for my own use
categories:
- data
base_url: https://yjunechoe.github.io
author:
- name: June Choe
affiliation: University of Pennsylvania Linguistics
affiliation_url: https://live-sas-www-ling.pantheon.sas.upenn.edu/
orcid_id: 0000-0002-0701-921X
date: 09-01-2024
output:
distill::distill_article:
include-after-body: "highlighting.html"
toc: true
self_contained: false
css: "../../styles.css"
editor_options:
chunk_output_type: console
preview: github-dplyr-starwars.jpg
draft: true
---

```{r setup, include=FALSE}
library(ggplot2)
knitr::opts_chunk$set(
comment = " ",
echo = TRUE,
message = FALSE,
warning = FALSE,
R.options = list(width = 80)
)
```

Every so often I'll have a link to some file on hand and want to read it in R without going out of my way to browse the web page, find a download link, download it somewhere onto my computer, grab the path to it, and then finally read it into R.

Over the years I've accumulated some tricks to get data into R "straight from a url", even if the url does not point to the raw file itself. The method varies between data sources though, and I have a hard time keeping track of them in my head, so I thought I'd write some of these down for my own reference.

## GitHub (public repos)

GitHub has nice a point-and-click interface for browsing repositories and previewing files. For example, you can navigate to the `dplyr::starwars` dataset from [tidyverse/dplyr](https://github.com/tidyverse/dplyr/), at <https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv>:

```{r, echo=FALSE, fig.align='center', out.width="500px", out.extra="class=external"}
knitr::include_graphics("github-dplyr-starwars.jpg", error = FALSE)
```

That url, despite ending in a `.csv`, does not point to the raw data - instead, it's a full html webpage:

```{r, eval=FALSE}
rvest::read_html("https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv")
```

```
{html_document}
<html lang="en" data-color-mode="auto" data-light-theme="light" ...
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="logged-out env-production page-responsive" style="word-wrap: ...
```

To actually point to the raw file, you want to click on the **Raw** button to the top-right corner of the preview:

```{r, echo=FALSE, fig.align='center', out.width="300px", out.extra="class=external"}
knitr::include_graphics("github-dplyr-starwars-raw.jpg", error = FALSE)
```

That gets you to the actual contents of the comma separated values, at <https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv>:

```{r, echo=FALSE, fig.align='center', out.width="100%", out.extra="class=external"}
knitr::include_graphics("github-dplyr-starwars-csv.jpg", error = FALSE)
```

You can then read that URL starting with "raw.githubusercontent.com/..." with `read.csv()`:

```{r}
read.csv("https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv") |>
dplyr::glimpse()
```

But note that this method of "click the **Raw** button to get the corresponding *raw.githubusercontent.com/...* url to the file contents" will not work for file formats that cannot be displayed in plain text (clicking the button will instead download the file via your browser). So sometimes (especially when you have a binary file) you have to construct this "remote-readable" url to the file manually.

Fortunately, going from one link to the other is pretty formulaic. To use the starwars dataset example again:

```{r}
emphatic::hl_diff(
"https://github.com/tidyverse/dplyr/blob/main/data-raw/starwars.csv",
"https://raw.githubusercontent.com/tidyverse/dplyr/main/data-raw/starwars.csv"
)
```

## GitHub (gists)

It's a similar idea with GitHub Gists (sometimes I like to store small datasets for demos as gists). For example, here's a link to a simulated data for a [Stroop experiment](https://en.wikipedia.org/wiki/Stroop_effect) `stroop.csv`: <https://gist.github.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6>

The modified url where you can read the csv contents off of is <https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv>, which you can again get to by clicking the **Raw** button at the top-right corner of the gist

```{r, echo=FALSE, fig.align='center', out.width="100%", out.extra="class=external"}
knitr::include_graphics("github-gist-stroop.jpg", error = FALSE)
```

But actually, that long link you get by default points specifically to the current commit. If you instead want to keep the link up to date with the most recent commit, you can remove the second hash that comes after `raw/`:

```{r}
emphatic::hl_diff(
"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/c643b9760126d92b8ac100860ac5b50ba492f316/stroop.csv",
"https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv"
)
```

In practice, I don't use gists to store replicability-sensitive data, so I prefer to just use the shorter link that's not tied to a specific commit.

```{r}
read.csv("https://gist.githubusercontent.com/yjunechoe/17b3787fb7aec108c19b33d71bc19bc6/raw/stroop.csv") |>
dplyr::glimpse()
```

## GitHub (private repos)

We now turn to the harder problem of accessing a file in a private GitHub repository. If you already have the GitHub webpage open and you're signed in, you can follow the same step of copying the link that the **Raw** button redirects to.

Except this time, you'll see the url come with a "token". This token is necessary to remotely access the data in a private repo. Once a token is generated, the file can be accessed using that token from anywhere, but it *will expire* at some point because GitHub refreshes these tokens periodically (so treat them as if they're for single use).

For a more robust approach, you can use the [GitHub Contents API](https://docs.github.com/en/rest/repos/contents). If you have your credentials set up in [`{gh}`](https://gh.r-lib.org/), you can request a token-tagged url to the private file using the syntax:

```{r, eval=FALSE}
gh::gh("/repos/{user}/{repo}/contents/{path}")$download_url
```

This is a general solution to getting a url to file contents. So for example, even without any credentials set up you can point to dplyr's `starwars.csv` since that's publicly accessible. This produces the same "raw.githubusercontent.com/..." url we saw above:

```{r}
gh::gh("/repos/tidyverse/dplyr/contents/data-raw/starwars.csv")$download_url
```

For demonstration with a private repo, here is one of mine that you cannot access <https://github.com/yjunechoe/my-super-secret-repo>. But because I set up my credentials in `{gt}`, I can get a link to a content within that repo with the access token attached in the url ("?token=..."):

```{r}
gh::gh("/repos/yjunechoe/my-super-secret-repo/contents/README.md")$download_url |>
# truncating...
substr(1, 100) |>
paste0("...")
```

I can then use this url to read the private file:^[Note that the API will actually generate a new token every time you send a request (and the tokens will expire with time).]

```{r}
gh::gh("/repos/yjunechoe/my-super-secret-repo/contents/README.md")$download_url |>
readLines()
```

## OSF

Reading files off of OSF follows a similar strategy to fetching public files on GitHub. Consider, for example, the `dyestuff.arrow` file in the [OSF repository for MixedModels.jl](https://osf.io/a94tr/). Browsing the repository through the point-and-click interface can get you to the page for the file at <https://osf.io/9vztj/>, where it shows:

```{r, echo=FALSE, fig.align='center', out.width="100%", out.extra="class=external"}
knitr::include_graphics("osf-MixedModels-dyestuff.jpg", error = FALSE)
```

The download button can be found inside the dropdown menubar:

```{r, echo=FALSE, fig.align='center', out.width="50%", out.extra="class=external"}
knitr::include_graphics("osf-MixedModels-dyestuff-download.jpg", error = FALSE)
```

But instead of clicking on it (which will start a download via the browser), we can grab the link address that it redirects to, which is <https://osf.io/download/9vztj/>. That url can then be passed directly into a read function:

```{r}
arrow::read_feather("https://osf.io/download/9vztj/") |>
dplyr::glimpse()
```

You might have already caught on to this, but the pattern is simply to point to `osf.io/download/` instead of `osf.io/`.

This method also works for view-only links to anonymized OSF projects as well. For example, this is an anonymized link to a csv file from one of my projects <https://osf.io/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad>. Navigating to this link will show a web preview of the csv file contents, just like in the GitHub example with `dplyr::starwars`.

By inserting `/download` into this url, we read the csv file contents directly:

```{r}
read.csv("https://osf.io/download/tr8qm?view_only=998ad87d86cc4049af4ec6c96a91d9ad") |>
head()
```

## Aside: Can't go wrong with a copy-paste!

I think it's severly underrated how base R has a `readClipboard()` function and a collection of `read.*()` functions which can also read directly from a `"clipboard"` connection.^[The special value `"clipboard"` works for most base-R read functions that take a `file` or `con` argument.]

I often do this for html/markdown summary tables that a website might display, or sometimes even for entire excel/googlesheets tables after doing a select-all. For such relatively small chunks of data that you just want to quickly get into R, you can lean on base R's clipboard functionalities.

For example, given this markdown table:

```{r, results="asis"}
aggregate(mtcars, mpg ~ cyl, mean) |>
knitr::kable()
```

You can copy it and run the following code to get that back as an R data frame:

```{r, eval=FALSE}
read.delim("clipboard")
# Or, `read.delim(text = readClipboard())`
```

```{r, echo = FALSE}
read.delim(text = "
cyl mpg
4 26.66364
6 19.74286
8 15.10000
")
```

If you're instead copying something flat like a list of numbers or strings, you can use `scan()` and specify the appropriate `sep` to get that back as a vector:^[Thanks [@coolbutuseless](https://fosstodon.org/@coolbutuseless/113042231377588589) for pointing me to `textConnection()`!]

```{r}
paste(1:10, collapse = ", ") |>
cat()
```

```{r, eval=FALSE}
scan("clipboard", sep = ",")
# Or, `scan(textConnection(readClipboard()), sep = ",")`
```

```{r, echo = FALSE}
1:10
```

It should be noted though that parsing clipboard contents is not a robust feature in base R. If you want a more principled approach to reading data from clipboard, you should use [`{datapasta}`](https://milesmcbain.github.io/datapasta/). And for printing data for others to copy-paste into R, use [`{constructive}`](https://cynkra.github.io/constructive/). See also [`{clipr}`](https://matthewlincoln.net/clipr/) which extends clipboard read/write functionalities.
Loading

0 comments on commit 13bf477

Please sign in to comment.