-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdata-wrangling.Rmd
107 lines (74 loc) · 5.99 KB
/
data-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "R for Data Analysis"
output: github_document
---
[Next >>>](02_isolating-data.md)
## Welcome
The layout and much of the content in this tutorial borrows from [RStudio Primers](https://rstudio.cloud/learn/primers). I would highly recommend exploring [RStudio Primers](https://rstudio.cloud/learn/primers) as they are a great resource for learning basic and intermediate R concepts!
In this case study, we will explore how the musical tastes of the American public has changed in relation to the COVID-19 pandemic and other factors. We're working with a fun data set comprised of the musical attributes of weekly [Billboard Hot 100](https://www.billboard.com/charts/billboard-200) songs from February 2019 to February 2021. The musical attributes were mined using the [spotifyr](https://www.rcharlie.com/spotifyr/) interface to the [Spotify Web API](https://developer.spotify.com/documentation/web-api/)*. Along the way, you will master some of the core functions for isolating and summarizing variables, cases, and values within a data frame:
* `select()` and `filter()`, which let you extract columns and rows from a data frame
* `arrange()`, which lets you reorder the rows in your data
* `%>%`, which organizes your code into reader-friendly "pipes"
* `group_by()`, which lets you group your data by a factor
* `summarize()`, which lets you perform a function on your grouped data
In addition to "wrangling" data, you will learn how to create visually appealing and effective data visualizations using the `ggplot2` package. Mastering these two skills will help you efficiently explore your data and create publishable figures.
This tutorial uses the [core tidyverse packages](http://tidyverse.org/), including ```ggplot2```, ```tibble```, ```readr```, and ```dplyr```.
\* For a PDF with code detailing how I mined the data,
<a id="raw-url" href="https://github.com/GC-DRI/r_data_analysis_2021/raw/main/00_acquire_spotify_data.pdf">Click here</a>
## Music
First, let's load the `tidyverse` suite of packages and the `here` package. The `here` package is a simple package that makes reproducing file paths much easier. We'll see it in action in the next code chunk.
```{r setup, echo = TRUE, message = FALSE, warning = FALSE}
library(tidyverse)
library(here)
```
### Spotify data
Now let's read in the Spotify data. We'll use the `read_csv` function for this. Notice our use of the `here` function as well. Rather than typing in a path, we're specifying each part of the path within the function. `here` does the rest, based on the user's operating system. This prevents the "forward-slash, back-slash" incompatibility across Windows and UNIX systems.
`read_csv` also handily outputs the columns and their types it interprets from the CSV file. This gives you a quick check to make sure the columns are being correctly read in and interpreted. Which columns are numeric and which are character?
```{r spotify}
spotify <- read_csv(here("data", "spotify.csv"))
```
Let's take a look at the data set. I will demonstrate two ways to do this, although there are many functions available to view your data. The first is just to print the data set to the screen. Notice the `<dbl>` and `<chr>` notes below each column name. This is telling you what type of data each column contains. You should also see information about how many rows and columns are in the data. It's always useful to check both of these data characteristics during an analysis to make sure you aren't getting rid of data that you need and only keeping data that you want.
```{r spotify-print, echo = TRUE}
spotify
```
To display the data in a spreadsheet-like format, try out the `View()` function from base R. There is some basic sorting functionality that may be familiar to you. I like to use this function when I'm interested in quickly scrolling through more than the first 10 or so observations, or I have a large number of variables I want to page through and not have to deal with viewing them in the console. Let's try it out!
```{r spotify-view, eval=FALSE}
View(spotify)
```
### Billboard Hot 100 trends
The `spotify` data set we're using today has a lot of potential for interesting analyses. For instance, do Billboard Top 200 hits tend to be more vocal or more instrumental? Looks like they're predominantly vocal!
```{r inst-hist, echo = FALSE, message = FALSE, warning = FALSE, out.width = "70%"}
spotify %>%
ggplot(aes(x = instrumentalness)) +
geom_histogram(bins = 50, fill = "darkblue") +
theme_minimal() +
labs(x = "Instrumentalness",
y = "Frequency",
title = "Distribution of instrumentalness across Billboard Hot 100 tracks ")
```
Does this pattern change across the time periods we sampled the Hot 100 chart? Doesn't look like it. Top 40 radio doesn't really appreciate intrumental music...
```{r inst-hist-facet, echo = FALSE, message = FALSE, warning = FALSE, out.width = "70%"}
spotify %>%
mutate(time_long = case_when(
covid_period == "pre_covid" ~ "Pre-COVID",
covid_period == "post_covid" ~ "Post-COVID"
)) %>%
mutate(time_long = fct_relevel(time_long,
"Pre-COVID",
"Post-COVID")) %>%
ggplot(aes(x = instrumentalness)) +
geom_histogram(bins = 50, fill = "darkblue") +
facet_wrap(~time_long) +
theme_minimal() +
labs(x = "Instrumentalness",
y = "Frequency",
title = "Distribution of instrumentalness across Billboard Hot 100 tracks")
```
While this makes sense and is somewhat interesting, I think we can uncover some more interesting patterns in the data with a little digging.
### Sections
Your goal in this workshop is to uncover more (interesting) patterns in the `spotify` data. Along the way, you will gain a data wrangling and visualization tool set that will help you explore and present data in any context.
1) [Isolating data and visualizing distributions](02_isolating-data.md)
2) [Piping and adding color](03_piping.md)
3) [Summarizing data and visualizing trends](04_summarizing-data.md)
-----
[Next >>>](02_isolating-data.md)