This type of task occurs often in Data Science: you need to extract data
from a table before you can use it. You can do this task quickly with
two functions that come in the dplyr
package:
- select() - which extracts columns from a data frame
- filter() - which extracts rows from a data frame
Each function takes a data frame or tibble as it’s first argument and returns a new data frame or tibble as its output.
select()
extracts columns of a data frame and returns the columns as a
new data frame. To use select()
, pass it the name of a data frame to
extract columns from, and then the names of the columns to extract. The
column names do not need to appear in quotation marks or be prefixed
with a $
; select()
knows to find them in the data frame that you
supply.
A typical select()
call looks like this:
select(<data>, <variables>)
If you want to select the loudness
variable, you run this:
select(spotify, loudness)
## # A tibble: 10,500 x 1
## loudness
## <dbl>
## 1 -10.7
## 2 -2.75
## 3 -7.05
## 4 -5.57
## 5 -3.71
## 6 -2.73
## 7 -5.63
## 8 -11.7
## 9 -7.36
## 10 -7.05
## # … with 10,490 more rows
Notice how it returned a data frame of the loudness
variable, rather
than a list or vector!
If you want to extract more than one variable you can include them,
separated by commas, as arguments to the function:
select(spotify, loudness, instrumentalness, key)
Or, you can provide the variables as a vector! They will return the same thing.
select(spotify, c(loudness, instrumentalness, key))
## # A tibble: 10,500 x 3
## loudness instrumentalness key
## <dbl> <dbl> <dbl>
## 1 -10.7 0 0.317
## 2 -2.75 0 0.792
## 3 -7.05 0.00000911 0.488
## 4 -5.57 0 0.479
## 5 -3.71 0 0.73
## 6 -2.73 0 0.904
## 7 -5.63 0 0.653
## 8 -11.7 0 0.364
## 9 -7.36 0.00000178 0.539
## 10 -7.05 0.00000101 0.462
## # … with 10,490 more rows
A third option, in case you have many variables or aren’t sure of the
variable’s exact names, is to use select helpers
.
You can also use a series of helpers with select()
. For example, if
you place a minus sign before a column name, select()
will return
every column but that column. The code below will return every column
except lyrics.
select(spotify, -lyrics)
The table below summarizes most of the other select()
helpers that are
available in dplyr
. This is here for your reference and to demonstrate
the flexibility of select()
Helper Function | Use | Example |
---|---|---|
- | Columns except | select(spotify, -loudness) |
: | Columns between (inclusive) | select(spotify, loudness:danceability) |
contains() | Columns that contains a string | select(spotify, contains("d")) |
ends_with() | Columns that ends with a string | select(spotify, ends_with("y")) |
matches() | Columns that matches a regex | select(spotify, matches("n")) |
num_range() | Columns with a numerical suffix in the range | Not applicable with spotify |
one_of() | Columns whose name appear in the given set | select(spotify, one_of(c("energy", "tempo"))) |
starts_with() | Columns that starts with a string | select(spotify, starts_with("e")) |
Exercise 1
Use the example below to get a feel for select()
. Can you extract just
the artist
column? How about the artist
and energy
columns?
select(spotify, loudness)
## # A tibble: 10,500 x 1
## loudness
## <dbl>
## 1 -10.7
## 2 -2.75
## 3 -7.05
## 4 -5.57
## 5 -3.71
## 6 -2.73
## 7 -5.63
## 8 -11.7
## 9 -7.36
## 10 -7.05
## # … with 10,490 more rows
Exercise 2
Use the contains()
helper function to select the columns that contain
the letter “n”. The structure should look like this: select(spotify, contains("d"))
. What variables are returned?
The tidyr
package allows you to reshape your entire data frame, rather
than perform simple operations with columns. It’s super helpful if you
need to, for instance, convert your wide data frame to a long
(i.e. tidy) data frame, or convert your nested JSON data to a tidy
data frame. For a short introduction, check this
out
filter()
extracts rows from a data frame and returns them as a new
data frame. As with select()
, the first argument of filter()
should
be a data frame to extract rows from. The arguments that follow should
be logical tests; filter()
will return every row for which the tests
return TRUE
.
For example, the code chunk below returns every track with a 5/4 time
signature in spotify
. 7 has been on the charts for a while! Also,
I’m happy to see my man John Prine switching up his time signatures too.
filter(spotify, time_signature == 5)
## # A tibble: 146 x 23
## rank track artist week month day year season time_since_covid
## <dbl> <chr> <chr> <date> <chr> <dbl> <dbl> <chr> <dbl>
## 1 27 Robb… Juice… 2019-03-02 Mar 2 2019 Winter -324
## 2 87 Wort… YK Os… 2019-03-02 Mar 2 2019 Winter -324
## 3 28 Robb… Juice… 2019-03-09 Mar 9 2019 Winter -317
## 4 73 Wort… YK Os… 2019-03-09 Mar 9 2019 Winter -317
## 5 34 Robb… Juice… 2019-03-16 Mar 16 2019 Spring -310
## 6 65 Wort… YK Os… 2019-03-16 Mar 16 2019 Spring -310
## 7 27 Robb… Juice… 2019-03-23 Mar 23 2019 Spring -303
## 8 62 Wort… YK Os… 2019-03-23 Mar 23 2019 Spring -303
## 9 35 Robb… Juice… 2019-03-30 Mar 30 2019 Spring -296
## 10 56 Wort… YK Os… 2019-03-30 Mar 30 2019 Spring -296
## # … with 136 more rows, and 14 more variables: covid_period <chr>,
## # danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>, mode <chr>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
## # time_signature <dbl>
To get the most from filter, you will need to know how to use R’s logical test operators, which are summarized below.
Logical operator | tests | Example |
---|---|---|
> | Is x greater than y ? |
x > y |
>= | Is x greater than or equal to y ? |
x >= y |
< | Is x less than y ? |
x < y |
<= | Is x less than or equal to y ? |
x <= y |
== | Is x equal to y ? |
x == y |
!= | Is x not equal to y ? |
x != y |
is.na() | Is x an NA ? |
is.na(x) |
!is.na() | Is x not an NA ? |
!is.na(x) |
Let’s look at another example before moving on to exercises. Tracks with
an instrumentalness
value greater than 0.5 are considered to likely be
instrumental, according to the metadata. Let’s see
which tracks are instrumental!
filter(spotify, instrumentalness > 0.5)
## # A tibble: 43 x 23
## rank track artist week month day year season time_since_covid
## <dbl> <chr> <chr> <date> <chr> <dbl> <dbl> <chr> <dbl>
## 1 65 How … Offse… 2019-03-09 Mar 9 2019 Winter -317
## 2 60 How … Drake 2019-08-17 Aug 17 2019 Summer -156
## 3 93 Fear… Tool 2019-08-17 Aug 17 2019 Summer -156
## 4 85 In M… Frank… 2019-11-16 Nov 16 2019 Fall -65
## 5 74 Ever… Billi… 2019-11-23 Nov 23 2019 Fall -58
## 6 8 Ever… Billi… 2019-11-30 Nov 30 2019 Fall -51
## 7 26 Ever… Billi… 2019-12-07 Dec 7 2019 Fall -44
## 8 25 Ever… Billi… 2019-12-14 Dec 14 2019 Fall -37
## 9 30 Ever… Billi… 2019-12-21 Dec 21 2019 Winter -30
## 10 34 Ever… Billi… 2019-12-28 Dec 28 2019 Winter -23
## # … with 33 more rows, and 14 more variables: covid_period <chr>,
## # danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>, mode <chr>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
## # time_signature <dbl>
Huh, it seems like this isn’t a strict enough filter. There are quite a few tracks that definitely have words in them here. Let’s increase the filter to 0.8 and see where that gets us.
filter(spotify, instrumentalness > 0.8)
## # A tibble: 4 x 23
## rank track artist week month day year season time_since_covid
## <dbl> <chr> <chr> <date> <chr> <dbl> <dbl> <chr> <dbl>
## 1 50 Can I Kehla… 2020-05-23 May 23 2020 Spring 124
## 2 47 Deck… Nat K… 2020-12-26 Dec 26 2020 Winter 341
## 3 100 Beau… Kid C… 2020-12-26 Dec 26 2020 Winter 341
## 4 43 Deck… Nat K… 2021-01-02 Jan 2 2021 Winter 348
## # … with 14 more variables: covid_period <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <chr>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>, time_signature <dbl>
Jeez Spotify, I don’t think instrumentalness
is a very accurate
metric! Or something may have gone wrong when I was mining the data…
This goes to show- exploring your data beforehand will help you catch
errors early so they don’t make it into your analyses!
Exercise 3
See if you can use the logical operators to manipulate our code below to show:
- All of the songs in the “post_covid”
covid_period
- All of the songs where
tempo
is less than or equal to 80 - Stretch challenge. All songs that have more than 20 characters in
their title. Hint: you will need to use the
nchar()
function.
filter(spotify, covid_period == "pre_covid")
When you use logical tests, be sure to look out for two common mistakes. One appears in each code chunk below. Can you find them? When you spot a mistake, fix it and then run the chunk to confirm that it works.
filter(spotify, artist = "DaBaby")
filter(spotify, artist == DaBaby)
Answer
filter(spotify, artist == "DaBaby")
## # A tibble: 114 x 23
## rank track artist week month day year season time_since_covid
## <dbl> <chr> <chr> <date> <chr> <dbl> <dbl> <chr> <dbl>
## 1 87 Suge DaBaby 2019-04-13 Apr 13 2019 Spring -282
## 2 63 Suge DaBaby 2019-04-20 Apr 20 2019 Spring -275
## 3 51 Suge DaBaby 2019-04-27 Apr 27 2019 Spring -268
## 4 46 Suge DaBaby 2019-05-04 May 4 2019 Spring -261
## 5 27 Suge DaBaby 2019-05-11 May 11 2019 Spring -254
## 6 22 Suge DaBaby 2019-05-18 May 18 2019 Spring -247
## 7 16 Suge DaBaby 2019-05-25 May 25 2019 Spring -240
## 8 14 Suge DaBaby 2019-06-01 Jun 1 2019 Spring -233
## 9 9 Suge DaBaby 2019-06-08 Jun 8 2019 Spring -226
## 10 8 Suge DaBaby 2019-06-15 Jun 15 2019 Summer -219
## # … with 104 more rows, and 14 more variables: covid_period <chr>,
## # danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>, mode <chr>,
## # speechiness <dbl>, acousticness <dbl>, instrumentalness <dbl>,
## # liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
## # time_signature <dbl>
When you use logical tests, be sure to look out for these two common mistakes:
- using
=
instead of==
to test for equality. - forgetting to use quotation marks when comparing strings,
e.g.
artist == DaBaby
, instead ofgenre == "DaBaby"
You can find a complete list of base R’s boolean operators in the table below. This is here for your reference.
Boolean operator | represents | Example |
---|---|---|
& | Are both A and B true? |
A & B |
| |
Are one or both of A and B true? |
A | B |
! | Is A not true? |
!A |
xor() | Is one and only one of A and B true? |
xor(A, B) |
%in% | Is x in the set of a , b , and c ? |
x %in% c(a, b, c) |
any() | Are any of A , B , or C true? |
any(A, B, C) |
all() | Are all of A , B , or C true? |
all(A, B, C) |
These allow you to customize filtered searches according to a variety of
criteria. We’re not going to go into depth about all of these here. The
basic Boolean operators we will focus on today are & and |
(or).
If you provide more than one test to filter()
, filter()
will combine
the tests with an and statement (&
): it will only return the rows
that satisfy all of the tests.
For instance,
filter(spotify, mode == "minor", tempo > 80)
will only return songs that are in a minor key and are faster than 80 beats per minute. Run this code and see how many observations are returned.
Now, try this.
filter(spotify, mode == "minor" | tempo > 80)
Many more observations are returned because the function is filtering for songs that are in a minor key or have a tempo greater than 80 beats per minute. A song only has to satisfy one of the criteria to pass the filter.
The stringr
package is very helpful for manipulating strings in R.
It’s part of the core tidyverse set of packages, so it is easy to
integrate into your data wrangling pipeline. It takes some of the pain
away from forming your own regex
expressions, but is flexible enough
to accomodate complex cases. I use stringr
regularly within filter()
and other data manipulation functions. If you’re interested, check out
this light introduction.
We’re going to kick off the data visualization segment with an
introduction to the ggplot2
package and a simple, but important task-
visualizing distributions.
ggplot2
is a plotting package that makes it simple to create complex
plots from data in a data frame. It stands for “grammar of graphics
plot” and the 2 comes from its total redesign after the creator
Hadley Wickham scrapped the original package. The
grammar of
graphics is a
useful framework to describe the components of a graphic. It provides a
more programmatic interface for specifying what variables to plot, how
they are displayed, and general visual properties. Therefore, we only
need minimal changes if the underlying data change or if we decide to
change from a bar plot to a scatter plot. This gives us a lot of
customization flexibility with reasonable defaults.
In practical terms,
-
ggplot2
functions like tidy data. Well-structured data will save you lots of time when making figures. -
ggplot2
graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = < DATA > , aes( < MAPPINGS > )) + geom_FUNCTION()
data Bind the plot to a specific data frame using the data argument
aes() Use the aes() (aesthetics) function to select the variables to be plotted and specify how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
geom Add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). Here are some examples:
geom_point()
for scatter plots, dot plots, etc. geom_smooth()
for
trendlines geom_boxplot()
for boxplots geom_histogram()
for
histograms
Plot components are tied together using the +
operator. You can add
geom_*
s, for instance plotting both a scatterplot and a trendline:
ggplot() + geom_point() + geom_smooth()
,
or add components that customize the plot- we will get to this part later.
Let’s build a plot, piece by piece!
Visualizing the distribution of your data is one of the most fundamental
methods of data exploration. We’ll learn the fundamentals of ggplot2
while making a fundamental plot!
Every plot begins with a call to ggplot()
(no 2! That’s only in
the package name). First, we have to specify the data with the data =
argument. In addition, we need to specify what variables we’re plotting
and which axes we’re plotting them on. These are considered plot
aesthetics and are specified within the aes()
function.
Let’s visualize the distribution of valence
! It is Spotify’s derived
metric that describes the “happiness” and “positivity” of a song. Since
we’re visualizing a single variable, we only provide an aesthetic for
the x axis.
ggplot(data = spotify, aes(x = valence))
As you can see, we get a mostly blank plot when we run the ggplot
function! That’s because we haven’t told ggplot
what type of geom_
we want. ggplot
sets the framework that the rest of the components
inherit. Every plot component that follows will use spotify
as the
data set and valence
as the x-axis data, unless otherwise specified.
Let’s make something classic- a histogram. It’s easy- just specify the geom!
ggplot(data = spotify, aes(x = valence)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Huh, you got a funny message-
“stat_bin()
using bins = 30
. Pick better value with binwidth
.”
ggplot
defaults to using 30 bins, but that may not always be the best
choice. Different values can highlight aspects about your distribution
that may be useful or obscure important patterns. We can play with the
binwidth
argument to adjust how wide each bin is in the units of our
data. Let’s make them extra narrow. Notice how we specify the binwidth
argument within the call to geom_histogram
- that’s because it is an
argument specific to this geom!
ggplot(data = spotify, aes(x = valence)) +
geom_histogram(binwidth = 0.01)
I don’t think this uncovers patterns and just adds noise to the distribution. Let’s widen those binwidths.
ggplot(data = spotify, aes(x = valence)) +
geom_histogram(binwidth = 0.1)
I like this a bit better- the distribution looks to be mostly normal with a bit of a right-skew. Most songs appear to be middle of the road in their positivity.
I’m curious- the COVID pandemic has been extremely tough on the psyche
of, I imagine, most people. Does this show up in our choice of pop
music? Let’s use our data isolation skillset to see if music from a time
period before the pandemic (a random week in March 2019) was happier
than the long-term average. We need to filter
our data set for
pre-COVID songs and then visualize the valence of those songs. The
variable that contains this information is the covid_period
variable,
where pre_covid
indicates the pre-COVID time frame.
# first filter for pre-covid songs
pre_covid <- filter(spotify, covid_period == "pre_covid")
# now, visualize a histogram of the valence for these songs
ggplot(data = pre_covid, aes(x = valence)) +
geom_histogram(binwidth = 0.1)
Hmm, there doesn’t seem to be a large change (if any). I think it would be cool to visualize the different periods of the COVID pandemic side-by-side to better visualize differences. We’ll learn this in the next data visualization segment!
While these plots aren’t too pretty, they get the message across. We will learn how to spruce up the plots in a bit!
Exercise 4
Try to make your own ggplot! Visualize a histogram of the loudness
variable. Play around with the binwidths to get a feel of how this
variable is distributed! If you’re feeling a bit ambitious, see if you
can filter the data by another variable (like covid_period
perhaps?)
before you create the visualization to see if there is an interesting
pattern not obvious in the full dataset.
Although this isn’t an R-specific book, Claus Wilke’s (free) Fundamentals of Data Visualization is a must-read for those trying to up their data-viz game. He’s a contributor to the ggplot2 package, and even made a couple of packages that complement ggplot2, like cowplot for stitching together plots and annotating them, ggridges for creating partially overlapping line plots that look like mountain ranges (or that one Joy Division album cover), and ggtext, which allows you to add Markdown and HTML formatted text to your ggplots. Super cool stuff! The book doesn’t contain R code and isn’t a “this is how you make this plot in R” book, but that type of info is available in many other places, like From data to viz.
Exercise 1
select(spotify, artist)
select(spotify, artist, energy)
Exercise 2
spotify %>% select(contains("n"))
## # A tibble: 10,500 x 14
## rank month season time_since_covid danceability energy loudness speechiness
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Feb Winter -338 0.778 0.317 -10.7 0.334
## 2 2 Feb Winter -338 0.687 0.792 -2.75 0.0452
## 3 3 Feb Winter -338 0.752 0.488 -7.05 0.0705
## 4 4 Feb Winter -338 0.76 0.479 -5.57 0.0466
## 5 5 Feb Winter -338 0.834 0.73 -3.71 0.222
## 6 6 Feb Winter -338 0.579 0.904 -2.73 0.0618
## 7 7 Feb Winter -338 0.717 0.653 -5.63 0.0658
## 8 8 Feb Winter -338 0.837 0.364 -11.7 0.276
## 9 9 Feb Winter -338 0.829 0.539 -7.36 0.208
## 10 10 Feb Winter -338 0.611 0.462 -7.05 0.0646
## # … with 10,490 more rows, and 6 more variables: acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, duration_ms <dbl>,
## # time_signature <dbl>
Exercise 3
filter(spotify, covid_period == "post_covid")
filter(spotify, tempo <= 80)
filter(spotify, nchar(track) > 20)
Exercise 4
ggplot(data = spotify, aes(x = loudness)) +
geom_histogram(binwidth = 1)
I thought it would be cool to see if top 40 songs are louder! I filtered for all songs with a rank equal to or below 40 before visualizing loudness. It looks to be the case! This filter at least got rid of some of the quieter outliers. I took a look at the extra quiet song, and it looks like DaBaby’s song Baby on Baby wanted to keep it quiet for the babies.
top40_songs <- filter(spotify, rank <= 40)
ggplot(data = top40_songs, aes(x = loudness)) +
geom_histogram(binwidth = 1)