-
Notifications
You must be signed in to change notification settings - Fork 1
/
2-using-R.Rmd
329 lines (214 loc) · 18.9 KB
/
2-using-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
# How to work with R
```{r setup, include = FALSE}
library(tidyverse)
library(tidytext)
beer_data <- read_csv("data/ba_2002.csv")
```
Now that we've all got R up and running, we're going to quickly go over the basic functionality of `R` to make sure everyone is on the same page. If you have ever used `R`, this might include some review, but my hope is that this will be helpful to everyone and get us all on the same page before we launch into some more advanced applications.
We're going to speed through the basics of the console (working interactively with `R`) and then some of the "programming"/"coding" capabilities in `R`.
## Doing math and creating objects in `R`
At it's most basic, `R` can be a calculator. If you type math into the `Console` and hit `return` it will do math for you!
```{r calculator 1}
2 + 5
```
```{r calculator 2}
1000 / 3.5
```
To get the most out of R, though, we are going to want to use its abilities as a programming language. Among some other topics that I won't explicitly cover today, this means using `R` to store values (and later objects that contain multiple lines of values, like you'd get in an Excel spreadsheet).
This set of characters is the **assignment operator**: `<-`. It works like this:
```{r assignment demo}
x <- 100
hi <- "hello world"
data_set <- rnorm(n = 100, mean = 0, sd = 1)
```
... but that didn't do anything! Where's the output? Well, we can do two things. First, look at the "Environment" tab in your RStudio after you run the above code chunk. You'll notice that there are 3 new things there: `x`, `hi`, and `data_set`. In general I am going to call those **objects**--they are stored variables, which R now knows about by name. How did it learn about them? You guessed it: the assignment operator: `<-`.
To be explicit: `x <- 100` can be read in English as "x gets 100" (what a lot of programmers like to say) or, in a clearer but longer way, "assign 100 to a variable called x".
**NB**: R also allows you to use `=` as an assignment operator. **DO NOT DO THIS!**. There are two good reasons.
1. It is ambiguous, because it is not directional (how are you sure what is getting assigned where?)
2. This makes your code super confusing because `=` is the *only* assignment operator for *arguments* in functions (as in `print(quote = FALSE)` see below for more on this)
3. Anyone who has used R for a while who sees you do this will roll their eyes and kind of make fun of you a little bit
**NB2**: Because it is directional, it is actually possible to use `->` as an assignment operator as well. What do you think it does? Check and find out.
If you find typing `<-` a pain (reasonably), RStudio provides a keyboard shortcut: either `option + -` on Mac or `Alt + -` on PC.
I am going to use the terms "object" and "variable" interchangeably in this workshop--in other programming languages there are hard distinctions between these concepts, and even in some `R` packages this can be important, but for our purposes I mean the same thing if I am sloppy.
The advantage of storing values into objects is that we can do math with them, change them, and duplicate and store them elsewhere.
```{r algebra 1}
x / 10
```
```{r algebra 2}
x + 1
```
Note that doing math with a variable doesn't change the variable itself. To do that, you have to use the assignment `<-` to change the value of the variable again:
```{r updating a variable}
x
x <- 100 + 1
x
```
## Functions and their arguments in `R`
Obviously if we're using `R` as a calculator we might want to do more than basic arithmetic. What about taking the square root of `x`?
```{r first function}
sqrt(x)
```
In fact, we can ask R to do lots of neat stuff, like generate random numbers for us. For example, here are 100 random numbers from a normal distribution with mean of 0 and standard deviation of 1.
```{r a more complex function}
rnorm(n = 100, mean = 0, sd = 1)
```
These forms are called a **function** in R. Functions lie at the heart of `R`'s power: they are pre-written scripts that are included with base `R` or added in packages, like the ones we installed. In general, an `R` function will have a form like `<name>(<argument>, <argument>, ...)`. In other words, the function will have a name (that lets `R` know what you're trying to do) followed by an open parenthesis, and inside that a list of arguments, which are variables, objects, values, etc that you "pass" to the function, finally followed by a close parenthesis. In the case of our `sqrt()` function, there is only a single argument: a variable to which the square-root operation will be applied. In the case of the `rnorm()` function there are 3 arguments: the number of values we want, `n`, and the `mean` and standard deviation `sd` of the normal distribution we wish to sample from.
Functions are the `R` tools for which you can make the most use of the help operator: `?`. Try typing `?rnorm` into your console, and when you hit `return` you'll see the help page for the function.
Notice that in the `rnorm()` example we *named* the arguments--we told `R` which was the `n`, `mean`, and `sd`. This is because if we name arguments, we can give them in any order. Otherwise, `R` will try to match the provided values to the arguments **in the order in which they are given in the help file**. This is a major source of errors for newer `R` users!
```{r argument order matters in functions}
# This will give us 1 value from the normal distribution with mean = 0 and sd = 100
rnorm(1, 0, 100)
# But we can also use named arguments to provide them in any order we wish!
rnorm(sd = 1, n = 100, mean = 0)
```
Programming languages like `R` are very literal, and we need to be as literal as we can to make them work the way we want them to.
## Reading data into `R`
So far we've only done very basic things with `R`, which probably haven't sold you on its power and utility. Let's try doing some things that will hopefully get us a little further towards actual, useful applications.
First off, make sure you have the `tidyverse` package loaded by using the `library()` function.
```{r loading packages 1}
library(tidyverse)
```
Now, we're going to read in the data we're going to use for this workshop using the function `read_csv()`. If you want to learn about this function, use the `?read_csv` command to get some details.
In the workshop archive you downloaded, the `data/` directory has a file called `ba_2002.csv`. If you double-click this in your file browser, it will (most likely) open in Excel, and you can take a look at it. It is 20,000 reviews of beer, posted in the year 2002. This is an edited (not original) and very delimited version of the dataset (which had >1 million reviews), as the original one has been removed from the web at the original website owner's request. It will be the training dataset we use for today.
To get this data into `R`, we have to tell `R` where it lives on your computer, and what kind of data it is.
### Where the data lives
We touched on **working directories** because this is how `R` "sees" your computer. It will look first in the working directory, and then you will have to tell it where the file is *relative* to that directory. If you have been following along and opened up the `.Rproj` file in the downloaded archive, your working directory should be the archive's top level, which will mean that we only need to point `R` towards the `data/` folder and then the `ba_2002.csv` file. We can check the working directory with the `getwd()` function.
```{r getting the working directory}
getwd()
```
Therefore, **relative to the working directory**, the file path to this data is `data/ba_2002.csv`. Please note that this is the UNIX convention for file paths: in Windows, the backslash `\` is used to separate directories. Happily, RStudio will translate between the two conventions, so you can just follow along with the macOS/UNIX convention in this workshop.
### What kind of file are we importing?
The first step is to notice this is a `.csv` file, which stands for **c**omma-**s**eparated **v**alue. This means our data, in raw format, looks something like this:
```
# Comma-separated data
cat_acquisition_order,name,weight\n
1,Nick,9\n
2,Margot,10\n
3,Little Guy,13\n
```
Each line represents a row of data, and each field is separated by a comma (`,`). We can read this kind of data into `R` by using the `read_csv()` function.
```{r reading in data}
read_csv(file = "data/ba_2002.csv")
```
Suddenly, we have tabular data (i.e., data in rows and columns), like we'd have in Excel! Now we're getting somewhere. However, before we go forward we'll have to store this data somewhere--right now we're just reading it and throwing it away.
```{r storing data when loaded}
beer_data <- read_csv(file = "data/ba_2002.csv")
```
As a note, in many countries the separator (delimiter) will be the semi-colon (`;`), since the comma is used as the decimal marker. To read files formatted this way, you can use the `read_csv2()` function. If you encounter tab-separated values files (`.tsv`) you can use the `read_tsv()` function. If you have more non-standard delimiters, you can use the `read_delim()` function, which will allow you to specify your own delimiter characters. You can also read many other formats of tabular data using the `rio` package ("read input/output"), which can be installed from CRAN.
## Data in `R`
Let's take a look at the `Environment` tab. Among some other objects you may have created (like `x`), you should see `beer_data` listed. This is a type of data called a `data.frame` in `R`, and it is going to be, for the most part, the kind of data you interact with most. Let's learn about how these types of objects work by doing a quick review of the basics.
We started by creating an object called `x` and storing a number (`100`) into it. What kind of thing is this?
```{r object properties 1}
x <- 100
class(x)
typeof(x)
```
`R` has a bunch of basic data types, including the above "numeric" data type, which is a "real number" (in computer terms, a floating-point double as opposed to an integer). It can also store logical values (`TRUE` and `FALSE`), integers, characters/strings (which are what we're really here to deal with) and some more exotic data types you won't encounter very much. What `R` does that makes it good for data analysis is that it stores these all as **vectors**: 1-dimensional arrays of the same type of data. So, in fact, `x` is a length-1 vector of numeric data:
```{r object properties 2}
length(x)
```
The operator to explicitly make a vector in `R` is the `c()` function, which stands for "combine". So if we want to make a vector of a few values, we use this function as so:
```{r the c()ombine function}
y <- c(1, 2, 3, 10, 50)
y
```
We can also use `c()` to combine pre-existing objects:
```{r combining variables with c()}
c(x, y)
```
You can have vectors of other types of objects:
```{r vector types}
animals <- c("fox", "bat", "rat", "cat")
class(animals)
```
If we try to combine vectors of 2 types of data, `R` will "coerce" the data types to match to the less restrictive type, in the following order: `logical > integer > numeric > character`. So if we combine `y` and `animals`, we'll turn the numbers into their character representations. I mention this because it can be a source of error and confusion when we are working with large datasets, as we may see.
```{r type coercion}
c(y, animals)
```
For example, we can divide all the numbers in `y` by `2`, but if we try to divide `c(y, animals)` by `2` we will get an error:
```{r math with characters, error = TRUE}
c(y, animals) / 2
```
For vectors (and more complex objects), we can use the `str()` ("structure") function to get some details about their nature and what they contain:
```{r examining the str()ucture of objects}
str(y)
str(animals)
```
This `str()` function is especially useful when we have big, complicated datasets, like `beer_data`:
```{r examining our beer data}
str(beer_data)
```
### Subsetting data
Vectors, by nature, are ordered arrays of data (in this case, 1-dimensional arrays). That means they have a first element, a second element, and so on. Our `y` vector has 5 total elements, and our `animals` vector has 4 elements. In `R`, the way to **subset** vectors is to use the `[]` (square brackets) operator. For a 1-dimensional vector, we use this to select one or more elements:
```{r subsetting 1}
y[1]
animals[4]
```
We can also select multiple elements by using a vector of indices
```{r subsetting 2}
animals[c(1, 2)]
```
A shortcut for a sequence of numbers in `R` is the `:` (colon) operator, so this is often used for indexing:
```{r the ":" sequence shortcut}
1:3
animals[1:3]
```
We often want to use programmatic (or "conditional") logic to subset vectors and more complex datasets. For example, we might want to only select elements of `y` that are less than `10`. To do that, we can use one of `R`'s conditional logic operators: `<`, `>`, `<=`, `>=`, `==`, or `!=`. These, in order, stand for "less than, "greater than,", "less than or equal to," "greater than or equal to," "equal to," and "not equal to."
```{r logical operators}
y < 10
```
We can then use this same set of logical values to select only the elements of `y` for which the condition is `TRUE`:
```{r subsetting with logical operators}
y[y < 10]
```
This is useful if we have a long vector (frequently) and do not want to list or are not able to list all of the actual indices that we want to select.
### Complex vectors/lists: `data.frame` and `tibble`
Now that we have the basics of vectors, we can move on to the complex data object we're really interested in: `beer_data`. This is a type of object called a `tibble`, which is a cute/fancy version of the more basic `R` object called a `data.frame`. These are `R`'s version of the `.csv` file or your typical Excel file: a rectangular matrix of data, with (usually) columns representing some variable and rows representing some kind of observation. Each row will have a value in each column or will be `NA`, which is `R`'s specific value to represent **missing data**.
In a `data.frame`, every column has to have only a single data type: a column might be `logical` or `integer` or `character`, but it cannot be a mix. However, each column can be a different type. For example, the first column in our `beer_data`, called `reviews`, is a `character` vector, but the third column, `beer_id`, is a `numeric` column.
We have now moved from 1-dimensional vectors to 2-dimensional data tables, which means we're going to have some new properties to investigate. First off, we might want to know how many rows and columns our data table has:
```{r data frame properties}
nrow(beer_data)
ncol(beer_data)
length(beer_data) # Note that length() of a data.frame is the same as ncol()
```
We already tried running `str(beer_data)`, which gives us the data types of each column and an example. Some other ways to examine the data include the following:
```{r examining what a data frame contains, eval = FALSE}
beer_data # simply printing the object
head(beer_data) # show the first few rows
tail(beer_data) # show the last few rows
glimpse(beer_data) # a more compact version of str()
names(beer_data) # get the variable/column names
```
Note that some of these functions (for example, `glimpse()`) come from the `tidyverse` package, so if you are having trouble running a command, first make sure you have run `library(tidyverse)`.
## Subsetting and wrangling data tables
Since we now have 2 dimensions, our old strategy of using a single number to select a value from a vector won't work! But the `[]` operator still works on data frames and tibbles. We just have to specify coordinates, as `[<row>, <column>]`.
```{r subsetting data tables 1}
beer_data[5, 1] # get the 5th row, 1st column value
```
We can continue to use ranges or vectors of indices to select larger parts of the table
```{r subsetting data tables 2}
beer_data[1:5, 1] # get the first 5 rows of the 1st column value
```
If we only want to subset on a specific dimension and get everything from the other dimension, we just leave it blank.
```{r subsetting data tables 3}
beer_data[, 2] # get all rows of the 2nd column
```
We can also use logical subsetting, just like in vectors. This is very powerful but a bit complicated, so we are going to introduce some `tidyverse` based operators to do this that will make it a lot easier. I will just give an example:
```{r logical subsetting for data tables}
beer_data[beer_data$rating > 4.5, ] # get all rows for which rating > 3
```
In this last example I also introduced the final bit of `tibble` and `data.frame` wrangling we will cover here: the `$` operator. This is the operator to select a single column from a `data.frame` or `tibble`. It gives you back the vector that makes up that column:
```{r getting variables out of a data frame, eval = FALSE}
beer_data$style # not printed because it is too long!
```
One of the nice things about RStudio is that it provides tab-completion for `$`. Go to the console, type "bee" and hit `tab`. You'll see a list of possible matches, with `beer_data` at the top. Hit enter, and this will fill out the typing for you! Now, type "$" and hit `tab` again. You'll see a list of the columns in `beer_data`! This can save a huge amount of typing and memorizing the names of variables and objects.
Now that we've gone over the basics of creating and manipulating objects in `R`, we're going to run through the basics of data manipulation and visualization with the `tidyverse`. Before we move on to that topic, let's address **any questions**.
## PSA: not-knowing is normal!
Above, I mentioned "help files". How do we get help when we (inevitably) run into problems in R? There are a couple steps you will find helpful in the future:
1. Look up the help file for whatever you're doing. Do this by using the syntax `?<search item>` (for example `?c` gets help on the vector command) as a shortcut on the console.
2. Search the help files for a term you think is related. Can't remember the command for making a sequence of integers? Go to the "Help" pane in RStudio and search in the search box for "sequence". See if some of the top results get you what you need.
3. The internet. Seriously. I am not kidding even a little bit. R has one of the most active and (surprisingly) helpful user communities I've ever encountered. Try going to google and searching for "How do I make a sequence of numbers in R?" You will find quite a bit of useful help. I find the following sites particularly helpful
1. [Stack Overflow](https://stackoverflow.com/questions/tagged/r)
2. [Cross Validated/Stack Exchange](https://stats.stackexchange.com/questions/tagged/r)
3. Seriously, [Google will get you most of the way to helpful answers](https://is.gd/80V5zF) for many basic R questions.
We may come back to this, but I want to emphasize that **looking up help is normal**. I do it all the time. Learning to ask questions in helpful ways, how to quickly parse the information you find, and how to slightly alter the answers to suit your particular situation are key skills.