-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path00-first-timers.Rmd
484 lines (330 loc) · 12.9 KB
/
00-first-timers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
```{r, echo=FALSE}
opts_chunk$set(results='hide')
```
# Data Carpentry R materials -
--------------------------------------------------
* Its really important that you know what you did. More journals/grants/etc. are also making it important for them to know what you did.
* A lot of scientific code is NOT reproducible.
* If you keep a lab notebook, why are we not as careful with our code.
* We edit each others manuscripts, but we don't edit each other's code.
* If you write your code with "future you" in mind, you will save yourself and others a lot of time.
# Very basics of R
R is a versatile, open source programming/scripting language that's useful both for statistics but also data science. Inspired by the programming language S.
* Open source software under GPL.
* Superior (if not just comparable) to commercial alternatives. R has over 5,000 user contributed packages at this time. It's widely used both in academia and industry.
* Available on all platforms.
* Not just for statistics, but also general purpose programming.
* Is object oriented and functional.
* Large and growing community of peers.
__Commenting__
Use # signs to comment. Comment liberally in your R scripts. Anything to the right of a # is ignored by R.
__Assignment operator__
`<-` is the assignment operator. Assigns values on the right to objects on the left. Mostly similar to `=` but not always. Learn to use `<-` as it is good programming practice. Using `=` in place of `<-` can lead to issues down the line.
__Package management__
`install.packages("package-name")` will download a package from one of the CRAN mirrors assuming that a binary is available for your operating system. If you have not set a preferred CRAN mirror in your options(), then a menu will pop up asking you to choose a location.
Use `old.packages()` to list all your locally installed packages that are now out of date. `update.packages()` will update all packages in the known libraries interactively. This can take a while if you haven't done it recently. To update everything without any user intervention, use the `ask = FALSE` argument.
In RStudio, you can also do package management through Tools -> Install Packages.
Updating packages can sometimes make changes, so if you already have a lot of code in R, don't run this now. Otherwise let's just go ahead and update our pacakges so things are up to date.
```{r, eval=FALSE}
update.packages(ask = FALSE)
```
## Introduction to R and RStudio
Let's start by learning about our tool.
_Point out the different windows in R._
* Console, Scripts, Environments, Plots
* Avoid using shortcuts.
* Code and workflow is more reproducible if we can document everything that we do.
* Our end goal is not just to "do stuff" but to do it in a way that anyone can easily and exactly replicate our workflow and results.
You can get output from R simply by typing in math
```{r}
3 + 5
12/7
```
or by typing words, with the command `print`
```{r}
print("hello world")
```
We can annotate our code (take notes) by typing "#". Everything to the right of # is ignored by R
Try it with and without the #
```{r}
# Print out hello world
print("hello world")
```
"hello world"
```{r}
Print out hello world
print("hello world")
```
Error: unexpected symbol in "Print out"
We can save our results to an object, if we give it a name
```{r}
a <- 60 * 60
hours <- 365 * 24
```
Now what is 'a' and 'hours'
```{r}
a
hours
```
## Data types and structures
__Understanding basic data types in R__
To make the best of the R language, you'll need a strong understanding of the basic data types and data structures and how to operate on those.
Very Important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object.
R has 6 (although we will not discuss the raw class for this workshop) atomic classes.
* character
* numeric (real or decimal)
* integer
* logical
* complex
__Example Type__
* “a”, “swc” character
* 2, 15.5 numeric
* 2 (Must add a L at end to denote integer) integer
* TRUE, FALSE logical
* 1+4i complex
`typeof()` - what is it?
`length()` - how long is it? What about two dimensional objects?
`attributes()` - does it have any metadata?
```{r}
# Example
x <- "dataset"
typeof(x)
attributes(x)
y <- 1:10
typeof(y)
length(y)
attributes(y)
z <- c(1L, 2L, 3L)
typeof(z)
```
R has many __data structures__. These include
* atomic vector
* list
* matrix
* data frame
* factors
* tables
### Vectors
A vector is the most common and basic data structure in `R` and is pretty much the workhorse of R. Technically, vectors can be one of two types:
* atomic vectors
* lists
although the term "vector" most commonly refers to the atomic type not lists.
**Atomic Vectors**
A vector can be a vector of elements that are most commonly `character`, `logical`, `integer` or `numeric`.
You can create an empty vector with `vector()` (By default the mode is `logical`. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as `character()`, `numeric()`, etc.
```{r}
x <- vector()
# with a length and type
vector("character", length = 10)
character(5) ## character vector of length 5
numeric(5)
logical(5)
```
Various examples:
```{r}
x <- c(1, 2, 3)
x
length(x)
```
`x` is a numeric vector. These are the most common kind. They are numeric objects and are treated as double precision real numbers. To explicitly create integers, add an `L` at the end.
```{r}
x1 <- c(1L, 2L, 3L)
```
You can also have logical vectors.
```{r}
y <- c(TRUE, TRUE, FALSE, FALSE)
```
Finally you can have character vectors:
```{r}
z <- c("Sarah", "Tracy", "Jon")
```
**Examine your vector**
```{r}
typeof(z)
length(z)
class(z)
str(z)
```
Question: Do you see a property that's common to all these vectors above?
**Add elements**
```{r}
z <- c(z, "Annette")
z
```
More examples of vectors
```{r}
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1+0i, 2+4i)
```
You can also create vectors as a sequence of numbers
```{r}
series <- 1:10
seq(10)
seq(1, 10, by = 0.1)
```
`Inf` is infinity. You can have either positive or negative infinity.
```{r}
1/0
```
`NaN` means Not a number. It's an undefined value.
```{r}
0/0
```
Each object can have __attributes__. Attribues can be part of an object of R. These include:
* names
* dimnames
* dim
* class
* attributes (contain metadata)
You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).
```{r}
length(1:10)
nchar("Software Carpentry")
```
What happens when you mix types?
R will create a resulting vector that is the least common denominator. The coercion will move towards the one that's easiest to __coerce__ to.
Guess what the following do without running them first
```{r}
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
```
This is called implicit coercion. You can also coerce vectors explicitly using the `as.<class_name>`. Example
```{r}
as.numeric()
as.character()
```
### Matrix
Matrices are a special vector in R. They are not a separate type of object but simply an atomic vector with dimensions added on to it. Matrices have rows and columns.
```{r}
m <- matrix(nrow = 2, ncol = 2)
m
dim(m)
```
Matrices are filled column-wise.
```{r}
m <- matrix(1:6, nrow = 2, ncol = 3)
```
Other ways to construct a matrix
```{r}
m <- 1:10
dim(m) <- c(2, 5)
```
This takes a vector and transform into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using `cbind()` and `rbind()`.
```{r}
x <- 1:3
y <- 10:12
cbind(x, y)
rbind(x, y)
```
You can also use the byrow argument to specify how the matrix is filled. From R's own documentation:
```{r}
mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
dimnames = list(c("row1", "row2"),
c("C.1", "C.2", "C.3")))
mdat
```
### List
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using `list()` or coerce other objects using `as.list()`
```{r}
x <- list(1, "a", TRUE, 1+4i)
x
x <- 1:10
x <- as.list(x)
length(x)
```
1. What is the class of `x[1]`?
2. How about `x[[1]]`?
```{r}
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
```
1. What is the length of this object? What about its structure?
Lists can be extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
Elements are indexed by double brackets. Single brackets will still return a(nother) list.
### Factors
Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for modelling functions such as `lm()` and `glm()` and also in `plot` methods.
Factors can only contain pre-defined values.
Factors are pretty much integers that have labels on them. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Some string methods will coerce factors to strings, while others will throw an error.
Sometimes factors can be left unordered. Example: male, female.
Other times you might want factors to be ordered (or ranked). Example: low, medium, high.
Underlying it's represented by numbers 1, 2, 3.
They are better than using simple integer labels because factors are what are called self describing. male and female is more descriptive than 1s and 2s. Helpful when there is no additional metadata.
Which is male? 1 or 2? You wouldn't be able to tell with just integer data. Factors have this information built in.
Factors can be created with `factor()`. Input is generally a character vector.
```{r}
x <- factor(c("yes", "no", "no", "yes", "yes"))
x
```
`table(x)` will return a frequency table.
If you need to convert a factor to a character vector, simply use
```{r}
as.character(x)
```
In modeling functions, it is important to know what the baseline level is. This is the first factor but by default the ordering is determined by alphabetical order of words entered. You can change this by speciying the levels (another option is to use the function relevel).
```{r}
x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))
x
```
### Data frame
A data frame is a very important data type in R. It's pretty much the de facto data structure for most tabular data and what we use for statistics.
Data frames can have additional attributes such as `rownames()`, which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used.
Some additional information on data frames:
* Usually created by `read.csv()` and `read.table()`.
* Can convert to matrix with `data.matrix()`
* Coercion will be forced and not always what you expect.
* Can also create with `data.frame()` function.
* Find the number of rows and columns with `nrow(dat)` and `ncol(dat)`, respectively.
* Rownames are usually 1..n.
## __Combining data frames__
```{r}
dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
```
### __Useful functions__
* `head()` - see first 6 rows
* `tail()` - see last 6 rows
* `dim()` - see dimensions
* `nrow()` - number of rows
* `ncol()` - number of columns
* `str()` - structure of each column
* `names()` - will list the names attribute for a data frame (or any object really), which gives the column names.
* A data frame is a special type of list where every element of the list has same length.
See that it is actually a special list:
```{r}
is.list(iris)
class(iris)
```
| Dimensions | Homogenous | Heterogeneous |
| ------- | ---- | ---- |
| 1-D | atomic vector | list |
| 2_D | matrix | data frame |
### __Indexing__
Vectors have positions, these positions are ordered and can be called using name_vector[index]
```{r}
letters[2]
```
### __Functions__
A function is a saved object that takes inputs to perform a task.
Functions take in information and return desired outputs.
output = name_of_function(inputs)
```{r}
x <- 1:10
y <- sum(x)
```
### __Help__
All functions come with a help screen.
It is critical that you learn to read the help screens since they provide important information on what the function does,
how it works, and usually sample examples at the very bottom.
### __Install new functions__
To install any new package `install.packages('ggplot2')`
You can't ever learn all of R, but you can learn how to build a program and how to find help
to do the things that you want to do. Let's get hands-on.