-
Notifications
You must be signed in to change notification settings - Fork 4
/
nursing_741.Rmd
1607 lines (1045 loc) · 56.6 KB
/
nursing_741.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Intro To dplyr"
author: "Steve Pittard"
date: "`r Sys.Date()`"
bibliography:
- book.bib
- packages.bib
description: An Intro to data manipulation with dplyr
documentclass: book
link-citations: yes
site: bookdown::bookdown_site
output: bookdown::gitbook
biblio-style: apalike
---
# R Data Structures
There are a number of data structures in R such as **vectors, lists** qnd **matrices**. The **vector** structure winds up being helpful in understanding how to work with data frames.
## Vectors
It is a container for a series of related data of the same type: height measurements of students, whether a group of people smoke or not, their blood pressure. The only rule here is that a vector can contain only one data type at a time.
```{r}
names <- c("P1","P2","P3","P4","P5")
temp <- c(98.2,101.3,97.2,100.2,98.5)
pulse <- c(66,72,83,85,90)
gender <- c("M","F","M","M","F")
```
To access elements, or ranges of elements, within a vector involves using the "bracket" notation:
```{r}
# Get the first element of temp
temp[1]
# Get elements 3,4, and 5 from pulse
pulse[3:5]
```
Base R likes to use the the composite function style. Not surprising since the language was written by statisticians. Create some vectors
```{r}
# create a sequence of numbers
# looks like f(x)
seq(0,16,4)
# looks like f(g(x))
str(seq(0,16,4))
# create some numbers from a Normal distributions
rnorm(10)
# Looks like f(g(x))
hist(rnorm(100000))
# You kind of know what you want to do before you type it
# What about this ?
rnorm(100000) %>% hist()
```
We can also use logical expressions to find elements that satisfy some condition. This is a very powerful capability in R.
```{r}
temp < 98
```
Whoa. What was that ? Well we get back a T/F logical vector that tells us what elements satisfy the specified condition. We can then use this info to get the elements of interest.
```{r}
temp[temp < 98]
```
Working with individual vectors is fine but a more general way of working with them is in data frames which provides more fleibility:
```{r}
(my_df <- data.frame(names,temp,pulse,gender))
```
Looking at each column, we see that they are the vectors we were just working with. If we need to access them from the data frame it's easy.
```{r}
# Get the temp column
my_df$temp
# What's the mean of the temp column
mean(my_df$temp)
```
## Data Frames
![](./figures/dframe.png){width=300}
But we are getting ahead of ourselves. Just know that the **premier data structure** in R is the **data.frame**. This structure can be described as follows:
- A data frame is a special type of list that contains data in a format that allows for easier manipulation, reshaping, and open-ended analysis
- Data frames are tightly coupled collections of variables. It is one of the more important constructs you will encounter when using R so learn all you can about it
- A data frame is an analogue to the Excel spreadsheet but is much more flexible for storing, manipulating, and analyzing data
- Data frames can be constructed from existing vectors, lists, or matrices. Many times they are created by reading in comma delimited files, (CSV files), using the read.table command
Once you become accustomed to working with data frames, R becomes so much easier to use. In fact, it could be well argued tht UNTIL you wrap your head around the data frame concept then you cannot be productive in R. This is mostly true, in my experience.
R comes with with a variety of built-in data sets that are very useful for getting used to data sets and how to manipulate them.
```{r eval=FALSE}
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Theoph Pharmacokinetics of Theophylline
```
## A Reference Data Frame
We will use a well-known data frame, at least in R circles, called **mtcars** which is part of any default installation of R. It is a simple data set relating to, well, automobiles. This data frame has the distinction of being the most (ab)used data frame in R education.
```{r eval=FALSE}
The data was extracted from the 1974 Motor Trend US
magazine, and comprises fuel consumption and 10 aspects
of automobile design and performance for 32 automobiles
(1973–74 models).
A data frame with 32 observations on 11 (numeric)
variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
```
## Relation to dplyr
What you will discover is that the **dplyr** package, which itself is part of the much larger **tidyverse** package set , extends upon the idea of the basic R data frame in a way that some feel is superior. It depends on your point of view though the **tidyverse** has a lot of what I call a philosophic consistency in it which makes it **very** useful once you get some concepts in mind.
While you could start exclusively with **dplyr** and the **tidyverse** the world is still full of older code. Plus, many of the advantages of **dplyr** only become quite apparent when compared to the "older way" of doing things. So my recommendation is to know how to deal with data frames in base R while also spending time to learn the **dplyr** way of doing things.
```{r include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
.packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
```
<!--chapter:end:index.Rmd-->
# Digging In
Data frames look like an Excel Spreadsheet. The rows are observations and the columns are variables or "features" that represent some measurement or character-based description of a given observation. When viewed from the row point of view, the data can be heterogeneous. When viewed as a column, the data is homogeneous.
```{r}
data(mtcars)
mtcars
```
We can do this with this data such as make plots or create models:
```{r cars}
plot(mpg ~ wt, data=mtcars)
```
Let's create a regression model. It doesn't take long to realize that most functions in R will use a data frame as input. This means that you will spend a lot of time working with data frames to get them into shape for use with modeling and visualization tools. In fact you will spend most of your time **importing, transforming, and cleaning**.
```{r}
(mylm <- lm(mpg ~ ., data = mtcars))
```
There are some useful functions that help you understand the structure of a data frame. One of the most important ones is called the **str()** function which is short hand for **structure**.
## Structure
```{r}
str(mtcars)
```
This gives you some idea about the number of rows and columns of the data frame along with a description of the variable types and their values. I use this function frequently. Other functions that will help you include the following.
## Meta Information
```{r}
# how many rows
nrow(mtcars)
# how many columns
ncol(mtcars)
# Column names
names(mtcars)
```
## Printing
Some data frames, such as mtcars, don't have many rows but others might have hundreds, thousands or even more than that ! Imagine trying to view one of those data frames. It is for this reason that the **head()** and **tail()** functions exist.
```{r}
head(mtcars,5) # First 5 rows
tail(mtcars,3) # Last 3 rows
```
## Accessing Rows And Columns
There are various ways to select, remove, or exclude rows and columns of a data frame. We use the **bracket** notation to do this. This is very powerful. Keep in mind that data frames have rows and columns so it would make sense that you need a way to specify what rows and columns you want to access.
```{r}
mtcars[1,] # First row, all columns
```
```{r}
mtcars[1:3,] # First three rows, all columns
```
```{r}
# All rows, and first 4 columns
mtcars[,1:4]
```
```{r}
# Rows 1-5 and columns 1,2 and 8-10
mtcars[1:4,c(1:2,8:10)]
```
```{r}
# Rows 1-5 and columns 1,2 and 8-10
mtcars[1:4,c(1:2,8:10)]
```
```{r}
# Rows 1-5 and columns by name
mtcars[1:4,c("mpg","wt","drat")]
```
## Interrogation
Many times you will wish to find rows that satisfy certain conditions. For example, what rows have an mpg > 11 and at
wt < 2.0 ? We use the bracket notation to help us. We can pass logical conditions into the brackets. Note the following:
```{r}
mtcars$mpg > 11 & mtcars$wt < 2.0
```
There are 32 elements in this logical vector each with a value of either TRUE or FALSE. When passed into the row index of the bracket notation, it will print that row if the corresponding value is TRUE. If FALSE, the row will not be printed.
```{r}
mtcars[mtcars$mpg > 11 & mtcars$wt < 2.0,]
```
What if we just want to know how many cars satisfy this condition ?
```{r}
nrow(mtcars[mtcars$mpg > 11 & mtcars$wt < 2.0,])
```
Find all rows that correspond to cars with 4 cylinders
```{r}
mtcars[mtcars$cyl == 4,]
```
We can even use other R functions in the bracket notation. Extract all rows whose MPG value exceeds the mean MPG for the entire data frame.
```{r}
mtcars[mtcars$mpg > mean(mtcars$mpg),]
```
Now find the cars for which the MPG exceeds the 75% percentile value for MPG
```{r}
mtcars[mtcars$mpg > quantile(mtcars$mpg)[4],]
```
## Missing values
This is big deal. Most "real" data has rows that do not contain values for all columns. This is the so called "missing value" problem. Here is an example. The following code will read in a version of the mtcars data frame that has some missing values:
```{r eval=F}
url <- "https://raw.githubusercontent.com/steviep42/utilities/master/data/mtcars_na.csv"
(mtcars_na <- read.csv(url, stringsAsFactors = FALSE))
```
If you look, you can see the missing values "NA" present in certain columns. This is R's way of indicating what is missing. There are functions that can help you find these. This is important because, for example, if you wanted to find the average value of a column, say the **wt** column then there will be a problem as it contains a missing value:
```{r}
mean(mtcars_na$wt)
```
We have to tell the function to remove missing values from consideration.
```{r}
mean(mtcars$wt, na.rm=TRUE)
```
A more general approach would involve the following functions.
```{r}
complete.cases(mtcars_na)
```
```{r}
# How many rows in the df do not contain any NAs ?
sum(complete.cases(mtcars_na))
# How many rows in the df do contain at least one NA ?
sum(!complete.cases(mtcars_na))
```
How would we find those rows and print them ?
```{r}
mtcars_na[complete.cases(mtcars_na),]
```
And here are the ones that do contain missing values:
```{r}
mtcars_na[!complete.cases(mtcars_na),]
```
One quick way to omit rows with missing values is:
```{r}
na.omit(mtcars_na)
```
## Continuous vs Factors
One **recipe** that I use frequently is given below. This tells me how many unique values are assumed by each column which then helps to identify continuous quantities and categories. If a column assumes only a small number of unique values then perhaps it should be classified as a factor. Don't let the code here scare you. If you are new to R and don't yet understand what is going on then just use this as a "recipe" for now.
```{r}
sapply(mtcars, function(x) length(unique(x)))
```
So it looks to me, for example, that **cyl, vs, am, gear, and carb** are actually categories rather than measured quantities. If you look at the help page for mtcars you will see that **am** is a 0 or 1 which corresponds to, respectively, a car with an automatic transmission (0) or a manual transmission (1). If you use the **summarize** function it will treat this variable as a numeric, continuous quantity.
Is it actually possible to have a transmission value of 0.4062 ?
```{r}
summary(mtcars$am)
```
I might then use some code to transform this into factors so that when they are used with various modeling functions they will be recognized as such. For example, if we summarize the data frame right now, we will see the following
```{r}
summary(mtcars$am)
```
Let's turn **am** into a factor
```{r}
mtcars$am <- factor(mtcars$am,
levels = c(0,1),
labels = c("Auto","Man") )
```
Now the summary will make more sense. This is also useful because graphics packages such as ggplot2 will know how to handle factors.
```{r}
summary(mtcars$am)
```
```{r echo=FALSE}
suppressMessages(library(tidyverse))
```
```{r}
ggplot(mtcars,aes(x=wt,y=mpg)) +
geom_point() +
facet_wrap(~am)
```
## Sorting data
Sorting rows in a data frame is a common activity. However, in Base R this is called "ordering" because of the function used to "order" the data. Let's say we want to sort or "order" the mtcars data frame such that the row with the lowest mpg value is listed first and the row with the highest mpg value is listed last. First, look at the **order** function's output. What are those numbers ?
```{r}
order(mtcars$mpg)
```
Oh, so they are row numbers corresponding to rows in mtcars. Row 15 has the car with the lowest mpg. Row 16 corresponds to the car with the next lowest mpg and so on. So we can use this information to order our dataframe accordingly:
```{r}
mtcars[order(mtcars$mpg),]
```
To invert the sense of the order use the **rev** function. We'll also use the head function to list only the first 5 rows of the result. Note that in base R, using composite functions is welcomed although you will find out that this is not a value in the tidyverse. For math people, using a composite function is natural which, in large part, is why R embraced that approach early on.
```{r}
head(mtcars[rev(order(mtcars$mpg)),])
```
## Reading CSV Files
Many times data will be read in from a comma delimited file exported from Excel. These are known as Comma Separated Value files - generally abbreviated as **CSV**. The file can be read from a local drive or even from the Web as long as you know the URL associated with the file. In this example, there is a file on the Internet relating to some testing data involving students and various subjects.
```{r}
url <- "https://raw.githubusercontent.com/pittardsp/bios545r_spring_2018/master/SUPPORT/hsb2.csv"
data1 <- read.csv(url,header=T,sep=",")
head(data1)
```
<!--chapter:end:01-intro.Rmd-->
# The tidyverse
![](./figures/tidyverse.png){width=350}
The **dplyr** package is part of the larger **tidyverse** package set which has expanded considerably in recent years and continues to grow in size and utility such that many people never learn the "older way" of doing things in R. But we've already been through that in the previous section. The tidyverse has the following packages. The descriptions have been lifted from the [tidyverse home page](https://www.tidyverse.org/packages/).
**ggplot2** - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
**dplyr** - dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
**tidyr** - tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
**readr** - readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
**tibble** - tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.
**stringr** - stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
**lubdriate** - Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.
## Installing
You will probably use a number of functions from several of these packages so it's best to go ahead and install the entire **tidyverse** in one go. To install it, do one of the following:
1) At the R Console from within RStudio, type:
```{r eval=FALSE}
install.packages("tidyverse")
```
2) Use the Tools -> Install Packages menu item in RStudio:
![](./figures/inst.png){width=400}
After you have installed the package you may load it by doing:
```{r}
suppressMessages(library(tidyverse))
```
Note that the **cheatsheet** for **dplyr** can be [found here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)
![](./figures/cheat.png)
## dplyr Basics
**dplyr** is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In fact if you were paying attention during the opening section on data frames you will have noticed that most of the activities we performed related to the following activities. In dplyr-speak there are the **verbs** that help us get work done.
mutate() - adds new variables that are functions of existing variables
select() - picks variables based on their names.
filter() - picks cases based on their values.
summarise() - reduces multiple values down to a single summary.
arrange() - changes the ordering of the rows.
## First Steps
Note that this material references ["Becoming a data ninja with dplyr"](https: //speakerdeck.com/dpastoor/becoming-a-data-ninja-with-dplyr) as well as [this dplyr tutorial]( http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)
We'll go back to the basics here by using a very small data frame which will make it clear how the various **dplyr** verbs actually work:
```{r}
df <- data.frame(id = 1:5,
gender = c("MALE","MALE","FEMALE","MALE","FEMALE"),
age = c(70,76,60,64,68))
```
![](./figures/df1.png){width=400}
### filter()
The **filter()** function allows us to sift through the data frame to find rows that satisfy some logical condition. (With the older approach we would be using the bracket notation). The following example allows is to find only the observations relating to a declared gender of **female**.
```{r}
filter(df,gender == "FEMALE")
# Given this data frame, the following is equivalent
filter(df, gender != "MALE")
```
![](./figures/df2.png)
So, now find only the **ids** that relate to rows 1,3, or 5. This is a highly specialized search but it is helpful to show that you can use a wide variety of logical constructs.
```{r}
filter(df, id %in% c(1,3,5))
```
![](./figures/df3.png)
### mutate()
Mutate is used to add or remove columns in a data frame. Let's create a new column in the data frame that contains the mean value of the age column.
```{r}
mutate(df,meanage = mean(age))
```
![](./figures/meanage.png)
Next we will create a new column designed to tell us if a given observation has an age that is greater than or equal to the average age. Specifically, create a variable called **old_young** and assign a value of "Y" if the observed age for that row is above the mean age and a value of "N" if it is not.
```{r}
mutate(df,old_young=ifelse(df$age>=mean(df$age),"Y","N"))
```
One way we could use something like this is in making a plot where the observations exhibiting an age value above the mean are plotted in a certain color and those below the mean are in another color.
```{r}
tmp <- mutate(df, color = ifelse(age > mean(age),"red","blue"))
plot(tmp$age,col=tmp$color, type="p",
pch=19,main="Ages",ylab="Age")
grid()
abline(h=mean(tmp$age),lty=2)
legend("topright",
c("Above Avg","Below Avg"),col=c("red","blue"),pch=19)
```
### arrange()
Use arrange for sorting the data frame by one or more columns. When using the basic data frame structure from R we had to use the **order()** function to help us generate a vector that has the row numbers of the data frame that correspond to the desired order of display (lowest to highest, etc).
Let's sort the data frame **dff** by age from oldest to youngest.
First we'll use the older approach. While this will work, it is not exactly very intuitive.
```{r}
df[rev(order(df$age)),]
```
**dplyr** makes this process more simple - at least in my opinion
```{r}
arrange(df,desc(age))
```
Next, let's sort **df** by gender (alphabetically) and then by age
from oldest to youngest. The rows relating to a gender of **female** are going to be listed first because, alphabetically speaking, the letter "F" comes before the letter "M". Then within those categories we have the ages sorted from oldest to youngest.
```{r}
arrange(df, gender,desc(age))
```
If we used the older approach it would look like the following. Ugh !
```{r}
df[order(df$gender,-df$age),]
```
### select()
The select() functions allows us to select one or more columns from a data frame.
```{r}
# Reorder the columns
select(df,gender,id,age)
```
```{r}
# Select all but the age column
select(df,-age)
```
```{r}
# Can use the ":" operator to select a range
select(df,id:age)
```
The select() function provides the ability to select by "regular expressions"" or numeric patterns:
```{r}
# Select all columns that start with an "a"
select(df,starts_with("a"))
```
```{r}
names(mtcars)
# Get only columns that start with "c"
select(mtcars,starts_with("c"))
```
This example is more realistic in that data frames can have a large number of columns named according to some convention. For example, the measurements on a patient might not be labelled specifically - they might have a common prefix such as "m_" followed by some sequential number (or not).
```{r}
testdf <- expand.grid(m_1=seq(60,70,10),
age=c(25,32),
m_2=seq(50,60,10),
m_3=seq(60,70,10))
```
```{r}
testdf
```
Find all the columns that include a "_" character
```{r}
select(testdf,matches("_"))
```
This will select columns beginning with "m_" but only those with a suffix of 1 or 2.
```{r}
select(testdf,num_range("m_",1:2))
```
### group_by()
The **group_by()** function let’s you organize a data frame by some factor or grouping variable. This a very powerful function that is typically used in conjunction with a function called **summarize**. Here is what it looks like by itself. It's somewhat underwhelming. It does seem to create table of some kind but it doesn't do much else.
```{r}
df
# Hmm. the following doesn't do anything - or so it seems
group_by(df)
```
So as mentioned, the **group_by** function is usually paired with the **summarize** function. Ah. so what this does is to first group the data frame by the **gender** column and then it **counts** the number of occurrences therein. So this is a form of aggregation.
```{r}
summarize(group_by(df,gender),total=n())
```
![](./figures/grp.png)
Let's group the data frame by gender and then compute the average age for each group.
```{r}
summarize(group_by(df,gender),av_age=mean(age))
```
![](./figures/grpa.png)
Let's group by gender and then compute the total number of observations in each gender group and then compute the mean age.
```{r}
summarize(group_by(df,gender),av_age=mean(age),total=n())
```
# Split Apply Combine
This pattern of using **group_by()** followed by **summarize()**
is called **Split Apply Combine**. The idea is that we
1) Split up the data frame by gender group
2) Then for each group, apply the average function
3) Then combine the average results for each group
```{r}
summarize(group_by(df,gender),av_age=mean(age))
```
![](./figures/morpheus.png){width=600}
## What Are Pipes ?
Before moving forward let us consider the "pipe" operator that is included with the dplyr - well actually **magrittr** package. This is used to make it possible to "pipe" the results of one command into another command and so on.
The inspiration for this comes from the UNIX/LINUX operating system where pipes are used all the time. So in effect using "pipes" is nothing new in the world of research computation.
![](./figures/unix_pipe.png)
**Warning:** Once you get used to pipes it is hard to go back to not using them.
Let’s use the mtcars data frame to illustrate the basics of the piping mechanism as used by dplyr. Here we will select the **mpg** and **am** column from mtcars and view the top 5 rows.
```{r}
head(select(mtcars, mpg, am))
```
Okay, how would we do this using pipes ? Whoa ! Note that each command is "it's own thing" independently of the pipe character.
So the:
- output of mtcars goes into the
- input of the **select** function whose output goes into the
- input of the **head** function
```{r}
mtcars %>% select(mpg, am) %>% head
```
Break this down:
```{r}
mtcars %>% select(mpg, am)
```
The key to understanding how this works is to **read** this from left to right. It bears repeating that each command is "it's own thing" independently of the pipe character. So the:
- output of mtcars goes into the
- input of the **select** function whose output goes into the
- input of the **head** function
Let's use our new found knowledge to re-imagine our use of the **group_by** and **summarize** functions that we have been using in **composite** form up until now.
## Using Pipes To Do Split-Apply-Combine
```{r}
df %>% group_by(gender) %>% summarize(avg=mean(age))
# Same as the following but the pipes don't require you to "commit"
# With the following, you have to know in advance what you want to do
summarize(group_by(df,gender), avg=mean(age))
```
This approach allows us to build a "pipeline" containing commands. We don't have to commit to a specific sequence of functions. This enables a free-form type of exploration.
```{r}
df %>%
group_by(gender) %>%
summarize(avg=mean(age),total=n())
```
What is the median age of all males ?
```{r}
df %>%
filter(gender == "MALE") %>%
summarize(med_age=median(age))
```
![](./figures/chrt.png)
### Saving Results
It should be observed that if you want to save the results of some sequence of commands that you will need to use the "<-" operator. Using the previous example we could the following to save our result.
```{r}
results <- df %>%
filter(gender == "MALE") %>%
summarize(med_age=median(age))
```
## An Example
Using the built in mtcars dataframe, do the following:
1) **filter** for records where the wt is greater than 3.3 tons.
2) Then, using the **mutate** function to create a column called
ab_be (Y or N) that indicates whether that observation’s
mpg is greater (or not) than the average mpg for the filtered set.
3) Then present the average mpg for each group.
This is easy using pipes and dplyr verbs.
```{r}
mtcars %>% filter(wt > 3.3) %>%
mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")) %>%
group_by(ab_be) %>%
summarize(mean_mpg=mean(mpg))
```
This could be then "piped" into the input of the **ggplot** command to plot a corresponding bar chart. If you don't yet know ggplot then it's okay as this will nudge you in that direction. Both **ggplot** and **dplyr** are part of the **tidyverse** which means that the two packages "talk" to each other well.
```{r}
mtcars %>% filter(wt > 3.3) %>%
mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N") ) %>%
group_by(ab_be) %>% summarize(mean_mpg=mean(mpg)) %>%
ggplot(aes(x=ab_be,y=mean_mpg)) +
geom_bar(stat="identity") +
ggtitle("Mean MPG") + labs(x = "ab_be", y = "Mean MPG")
```
## Working With Flowers
Let's work with the built in **iris** data frame to explore the world of flowers. This is famous (Fisher's or Anderson's) **iris** data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
I'll use dyplr idioms here as opposed to the typical Base R approach which might involve use of composite functions. First, load up the iris data
```{r}
data(iris)
```
### Structure of The Data frame
It's always helpful to look at what types of data you have in a data frame. As you already known, base R has a function called **str** which is useful. The tidyverse equivalent is **glimpse** although the two commands basically provide the same types of information. Personally, I still prefer the "old" **str** function if only because I've been using it for so long.
```{r}
str(iris)
```
```{r}
glimpse(iris)
```
### More Practice
1) Get all the rows where the Species is "setosa" Sepal.Length is > 4.7
but < 5.0. Use the pipes feature to help you with this.
```{r}
iris %>%
filter(Sepal.Length > 4.7 & Sepal.Length < 5.0 & Species=="setosa" )
```
2) Select out only the columns relating Sepal measurements. List only the top five rows:
```{r}
iris %>% select(c(Sepal.Length,Sepal.Width)) %>% head(5)
# Or use a helper function to process strings
iris %>% select(starts_with("Sepal")) %>% head(5)
# Or if you know which column numbers you want
iris %>% select(c(1:2)) %>% head(5)
```
3) Sort the data frame by Sepal.Width such that the row with the largest Sepal.Width is listed first. Print only the first 5 rows
```{r}
iris %>% arrange(desc(Sepal.Width)) %>% head(5)
```
4) How many observations are there for each Species group ?
```{r}
iris %>% group_by(Species) %>% count()
# Or
iris %>% group_by(Species) %>% summarize(total=n())
```
5) Select all columns that do NOT relate to Length. Limit output to 5 rows
```{r}
iris %>% select(-ends_with("Length")) %>% head()
```
6) Select all columns that do NOT relate to Length or Species. Limit output to 5 rows
```{r}
iris %>% select(-c(ends_with("Length"),"Species")) %>% head()
# Or
iris %>% select(-ends_with("Length")) %>% select(-"Species") %>% head()
```
7) Select all columns that do NOT relate to Length or Species. But only for observations where Sepal.Width is > 3.9. There are multiples way to attack this problem.
```{r}
iris %>% filter(Sepal.Width > 3.9) %>%
select(-ends_with("Length")) %>%
select(-"Species")
# Or
iris %>% select(-ends_with("Length")) %>%
select(-"Species") %>%
filter(Sepal.Width > 3.9)
```
8) Determine the mean, standard deviation, max, and min for Sepal.Length
```{r}
iris %>% summarize(mean=mean(Sepal.Length),
sd=sd(Sepal.Length),
max=max(Sepal.Length),
min=min(Sepal.Length))
```
9) For each Species type, determine the mean, standard deviation, max, and min for Sepal.Length
```{r}
iris %>% group_by(Species) %>% summarize(mean=mean(Sepal.Length),
sd=sd(Sepal.Length),
max=max(Sepal.Length),
min=min(Sepal.Length))
```
# Merging Data Frames
This section describes the process of joiningg data frames which is an important skill to have. R has a command called **merge** which can handle some of these tasks but if you look in the help pages for this command it will in turn make references to "join operations" which is a more general way to describe mering activity. In reality joining tables is quite common in SQL (Structured Query Language) so developing facility will pay dividends in the future. Let's start with a very basic example. The data is simple.
```{r}
inventory <- data.frame(part_num=c("001","002","003"),
description=c("Indispensable Widget",
"Flux Capacitor",
"Radiator"),
price=c(20,25,15),stringsAsFactors = FALSE)
sales <- data.frame(part_num=c("001","001","001","003","110"),
quantity_sold=c(23,100,44,98,98),
sales_regions=c("east","west","north","north","south"), stringsAsFactors = FALSE)
```
## Using keys
So the first data frame represents and inventory that lists part numbers, their descriptions, and current quantity in inventory. Basically, one line for each part.
```{r}
inventory
```
The second data frame, sales, represents the sales of various parts within one or more regions. Notice that no quantity of part 2 was sold at all. Notice also that 3 units of part number 1 were sold in three regions.
```{r}
sales
```
What if we wanted *merge* or *join* these two data frames in various ways. Let's explore these. So the **part_num** column in each data frame appears to relate to the same thing so this will be our **key** by which to merge the two data sources.
To visualize what various joins look like here is a diagram from [R for Data Science](https://r4ds.had.co.nz/)
![](./figures/joining.png)
## full_join()
We will start with a **full_join** which will seek to involve all rows in both data frames based on a matching **key** which, in this case, is the **part_num** column.
Let's use the two data frames, along with a join command, to create a single table that contains a sales report for **all** part numbers even if there were no sales of that part number or it is not in the current inventory.
```{r}
full_join(inventory,sales)
```