forked from steviep42/nursing_741
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-literature.Rmd
1145 lines (741 loc) · 41.4 KB
/
02-literature.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# The tidyverse
![](./figures/tidyverse.png){width=350}
The **dplyr** package is part of the larger **tidyverse** package set which has expanded considerably in recent years and continues to grow in size and utility such that many people never learn the "older way" of doing things in R. But we've already been through that in the previous section. The tidyverse has the following packages. The descriptions have been lifted from the [tidyverse home page](https://www.tidyverse.org/packages/).
**ggplot2** - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
**dplyr** - dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges.
**tidyr** - tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
**readr** - readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
**tibble** - tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.
**stringr** - stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.
**lubdriate** - Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.
## Installing
You will probably use a number of functions from several of these packages so it's best to go ahead and install the entire **tidyverse** in one go. To install it, do one of the following:
1) At the R Console from within RStudio, type:
```{r eval=FALSE}
install.packages("tidyverse")
```
2) Use the Tools -> Install Packages menu item in RStudio:
![](./figures/inst.png){width=400}
After you have installed the package you may load it by doing:
```{r}
suppressMessages(library(tidyverse))
```
Note that the **cheatsheet** for **dplyr** can be [found here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)
![](./figures/cheat.png)
## dplyr Basics
**dplyr** is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In fact if you were paying attention during the opening section on data frames you will have noticed that most of the activities we performed related to the following activities. In dplyr-speak there are the **verbs** that help us get work done.
mutate() - adds new variables that are functions of existing variables
select() - picks variables based on their names.
filter() - picks cases based on their values.
summarise() - reduces multiple values down to a single summary.
arrange() - changes the ordering of the rows.
## First Steps
Note that this material references ["Becoming a data ninja with dplyr"](https: //speakerdeck.com/dpastoor/becoming-a-data-ninja-with-dplyr) as well as [this dplyr tutorial]( http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)
We'll go back to the basics here by using a very small data frame which will make it clear how the various **dplyr** verbs actually work:
```{r}
df <- data.frame(id = 1:5,
gender = c("MALE","MALE","FEMALE","MALE","FEMALE"),
age = c(70,76,60,64,68))
```
![](./figures/df1.png){width=400}
### filter()
The **filter()** function allows us to sift through the data frame to find rows that satisfy some logical condition. (With the older approach we would be using the bracket notation). The following example allows is to find only the observations relating to a declared gender of **female**.
```{r}
filter(df,gender == "FEMALE")
# Given this data frame, the following is equivalent
filter(df, gender != "MALE")
```
![](./figures/df2.png)
So, now find only the **ids** that relate to rows 1,3, or 5. This is a highly specialized search but it is helpful to show that you can use a wide variety of logical constructs.
```{r}
filter(df, id %in% c(1,3,5))
```
![](./figures/df3.png)
### mutate()
Mutate is used to add or remove columns in a data frame. Let's create a new column in the data frame that contains the mean value of the age column.
```{r}
mutate(df,meanage = mean(age))
```
![](./figures/meanage.png)
Next we will create a new column designed to tell us if a given observation has an age that is greater than or equal to the average age. Specifically, create a variable called **old_young** and assign a value of "Y" if the observed age for that row is above the mean age and a value of "N" if it is not.
```{r}
mutate(df,old_young=ifelse(df$age>=mean(df$age),"Y","N"))
```
One way we could use something like this is in making a plot where the observations exhibiting an age value above the mean are plotted in a certain color and those below the mean are in another color.
```{r}
tmp <- mutate(df, color = ifelse(age > mean(age),"red","blue"))
plot(tmp$age,col=tmp$color, type="p",
pch=19,main="Ages",ylab="Age")
grid()
abline(h=mean(tmp$age),lty=2)
legend("topright",
c("Above Avg","Below Avg"),col=c("red","blue"),pch=19)
```
### arrange()
Use arrange for sorting the data frame by one or more columns. When using the basic data frame structure from R we had to use the **order()** function to help us generate a vector that has the row numbers of the data frame that correspond to the desired order of display (lowest to highest, etc).
Let's sort the data frame **dff** by age from oldest to youngest.
First we'll use the older approach. While this will work, it is not exactly very intuitive.
```{r}
df[rev(order(df$age)),]
```
**dplyr** makes this process more simple - at least in my opinion
```{r}
arrange(df,desc(age))
```
Next, let's sort **df** by gender (alphabetically) and then by age
from oldest to youngest. The rows relating to a gender of **female** are going to be listed first because, alphabetically speaking, the letter "F" comes before the letter "M". Then within those categories we have the ages sorted from oldest to youngest.
```{r}
arrange(df, gender,desc(age))
```
If we used the older approach it would look like the following. Ugh !
```{r}
df[order(df$gender,-df$age),]
```
### select()
The select() functions allows us to select one or more columns from a data frame.
```{r}
# Reorder the columns
select(df,gender,id,age)
```
```{r}
# Select all but the age column
select(df,-age)
```
```{r}
# Can use the ":" operator to select a range
select(df,id:age)
```
The select() function provides the ability to select by "regular expressions"" or numeric patterns:
```{r}
# Select all columns that start with an "a"
select(df,starts_with("a"))
```
```{r}
names(mtcars)
# Get only columns that start with "c"
select(mtcars,starts_with("c"))
```
This example is more realistic in that data frames can have a large number of columns named according to some convention. For example, the measurements on a patient might not be labelled specifically - they might have a common prefix such as "m_" followed by some sequential number (or not).
```{r}
testdf <- expand.grid(m_1=seq(60,70,10),
age=c(25,32),
m_2=seq(50,60,10),
m_3=seq(60,70,10))
```
```{r}
testdf
```
Find all the columns that include a "_" character
```{r}
select(testdf,matches("_"))
```
This will select columns beginning with "m_" but only those with a suffix of 1 or 2.
```{r}
select(testdf,num_range("m_",1:2))
```
### group_by()
The **group_by()** function let’s you organize a data frame by some factor or grouping variable. This a very powerful function that is typically used in conjunction with a function called **summarize**. Here is what it looks like by itself. It's somewhat underwhelming. It does seem to create table of some kind but it doesn't do much else.
```{r}
df
# Hmm. the following doesn't do anything - or so it seems
group_by(df)
```
So as mentioned, the **group_by** function is usually paired with the **summarize** function. Ah. so what this does is to first group the data frame by the **gender** column and then it **counts** the number of occurrences therein. So this is a form of aggregation.
```{r}
summarize(group_by(df,gender),total=n())
```
![](./figures/grp.png)
Let's group the data frame by gender and then compute the average age for each group.
```{r}
summarize(group_by(df,gender),av_age=mean(age))
```
![](./figures/grpa.png)
Let's group by gender and then compute the total number of observations in each gender group and then compute the mean age.
```{r}
summarize(group_by(df,gender),av_age=mean(age),total=n())
```
# Split Apply Combine
This pattern of using **group_by()** followed by **summarize()**
is called **Split Apply Combine**. The idea is that we
1) Split up the data frame by gender group
2) Then for each group, apply the average function
3) Then combine the average results for each group
```{r}
summarize(group_by(df,gender),av_age=mean(age))
```
![](./figures/morpheus.png){width=600}
## What Are Pipes ?
Before moving forward let us consider the "pipe" operator that is included with the dplyr - well actually **magrittr** package. This is used to make it possible to "pipe" the results of one command into another command and so on.
The inspiration for this comes from the UNIX/LINUX operating system where pipes are used all the time. So in effect using "pipes" is nothing new in the world of research computation.
![](./figures/unix_pipe.png)
**Warning:** Once you get used to pipes it is hard to go back to not using them.
Let’s use the mtcars data frame to illustrate the basics of the piping mechanism as used by dplyr. Here we will select the **mpg** and **am** column from mtcars and view the top 5 rows.
```{r}
head(select(mtcars, mpg, am))
```
Okay, how would we do this using pipes ? Whoa ! Note that each command is "it's own thing" independently of the pipe character.
So the:
- output of mtcars goes into the
- input of the **select** function whose output goes into the
- input of the **head** function
```{r}
mtcars %>% select(mpg, am) %>% head
```
Break this down:
```{r}
mtcars %>% select(mpg, am)
```
The key to understanding how this works is to **read** this from left to right. It bears repeating that each command is "it's own thing" independently of the pipe character. So the:
- output of mtcars goes into the
- input of the **select** function whose output goes into the
- input of the **head** function
Let's use our new found knowledge to re-imagine our use of the **group_by** and **summarize** functions that we have been using in **composite** form up until now.
## Using Pipes To Do Split-Apply-Combine
```{r}
df %>% group_by(gender) %>% summarize(avg=mean(age))
# Same as the following but the pipes don't require you to "commit"
# With the following, you have to know in advance what you want to do
summarize(group_by(df,gender), avg=mean(age))
```
This approach allows us to build a "pipeline" containing commands. We don't have to commit to a specific sequence of functions. This enables a free-form type of exploration.
```{r}
df %>%
group_by(gender) %>%
summarize(avg=mean(age),total=n())
```
What is the median age of all males ?
```{r}
df %>%
filter(gender == "MALE") %>%
summarize(med_age=median(age))
```
![](./figures/chrt.png)
### Saving Results
It should be observed that if you want to save the results of some sequence of commands that you will need to use the "<-" operator. Using the previous example we could the following to save our result.
```{r}
results <- df %>%
filter(gender == "MALE") %>%
summarize(med_age=median(age))
```
## An Example
Using the built in mtcars dataframe, do the following:
1) **filter** for records where the wt is greater than 3.3 tons.
2) Then, using the **mutate** function to create a column called
ab_be (Y or N) that indicates whether that observation’s
mpg is greater (or not) than the average mpg for the filtered set.
3) Then present the average mpg for each group.
This is easy using pipes and dplyr verbs.
```{r}
mtcars %>% filter(wt > 3.3) %>%
mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")) %>%
group_by(ab_be) %>%
summarize(mean_mpg=mean(mpg))
```
This could be then "piped" into the input of the **ggplot** command to plot a corresponding bar chart. If you don't yet know ggplot then it's okay as this will nudge you in that direction. Both **ggplot** and **dplyr** are part of the **tidyverse** which means that the two packages "talk" to each other well.
```{r}
mtcars %>% filter(wt > 3.3) %>%
mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N") ) %>%
group_by(ab_be) %>% summarize(mean_mpg=mean(mpg)) %>%
ggplot(aes(x=ab_be,y=mean_mpg)) +
geom_bar(stat="identity") +
ggtitle("Mean MPG") + labs(x = "ab_be", y = "Mean MPG")
```
## Working With Flowers
Let's work with the built in **iris** data frame to explore the world of flowers. This is famous (Fisher's or Anderson's) **iris** data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
I'll use dyplr idioms here as opposed to the typical Base R approach which might involve use of composite functions. First, load up the iris data
```{r}
data(iris)
```
### Structure of The Data frame
It's always helpful to look at what types of data you have in a data frame. As you already known, base R has a function called **str** which is useful. The tidyverse equivalent is **glimpse** although the two commands basically provide the same types of information. Personally, I still prefer the "old" **str** function if only because I've been using it for so long.
```{r}
str(iris)
```
```{r}
glimpse(iris)
```
### More Practice
1) Get all the rows where the Species is "setosa" Sepal.Length is > 4.7
but < 5.0. Use the pipes feature to help you with this.
```{r}
iris %>%
filter(Sepal.Length > 4.7 & Sepal.Length < 5.0 & Species=="setosa" )
```
2) Select out only the columns relating Sepal measurements. List only the top five rows:
```{r}
iris %>% select(c(Sepal.Length,Sepal.Width)) %>% head(5)
# Or use a helper function to process strings
iris %>% select(starts_with("Sepal")) %>% head(5)
# Or if you know which column numbers you want
iris %>% select(c(1:2)) %>% head(5)
```
3) Sort the data frame by Sepal.Width such that the row with the largest Sepal.Width is listed first. Print only the first 5 rows
```{r}
iris %>% arrange(desc(Sepal.Width)) %>% head(5)
```
4) How many observations are there for each Species group ?
```{r}
iris %>% group_by(Species) %>% count()
# Or
iris %>% group_by(Species) %>% summarize(total=n())
```
5) Select all columns that do NOT relate to Length. Limit output to 5 rows
```{r}
iris %>% select(-ends_with("Length")) %>% head()
```
6) Select all columns that do NOT relate to Length or Species. Limit output to 5 rows
```{r}
iris %>% select(-c(ends_with("Length"),"Species")) %>% head()
# Or
iris %>% select(-ends_with("Length")) %>% select(-"Species") %>% head()
```
7) Select all columns that do NOT relate to Length or Species. But only for observations where Sepal.Width is > 3.9. There are multiples way to attack this problem.
```{r}
iris %>% filter(Sepal.Width > 3.9) %>%
select(-ends_with("Length")) %>%
select(-"Species")
# Or
iris %>% select(-ends_with("Length")) %>%
select(-"Species") %>%
filter(Sepal.Width > 3.9)
```
8) Determine the mean, standard deviation, max, and min for Sepal.Length
```{r}
iris %>% summarize(mean=mean(Sepal.Length),
sd=sd(Sepal.Length),
max=max(Sepal.Length),
min=min(Sepal.Length))
```
9) For each Species type, determine the mean, standard deviation, max, and min for Sepal.Length
```{r}
iris %>% group_by(Species) %>% summarize(mean=mean(Sepal.Length),
sd=sd(Sepal.Length),
max=max(Sepal.Length),
min=min(Sepal.Length))
```
# Merging Data Frames
This section describes the process of joiningg data frames which is an important skill to have. R has a command called **merge** which can handle some of these tasks but if you look in the help pages for this command it will in turn make references to "join operations" which is a more general way to describe mering activity. In reality joining tables is quite common in SQL (Structured Query Language) so developing facility will pay dividends in the future. Let's start with a very basic example. The data is simple.
```{r}
inventory <- data.frame(part_num=c("001","002","003"),
description=c("Indispensable Widget",
"Flux Capacitor",
"Radiator"),
price=c(20,25,15),stringsAsFactors = FALSE)
sales <- data.frame(part_num=c("001","001","001","003","110"),
quantity_sold=c(23,100,44,98,98),
sales_regions=c("east","west","north","north","south"), stringsAsFactors = FALSE)
```
## Using keys
So the first data frame represents and inventory that lists part numbers, their descriptions, and current quantity in inventory. Basically, one line for each part.
```{r}
inventory
```
The second data frame, sales, represents the sales of various parts within one or more regions. Notice that no quantity of part 2 was sold at all. Notice also that 3 units of part number 1 were sold in three regions.
```{r}
sales
```
What if we wanted *merge* or *join* these two data frames in various ways. Let's explore these. So the **part_num** column in each data frame appears to relate to the same thing so this will be our **key** by which to merge the two data sources.
To visualize what various joins look like here is a diagram from [R for Data Science](https://r4ds.had.co.nz/)
![](./figures/joining.png)
## full_join()
We will start with a **full_join** which will seek to involve all rows in both data frames based on a matching **key** which, in this case, is the **part_num** column.
Let's use the two data frames, along with a join command, to create a single table that contains a sales report for **all** part numbers even if there were no sales of that part number or it is not in the current inventory.
```{r}
full_join(inventory,sales)
```
This table includes all matched records (using **part_num** as a **key**) from both the data frames. The operation will supply NA values for missing matches on either side. What does "matched records" mean ?
![](./figures/inventorymatch1.png){width=550}
![](./figures/salesmatch1.png)
Notice we see part number 002 even though there were no sales. We also see a row for part 110 even though it doesn't appear in the inventory. Notice that there are missing values.
The reason we get a value of NA for **description** and **price** in the 6th row is that part number 110 does not exist within the **inventory** data frame. So, to reasonably include the information in the fully joined data, the NAs must be provided as placeholders for the missing values.
Also notice that there are missing values for the quantity and sales_regions realtive to part number 002 since it doesn't appear
within the sales data frame.
## inner_join()
Next, let's create a joined table that shows **only** information for parts listed in inventory and only those parts that had at least one transaction. In effect we want only rows that are in common (based on the **part_num** key) to both data frames.
The **inner_join** will produce a data frame that includes rows pertaining to part numbers *in common* to both data frames. An inner join matches pairs of observations whenever their keys are equal.
```{r}
inner_join(inventory,sales)
```
This would mean then that the row in the sales data frame that refers to a part number of 110 would be omitted in the result since it does not occur in the inventory data frame. Part number 002 does not appear in the sales data frame so it is not listed here.
## left_join()
Next up we have the **left_join** which will "favor" the first data frame referenced in the command in that it will include rows from inventory AND rows from **sales** where the **part_num** matches. So part number 002 is still included even though it does not appear in the **sales** data frame.
```{r}
left_join(inventory,sales)
```
## right_join()
The right_join() will favor the second specified data frame which it will return all the rows from sales and rows from sales where the part number matches. This is why we get a line in the result for part 110 that includes missing values for description and price since those values are missing the inventory data frame.
```{r}
right_join(inventory,sales)
```
There are other join types but these cover the main ones.
# Your Turn
Now it's time for you to try some exercises on your own. The **ggplot2** package (which is part of the larger **tidyverse** comes with some data sets one of which is called **msleep**. It's easy to load into your work space:
```{r}
data(msleep)
```
This data set is
> an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007
![](./figures/vore.png){width=600}
You can always get the names of the columns in a data frame using **str** or **names**
```{r}
names(msleep)
#
str(msleep)
```
Not only is this a new data frame for you it also has the added challenge of containing some missing values - those pesky "NAs". In the older approach we used the function **complete.cases** to help us figure out which rows had at least one missing value. We could still use that here although the **dplyr** world has a function called **drop_na()** which could be used but itself or as part of a pipeline.
Here are some questions I want you to answer. Note that the answers are provided so you can check your work. However, the code used to generate the answer is not visible. There is typically more than one way to answer the question.
#### What is the average total sleep time for Omnivores ?
```{r echo=FALSE}
msleep %>% filter(vore=="omni") %>% summarize(mean=mean(sleep_total))
# Or
msleep %>% group_by(vore) %>%
summarize(mean=mean(sleep_total)) %>%
filter(vore=="omni")
```
#### Group the msleep data frame by taxanomic order and then summarize the mean sleep total for each group. Make sure the resulting table is arranged in descending order of the average - from highest sleep_total average to the lowest
```{r echo=FALSE}
msleep %>%
group_by(order) %>%
summarize(avg=mean(sleep_total)) %>%
arrange(desc(avg))
```
#### What is the average total sleep time for all vore types ?
```{r echo=FALSE}
msleep %>% group_by(vore) %>%
summarize(mean=mean(sleep_total))
```
#### Omitting any rows that contain missing values, what is the average total sleep time for all vore types ? Remember that you will need to use the **drop_na** function to help you.
```{r echo=FALSE}
msleep %>%
drop_na() %>%
group_by(vore) %>%
summarize(mean=mean(sleep_total))
```
#### Remove all rows that contain a missing value and save the result into a new data table called **msleep_na**. How many rows does msleep_na contain ?
```{r echo=FALSE}
msleep_na <- msleep %>% drop_na()
msleep_na %>% nrow()
```
#### Using msleep_na, find the average braintwt for all vores by order. Note that the group_by() function can take multiple arguments.
```{r echo=FALSE}
msleep_na %>%
group_by(order,vore) %>%
summarize(mean=mean(brainwt))
```
# Chicago Crime Data
Here we look at some data obtained from the City of Chicago Open Data Portal <https://data.cityofchicago.org/>. This is data that represents phone calls to the police during the year of 2012. Just pick a folder on your hard drive and download the data.
## Getting the Data
```{r eval=FALSE}
url <- "https://raw.githubusercontent.com/steviep42/nursing_741/master/chi_crimes.csv"
download.file(url,"chi_crimes.csv")
```
Next, read the file in to a local data frame:
```{r}
chi <- read_csv("chi_crimes.csv")
```
## Inspecting the Data
The way the data is organized is that each row represents a call to the police which may or may not result in an actual arrest. In any case there will be a case number assigned to the call and other information such as the Best, Ward, Block, and Community Area will be noted. If there is an arrest it will be recorded and then it will be categorized according to an FBI code. The location of the call will be recorded in latitude and longitude. Some basic questions might be how many rows and columns there are, etc
```{r}
glimpse(chi)
```
So let's check this data frame out some more. For example, what are the factors in this data frame ? Those that take on only a few different values might be good summary variables.
The way to do this with Base R would be:
```{r}
sapply(chi,function(x) length(unique(x)))
```
The way to do this with dplyr could be:
```{r}
chi %>% summarize_all(n_distinct)
```
From viewing this perhaps variables like Arrest, Domestic, FBI Code, District, and Ward could be good summary variables. We don't really have any kind of measured data here.
We do have lots of dates which at this point are actually character strings so let's turn them into dates and then work with them to get some more summary variables created. There are also some latitude and longitude information that we perhaps later use to draw some maps if we want. We'll consider that later.
## Creating Some Dates
```{r}
suppressMessages(library(lubridate))
# Right now the dates and times are just a character string
str(chi$Date)
# Note that the dates have a PM or AM string. Let's exploit
# that information to create a category that captures this information.
chi <- chi %>% mutate(am_pm=ifelse(grepl("PM",Date),"PM","AM"))
# Let's turn them into actual dates and times using the lubridate package
chi <- chi %>%
mutate(Date = parse_date_time(chi$Date,'%m/%d/%Y %I:%M:%S %p'))
range(chi$Date)
```
It's worth noting that we can use some functions from lubridate to further group the dates and times so we can do some more specific analysis. [Here](http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html) is link to a tutorial on lubridate. Anyway try some of these:
```{r}
# List the first 5 records from the chi data table
chi$Date %>% head
# This will tell us what day of the week the given date and time represents
chi$Date %>% weekdays %>% head
# This will tell us what month the date and time falls into
chi$Date %>% months %>% head
# Let's add a factor to the data table/data frame that gives the month
chi <- chi %>% mutate(Month=months(Date))
chi <- chi %>% mutate(Month=factor(Month, levels = month.name))
chi %>% distinct(Month)
```
Let's add in a few more factors based on the date since it will help us build some interesting tables as we try to understand the patterns in Chicago crime during the year of 2012. It is also important to understand how to work with dates since they wind up being important when looking at data. Not every data set will be a function of dates or be reliant upon them for interpretation but this one is.
It's also not necessary to create these categories in advance or make them part of the data frame. We could do this on the fly as we will see later. We will create a category to classify the quarter during which the call to the police occurred.
```{r}
chi <- chi %>% mutate(quarter=quarter(Date))
chi <- chi %>% mutate(weekdays=weekdays(Date))
chi %>% select(Month, am_pm, weekdays, quarter) %>% head()
```
## Count Arrests
Now we can start to ask some basic questions to better understand the data. When we have lots of categories in our data for which we could make some count-based summaries which gives us some frequencies. For example, let's focus on the "Arrest" variable to figure out how many Arrests happened. What about the FBI codes ?
```{r}
# This gives us a table of Arrests vs non-Arrests
chi %>% count(Arrest)
# And we could see the count of reported crimes according to FBI code
chi %>% count(`FBI Code`) %>% arrange(desc(n))
```
Relative to the FBI codes I don't know what the code actually mean but I found a description of them at [this URL](http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
So it seems to me that the top 5 areas, at least by FBI codes, are:
* Larceny
* Simple Battery
* Vandalism
* Drug Abuse
* Misc Non-Index Offense
We could do two-way table of course. This helps us look at relationships between categories. Here we look at how Arrests relate to whether it's day or night. When do most arrests occur ?
```{r}
chi %>%
count(Arrest,am_pm) %>%
filter(Arrest==TRUE) %>%
arrange(desc(n))
```
So let's plot a bar chart that shows is the total number of calls to the police for each month and what portion of those calls each month resulted in an arrest.
```{r}
chi %>% count(Arrest,Month) %>%
ggplot(aes(x=Month,y=n,fill=Arrest)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45)) +
ylab("Calls To Police") +
ggtitle("Chicago Arrests per Month in 2013")
```
We could also put the bars next to each other. Notice here that I use the **group_by()** function with **summarize()** whereas before I used the **count()** function. The two approaches are equivalent.
```{r}
chi %>% group_by(Arrest,Month) %>%
summarize(total=n()) %>%
ggplot(aes(x=Month,y=total,fill=Arrest)) +
geom_bar(stat="identity",position=position_dodge()) +
theme(axis.text.x=element_text(angle=45)) +
ggtitle("Chicago Arrests per Month in 2013")
```
## Crime At Day vs Night
In terms of number of calls to Police let's see how many calls there were per month and then break that down by night vs day. You can see then that calling at night is more frequent. Intuitively this makes some sense.
```{r}
chi %>% group_by(am_pm,Month) %>%
summarize(total=n()) %>%
ggplot(aes(x=Month,y=total,fill=am_pm)) +
geom_bar(stat="identity",position=position_dodge()) +
theme(axis.text.x=element_text(angle=45)) +
ggtitle("Calls to Police per Month in 2013")
```
## Line Graphs and Histograms
Let's try something else here. The lubridate package has a function that we can use to get the numeric day of the year. For example January 5, 2012 would be the 5th day of the year. So we could add up the number of calls to the police for each day of the year and maybe plot that as a line chart. Notice here how we create the **ydays** category on the fly. We don't need to create it as part of the data frame.
```{r}
chi %>% mutate(ydays = yday(Date)) %>%
group_by(ydays) %>% summarize(total_calls=n()) %>%
ggplot(aes(x=ydays,y=total_calls)) + geom_line() +
ggtitle("Calls per Each Day of 2012") +
xlab("Day of the year 1 - 365") +
ylab("Total Calls to the Police")
```
Now if we wanted to we could split the line graph up into two different panels or "facets" as ggplot calls it. This won't take much more than what we already had before. We'll just throw in the am_pm variable.
```{r}
chi %>%
mutate(ydays = yday(Date)) %>%
group_by(am_pm,ydays) %>%
summarize(total_calls=n()) %>%
ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(am_pm~.)
```
That looks okay but what about changing the arrangement of the panels ? That's easy - we just change the formula in the **facet_grid** layer
```{r}
chi %>%
mutate(ydays = yday(Date)) %>%
group_by(am_pm,ydays) %>%
summarize(total_calls=n()) %>%
ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(.~am_pm)
```
Now we could also use a function called **hour** which given the date will extract the hour of the day that the call to the police occurred. Let's generate a histogram of calls to the police based on the hour of the day. We'll pick a bin size of 24 since this will match the number of hours
```{r}
chi %>% mutate(hod=hour(Date)) %>%
ggplot(aes(x=hod)) +
geom_histogram(bins=24,color="black",fill="white") +
ggtitle("Call Count by Hour of the Day") +
xlab("Hour of the day 0 - 23") + theme_bw()
```
## Calendar Map
The **googleVis** package has a way to create something known as a calendar heatmap. We can use dplyr functions, as we have been, to count the number of calls per day of the year and then use that to create the heatmap. Most of the work here is involved with mutating the date to a "Year Month Day" format after which we count the number of police calls that day.
```{r eval=FALSE}
calend <- chi %>%
mutate(ydays = ymd(format(chi$Date,"%Y-%m-%d"))) %>%
group_by(ydays) %>% summarize(total_calls=n()) %>%
gvisCalendar(.,datevar="ydays",numvar="total_calls",
options=list(width="1000px", height="300px"))
plot(calend)
```
![](./figures/calend.png)
## What Types of Crime Do We See ?
Next up lets look at the counts for the most frequently committed types of crimes. This will help us understand what our risks are should we consider living in Chicago. First we can create a table of all the Crime types by using the **Primary Type** variable.
```{r}
# Sort the Crime Types From Highest Count to Lowest
chi %>% count(`Primary Type`) %>% arrange(desc(n))
# Equivalent to
chi %>%
group_by(`Primary Type`) %>%
summarize(n=n()) %>%
arrange(desc(n))
```
Let's make a barplot. Most of this relates to using ggplot to plot things which can make you think that the dplyr verbs are complex when they aren't. We did have to **mutate** the **Primary Type** variable to reorder it according to the corresponding number of crimes for each type. This is what makes the bar plot organize the bars according to highest to lowest.
```{r}
chi %>% count(`Primary Type`) %>%
arrange(desc(n)) %>%
ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) +
geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Let's clean this plot up to show say only the top 12 crime types. Well, we don't know that these were actual crimes - only that the call was classified into a certain category.
```{r}
chi %>% group_by(`Primary Type`) %>%
summarize(n=n()) %>%
arrange(desc(n)) %>%
slice(1:20) %>%
ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) +
geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
So this just gives us the total number of police calls relating to a given crime type. It doesn't say very much about the total number of arrests that were made.
```{r}
chi %>% group_by(Arrest,`Primary Type`) %>%
summarize(n=n()) %>%
arrange(desc(n)) %>%
slice(1:20) %>%
ggplot(aes(x=reorder(`Primary Type`,-n),y=n,fill=Arrest)) +
geom_bar(stat="identity", width=.8) +
xlab("Crime Type") +
ylab("Total Calls") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
This is illuminating in that almost all the calls relating to NARCOTICS seem to result in an arrest. Let's examine these offenses in greater detail since they don't seem to fit the pattern. That is most of the other crime reports don't seem to result in arrest except maybe prostitution, criminal trespass, and weapons violation. We can drill down some more. Let's slice out the lower 10 of the top 20.
```{r}
chi %>% group_by(Arrest,`Primary Type`) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
slice(10:20) %>%
ggplot(aes(x=reorder(`Primary Type`,-count),
y=count,fill=Arrest)) +
geom_bar(stat="identity",width=.8) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Oh so now we see that there are several areas that tend to result in an arrest. Narcotics isn't the only one at all. This seems a little weird to me though because ALL prostitution calls result in an arrest. Let's filter out some of these crimes to see what is going on.
```{r}
chi %>% filter(grepl("PROST",Description)) %>%
group_by(Arrest) %>%
summarize(cnt=n())
#
chi %>% filter(grepl("BURGLARY",`Primary Type`)) %>%
group_by(Arrest) %>%
summarize(cnt=n())
```
More stuff to consider about NARCOTICS
```{r}
chi %>% filter(grepl("NARCOTICS",`Primary Type`)) %>%
group_by(Description) %>%
summarize(count=n()) %>%