nursing_741.Rmd

--- 
title: "Intro To dplyr"
author: "Steve Pittard"
date: "`r Sys.Date()`"
bibliography:
- book.bib
- packages.bib
description: An Intro to data manipulation with dplyr
documentclass: book
link-citations: yes
site: bookdown::bookdown_site
output: bookdown::gitbook
biblio-style: apalike
---

# R Data Structures

There are a number of data structures in R such as **vectors, lists** qnd **matrices**. The **vector** structure winds up being helpful in understanding how to work with data frames. 

## Vectors

It is a container for a series of related data of the same type: height measurements of students, whether a group of people smoke or not, their blood pressure. The only rule here is that a vector can contain only one data type at a time.


```{r}
names  <- c("P1","P2","P3","P4","P5")
temp   <- c(98.2,101.3,97.2,100.2,98.5)
pulse  <- c(66,72,83,85,90)
gender <- c("M","F","M","M","F")
```

To access elements, or ranges of elements, within a vector involves using the "bracket" notation:

```{r}
# Get the first element of temp
temp[1]

# Get elements 3,4, and 5 from pulse
pulse[3:5]
```

Base R likes to use the the composite function style. Not surprising since the language was written by statisticians. Create some vectors

```{r}
# create a sequence of numbers

# looks like f(x)
seq(0,16,4)

# looks like f(g(x))
str(seq(0,16,4))

# create some numbers from a Normal distributions
rnorm(10)

# Looks like f(g(x))
hist(rnorm(100000))

# You kind of know what you want to do before you type it
# What about this ? 

rnorm(100000) %>% hist()
```


We can also use logical expressions to find elements that satisfy some condition. This is a very powerful capability in R. 

```{r}
temp < 98
```

Whoa. What was that ? Well we get back a T/F logical vector that tells us what elements satisfy the specified condition. We can then use this info to get the elements of interest.

```{r}
temp[temp < 98]
```

Working with individual vectors is fine but a more general way of working with them is in data frames which provides more fleibility:

```{r}
(my_df <- data.frame(names,temp,pulse,gender)) 
```

Looking at each column, we see that they are the vectors we were just working with. If we need to access them from the data frame it's easy.

```{r}
# Get the temp column
my_df$temp

# What's the mean of the temp column
mean(my_df$temp)
```

## Data Frames

![](./figures/dframe.png){width=300}

But we are getting ahead of ourselves. Just know that the **premier data structure** in R is the **data.frame**. This structure can be described as follows:

  - A data frame is a special type of list that contains data in a format that allows for easier manipulation, reshaping, and open-ended analysis
  
  - Data frames are tightly coupled collections of variables. It is one of the more important constructs you will encounter when using R so learn all you can about it
  
  - A data frame is an analogue to the Excel spreadsheet but is much more flexible for storing, manipulating, and analyzing data
  
  - Data frames can be constructed from existing vectors, lists, or matrices. Many times they are created by reading in comma delimited files, (CSV files), using the read.table command
  
Once you become accustomed to working with data frames, R becomes so much easier to use. In fact, it could be well argued tht UNTIL you wrap your head around the data frame concept then you cannot be productive in R. This is mostly true, in my experience. 


R comes with with a variety of built-in data sets that are very useful for getting used to data sets and how to manipulate them.

```{r eval=FALSE}
AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level of Lake Huron 1875-1972
LifeCycleSavings        Intercountry Life-Cycle Savings Data
Loblolly                Growth of Loblolly pine trees
Nile                    Flow of the River Nile
Orange                  Growth of Orange Trees
OrchardSprays           Potency of Orchard Sprays
PlantGrowth             Results from an Experiment on Plant Growth
Puromycin               Reaction Velocity of an Enzymatic Reaction
Theoph                  Pharmacokinetics of Theophylline
```


## A Reference Data Frame

We will use a well-known data frame, at least in R circles, called **mtcars** which is part of any default installation of R. It is a simple data set relating to, well, automobiles. This data frame has the distinction of being the most (ab)used data frame in R education. 

```{r eval=FALSE}
The data was extracted from the 1974 Motor Trend US 
magazine, and comprises fuel consumption and 10 aspects 
of automobile design and performance for 32 automobiles 
(1973–74 models).

A data frame with 32 observations on 11 (numeric) 
variables.

[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	Engine (0 = V-shaped, 1 = straight)
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors
```


## Relation to dplyr

What you will discover is that the **dplyr** package, which itself is part of the much larger **tidyverse** package set , extends upon the idea of the basic R data frame in a way that some feel is superior. It depends on your point of view though the **tidyverse** has a lot of what I call a philosophic consistency in it which makes it **very** useful once you get some concepts in mind.

While you could start exclusively with **dplyr** and the **tidyverse** the world is still full of older code. Plus, many of the advantages of **dplyr** only become quite apparent when compared to the "older way" of doing things.  So my recommendation is to know how to deal with data frames in base R while also spending time to learn the **dplyr** way of doing things.


```{r include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
  .packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
```

<!--chapter:end:index.Rmd-->

# Digging In

Data frames look like an Excel Spreadsheet. The rows are observations and the columns are variables or "features" that represent some measurement or character-based description of a given observation. When viewed from the row point of view, the data can be heterogeneous. When viewed as a column, the data is homogeneous. 

```{r}
data(mtcars)
mtcars
```

We can do this with this data such as make plots or create models:


```{r cars}
plot(mpg ~ wt, data=mtcars)
```

Let's create a regression model. It doesn't take long to realize that most functions in R will use a data frame as input. This means that you will spend a lot of time working with data frames to get them into shape for use with modeling and visualization tools. In fact you will spend most of your time **importing, transforming, and cleaning**. 

```{r}
(mylm <- lm(mpg ~ ., data = mtcars))
```


There are some useful functions that help you understand the structure of a data frame. One of the most important ones is called the **str()** function which is short hand for **structure**.

## Structure

```{r}
str(mtcars)
```

This gives you some idea about the number of rows and columns of the data frame along with a description of the variable types and their values. I use this function frequently. Other functions that will help you include the following. 

## Meta Information

```{r}
# how many rows
nrow(mtcars) 

# how many columns
ncol(mtcars) 

# Column names
names(mtcars)

```

## Printing

Some data frames, such as mtcars, don't have many rows but others might have hundreds, thousands or even more than that ! Imagine trying to view one of those data frames. It is for this reason that the **head()** and **tail()** functions exist. 

```{r}
head(mtcars,5) # First 5 rows
tail(mtcars,3) # Last 3 rows
```

## Accessing Rows And Columns

There are various ways to select, remove, or exclude rows and columns of a data frame. We use the **bracket** notation to do this. This is very powerful. Keep in mind that data frames have rows and columns so it would make sense that you need a way to specify what rows and columns you want to access. 

```{r}
mtcars[1,]     # First row, all columns
```
```{r}
mtcars[1:3,]   # First three rows, all columns

```


```{r}
# All rows, and first 4 columns
mtcars[,1:4]   

```


```{r}
# Rows 1-5 and columns 1,2 and 8-10
mtcars[1:4,c(1:2,8:10)]
```


```{r}
# Rows 1-5 and columns 1,2 and 8-10
mtcars[1:4,c(1:2,8:10)]
```


```{r}
# Rows 1-5 and columns by name
mtcars[1:4,c("mpg","wt","drat")]
```

## Interrogation

Many times you will wish to find rows that satisfy certain conditions. For example, what rows have an mpg > 11 and at 
wt < 2.0 ? We use the bracket notation to help us. We can pass logical conditions into the brackets. Note the following:

```{r}
mtcars$mpg > 11 & mtcars$wt < 2.0
```

There are 32 elements in this logical vector each with a value of either TRUE or FALSE. When passed into the row index of the bracket notation, it will print that row if the corresponding value is TRUE. If FALSE, the row will not be printed. 


```{r}
mtcars[mtcars$mpg > 11 & mtcars$wt < 2.0,]
```

What if we just want to know how many cars satisfy this condition ? 

```{r}
nrow(mtcars[mtcars$mpg > 11 & mtcars$wt < 2.0,])
```

Find all rows that correspond to cars with 4 cylinders 

```{r}
mtcars[mtcars$cyl == 4,]
```

We can even use other R functions in the bracket notation. Extract all rows whose MPG value exceeds the mean MPG for the entire data frame.

```{r}
mtcars[mtcars$mpg > mean(mtcars$mpg),]
```

Now find the cars for which the MPG exceeds the 75% percentile value for MPG

```{r}
mtcars[mtcars$mpg > quantile(mtcars$mpg)[4],]
```

## Missing values

This is big deal. Most "real" data has rows that do not contain values for all columns. This is the so called "missing value" problem. Here is an example. The following code will read in a version of the mtcars data frame that has some missing values:

```{r eval=F}
url <- "https://raw.githubusercontent.com/steviep42/utilities/master/data/mtcars_na.csv"
(mtcars_na <- read.csv(url, stringsAsFactors = FALSE))
```

If you look, you can see the missing values "NA" present in certain columns. This is R's way of indicating what is missing. There are functions that can help you find these. This is important because, for example, if you wanted to find the average value of a column, say the **wt** column then there will be a problem as it contains a missing value:

```{r}
mean(mtcars_na$wt)
```

We have to tell the function to remove missing values from consideration.

```{r}
mean(mtcars$wt, na.rm=TRUE)
```

A more general approach would involve the following functions. 

```{r}
complete.cases(mtcars_na)
```

```{r}
# How many rows in the df do not contain any NAs ?
sum(complete.cases(mtcars_na))

# How many rows in the df do contain at least one NA ?
sum(!complete.cases(mtcars_na))
```

How would we find those rows and print them ? 

```{r}
mtcars_na[complete.cases(mtcars_na),]
```

And here are the ones that do contain missing values:

```{r}
mtcars_na[!complete.cases(mtcars_na),]
```

One quick way to omit rows with missing values is:

```{r}
na.omit(mtcars_na)
```


## Continuous vs Factors

One **recipe** that I use frequently is given below. This tells me how many unique values are assumed by each column which then helps to  identify continuous quantities and categories. If a column assumes only a small number of unique values then perhaps it should be classified as a factor. Don't let the code here scare you. If you are new to R and don't yet understand what is going on then just use this as a "recipe" for now. 

```{r}
sapply(mtcars, function(x) length(unique(x)))
```

So it looks to me, for example, that **cyl, vs, am, gear, and carb** are actually categories rather than measured quantities. If you look at the help page for mtcars you will see that **am** is a 0 or 1 which corresponds to, respectively, a car with an automatic transmission (0) or a manual transmission (1). If you use the **summarize** function it will treat this variable as a numeric, continuous quantity. 

Is it actually possible to have a transmission value of 0.4062 ? 

```{r}
summary(mtcars$am)
```

I might then use some code to transform this into factors so that when they are used with various modeling functions they will be recognized as such. For example, if we summarize the data frame right now, we will see the following

```{r}
summary(mtcars$am)
```

Let's turn **am** into a factor

```{r}
mtcars$am <- factor(mtcars$am, 
                    levels = c(0,1), 
                    labels = c("Auto","Man") )
```

Now the summary will make more sense. This is also useful because graphics packages such as ggplot2 will know how to handle factors.

```{r}
summary(mtcars$am)
```


```{r echo=FALSE}
suppressMessages(library(tidyverse))
```


```{r}
ggplot(mtcars,aes(x=wt,y=mpg)) + 
  geom_point() + 
  facet_wrap(~am)
```

## Sorting data

Sorting rows in a data frame is a common activity. However, in Base R this is called "ordering" because of the function used to "order" the data. Let's say we want to sort or "order" the mtcars data frame such that the row with the lowest mpg value is listed first and the row with the highest mpg value is listed last. First, look at the **order** function's output. What are those numbers ? 


```{r}
order(mtcars$mpg)
```

Oh, so they are row numbers corresponding to rows in mtcars. Row 15 has the car with the lowest mpg. Row 16 corresponds to the car with the next lowest mpg and so on. So we can use this information to order our dataframe accordingly:

```{r}
mtcars[order(mtcars$mpg),]
```

To invert the sense of the order use the **rev** function. We'll also use the head function to list only the first 5 rows of the result. Note that in base R, using composite functions is welcomed although you will find out that this is not a value in the tidyverse. For math people, using a composite function is natural which, in large part, is why R embraced that approach early on. 

```{r}
head(mtcars[rev(order(mtcars$mpg)),])
```

## Reading CSV Files

Many times data will be read in from a comma delimited file exported from Excel. These are known as Comma Separated Value files - generally abbreviated as **CSV**. The file can be read from a local drive or even from the Web as long as you know the URL associated with the file. In this example, there is a file on the Internet relating to some testing data involving students and various subjects. 

```{r}
url <- "https://raw.githubusercontent.com/pittardsp/bios545r_spring_2018/master/SUPPORT/hsb2.csv"

data1 <- read.csv(url,header=T,sep=",")

head(data1)
```


<!--chapter:end:01-intro.Rmd-->

# The tidyverse

![](./figures/tidyverse.png){width=350}

The **dplyr** package is part of the larger **tidyverse** package set which has expanded considerably in recent years and continues to grow in size and utility such that many people never learn the "older way" of doing things in R. But we've already been through that in the previous section. The tidyverse has the following packages. The descriptions have been lifted from the [tidyverse home page](https://www.tidyverse.org/packages/).

**ggplot2** - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

**dplyr** - dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. 

**tidyr** - tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.

**readr** - readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

**tibble** - tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. 

**stringr** - stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.

**lubdriate** - Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.

## Installing

You will probably use a number of functions from several of these packages so it's best to go ahead and install the entire **tidyverse** in one go. To install it, do one of the following: 

1) At the R Console from within RStudio, type:

```{r eval=FALSE}
install.packages("tidyverse")
```

2) Use the Tools -> Install Packages menu item in RStudio:

![](./figures/inst.png){width=400}

After you have installed the package you may load it by doing:

```{r}
suppressMessages(library(tidyverse))
```

Note that the **cheatsheet** for **dplyr** can be [found here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)

![](./figures/cheat.png)

## dplyr Basics

**dplyr** is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In fact if you were paying attention during the opening section on data frames you will have noticed that most of the activities we performed related to the following activities. In dplyr-speak there are the **verbs** that help us get work done.


    mutate() - adds new variables that are functions of existing variables
    select() - picks variables based on their names.
    filter() - picks cases based on their values.
    summarise() - reduces multiple values down to a single summary.
    arrange() - changes the ordering of the rows.


## First Steps

Note that this material references ["Becoming a data ninja with dplyr"](https: //speakerdeck.com/dpastoor/becoming-a-data-ninja-with-dplyr) as well as [this dplyr tutorial]( http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)

We'll go back to the basics here by using a very small data frame which will make it clear how the various **dplyr** verbs actually work:

```{r}
df <- data.frame(id = 1:5,
                 gender = c("MALE","MALE","FEMALE","MALE","FEMALE"),
                 age = c(70,76,60,64,68))
```

![](./figures/df1.png){width=400}


### filter()

The **filter()** function allows us to sift through the data frame to find rows that satisfy some logical condition. (With the older approach we would be using the bracket notation). The following example allows is to find only the observations relating to a declared gender of **female**.

```{r}
filter(df,gender == "FEMALE")

# Given this data frame, the following is equivalent 

filter(df, gender != "MALE")
```

![](./figures/df2.png)


So, now find only the **ids** that relate to rows 1,3, or 5. This is a highly specialized search but it is helpful to show that you can use a wide variety of logical constructs.

```{r}
filter(df, id %in% c(1,3,5))
```

![](./figures/df3.png)

### mutate()

Mutate is used to add or remove columns in a data frame. Let's create a new column in the data frame that contains the mean value of the age column.

```{r}
mutate(df,meanage = mean(age))
```

![](./figures/meanage.png)


Next we will create a new column designed to tell us if a given observation has an age that is greater than or equal to the average age. Specifically, create a variable called **old_young** and assign a value of "Y" if the observed age for that row is above the mean age and a value of "N" if it is not.

```{r}
mutate(df,old_young=ifelse(df$age>=mean(df$age),"Y","N"))
```

One way we could use something like this is in making a plot where the observations exhibiting an age value above the mean are plotted in a certain color and those below the mean are in another color.

```{r}
tmp <- mutate(df, color = ifelse(age > mean(age),"red","blue")) 

plot(tmp$age,col=tmp$color, type="p",
     pch=19,main="Ages",ylab="Age") 

grid()

abline(h=mean(tmp$age),lty=2)

legend("topright",
       c("Above Avg","Below Avg"),col=c("red","blue"),pch=19)
```

### arrange()

Use arrange for sorting the data frame by one or more columns. When using the basic data frame structure from R we had to use the **order()** function to help us generate a vector that has the row numbers of the data frame that correspond to the desired order of display (lowest to highest, etc). 

Let's sort the data frame **dff** by age from oldest to youngest.
First we'll use the older approach. While this will work, it is not exactly very intuitive.

```{r}
df[rev(order(df$age)),]
```

**dplyr** makes this process more simple - at least in my opinion

```{r}
arrange(df,desc(age))
```

Next, let's sort **df** by gender (alphabetically) and then by age
from oldest to youngest. The rows relating to a gender of **female** are going to be listed first because, alphabetically speaking, the letter "F" comes before the letter "M". Then within those categories we have the ages sorted from oldest to youngest.

```{r}
arrange(df, gender,desc(age))
```

If we used the older approach it would look like the following. Ugh !

```{r}
df[order(df$gender,-df$age),]
```

### select()

The select() functions allows us to select one or more columns from a data frame.

```{r}
# Reorder the columns
select(df,gender,id,age)  
```


```{r}
# Select all but the age column
select(df,-age)
```

```{r}
# Can use the ":" operator to select a range
select(df,id:age) 
```

The select() function provides the ability to select by "regular expressions"" or numeric patterns:

```{r}
# Select all columns that start with an "a"
select(df,starts_with("a"))
```

```{r}
names(mtcars)

# Get only columns that start with "c"

select(mtcars,starts_with("c"))
```

This example is more realistic in that data frames can have a large number of columns named according to some convention. For example, the measurements on a patient might not be labelled specifically - they might have a common prefix such as "m_" followed by some sequential number (or not). 

```{r}
testdf <- expand.grid(m_1=seq(60,70,10),
                      age=c(25,32),
                      m_2=seq(50,60,10),
                      m_3=seq(60,70,10))
```


```{r}
testdf
```


Find all the columns that include a "_" character

```{r}
select(testdf,matches("_"))
```

This will select columns beginning with "m_" but only those with a suffix of 1 or 2.

```{r}
select(testdf,num_range("m_",1:2))
```

### group_by()

The **group_by()** function let’s you organize a data frame by some factor or grouping variable. This a very powerful function that is typically used in conjunction with a function called **summarize**. Here is what it looks like by itself. It's somewhat underwhelming. It does seem to create table of some kind but it doesn't do much else. 

```{r}
df

# Hmm. the following doesn't do anything - or so it seems

group_by(df)
```

So as mentioned, the **group_by** function is usually paired with the **summarize** function. Ah. so what this does is to first group the data frame by the **gender** column and then it **counts** the number of occurrences therein. So this is a form of aggregation.

```{r}
summarize(group_by(df,gender),total=n())
```

![](./figures/grp.png)

Let's group the data frame by gender and then compute the average age for each group.

```{r}
summarize(group_by(df,gender),av_age=mean(age))
```

![](./figures/grpa.png)

Let's group by gender and then compute the total number of observations in each gender group and then compute the mean age. 

```{r}
summarize(group_by(df,gender),av_age=mean(age),total=n())
```

# Split Apply Combine

This pattern of using **group_by()** followed by **summarize()** 
is called **Split Apply Combine**. The idea is that we 

1) Split up the data frame by gender group
2) Then for each group, apply the average function
3) Then combine the average results for each group

```{r}
summarize(group_by(df,gender),av_age=mean(age))
```

![](./figures/morpheus.png){width=600}


## What Are Pipes ?

Before moving forward let us consider the "pipe" operator that is included with the dplyr - well actually **magrittr** package. This is used to make it possible to "pipe" the results of one command into another command and so on.

The inspiration for this comes from the UNIX/LINUX operating system where pipes are used all the time. So in effect using "pipes" is nothing new in the world of research computation. 

![](./figures/unix_pipe.png)

**Warning:** Once you get used to pipes it is hard to go back to not using them. 

Let’s use the mtcars data frame to illustrate the basics of the piping mechanism as used by dplyr. Here we will select the **mpg** and **am** column from mtcars and view the top 5 rows.

```{r}
head(select(mtcars, mpg, am))
```

Okay, how would we do this using pipes ? Whoa ! Note that each command is "it's own thing" independently of the pipe character. 
So the:
      - output of mtcars goes into the 
      - input of the **select** function whose output goes into the
      - input of the **head** function

```{r}
mtcars %>% select(mpg, am) %>% head
```

Break this down:

```{r}
mtcars %>% select(mpg, am)
```

The key to understanding how this works is to **read** this from left to right. It bears repeating that each command is "it's own thing" independently of the pipe character. So the:

    - output of mtcars goes into the 
    - input of the **select** function whose output goes into the
    - input of the **head** function

Let's use our new found knowledge to re-imagine our use of the **group_by** and **summarize** functions that we have been using in **composite** form up until now. 

## Using Pipes To Do Split-Apply-Combine

```{r}
df %>% group_by(gender) %>% summarize(avg=mean(age))

# Same as the following but the pipes don't require you to "commit"
# With the following, you have to know in advance what you want to do

summarize(group_by(df,gender), avg=mean(age))
```

This approach allows us to build a "pipeline" containing commands. We don't have to commit to a specific sequence of functions. This enables a free-form type of exploration.

```{r}
df %>% 
  group_by(gender) %>% 
  summarize(avg=mean(age),total=n())
```

What is the median age of all males ? 

```{r}
df %>% 
  filter(gender == "MALE") %>% 
  summarize(med_age=median(age))
```

![](./figures/chrt.png)

### Saving Results

It should be observed that if you want to save the results of some sequence of commands that you will need to use the "<-" operator. Using the previous example we could the following to save our result.

```{r}
results <- df %>% 
              filter(gender == "MALE") %>% 
              summarize(med_age=median(age))
```


## An Example 

Using the built in mtcars dataframe, do the following:
   
   1) **filter** for records where the wt is greater than 3.3 tons.

   2) Then, using the **mutate** function to create a column called 
      ab_be (Y or N) that indicates whether that observation’s 
      mpg is greater (or not) than the average mpg for the filtered set.

   3) Then present the average mpg for each group. 
   
This is easy using pipes and dplyr verbs. 

```{r}
mtcars %>% filter(wt > 3.3)  %>%
           mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")) %>%
           group_by(ab_be) %>% 
           summarize(mean_mpg=mean(mpg))
```

This could be then "piped" into the input of the **ggplot** command to plot a  corresponding bar chart. If you don't yet know ggplot then it's okay as this will nudge you in that direction. Both **ggplot** and **dplyr** are part of the **tidyverse** which means that the two packages "talk" to each other well. 

```{r}
mtcars %>% filter(wt > 3.3)  %>%
           mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")  ) %>%
           group_by(ab_be) %>% summarize(mean_mpg=mean(mpg)) %>%
           ggplot(aes(x=ab_be,y=mean_mpg)) +
           geom_bar(stat="identity") +
           ggtitle("Mean MPG") + labs(x = "ab_be", y = "Mean MPG")
```

## Working With Flowers

Let's work with the built in **iris** data frame to explore the world of flowers. This is famous (Fisher's or Anderson's) **iris** data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.  

I'll use dyplr idioms here as opposed to the typical Base R approach which might involve use of composite functions. First, load up the iris data 

```{r}
data(iris)
```


### Structure of The Data frame 

It's always helpful to look at what types of data you have in a data frame. As you already known, base R has a function called **str** which is useful. The tidyverse equivalent is **glimpse** although the two commands basically provide the same types of information. Personally, I still prefer the "old" **str** function if only because I've been using it for so long. 

```{r}
str(iris)
```

```{r}
glimpse(iris)
```


### More Practice

1) Get all the rows where the Species is "setosa" Sepal.Length is > 4.7 
but < 5.0. Use the pipes feature to help you with this.

```{r}
iris %>% 
  filter(Sepal.Length > 4.7 & Sepal.Length < 5.0 & Species=="setosa" )
```

2) Select out only the columns relating Sepal measurements. List only the top five rows:

```{r}
iris %>% select(c(Sepal.Length,Sepal.Width)) %>% head(5)

# Or use a helper function to process strings

iris %>% select(starts_with("Sepal")) %>% head(5)

# Or if you know which column numbers you want

iris %>% select(c(1:2)) %>% head(5)
```

3) Sort the data frame by Sepal.Width such that the row with the largest Sepal.Width is listed first. Print only the first 5 rows

```{r}
iris %>% arrange(desc(Sepal.Width)) %>% head(5)
```


4) How many observations are there for each Species group ? 

```{r}
iris %>% group_by(Species) %>% count()

# Or

iris %>% group_by(Species) %>% summarize(total=n())
```

5) Select all columns that do NOT relate to Length. Limit output to 5 rows

```{r}
iris %>% select(-ends_with("Length")) %>% head()
```

6) Select all columns that do NOT relate to Length or Species. Limit output to 5 rows

```{r}
iris %>% select(-c(ends_with("Length"),"Species")) %>% head()

# Or

iris %>% select(-ends_with("Length")) %>% select(-"Species") %>% head()
```

7) Select all columns that do NOT relate to Length or Species. But only for observations where Sepal.Width is > 3.9.  There are multiples way to attack this problem.


```{r}
iris %>% filter(Sepal.Width > 3.9) %>% 
  select(-ends_with("Length")) %>% 
  select(-"Species")

# Or

iris %>% select(-ends_with("Length")) %>% 
  select(-"Species") %>%
  filter(Sepal.Width > 3.9) 
```

8) Determine the mean, standard deviation, max, and min for Sepal.Length

```{r}
iris %>% summarize(mean=mean(Sepal.Length),
                   sd=sd(Sepal.Length),
                   max=max(Sepal.Length),
                   min=min(Sepal.Length))
```


9) For each Species type, determine the mean, standard deviation, max, and min for Sepal.Length

```{r}
iris %>% group_by(Species) %>% summarize(mean=mean(Sepal.Length),
                                         sd=sd(Sepal.Length),
                                         max=max(Sepal.Length),
                                         min=min(Sepal.Length))
```

# Merging Data Frames

This section describes the process of joiningg data frames which is an important skill to have. R has a command called **merge** which can handle some of these tasks but if you look in the help pages for this command it will in turn make references to "join operations" which is a more general way to describe mering activity. In reality joining tables is quite common in SQL (Structured Query Language) so developing facility will pay dividends in the future. Let's start with a very basic example. The data is simple.

```{r}
inventory <- data.frame(part_num=c("001","002","003"),
                    description=c("Indispensable Widget",
                                  "Flux Capacitor",
                                  "Radiator"),
                    price=c(20,25,15),stringsAsFactors = FALSE)
                    
sales <- data.frame(part_num=c("001","001","001","003","110"),
                    quantity_sold=c(23,100,44,98,98),
               sales_regions=c("east","west","north","north","south"), stringsAsFactors = FALSE)
```

## Using keys

So the first data frame represents and inventory that lists part numbers, their descriptions, and current quantity in inventory. Basically, one line for each part. 

```{r}
inventory
```

The second data frame, sales, represents the sales of various parts within one or more regions. Notice that no quantity of part 2 was sold at all. Notice also that 3 units of part number 1 were sold in three regions. 

```{r}
sales
```

What if we wanted *merge* or *join* these two data frames in various ways. Let's explore these. So the **part_num** column in each data frame appears to relate to the same thing so this will be our **key** by which to merge the two data sources. 

To visualize what various joins look like here is a diagram from [R for Data Science](https://r4ds.had.co.nz/) 

![](./figures/joining.png)

## full_join()

We will start with a **full_join** which will seek to involve all rows in both data frames based on a matching **key** which, in this case, is the **part_num** column. 

Let's use the two data frames, along with a join command, to create a single table that contains a sales report for **all** part numbers even if there were no sales of that part number or it is not in the current inventory. 

```{r}
full_join(inventory,sales)
```

This table includes all matched records (using **part_num** as a **key**) from both the data frames. The operation will supply NA values for missing matches on either side. What does "matched records" mean ? 

![](./figures/inventorymatch1.png){width=550}

![](./figures/salesmatch1.png)

Notice we see part number 002 even though there were no sales. We also see a row for part 110 even though it doesn't appear in the inventory. Notice that there are missing values. 

The reason we get a value of NA for **description** and **price** in the 6th row is that part number 110 does not exist within the **inventory** data frame. So, to reasonably include the information in the fully joined data, the NAs must be provided as placeholders for the missing values. 

Also notice that there are missing values for the quantity and sales_regions realtive to part number 002 since it doesn't appear 
within the sales data frame. 

## inner_join()

Next, let's create a joined table that shows **only** information for parts listed in inventory and only those parts that had at least one transaction. In effect we want only rows that are in common (based on the **part_num** key) to both data frames. 

The  **inner_join** will produce a data frame that includes rows pertaining to part numbers *in common* to both data frames. An inner join matches pairs of observations whenever their keys are equal. 

```{r}
inner_join(inventory,sales)
```

This would mean then that the row in the sales data frame that refers to a part number of 110 would be omitted in the result since it does not occur in the inventory data frame. Part number 002 does not appear in the sales data frame so it is not listed here. 

## left_join()

Next up we have the **left_join** which will "favor" the first data frame referenced in the command in that it will include rows from inventory AND rows from **sales** where the **part_num** matches. So part number 002 is still included even though it does not appear in the **sales** data frame. 

```{r}
left_join(inventory,sales)
```

## right_join()

The right_join() will favor the second specified data frame which it will return all the rows from sales and rows from sales where the part number matches. This is why we get a line in the result for part 110 that includes missing values for description and price since those values are missing the inventory data frame. 

```{r}
right_join(inventory,sales)
```

There are other join types but these cover the main ones. 

# Your Turn

Now it's time for you to try some exercises on your own. The **ggplot2** package (which is part of the larger **tidyverse** comes with some data sets one of which is called **msleep**. It's easy to load into your work space:

```{r}
data(msleep)
```

This data set is 

 > an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007

![](./figures/vore.png){width=600}

You can always get the names of the columns in a data frame using **str** or **names**

```{r}
names(msleep)

#

str(msleep)
```


Not only is this a new data frame for you it also has the added challenge of containing some missing values - those pesky "NAs". In the older approach we used the function **complete.cases** to help us figure out which rows had at least one missing value. We could still use that here although the **dplyr** world has a function called **drop_na()** which could be used but itself or as part of a pipeline. 


Here are some questions I want you to answer. Note that the answers are provided so you can check your work. However, the code used to generate the answer is not visible. There is typically more than one way to answer the question. 


#### What is the average total sleep time for Omnivores ? 

```{r echo=FALSE}
msleep %>% filter(vore=="omni") %>% summarize(mean=mean(sleep_total))

# Or

msleep %>% group_by(vore) %>% 
  summarize(mean=mean(sleep_total)) %>% 
  filter(vore=="omni")
```


#### Group the msleep data frame by taxanomic order and then summarize the mean sleep total for each group. Make sure the resulting table is arranged in descending order of the average - from highest sleep_total average to the lowest


```{r echo=FALSE}
msleep %>% 
  group_by(order) %>% 
  summarize(avg=mean(sleep_total)) %>%
  arrange(desc(avg))
```

#### What is the average total sleep time for all vore types ? 


```{r echo=FALSE}
msleep %>% group_by(vore) %>% 
  summarize(mean=mean(sleep_total))
```


#### Omitting any rows that contain missing values, what is the average total sleep time for all vore types ? Remember that you will need to use the **drop_na** function to help you. 

```{r echo=FALSE}
msleep %>% 
  drop_na() %>% 
  group_by(vore) %>% 
  summarize(mean=mean(sleep_total))
```


#### Remove all rows that contain a missing value and save the result into a new data table called **msleep_na**. How many rows does msleep_na contain ? 

```{r echo=FALSE}
msleep_na <- msleep %>% drop_na()
msleep_na %>% nrow()
```


#### Using msleep_na, find the average braintwt for all vores by order. Note that the group_by() function can take multiple arguments. 

```{r echo=FALSE}
msleep_na %>% 
  group_by(order,vore) %>% 
  summarize(mean=mean(brainwt))
```


# Chicago Crime Data

Here we look at some data obtained from the City of Chicago Open Data Portal <https://data.cityofchicago.org/>. This is data that represents phone calls to the police during the year of 2012. Just pick a folder on your hard drive and download the data. 


## Getting the Data

```{r eval=FALSE}
url <- "https://raw.githubusercontent.com/steviep42/nursing_741/master/chi_crimes.csv"

download.file(url,"chi_crimes.csv")
```


Next, read the file in to a local data frame:


```{r}
chi <- read_csv("chi_crimes.csv")
```

## Inspecting the Data

The way the data is organized is that each row represents a call to the police which may or may not result in an actual arrest. In any case there will be a case number assigned to the call and other information such as the Best, Ward, Block, and Community Area will be noted.  If there is an arrest it will be recorded and then it will be categorized according to an FBI code. The location of the call will be recorded in latitude and longitude. Some basic questions might be how many rows and columns there are, etc


```{r}
glimpse(chi)
```

So let's check this data frame out some more. For example, what are the factors in this data frame ? Those that take on only a few different values might be good summary variables.

The way to do this with Base R would be:

```{r}
sapply(chi,function(x) length(unique(x)))
```

The way to do this with dplyr could be:

```{r}
chi %>% summarize_all(n_distinct)
```


From viewing this perhaps variables like Arrest, Domestic, FBI Code, District, and Ward could be good summary variables. We don't really have any kind of measured data here. 

We do have lots of dates which at this point are actually character strings so let's turn them into dates and then work with them to get some more summary variables created. There are also some latitude and longitude information that we perhaps later use to draw some maps if we want. We'll consider that later. 

## Creating Some Dates

```{r}
suppressMessages(library(lubridate))
# Right now the dates and times are just a character string

str(chi$Date)  

# Note that the dates have a PM or AM string. Let's exploit 
# that information to create a category that captures this information.

chi <- chi %>% mutate(am_pm=ifelse(grepl("PM",Date),"PM","AM")) 

# Let's turn them into actual dates and times using the lubridate package

chi <- chi %>% 
  mutate(Date = parse_date_time(chi$Date,'%m/%d/%Y %I:%M:%S %p'))
range(chi$Date)
```

It's worth noting that we can use some functions from lubridate to further group the dates and times so we can do some more specific analysis. [Here](http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html) is link to a tutorial on lubridate. Anyway try some of these:

```{r}
# List the first 5 records from the chi data table
chi$Date %>% head

# This will tell us what day of the week the given date and time represents
chi$Date %>% weekdays %>% head

# This will tell us what month the date and time falls into
chi$Date %>% months %>% head

# Let's add a factor to the data table/data frame that gives the month
chi <- chi %>% mutate(Month=months(Date)) 
chi <- chi %>% mutate(Month=factor(Month, levels = month.name))

chi %>% distinct(Month)
```


Let's add in a few more factors based on the date since it will help us build some interesting tables as we try to understand the patterns in Chicago crime during the year of 2012. It is also important to understand how to work with dates since they wind up being important when looking at data. Not every data set will be a function of dates or be reliant upon them for interpretation but this one is. 

It's also not necessary to create these categories in advance or make them part of the data frame. We could do this on the fly as we will see later. We will create a category to classify the quarter during which the call to the police occurred.

```{r}
chi <- chi %>% mutate(quarter=quarter(Date)) 

chi <- chi %>% mutate(weekdays=weekdays(Date))

chi %>% select(Month, am_pm, weekdays, quarter) %>% head()
```

## Count Arrests

Now we can start to ask some basic questions to better understand the data. When we have lots of categories in our data for which we could make some count-based summaries which gives us some frequencies. For example, let's focus on the "Arrest" variable to figure out how many Arrests happened. What about the FBI codes ? 

```{r}
# This gives us a table of Arrests vs non-Arrests

chi %>% count(Arrest)

# And we could see the count of reported crimes according to FBI code

chi %>% count(`FBI Code`) %>% arrange(desc(n))
```

Relative to the FBI codes I don't know what the code actually mean but I found a description of them at [this URL](http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
So it seems to me that the top 5 areas, at least by FBI codes, are: 

* Larceny
* Simple Battery
* Vandalism
* Drug Abuse
* Misc Non-Index Offense

We could do two-way table of course. This helps us look at relationships between categories. Here we look at how Arrests relate to whether it's day or night. When do most arrests occur ? 

```{r}
chi %>% 
  count(Arrest,am_pm) %>% 
  filter(Arrest==TRUE) %>% 
  arrange(desc(n))
```

So let's plot a bar chart that shows is the total number of calls to the police for each month and what portion of those calls each month resulted in an arrest.

```{r}
chi %>% count(Arrest,Month) %>% 
  ggplot(aes(x=Month,y=n,fill=Arrest)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=45)) + 
  ylab("Calls To Police") +
  ggtitle("Chicago Arrests per Month in 2013")
```

We could also put the bars next to each other. Notice here that I use the **group_by()** function with **summarize()** whereas before I used the **count()** function. The two approaches are equivalent. 

```{r}
chi %>% group_by(Arrest,Month) %>% 
  summarize(total=n()) %>% 
  ggplot(aes(x=Month,y=total,fill=Arrest)) + 
  geom_bar(stat="identity",position=position_dodge()) + 
  theme(axis.text.x=element_text(angle=45)) + 
  ggtitle("Chicago Arrests per Month in 2013")

```

## Crime At Day vs Night

In terms of number of calls to Police let's see how many calls there were per month and then break that down by night vs day. You can see then that calling at night is more frequent. Intuitively this makes some sense. 

```{r}
chi %>% group_by(am_pm,Month) %>% 
  summarize(total=n()) %>% 
  ggplot(aes(x=Month,y=total,fill=am_pm)) + 
  geom_bar(stat="identity",position=position_dodge()) + 
  theme(axis.text.x=element_text(angle=45)) + 
  ggtitle("Calls to Police per Month in 2013")

```

## Line Graphs and Histograms

Let's try something else here. The lubridate package has a function that we can use to get the numeric day of the year. For example January 5, 2012 would be the 5th day of the year. So we could add up the number of calls to the police for each day of the year and maybe plot that as a line chart. Notice here how we create the **ydays** category on the fly. We don't need to create it as part of the data frame. 

```{r}
chi %>% mutate(ydays = yday(Date)) %>% 
  group_by(ydays) %>% summarize(total_calls=n()) %>% 
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() +
  ggtitle("Calls per Each Day of 2012") + 
  xlab("Day of the year 1 - 365") +
  ylab("Total Calls to the Police")

```

Now if we wanted to we could split the line graph up into two different panels or "facets" as ggplot calls it. This won't take much more than what we already had before. We'll just throw in the am_pm variable.

```{r}
chi %>% 
  mutate(ydays = yday(Date)) %>% 
  group_by(am_pm,ydays) %>% 
  summarize(total_calls=n()) %>%
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(am_pm~.)
```

That looks okay but what about changing the arrangement of the panels ? That's easy - we just change the formula in the **facet_grid** layer

```{r}
chi %>% 
  mutate(ydays = yday(Date)) %>% 
  group_by(am_pm,ydays) %>% 
  summarize(total_calls=n()) %>%
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(.~am_pm)

```


Now we could also use a function called **hour** which given the date will extract the hour of the day that the call to the police occurred. Let's generate a histogram of calls to the police based on the hour of the day. We'll pick a bin size of 24 since this will match the number of hours 

```{r}

chi %>% mutate(hod=hour(Date)) %>% 
  ggplot(aes(x=hod)) + 
  geom_histogram(bins=24,color="black",fill="white") + 
  ggtitle("Call Count by Hour of the Day") +
  xlab("Hour of the day 0 - 23") + theme_bw()

```

## Calendar Map

The **googleVis** package has a way to create something known as a calendar heatmap. We can use dplyr functions, as we have been, to count the number of calls per day of the year and then use that to create the heatmap. Most of the work here is involved with mutating the date to a "Year Month Day" format after which we count the number of police calls that day. 

```{r eval=FALSE}
calend <- chi %>% 
  mutate(ydays = ymd(format(chi$Date,"%Y-%m-%d"))) %>% 
  group_by(ydays) %>% summarize(total_calls=n()) %>% 
  gvisCalendar(.,datevar="ydays",numvar="total_calls",
               options=list(width="1000px", height="300px"))
plot(calend)
```

![](./figures/calend.png)


## What Types of Crime Do We See ? 

Next up lets look at the counts for the most frequently committed types of crimes. This will help us understand what our risks are should we consider living in Chicago. First we can create a table of all the Crime types by using the **Primary Type** variable.

```{r}
# Sort the Crime Types From Highest Count to Lowest
chi %>% count(`Primary Type`) %>% arrange(desc(n))

# Equivalent to 

chi %>% 
  group_by(`Primary Type`) %>% 
  summarize(n=n()) %>%
  arrange(desc(n))
```

Let's make a barplot. Most of this relates to using ggplot to plot things which can make you think that the dplyr verbs are complex when they aren't. We did have to **mutate** the **Primary Type** variable to reorder it according to the corresponding number of crimes for each type. This is what makes the bar plot organize the bars according to highest to lowest. 

```{r}
chi %>% count(`Primary Type`) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) + 
  geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
```


Let's clean this plot up to show say only the top 12 crime types. Well, we don't know that these were actual crimes - only that the call was classified into a certain category. 

```{r}
chi %>% group_by(`Primary Type`) %>% 
  summarize(n=n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) + 
  geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
```

So this just gives us the total number of police calls relating to a given crime type. It doesn't say very much about the total number of arrests that were made.

```{r}
chi %>% group_by(Arrest,`Primary Type`) %>% 
  summarize(n=n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n,fill=Arrest)) + 
  geom_bar(stat="identity", width=.8) +
  xlab("Crime Type") +
  ylab("Total Calls") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

This is illuminating in that almost all the calls relating to NARCOTICS seem to result in an arrest. Let's examine these offenses in greater detail since they don't seem to fit the pattern. That is most of the other crime reports don't seem to result in arrest except maybe prostitution, criminal trespass, and weapons violation. We can drill down some more. Let's slice out the lower 10 of the top 20. 


```{r}
chi %>% group_by(Arrest,`Primary Type`) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  slice(10:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-count),
             y=count,fill=Arrest)) +     
  geom_bar(stat="identity",width=.8) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


Oh so now we see that there are several areas that tend to result in an arrest. Narcotics isn't the only one at all. This seems a little weird to me though because ALL prostitution calls result in an arrest. Let's filter out some of these crimes to see what is going on.

```{r}
chi %>% filter(grepl("PROST",Description)) %>% 
  group_by(Arrest) %>% 
  summarize(cnt=n())

#

chi %>% filter(grepl("BURGLARY",`Primary Type`)) %>% 
  group_by(Arrest) %>% 
  summarize(cnt=n())

```

More stuff to consider about NARCOTICS

```{r}
chi %>% filter(grepl("NARCOTICS",`Primary Type`)) %>% 
  group_by(Description) %>%
  summarize(count=n()) %>%
  arrange(desc(count)) %>% 
  slice(1:10) %>% 
  ggplot(aes(x=reorder(Description,-count),
             y=count)) +      
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```

So looks like we have lots of people getting arrested for possession of cannabis of 30 grams or less. Heroin, Crack, and Cocaine arrests are next in line but the bulk of arrests relate to cannabis. Now there are also some arrests relating to Cannabis possession with amounts greater than 30 Grams.


```{r}

# Most of the > 30GMS cannabis possession happens in the STREET

chi %>% filter(grepl("CANNABIS MORE THAN 30GMS",Description)) %>%
  group_by(`Location Description`) %>% 
  summarize(count=n()) %>%
  arrange(desc(count))


```


## Creating A Map

Next we will put some of these locations on a map - the ones that relate to narcotics. 

```{r echo=FALSE}
chi_map <- readRDS("chi_map.rds")
chi_map_13 <- readRDS("chi_map_13.rds")
```


```{r etmap13,eval=FALSE}
library(ggmap)

chi_map <- get_map("Chicago, Illinois",zoom=11,extent="device")
```


```{r}

# The following reveals that there are 284 violations. 
street <- chi %>% 
  filter(grepl("CANNABIS MORE THAN 30GMS",Description)) %>%
  filter(grepl("STREET",`Location Description`)) 
  
nrow(street)

# Let's map these using R googleVis
ggmap(chi_map) + geom_point(aes(x=Longitude,y=Latitude, 
                         color=am_pm),alpha=1,data=street)


```

Now, let's compare this to all Narcotics Offenses
```{r cache=TRUE}
narcotics<- chi %>% filter(grepl("NARCOTICS",`Primary Type`))

ggmap(chi_map) + geom_point(aes(x=Longitude,y=Latitude, 
                         color=am_pm),alpha=.9,data=narcotics) 
 
```

There is a geometry called "stat_bin2d" that will allow us to bin locations into "containers" for purposes of plotting. Here we create a "bubble" chart which let's us collect reported crime locations areas (squares) so we can maybe see where these crimes are being committed. The problem with this approach is that it doesn't give us a strong idea about how many crimes are happening in these areas. 


```{r cache=TRUE}
ggmap(chi_map) + 
  stat_bin2d(
    aes(x=Longitude, y=Latitude, color=am_pm, fill=am_pm),
    size= .5,bins=30, alpha=1/2,
    data = narcotics
  )
```

We can use some predefined terms such as "..level.." to get counts of the lat lons in an area which will then be reflected in "fill". This plot winds up being very cool and informative. We can see that the Narcotics offenses are happening on the East Side and near the University Village and Little Italy. 

```{r echo=FALSE,val=FALSE}
register_google(key="AIzaSyAl4fVkTWpnEFanFZaKegkkRNC-WOFZjBk")
```


```{r getmap13,eval=FALSE,cache=TRUE}
chi_map_13 <- get_map("Chicago,Illinois",zoom=13,extent="device")
```


```{r statd13,eval=FALSE}
ggmap(chi_map_13) +
  stat_density2d(
    aes(x = Longitude, y =Latitude, fill = ..level..,
        alpha = ..level..),
    size = 2, bins = 4, data = narcotics, geom = "polygon")

```

![](./figures/dense2.png){width=800}

```{r cache=TRUE,eval=FALSE}
ChiMap <- ggmap(chi_map_13, 
                base_layer = ggplot(aes(x = Longitude, 
                                        y = Latitude),
                                        data = narcotics))

# Now we overlay the base with a density based approach
# The more crime a location experienvces, the darker the color

ChiMap +
 stat_density2d(aes(x = Longitude, y =Latitude, 
                    fill = ..level.., alpha = ..level..),
                size = 2, bins = 4, 
                data = narcotics, 
                geom = "polygon") + facet_wrap(~weekdays)
```

![](./figures/sfacets.png){width=800}

## Leaflet 

**Leaflet** is a javascript library that simplifies the creation of maps and placement of markers there upon. R has a companion package called "leaflet" that you can use to help create some really nice maps. All that is required is some latitude and longitude information which we already have since we are working with the Chicago data. 

Let's continue with the Narcotics data and see where these are occurring. I think you will like the output that leaflet provides because it is interactive and you can zoom in and out of the data


```{r leaflet}
library(leaflet)

m <- leaflet() %>% 
  addTiles() %>% 
  addMarkers(lng=narcotics$Longitude,
             lat=narcotics$Latitude,
             popup=narcotics$Description,
             clusterOptions = markerClusterOptions())

m
```


<!--chapter:end:02-literature.Rmd-->