02-literature.Rmd

# The tidyverse

![](./figures/tidyverse.png){width=350}

The **dplyr** package is part of the larger **tidyverse** package set which has expanded considerably in recent years and continues to grow in size and utility such that many people never learn the "older way" of doing things in R. But we've already been through that in the previous section. The tidyverse has the following packages. The descriptions have been lifted from the [tidyverse home page](https://www.tidyverse.org/packages/).

**ggplot2** - ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

**dplyr** - dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. 

**tidyr** - tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.

**readr** - readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

**tibble** - tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. 

**stringr** - stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.

**lubdriate** - Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.

## Installing

You will probably use a number of functions from several of these packages so it's best to go ahead and install the entire **tidyverse** in one go. To install it, do one of the following: 

1) At the R Console from within RStudio, type:

```{r eval=FALSE}
install.packages("tidyverse")
```

2) Use the Tools -> Install Packages menu item in RStudio:

![](./figures/inst.png){width=400}

After you have installed the package you may load it by doing:

```{r}
suppressMessages(library(tidyverse))
```

Note that the **cheatsheet** for **dplyr** can be [found here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)

![](./figures/cheat.png)

## dplyr Basics

**dplyr** is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In fact if you were paying attention during the opening section on data frames you will have noticed that most of the activities we performed related to the following activities. In dplyr-speak there are the **verbs** that help us get work done.


    mutate() - adds new variables that are functions of existing variables
    select() - picks variables based on their names.
    filter() - picks cases based on their values.
    summarise() - reduces multiple values down to a single summary.
    arrange() - changes the ordering of the rows.


## First Steps

Note that this material references ["Becoming a data ninja with dplyr"](https: //speakerdeck.com/dpastoor/becoming-a-data-ninja-with-dplyr) as well as [this dplyr tutorial]( http://genomicsclass.github.io/book/pages/dplyr_tutorial.html)

We'll go back to the basics here by using a very small data frame which will make it clear how the various **dplyr** verbs actually work:

```{r}
df <- data.frame(id = 1:5,
                 gender = c("MALE","MALE","FEMALE","MALE","FEMALE"),
                 age = c(70,76,60,64,68))
```

![](./figures/df1.png){width=400}


### filter()

The **filter()** function allows us to sift through the data frame to find rows that satisfy some logical condition. (With the older approach we would be using the bracket notation). The following example allows is to find only the observations relating to a declared gender of **female**.

```{r}
filter(df,gender == "FEMALE")

# Given this data frame, the following is equivalent 

filter(df, gender != "MALE")
```

![](./figures/df2.png)


So, now find only the **ids** that relate to rows 1,3, or 5. This is a highly specialized search but it is helpful to show that you can use a wide variety of logical constructs.

```{r}
filter(df, id %in% c(1,3,5))
```

![](./figures/df3.png)

### mutate()

Mutate is used to add or remove columns in a data frame. Let's create a new column in the data frame that contains the mean value of the age column.

```{r}
mutate(df,meanage = mean(age))
```

![](./figures/meanage.png)


Next we will create a new column designed to tell us if a given observation has an age that is greater than or equal to the average age. Specifically, create a variable called **old_young** and assign a value of "Y" if the observed age for that row is above the mean age and a value of "N" if it is not.

```{r}
mutate(df,old_young=ifelse(df$age>=mean(df$age),"Y","N"))
```

One way we could use something like this is in making a plot where the observations exhibiting an age value above the mean are plotted in a certain color and those below the mean are in another color.

```{r}
tmp <- mutate(df, color = ifelse(age > mean(age),"red","blue")) 

plot(tmp$age,col=tmp$color, type="p",
     pch=19,main="Ages",ylab="Age") 

grid()

abline(h=mean(tmp$age),lty=2)

legend("topright",
       c("Above Avg","Below Avg"),col=c("red","blue"),pch=19)
```

### arrange()

Use arrange for sorting the data frame by one or more columns. When using the basic data frame structure from R we had to use the **order()** function to help us generate a vector that has the row numbers of the data frame that correspond to the desired order of display (lowest to highest, etc). 

Let's sort the data frame **dff** by age from oldest to youngest.
First we'll use the older approach. While this will work, it is not exactly very intuitive.

```{r}
df[rev(order(df$age)),]
```

**dplyr** makes this process more simple - at least in my opinion

```{r}
arrange(df,desc(age))
```

Next, let's sort **df** by gender (alphabetically) and then by age
from oldest to youngest. The rows relating to a gender of **female** are going to be listed first because, alphabetically speaking, the letter "F" comes before the letter "M". Then within those categories we have the ages sorted from oldest to youngest.

```{r}
arrange(df, gender,desc(age))
```

If we used the older approach it would look like the following. Ugh !

```{r}
df[order(df$gender,-df$age),]
```

### select()

The select() functions allows us to select one or more columns from a data frame.

```{r}
# Reorder the columns
select(df,gender,id,age)  
```


```{r}
# Select all but the age column
select(df,-age)
```

```{r}
# Can use the ":" operator to select a range
select(df,id:age) 
```

The select() function provides the ability to select by "regular expressions"" or numeric patterns:

```{r}
# Select all columns that start with an "a"
select(df,starts_with("a"))
```

```{r}
names(mtcars)

# Get only columns that start with "c"

select(mtcars,starts_with("c"))
```

This example is more realistic in that data frames can have a large number of columns named according to some convention. For example, the measurements on a patient might not be labelled specifically - they might have a common prefix such as "m_" followed by some sequential number (or not). 

```{r}
testdf <- expand.grid(m_1=seq(60,70,10),
                      age=c(25,32),
                      m_2=seq(50,60,10),
                      m_3=seq(60,70,10))
```


```{r}
testdf
```


Find all the columns that include a "_" character

```{r}
select(testdf,matches("_"))
```

This will select columns beginning with "m_" but only those with a suffix of 1 or 2.

```{r}
select(testdf,num_range("m_",1:2))
```

### group_by()

The **group_by()** function let’s you organize a data frame by some factor or grouping variable. This a very powerful function that is typically used in conjunction with a function called **summarize**. Here is what it looks like by itself. It's somewhat underwhelming. It does seem to create table of some kind but it doesn't do much else. 

```{r}
df

# Hmm. the following doesn't do anything - or so it seems

group_by(df)
```

So as mentioned, the **group_by** function is usually paired with the **summarize** function. Ah. so what this does is to first group the data frame by the **gender** column and then it **counts** the number of occurrences therein. So this is a form of aggregation.

```{r}
summarize(group_by(df,gender),total=n())
```

![](./figures/grp.png)

Let's group the data frame by gender and then compute the average age for each group.

```{r}
summarize(group_by(df,gender),av_age=mean(age))
```

![](./figures/grpa.png)

Let's group by gender and then compute the total number of observations in each gender group and then compute the mean age. 

```{r}
summarize(group_by(df,gender),av_age=mean(age),total=n())
```

# Split Apply Combine

This pattern of using **group_by()** followed by **summarize()** 
is called **Split Apply Combine**. The idea is that we 

1) Split up the data frame by gender group
2) Then for each group, apply the average function
3) Then combine the average results for each group

```{r}
summarize(group_by(df,gender),av_age=mean(age))
```

![](./figures/morpheus.png){width=600}


## What Are Pipes ?

Before moving forward let us consider the "pipe" operator that is included with the dplyr - well actually **magrittr** package. This is used to make it possible to "pipe" the results of one command into another command and so on.

The inspiration for this comes from the UNIX/LINUX operating system where pipes are used all the time. So in effect using "pipes" is nothing new in the world of research computation. 

![](./figures/unix_pipe.png)

**Warning:** Once you get used to pipes it is hard to go back to not using them. 

Let’s use the mtcars data frame to illustrate the basics of the piping mechanism as used by dplyr. Here we will select the **mpg** and **am** column from mtcars and view the top 5 rows.

```{r}
head(select(mtcars, mpg, am))
```

Okay, how would we do this using pipes ? Whoa ! Note that each command is "it's own thing" independently of the pipe character. 
So the:
      - output of mtcars goes into the 
      - input of the **select** function whose output goes into the
      - input of the **head** function

```{r}
mtcars %>% select(mpg, am) %>% head
```

Break this down:

```{r}
mtcars %>% select(mpg, am)
```

The key to understanding how this works is to **read** this from left to right. It bears repeating that each command is "it's own thing" independently of the pipe character. So the:

    - output of mtcars goes into the 
    - input of the **select** function whose output goes into the
    - input of the **head** function

Let's use our new found knowledge to re-imagine our use of the **group_by** and **summarize** functions that we have been using in **composite** form up until now. 

## Using Pipes To Do Split-Apply-Combine

```{r}
df %>% group_by(gender) %>% summarize(avg=mean(age))

# Same as the following but the pipes don't require you to "commit"
# With the following, you have to know in advance what you want to do

summarize(group_by(df,gender), avg=mean(age))
```

This approach allows us to build a "pipeline" containing commands. We don't have to commit to a specific sequence of functions. This enables a free-form type of exploration.

```{r}
df %>% 
  group_by(gender) %>% 
  summarize(avg=mean(age),total=n())
```

What is the median age of all males ? 

```{r}
df %>% 
  filter(gender == "MALE") %>% 
  summarize(med_age=median(age))
```

![](./figures/chrt.png)

### Saving Results

It should be observed that if you want to save the results of some sequence of commands that you will need to use the "<-" operator. Using the previous example we could the following to save our result.

```{r}
results <- df %>% 
              filter(gender == "MALE") %>% 
              summarize(med_age=median(age))
```


## An Example 

Using the built in mtcars dataframe, do the following:
   
   1) **filter** for records where the wt is greater than 3.3 tons.

   2) Then, using the **mutate** function to create a column called 
      ab_be (Y or N) that indicates whether that observation’s 
      mpg is greater (or not) than the average mpg for the filtered set.

   3) Then present the average mpg for each group. 
   
This is easy using pipes and dplyr verbs. 

```{r}
mtcars %>% filter(wt > 3.3)  %>%
           mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")) %>%
           group_by(ab_be) %>% 
           summarize(mean_mpg=mean(mpg))
```

This could be then "piped" into the input of the **ggplot** command to plot a  corresponding bar chart. If you don't yet know ggplot then it's okay as this will nudge you in that direction. Both **ggplot** and **dplyr** are part of the **tidyverse** which means that the two packages "talk" to each other well. 

```{r}
mtcars %>% filter(wt > 3.3)  %>%
           mutate(ab_be=ifelse(mpg > mean(mpg),"Y","N")  ) %>%
           group_by(ab_be) %>% summarize(mean_mpg=mean(mpg)) %>%
           ggplot(aes(x=ab_be,y=mean_mpg)) +
           geom_bar(stat="identity") +
           ggtitle("Mean MPG") + labs(x = "ab_be", y = "Mean MPG")
```

## Working With Flowers

Let's work with the built in **iris** data frame to explore the world of flowers. This is famous (Fisher's or Anderson's) **iris** data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.  

I'll use dyplr idioms here as opposed to the typical Base R approach which might involve use of composite functions. First, load up the iris data 

```{r}
data(iris)
```


### Structure of The Data frame 

It's always helpful to look at what types of data you have in a data frame. As you already known, base R has a function called **str** which is useful. The tidyverse equivalent is **glimpse** although the two commands basically provide the same types of information. Personally, I still prefer the "old" **str** function if only because I've been using it for so long. 

```{r}
str(iris)
```

```{r}
glimpse(iris)
```


### More Practice

1) Get all the rows where the Species is "setosa" Sepal.Length is > 4.7 
but < 5.0. Use the pipes feature to help you with this.

```{r}
iris %>% 
  filter(Sepal.Length > 4.7 & Sepal.Length < 5.0 & Species=="setosa" )
```

2) Select out only the columns relating Sepal measurements. List only the top five rows:

```{r}
iris %>% select(c(Sepal.Length,Sepal.Width)) %>% head(5)

# Or use a helper function to process strings

iris %>% select(starts_with("Sepal")) %>% head(5)

# Or if you know which column numbers you want

iris %>% select(c(1:2)) %>% head(5)
```

3) Sort the data frame by Sepal.Width such that the row with the largest Sepal.Width is listed first. Print only the first 5 rows

```{r}
iris %>% arrange(desc(Sepal.Width)) %>% head(5)
```


4) How many observations are there for each Species group ? 

```{r}
iris %>% group_by(Species) %>% count()

# Or

iris %>% group_by(Species) %>% summarize(total=n())
```

5) Select all columns that do NOT relate to Length. Limit output to 5 rows

```{r}
iris %>% select(-ends_with("Length")) %>% head()
```

6) Select all columns that do NOT relate to Length or Species. Limit output to 5 rows

```{r}
iris %>% select(-c(ends_with("Length"),"Species")) %>% head()

# Or

iris %>% select(-ends_with("Length")) %>% select(-"Species") %>% head()
```

7) Select all columns that do NOT relate to Length or Species. But only for observations where Sepal.Width is > 3.9.  There are multiples way to attack this problem.


```{r}
iris %>% filter(Sepal.Width > 3.9) %>% 
  select(-ends_with("Length")) %>% 
  select(-"Species")

# Or

iris %>% select(-ends_with("Length")) %>% 
  select(-"Species") %>%
  filter(Sepal.Width > 3.9) 
```

8) Determine the mean, standard deviation, max, and min for Sepal.Length

```{r}
iris %>% summarize(mean=mean(Sepal.Length),
                   sd=sd(Sepal.Length),
                   max=max(Sepal.Length),
                   min=min(Sepal.Length))
```


9) For each Species type, determine the mean, standard deviation, max, and min for Sepal.Length

```{r}
iris %>% group_by(Species) %>% summarize(mean=mean(Sepal.Length),
                                         sd=sd(Sepal.Length),
                                         max=max(Sepal.Length),
                                         min=min(Sepal.Length))
```

# Merging Data Frames

This section describes the process of joiningg data frames which is an important skill to have. R has a command called **merge** which can handle some of these tasks but if you look in the help pages for this command it will in turn make references to "join operations" which is a more general way to describe mering activity. In reality joining tables is quite common in SQL (Structured Query Language) so developing facility will pay dividends in the future. Let's start with a very basic example. The data is simple.

```{r}
inventory <- data.frame(part_num=c("001","002","003"),
                    description=c("Indispensable Widget",
                                  "Flux Capacitor",
                                  "Radiator"),
                    price=c(20,25,15),stringsAsFactors = FALSE)
                    
sales <- data.frame(part_num=c("001","001","001","003","110"),
                    quantity_sold=c(23,100,44,98,98),
               sales_regions=c("east","west","north","north","south"), stringsAsFactors = FALSE)
```

## Using keys

So the first data frame represents and inventory that lists part numbers, their descriptions, and current quantity in inventory. Basically, one line for each part. 

```{r}
inventory
```

The second data frame, sales, represents the sales of various parts within one or more regions. Notice that no quantity of part 2 was sold at all. Notice also that 3 units of part number 1 were sold in three regions. 

```{r}
sales
```

What if we wanted *merge* or *join* these two data frames in various ways. Let's explore these. So the **part_num** column in each data frame appears to relate to the same thing so this will be our **key** by which to merge the two data sources. 

To visualize what various joins look like here is a diagram from [R for Data Science](https://r4ds.had.co.nz/) 

![](./figures/joining.png)

## full_join()

We will start with a **full_join** which will seek to involve all rows in both data frames based on a matching **key** which, in this case, is the **part_num** column. 

Let's use the two data frames, along with a join command, to create a single table that contains a sales report for **all** part numbers even if there were no sales of that part number or it is not in the current inventory. 

```{r}
full_join(inventory,sales)
```

This table includes all matched records (using **part_num** as a **key**) from both the data frames. The operation will supply NA values for missing matches on either side. What does "matched records" mean ? 

![](./figures/inventorymatch1.png){width=550}

![](./figures/salesmatch1.png)

Notice we see part number 002 even though there were no sales. We also see a row for part 110 even though it doesn't appear in the inventory. Notice that there are missing values. 

The reason we get a value of NA for **description** and **price** in the 6th row is that part number 110 does not exist within the **inventory** data frame. So, to reasonably include the information in the fully joined data, the NAs must be provided as placeholders for the missing values. 

Also notice that there are missing values for the quantity and sales_regions realtive to part number 002 since it doesn't appear 
within the sales data frame. 

## inner_join()

Next, let's create a joined table that shows **only** information for parts listed in inventory and only those parts that had at least one transaction. In effect we want only rows that are in common (based on the **part_num** key) to both data frames. 

The  **inner_join** will produce a data frame that includes rows pertaining to part numbers *in common* to both data frames. An inner join matches pairs of observations whenever their keys are equal. 

```{r}
inner_join(inventory,sales)
```

This would mean then that the row in the sales data frame that refers to a part number of 110 would be omitted in the result since it does not occur in the inventory data frame. Part number 002 does not appear in the sales data frame so it is not listed here. 

## left_join()

Next up we have the **left_join** which will "favor" the first data frame referenced in the command in that it will include rows from inventory AND rows from **sales** where the **part_num** matches. So part number 002 is still included even though it does not appear in the **sales** data frame. 

```{r}
left_join(inventory,sales)
```

## right_join()

The right_join() will favor the second specified data frame which it will return all the rows from sales and rows from sales where the part number matches. This is why we get a line in the result for part 110 that includes missing values for description and price since those values are missing the inventory data frame. 

```{r}
right_join(inventory,sales)
```

There are other join types but these cover the main ones. 

# Your Turn

Now it's time for you to try some exercises on your own. The **ggplot2** package (which is part of the larger **tidyverse** comes with some data sets one of which is called **msleep**. It's easy to load into your work space:

```{r}
data(msleep)
```

This data set is 

 > an updated and expanded version of the mammals sleep dataset. Updated sleep times and weights were taken from V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007

![](./figures/vore.png){width=600}

You can always get the names of the columns in a data frame using **str** or **names**

```{r}
names(msleep)

#

str(msleep)
```


Not only is this a new data frame for you it also has the added challenge of containing some missing values - those pesky "NAs". In the older approach we used the function **complete.cases** to help us figure out which rows had at least one missing value. We could still use that here although the **dplyr** world has a function called **drop_na()** which could be used but itself or as part of a pipeline. 


Here are some questions I want you to answer. Note that the answers are provided so you can check your work. However, the code used to generate the answer is not visible. There is typically more than one way to answer the question. 


#### What is the average total sleep time for Omnivores ? 

```{r echo=FALSE}
msleep %>% filter(vore=="omni") %>% summarize(mean=mean(sleep_total))

# Or

msleep %>% group_by(vore) %>% 
  summarize(mean=mean(sleep_total)) %>% 
  filter(vore=="omni")
```


#### Group the msleep data frame by taxanomic order and then summarize the mean sleep total for each group. Make sure the resulting table is arranged in descending order of the average - from highest sleep_total average to the lowest


```{r echo=FALSE}
msleep %>% 
  group_by(order) %>% 
  summarize(avg=mean(sleep_total)) %>%
  arrange(desc(avg))
```

#### What is the average total sleep time for all vore types ? 


```{r echo=FALSE}
msleep %>% group_by(vore) %>% 
  summarize(mean=mean(sleep_total))
```


#### Omitting any rows that contain missing values, what is the average total sleep time for all vore types ? Remember that you will need to use the **drop_na** function to help you. 

```{r echo=FALSE}
msleep %>% 
  drop_na() %>% 
  group_by(vore) %>% 
  summarize(mean=mean(sleep_total))
```


#### Remove all rows that contain a missing value and save the result into a new data table called **msleep_na**. How many rows does msleep_na contain ? 

```{r echo=FALSE}
msleep_na <- msleep %>% drop_na()
msleep_na %>% nrow()
```


#### Using msleep_na, find the average braintwt for all vores by order. Note that the group_by() function can take multiple arguments. 

```{r echo=FALSE}
msleep_na %>% 
  group_by(order,vore) %>% 
  summarize(mean=mean(brainwt))
```


# Chicago Crime Data

Here we look at some data obtained from the City of Chicago Open Data Portal <https://data.cityofchicago.org/>. This is data that represents phone calls to the police during the year of 2012. Just pick a folder on your hard drive and download the data. 


## Getting the Data

```{r eval=FALSE}
url <- "https://raw.githubusercontent.com/steviep42/nursing_741/master/chi_crimes.csv"

download.file(url,"chi_crimes.csv")
```


Next, read the file in to a local data frame:


```{r}
chi <- read_csv("chi_crimes.csv")
```

## Inspecting the Data

The way the data is organized is that each row represents a call to the police which may or may not result in an actual arrest. In any case there will be a case number assigned to the call and other information such as the Best, Ward, Block, and Community Area will be noted.  If there is an arrest it will be recorded and then it will be categorized according to an FBI code. The location of the call will be recorded in latitude and longitude. Some basic questions might be how many rows and columns there are, etc


```{r}
glimpse(chi)
```

So let's check this data frame out some more. For example, what are the factors in this data frame ? Those that take on only a few different values might be good summary variables.

The way to do this with Base R would be:

```{r}
sapply(chi,function(x) length(unique(x)))
```

The way to do this with dplyr could be:

```{r}
chi %>% summarize_all(n_distinct)
```


From viewing this perhaps variables like Arrest, Domestic, FBI Code, District, and Ward could be good summary variables. We don't really have any kind of measured data here. 

We do have lots of dates which at this point are actually character strings so let's turn them into dates and then work with them to get some more summary variables created. There are also some latitude and longitude information that we perhaps later use to draw some maps if we want. We'll consider that later. 

## Creating Some Dates

```{r}
suppressMessages(library(lubridate))
# Right now the dates and times are just a character string

str(chi$Date)  

# Note that the dates have a PM or AM string. Let's exploit 
# that information to create a category that captures this information.

chi <- chi %>% mutate(am_pm=ifelse(grepl("PM",Date),"PM","AM")) 

# Let's turn them into actual dates and times using the lubridate package

chi <- chi %>% 
  mutate(Date = parse_date_time(chi$Date,'%m/%d/%Y %I:%M:%S %p'))
range(chi$Date)
```

It's worth noting that we can use some functions from lubridate to further group the dates and times so we can do some more specific analysis. [Here](http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html) is link to a tutorial on lubridate. Anyway try some of these:

```{r}
# List the first 5 records from the chi data table
chi$Date %>% head

# This will tell us what day of the week the given date and time represents
chi$Date %>% weekdays %>% head

# This will tell us what month the date and time falls into
chi$Date %>% months %>% head

# Let's add a factor to the data table/data frame that gives the month
chi <- chi %>% mutate(Month=months(Date)) 
chi <- chi %>% mutate(Month=factor(Month, levels = month.name))

chi %>% distinct(Month)
```


Let's add in a few more factors based on the date since it will help us build some interesting tables as we try to understand the patterns in Chicago crime during the year of 2012. It is also important to understand how to work with dates since they wind up being important when looking at data. Not every data set will be a function of dates or be reliant upon them for interpretation but this one is. 

It's also not necessary to create these categories in advance or make them part of the data frame. We could do this on the fly as we will see later. We will create a category to classify the quarter during which the call to the police occurred.

```{r}
chi <- chi %>% mutate(quarter=quarter(Date)) 

chi <- chi %>% mutate(weekdays=weekdays(Date))

chi %>% select(Month, am_pm, weekdays, quarter) %>% head()
```

## Count Arrests

Now we can start to ask some basic questions to better understand the data. When we have lots of categories in our data for which we could make some count-based summaries which gives us some frequencies. For example, let's focus on the "Arrest" variable to figure out how many Arrests happened. What about the FBI codes ? 

```{r}
# This gives us a table of Arrests vs non-Arrests

chi %>% count(Arrest)

# And we could see the count of reported crimes according to FBI code

chi %>% count(`FBI Code`) %>% arrange(desc(n))
```

Relative to the FBI codes I don't know what the code actually mean but I found a description of them at [this URL](http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
So it seems to me that the top 5 areas, at least by FBI codes, are: 

* Larceny
* Simple Battery
* Vandalism
* Drug Abuse
* Misc Non-Index Offense

We could do two-way table of course. This helps us look at relationships between categories. Here we look at how Arrests relate to whether it's day or night. When do most arrests occur ? 

```{r}
chi %>% 
  count(Arrest,am_pm) %>% 
  filter(Arrest==TRUE) %>% 
  arrange(desc(n))
```

So let's plot a bar chart that shows is the total number of calls to the police for each month and what portion of those calls each month resulted in an arrest.

```{r}
chi %>% count(Arrest,Month) %>% 
  ggplot(aes(x=Month,y=n,fill=Arrest)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=45)) + 
  ylab("Calls To Police") +
  ggtitle("Chicago Arrests per Month in 2013")
```

We could also put the bars next to each other. Notice here that I use the **group_by()** function with **summarize()** whereas before I used the **count()** function. The two approaches are equivalent. 

```{r}
chi %>% group_by(Arrest,Month) %>% 
  summarize(total=n()) %>% 
  ggplot(aes(x=Month,y=total,fill=Arrest)) + 
  geom_bar(stat="identity",position=position_dodge()) + 
  theme(axis.text.x=element_text(angle=45)) + 
  ggtitle("Chicago Arrests per Month in 2013")

```

## Crime At Day vs Night

In terms of number of calls to Police let's see how many calls there were per month and then break that down by night vs day. You can see then that calling at night is more frequent. Intuitively this makes some sense. 

```{r}
chi %>% group_by(am_pm,Month) %>% 
  summarize(total=n()) %>% 
  ggplot(aes(x=Month,y=total,fill=am_pm)) + 
  geom_bar(stat="identity",position=position_dodge()) + 
  theme(axis.text.x=element_text(angle=45)) + 
  ggtitle("Calls to Police per Month in 2013")

```

## Line Graphs and Histograms

Let's try something else here. The lubridate package has a function that we can use to get the numeric day of the year. For example January 5, 2012 would be the 5th day of the year. So we could add up the number of calls to the police for each day of the year and maybe plot that as a line chart. Notice here how we create the **ydays** category on the fly. We don't need to create it as part of the data frame. 

```{r}
chi %>% mutate(ydays = yday(Date)) %>% 
  group_by(ydays) %>% summarize(total_calls=n()) %>% 
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() +
  ggtitle("Calls per Each Day of 2012") + 
  xlab("Day of the year 1 - 365") +
  ylab("Total Calls to the Police")

```

Now if we wanted to we could split the line graph up into two different panels or "facets" as ggplot calls it. This won't take much more than what we already had before. We'll just throw in the am_pm variable.

```{r}
chi %>% 
  mutate(ydays = yday(Date)) %>% 
  group_by(am_pm,ydays) %>% 
  summarize(total_calls=n()) %>%
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(am_pm~.)
```

That looks okay but what about changing the arrangement of the panels ? That's easy - we just change the formula in the **facet_grid** layer

```{r}
chi %>% 
  mutate(ydays = yday(Date)) %>% 
  group_by(am_pm,ydays) %>% 
  summarize(total_calls=n()) %>%
  ggplot(aes(x=ydays,y=total_calls)) + geom_line() + facet_grid(.~am_pm)

```


Now we could also use a function called **hour** which given the date will extract the hour of the day that the call to the police occurred. Let's generate a histogram of calls to the police based on the hour of the day. We'll pick a bin size of 24 since this will match the number of hours 

```{r}

chi %>% mutate(hod=hour(Date)) %>% 
  ggplot(aes(x=hod)) + 
  geom_histogram(bins=24,color="black",fill="white") + 
  ggtitle("Call Count by Hour of the Day") +
  xlab("Hour of the day 0 - 23") + theme_bw()

```

## Calendar Map

The **googleVis** package has a way to create something known as a calendar heatmap. We can use dplyr functions, as we have been, to count the number of calls per day of the year and then use that to create the heatmap. Most of the work here is involved with mutating the date to a "Year Month Day" format after which we count the number of police calls that day. 

```{r eval=FALSE}
calend <- chi %>% 
  mutate(ydays = ymd(format(chi$Date,"%Y-%m-%d"))) %>% 
  group_by(ydays) %>% summarize(total_calls=n()) %>% 
  gvisCalendar(.,datevar="ydays",numvar="total_calls",
               options=list(width="1000px", height="300px"))
plot(calend)
```

![](./figures/calend.png)


## What Types of Crime Do We See ? 

Next up lets look at the counts for the most frequently committed types of crimes. This will help us understand what our risks are should we consider living in Chicago. First we can create a table of all the Crime types by using the **Primary Type** variable.

```{r}
# Sort the Crime Types From Highest Count to Lowest
chi %>% count(`Primary Type`) %>% arrange(desc(n))

# Equivalent to 

chi %>% 
  group_by(`Primary Type`) %>% 
  summarize(n=n()) %>%
  arrange(desc(n))
```

Let's make a barplot. Most of this relates to using ggplot to plot things which can make you think that the dplyr verbs are complex when they aren't. We did have to **mutate** the **Primary Type** variable to reorder it according to the corresponding number of crimes for each type. This is what makes the bar plot organize the bars according to highest to lowest. 

```{r}
chi %>% count(`Primary Type`) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) + 
  geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
```


Let's clean this plot up to show say only the top 12 crime types. Well, we don't know that these were actual crimes - only that the call was classified into a certain category. 

```{r}
chi %>% group_by(`Primary Type`) %>% 
  summarize(n=n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n)) + 
  geom_bar(stat="identity",fill="#f68060", alpha=.6, width=.8) + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
```

So this just gives us the total number of police calls relating to a given crime type. It doesn't say very much about the total number of arrests that were made.

```{r}
chi %>% group_by(Arrest,`Primary Type`) %>% 
  summarize(n=n()) %>% 
  arrange(desc(n)) %>% 
  slice(1:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-n),y=n,fill=Arrest)) + 
  geom_bar(stat="identity", width=.8) +
  xlab("Crime Type") +
  ylab("Total Calls") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

This is illuminating in that almost all the calls relating to NARCOTICS seem to result in an arrest. Let's examine these offenses in greater detail since they don't seem to fit the pattern. That is most of the other crime reports don't seem to result in arrest except maybe prostitution, criminal trespass, and weapons violation. We can drill down some more. Let's slice out the lower 10 of the top 20. 


```{r}
chi %>% group_by(Arrest,`Primary Type`) %>% 
  summarize(count=n()) %>% 
  arrange(desc(count)) %>% 
  slice(10:20) %>% 
  ggplot(aes(x=reorder(`Primary Type`,-count),
             y=count,fill=Arrest)) +     
  geom_bar(stat="identity",width=.8) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


Oh so now we see that there are several areas that tend to result in an arrest. Narcotics isn't the only one at all. This seems a little weird to me though because ALL prostitution calls result in an arrest. Let's filter out some of these crimes to see what is going on.

```{r}
chi %>% filter(grepl("PROST",Description)) %>% 
  group_by(Arrest) %>% 
  summarize(cnt=n())

#

chi %>% filter(grepl("BURGLARY",`Primary Type`)) %>% 
  group_by(Arrest) %>% 
  summarize(cnt=n())

```

More stuff to consider about NARCOTICS

```{r}
chi %>% filter(grepl("NARCOTICS",`Primary Type`)) %>% 
  group_by(Description) %>%
  summarize(count=n()) %>%
  arrange(desc(count)) %>% 
  slice(1:10) %>% 
  ggplot(aes(x=reorder(Description,-count),
             y=count)) +      
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```

So looks like we have lots of people getting arrested for possession of cannabis of 30 grams or less. Heroin, Crack, and Cocaine arrests are next in line but the bulk of arrests relate to cannabis. Now there are also some arrests relating to Cannabis possession with amounts greater than 30 Grams.


```{r}

# Most of the > 30GMS cannabis possession happens in the STREET

chi %>% filter(grepl("CANNABIS MORE THAN 30GMS",Description)) %>%
  group_by(`Location Description`) %>% 
  summarize(count=n()) %>%
  arrange(desc(count))


```


## Creating A Map

Next we will put some of these locations on a map - the ones that relate to narcotics. 

```{r echo=FALSE}
chi_map <- readRDS("chi_map.rds")
chi_map_13 <- readRDS("chi_map_13.rds")
```


```{r etmap13,eval=FALSE}
library(ggmap)

chi_map <- get_map("Chicago, Illinois",zoom=11,extent="device")
```


```{r}

# The following reveals that there are 284 violations. 
street <- chi %>% 
  filter(grepl("CANNABIS MORE THAN 30GMS",Description)) %>%
  filter(grepl("STREET",`Location Description`)) 
  
nrow(street)

# Let's map these using R googleVis
ggmap(chi_map) + geom_point(aes(x=Longitude,y=Latitude, 
                         color=am_pm),alpha=1,data=street)


```

Now, let's compare this to all Narcotics Offenses
```{r cache=TRUE}
narcotics<- chi %>% filter(grepl("NARCOTICS",`Primary Type`))

ggmap(chi_map) + geom_point(aes(x=Longitude,y=Latitude, 
                         color=am_pm),alpha=.9,data=narcotics) 
 
```

There is a geometry called "stat_bin2d" that will allow us to bin locations into "containers" for purposes of plotting. Here we create a "bubble" chart which let's us collect reported crime locations areas (squares) so we can maybe see where these crimes are being committed. The problem with this approach is that it doesn't give us a strong idea about how many crimes are happening in these areas. 


```{r cache=TRUE}
ggmap(chi_map) + 
  stat_bin2d(
    aes(x=Longitude, y=Latitude, color=am_pm, fill=am_pm),
    size= .5,bins=30, alpha=1/2,
    data = narcotics
  )
```

We can use some predefined terms such as "..level.." to get counts of the lat lons in an area which will then be reflected in "fill". This plot winds up being very cool and informative. We can see that the Narcotics offenses are happening on the East Side and near the University Village and Little Italy. 

```{r echo=FALSE,val=FALSE}
register_google(key="AIzaSyAl4fVkTWpnEFanFZaKegkkRNC-WOFZjBk")
```


```{r getmap13,eval=FALSE,cache=TRUE}
chi_map_13 <- get_map("Chicago,Illinois",zoom=13,extent="device")
```


```{r statd13,eval=FALSE}
ggmap(chi_map_13) +
  stat_density2d(
    aes(x = Longitude, y =Latitude, fill = ..level..,
        alpha = ..level..),
    size = 2, bins = 4, data = narcotics, geom = "polygon")

```

![](./figures/dense2.png){width=800}

```{r cache=TRUE,eval=FALSE}
ChiMap <- ggmap(chi_map_13, 
                base_layer = ggplot(aes(x = Longitude, 
                                        y = Latitude),
                                        data = narcotics))

# Now we overlay the base with a density based approach
# The more crime a location experienvces, the darker the color

ChiMap +
 stat_density2d(aes(x = Longitude, y =Latitude, 
                    fill = ..level.., alpha = ..level..),
                size = 2, bins = 4, 
                data = narcotics, 
                geom = "polygon") + facet_wrap(~weekdays)
```

![](./figures/sfacets.png){width=800}

## Leaflet 

**Leaflet** is a javascript library that simplifies the creation of maps and placement of markers there upon. R has a companion package called "leaflet" that you can use to help create some really nice maps. All that is required is some latitude and longitude information which we already have since we are working with the Chicago data. 

Let's continue with the Narcotics data and see where these are occurring. I think you will like the output that leaflet provides because it is interactive and you can zoom in and out of the data


```{r leaflet}
library(leaflet)

m <- leaflet() %>% 
  addTiles() %>% 
  addMarkers(lng=narcotics$Longitude,
             lat=narcotics$Latitude,
             popup=narcotics$Description,
             clusterOptions = markerClusterOptions())

m
```