-
Notifications
You must be signed in to change notification settings - Fork 17
/
04-Adding-new-features.Rmd
511 lines (375 loc) · 24.6 KB
/
04-Adding-new-features.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
---
title: "Creating new features"
author: "Seth Mottaghinejad"
output: github_document
date: "`r Sys.Date()`"
---
```{r chap04chunk01, include=FALSE}
source('setup.R')
```
Feature creation and feature selection are can be some of the most time-consuming part of model-building. It is often the place where the interplay between technical knowledge (for example about the models we intend to build) and business or domain knowledge (data requirements and how the analysis will be put to use) is at the forefront.
Let's review our workflow so far:
1. load all the data (and combine them if necessary)
2. inspect the data in preparation cleaning it
3. clean the data in preparation for analysis
4. **add any interesting features or columns as far as they pertain to the analysis**
5. find ways to analyze or summarize the data and report your findings
### Exercises
Features extraction is the process of creating new (and interesting) columns in our data out of the existing columns. Sometimes new features can be directly extracted from one of several columns in the data. For example, we can extract the day of the week from `pickup_datetime` and `dropoff_datetime`. Sometimes new features rely on third-party data. For example, we could have a `holiday_flag` column to know which dates were holidays.
Let's take a look at the data as it now stands.
```{r chap04chunk02}
head(nyc_taxi)
```
Discuss possible 'features' (columns) that we can extract from already existing columns. Recall that our goal is to tell interesting (unexpected, or not immediately obvious) stories based on the data, so think of features that would make this dataset more interesting to analyze and the story more compelling.
## Date and time features
The first set of features we extract are date and time related features. Specifically, we would like to know the day of the week and the time of the day (based on our own cutoffs).
```{r chap04chunk03}
library(dplyr)
library(lubridate)
weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
cut_levels <- c(1, 5, 9, 12, 16, 18, 22) # used to bucket hour of day into
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')
nyc_taxi <- mutate(nyc_taxi,
pickup_hour = addNA(cut(hour(pickup_datetime), cut_levels)),
pickup_dow = factor(wday(pickup_datetime), levels = 1:7, labels = weekday_labels),
dropoff_hour = addNA(cut(hour(dropoff_datetime), cut_levels)),
dropoff_dow = factor(wday(dropoff_datetime), levels = 1:7, labels = weekday_labels),
trip_duration = as.integer(as.duration(dropoff_datetime - pickup_datetime))
)
levels(nyc_taxi$pickup_hour) <- hour_labels
levels(nyc_taxi$dropoff_hour) <- hour_labels
head(nyc_taxi)
```
```{r chap04chunk04}
dp <- data.frame(x = seq(.01, .99, by = .01))
dp$y <- quantile(nyc_taxi$trip_duration, p = dp$x, na.rm = TRUE)
ggplot(aes(y = y/60, x = x*100), data = dp) +
geom_line()
```
## Geographical features
The next set of features we extract from the data are geographical features, for which we load the following geospatial packages:
```{r chap04chunk05}
library(rgeos)
library(sp)
library(maptools)
```
It is common to store GIS data in R into **shapefiles**. A shapefile is essentially a data object that stores geospatial informaiton such as region names and boundaries where a region can be anything from a continent to city neighborhoods. The shapefile we use here was provided by Zillow.com and can be found [here](http://www.zillow.com/howto/api/neighborhood-boundaries.htm). It is a shapefile for the state of New York, and it contains neighborhood-level information for New York City.
```{r chap04chunk06}
nyc_shapefile <- readShapePoly('ZillowNeighborhoods-NY/ZillowNeighborhoods-NY.shp')
```
We can see what sort of information is available by peeking at `nyc_shapefile@data`:
```{r chap04chunk07}
head(nyc_shapefile@data, 10)
```
The data stores information about neighborhoods under the column `NAME`. Since we have longitude and latitude for pick-up and drop-off location, we can use the above data set to find the pick-up and drop-off neighborhoods for each cab ride. To keep the analysis simple, we limit the data to Manhattan only, where the great majority of cab rides take place.
```{r chap04chunk08}
nyc_shapefile <- subset(nyc_shapefile, COUNTY == 'New York') # limit the data to Manhattan only
```
Notice that even though `nyc_shapefile` is not a `data.frame`, `subset` still worked. This is because subset is a function that works on more than just one kind of input. Quite a few R functions are the same way, such as `plot` and `predict`.
With a bit of work, we can plot a map of the whole area, showing the boundaries separating each neighborhood. We won't go into great detail on how the plots are generated, as it would derail us from the main topic.
```{r chap04chunk09}
library(ggplot2)
nyc_shapefile@data$id <- as.character(nyc_shapefile@data$NAME)
nyc_points <- fortify(gBuffer(nyc_shapefile, byid = TRUE, width = 0), region = "NAME") # fortify neighborhood boundaries
```
As part of the code to create the plot, we use `dplyr` to summarize the data and get median coordinates for each neighborhood, but since we revisit `dplyr` in greater depth in the next section, we skip the explanation for now.
```{r chap04chunk10}
library(dplyr)
nyc_df <- inner_join(nyc_points, nyc_shapefile@data, by = "id")
nyc_centroids <- summarize(group_by(nyc_df, id), long = median(long), lat = median(lat))
library(ggrepel)
library(ggplot2)
ggplot(nyc_df) +
aes(long, lat, fill = id) +
geom_polygon() +
geom_path(color = "white") +
coord_equal() +
theme(legend.position = "none") +
geom_text_repel(aes(label = id), data = nyc_centroids, size = 3)
```
We now go back to the data to find the neighborhood information based on the pick-up and drop-off coordinates. We store pick-up longitude and latitude in a separate `data.frame`, replacing NAs with zeroes (the function we're about to use doesn't work with NAs). We then use the `coordinates` function to point to the columns that correspond to the geographical coordinates. Finally, we use the `over` function to find the region (in this case the neighborhood) that the coordinates in the data fall into, and we append the neighborhood name as a new column to the `nyc_taxi` dataset.
```{r chap04chunk11}
data_coords <- data.frame(
long = ifelse(is.na(nyc_taxi$pickup_longitude), 0, nyc_taxi$pickup_longitude),
lat = ifelse(is.na(nyc_taxi$pickup_latitude), 0, nyc_taxi$pickup_latitude)
)
coordinates(data_coords) <- c('long', 'lat') # we specify the columns that correspond to the coordinates
# we replace NAs with zeroes, becuase NAs won't work with the `over` function
nhoods <- over(data_coords, nyc_shapefile) # returns the neighborhoods based on coordinates
nyc_taxi$pickup_nhood <- nhoods$NAME # we attach the neighborhoods to the original data and call it `pickup_nhood`
```
We can use `table` to get a count of pick-up neighborhoods:
```{r chap04chunk12}
head(table(nyc_taxi$pickup_nhood, useNA = "ifany"))
```
We now repeat the above process, this using drop-off coordinates this time to get the drop-off neighborhood.
```{r chap04chunk13}
data_coords <- data.frame(
long = ifelse(is.na(nyc_taxi$dropoff_longitude), 0, nyc_taxi$dropoff_longitude),
lat = ifelse(is.na(nyc_taxi$dropoff_latitude), 0, nyc_taxi$dropoff_latitude)
)
coordinates(data_coords) <- c('long', 'lat')
nhoods <- over(data_coords, nyc_shapefile)
nyc_taxi$dropoff_nhood <- nhoods$NAME
```
And since `data_coords` and `nhoods` are potentially large objects, we remove them from our session when they're no longer needed.
```{r chap04chunk14}
rm(data_coords, nhoods) # delete these objects, as they are no longer needed
```
Note how we had to repeat the same process in two different steps, once to get pick-up and once to get drop-off neighborhoods. Now if we need to change something about the above code, we have to change it in two different places. For example, if we want to reset the factor levels so that only Manhattan neighborhoods are showing, we need to remember to do it twice.
Another downside is we ended up with leftover objects `data_coords` and `nhood`. Since both objects have the same number of rows as the `nyc_taxi` dataset, they are relatively large objects, so we manually deleted them from the R session using `rm` after we finished using them. Carrying around too many by-product objects in the R session that are no longer needed can result in us clogging the memory, especially if the objects take up a lot of space. So we need to be careful and do some housecleaning every now and then so our session remains clean. Doing so is easier said than done.
There is however something we can do to avoid both of the above headaches: wrap the process into an R function.
### Exercises
(1) Conceptually describe the function we described in the above section, in other words
- what would be the input(s) to the function?
- what would the function return as output?
- what are the intermediate results that are created by the function?
(2) Here's a basic user-defined R function:
```{r chap04chunk15}
p <- 5
do.something <- function(n, d) {
m <- n + p
return(m/d)
}
```
- What is the name of the function?
- What are the function's arguments?
- What are the local and global variables?
- What is the "body" of the function?
- What does the function return?
- What happens to `m` and `d` (local variables) once the function finishes running?
Try to predict what the function returns in the two cases here:
```{r chap04chunk16}
do.something(10, 3)
p <- 8
do.something(10, 3)
```
(3) Change the above function so that `p` is always 5 for the function. There is more than one way to do this.
### Solutions
Here's the solution to part (3). As we mentioned, there's more than one way of doing this. One way is to turn `p` into an argument for the function `do.something`, with the default value set to 5.
```{r chap04chunk17}
do.something <- function(n, d, p = 5) {
m <- n + p
return(m/d)
}
```
Another approach is to make `p` a local variable, instead of a global variable. When the function runs, regardless of what `p` is assigned to globally, `p` will always be assigned the value 5 in the function's environment.
```{r chap04chunk18}
do.something <- function(n, d) {
p <- 5
m <- n + p
return(m/d)
}
p <- 8
do.something(10, 3)
```
## Creating a function
With the last exercise as introduction, believe it or not, we know everything we need to know to accomplish the automation task we set out to do. We already have the bulk of the code that the function relies on, so it's often a matter of pasting it into the body of the function and making some minor changes. To write good functions, we often begin by writing code that works, then we identify the need for automation (to reuse code, to automatically clean intermediate results), and finally we wrap the code around a function and modify and test it to make sure it still works.
Of course writing good functions can be more involved than what we described here. This is especially so when we write functions that we intend to use across multiple projects or share with others. In such cases, we often spend more time anticipating all the ways that the function could break given different inputs and try to account for such cases.
With the last exercise as a backdrop, let's now delete `pickup_nhood` and `dropoff_nhood` from the data and recreate those columns, this time by writing a function.
```{r chap04chunk19}
nyc_taxi$pickup_nhood <- NULL # we drop this column so we can re-create it here
nyc_taxi$dropoff_nhood <- NULL # we drop this column so we can re-create it here
```
### Exercises
Here is our earlier code for appending neighborhood information (drop-off in this case) to the data using the shapefile. Wrap this code around a new function. Let's call the function `add.neighborhoods`. Assign the function the correct inputs (think about what those need to be) and modify the code itself so the function takes those inputs and produces the correct output (think about what that needs to be). Once you have your function, run it twice, once to get pick-up neighborhood and once to get drop-off neighborhood.
```{r chap04chunk20, eval=FALSE}
data_coords <- data.frame(
long = ifelse(is.na(nyc_taxi$dropoff_longitude), 0, nyc_taxi$dropoff_longitude),
lat = ifelse(is.na(nyc_taxi$dropoff_latitude), 0, nyc_taxi$dropoff_latitude)
)
coordinates(data_coords) <- c('long', 'lat')
nhoods <- over(data_coords, nyc_shapefile)
nyc_taxi$dropoff_nhood <- nhoods$NAME
```
### Solution
We call the function `add.neighborhoods`. Its inputs are the dataset, the names of the longitude and latitude coordinates (as strings), and the shapefile. The output we return is a single column containing the neighborhoods names.
```{r chap04chunk21}
add.neighborhoods <- function(long_var, lat_var, shapefile) {
require(rgeos)
require(maptools)
data_coords <- data.frame(long = ifelse(is.na(long_var), 0, long_var),
lat = ifelse(is.na(lat_var), 0, lat_var)) # create `data.frame` with only those two columns
coordinates(data_coords) <- c('long', 'lat') # designate the columns as geographica l coordinates
nhoods <- over(data_coords, shapefile) # find the neighborhoods the coordinates fall into
nhoods$NAME <- factor(nhoods$NAME, levels = as.character(shapefile@data$NAME)) # reset factor levels to Manhattan only
return(nhoods$NAME) # return only the column with the neighborhoods
}
```
We can now use our function twice. Once to find the pick-up neighborhood:
```{r chap04chunk22}
nyc_taxi$pickup_nhood <- add.neighborhoods(nyc_taxi$pickup_longitude, nyc_taxi$pickup_latitude, nyc_shapefile)
table(nyc_taxi$pickup_nhood, useNA = "ifany")
```
And a second time to find the drop-off neighborhood:
```{r chap04chunk23}
nyc_taxi$dropoff_nhood <- add.neighborhoods(nyc_taxi$dropoff_longitude, nyc_taxi$dropoff_latitude, nyc_shapefile)
table(nyc_taxi$dropoff_nhood, useNA = "ifany")
```
### Exercises
Let's revisit the function we defined in the last exercise:
```{r chap04chunk24}
p <- 5
do.something <- function(n, d) {
m <- n + p
m/d
}
```
(1) Is the function **"vectorized"**? i.e. if the input(s) is a vector are the outputs also vectors? Show by example. This question is not trivial, because vectorized functions can be used directly for column transformations (after all, columns in a `data.frame` are just vectors).
(2) Based on what we learned about vectorized functions, is the `ifelse` function vectorized? what about `if` and `else`? Once again answer the question using examples for each.
### Solutions
For part (1) instead of feeding single numbers as inputs to the function, we try an `integer` vector instead. Since the function has two inputs, we can try an `integer` vector as the first input (and singleton as the second), vice versa, or `integer` vectors for both inputs.
```{r chap04chunk25}
do.something(10, 3) # singleton inputs
```
```{r chap04chunk26}
do.something(1:10, 3) # first input is a vector, output is also a vector
```
```{r chap04chunk27}
do.something(1:10, seq(1, 20, by = 2)) # both inputs are vectors, as is the output
```
For part (2), here's how we can show that `ifelse` is vectorized:
```{r chap04chunk28}
ifelse(2 > 1, 55, 0) # singleton condition and output
```
```{r chap04chunk29}
ifelse(0:5 > 1, 55, 0) # condition is vector, so is the output
```
```{r chap04chunk30}
ifelse(0:5 > 1, letters[1:6], LETTERS[1:6]) # all inputs are vectors, so is the output
```
However, `if` and `else` are not vectorized functions, as can be seen by the following example:
```{r chap04chunk31}
if(2 > 1) 55 else 0 # singleton condition works fine
```
```{r chap04chunk32}
if(0:5 > 1) 55 else 0 # vector of conditions does not work (only the first element) is considered
```
This means that we generally use `ifelse` when we need to transform the data (e.g. create a new column) based on conditional statements, whereas we use `if` and `else` when we need to check a single condition, such as this:
```{r chap04chunk33}
if(length(dir(data_dir)) < 1) {
warning(sprintf("%s folder seems empty!", data_dir))
} else {
dir(data_dir)
}
```
## Tipping behavior
We now calculate the tipping percentage for every trip.
```{r chap04chunk34}
nyc_taxi <- mutate(nyc_taxi, tip_percent = as.integer(tip_amount / (tip_amount + fare_amount) * 100))
```
The percentage for people who tipped nothing is a bit suspicious. The above table is useful, but it might be easier for us to see the distribution if we plot the histogram. And since there's a good chance that method of payment affects tipping, we break up the histogram by `payment_type`.
```{r chap04chunk35}
dp <- data.frame(x = seq(.01, .99, by = .01))
dp$y <- quantile(nyc_taxi$tip_percent, p = dp$x, na.rm = TRUE)
ggplot(aes(y = y, x = x*100), data = dp) +
geom_line()
```
```{r chap04chunk36}
library(ggplot2)
ggplot(data = nyc_taxi) +
geom_histogram(aes(x = tip_percent), binwidth = 1) + # show a separate bar for each percentage
facet_grid(payment_type ~ ., scales = "free") + # break up by payment type and allow different scales for 'y-axis'
xlim(c(-1, 31)) # only show tipping percentages between 0 and 30
```
The histogram confirms what we suspected: tipping is affected by the method of payment. However, it is unlikely to believe that people who pay cash simply don't tip. A more believable scenario is that cash customers tip too, but their tip does not get recorded into the system as tip. In the next exercise, we try our hand at simulating tipping behavior for cash customers.
Instead of ignoring tip amount for customers who pay cash, or pretending that it's really zero, in the last exercise we wrote a function that uses a simple rule-based approach to find how much to tipping. In the next exercise, we apply the function to the dataset. But before we do that, let's use an alternative approach to the rule-based method: Let's use a statistical technique to estimate tipping behavior, here's one naive way of doing it:
Since even among card-paying customers, a small proportion don't tip, we can toss a coin and do as follows:
- With 5 percent probability the customer does not tip
- With 95 percent probability the customer tips, and the tip is a certain percentage of the fare amount and a random component. More specifically, the tip is determined by drawing from a normal distribution centered around 20 percent of the fare amount with a standard deviation of 25 cents.
Here's how we can apply the above logic to the dataste:
```{r chap04chunk37}
nyc_taxi <- mutate(nyc_taxi,
toss_coin = rbinom(nrow(nyc_taxi), 1, p = .95), # toss a coin
tip_if_heads = rnorm(nrow(nyc_taxi), mean = fare_amount * 0.20, sd = .25),
tip_if_tails = 0, # if tails don't tip
tip_amount =
ifelse(payment_type == 'cash',
ifelse(toss_coin, tip_if_heads, tip_if_tails), # when payment method is cash apply the above rule
ifelse(payment_type == 'card', tip_amount, NA)), # otherwise just use the data we have
tip_percent = as.integer(tip_amount / (tip_amount + fare_amount) * 100), # recalculate tip percentage
toss_coin = NULL, # drop variables we no longer need
tip_if_heads = NULL,
tip_if_tails = NULL
)
```
Let's visualize the percentage tipped to for card and cash customers now.
```{r chap04chunk38}
library(ggplot2)
ggplot(data = nyc_taxi) +
geom_histogram(aes(x = tip_percent), binwidth = 1) + # show a separate bar for each percentage
facet_grid(payment_type ~ ., scales = "free") + # break up by payment type and allow different scales for 'y-axis'
xlim(c(-1, 31)) # only show tipping percentages between 0 and 30
```
### Exercises
When processing a `data.frame` with R, **vectorized functions** show up in many places. Without them, our R code would be more verbose, and often (though not always) less efficient. Let's look at another example of this by looking at the relationship between tipping and method of payment. Let's assume that most cash customers tip (but the amount they tip does not show in the data). We further assume that tipping behavior for cash vs card customers is very different in the following way:
- card customers might tip based on a certain percentage (automatically calculated when they swipe)
- cash customers might tip by rounding up (and thereby avoid getting small change)
For example, a card customer could tip 10 percent regardless of the fare amount, but a cash customer whose fare is \$4.65 would round up to \$6, and if the fare is \$26.32 they would round up to \$30. So the cash customer's tip is also proportional to the fare amount, but partly driven by the need to avoid getting change or doing the math. We want to find a way to simulate this behavior.
In other words, we want to write a function that calculates tip by *rounding up* the fare amount. Writing such a function from scratch is a little tedious. Fortunately, there is already a function in `base` R to help us:
```{r chap04chunk39}
findInterval(3.66, c(1, 3, 4.5, 6, 10))
```
Take a moment to inspect and familiarize yourself with the above function:
- What does the above function return?
- What are some ways the function could "misbehave"? In other words, check what the function returns when odd inputs are provided, including NAs.
```{r chap04chunk40}
findInterval(NA, c(1, 3, 4.5, 6, 10))
```
Let's break up the above code into two parts:
```{r chap04chunk41}
upper_limits <- c(1, 3, 4.5, 6, 10)
findInterval(3.66, upper_limits)
```
(1) Modify the last line so that we return the first number higher than the number we provide. In this case: the number we provide is 3.66, the first number higher than 3.66 is 4.5, so modify the code to return 4.5 only. (HINT: think of the line as the index to another vector.)
(2) Is the function `findInterval` vectorized? show by example.
(3) Wrap the above solution into a function called `round_up_fare` and test it with the following input:
```{r chap04chunk42, eval=FALSE}
sample_of_fares <- c(.55, 2.33, 4, 6.99, 15.20, 18, 23, 44)
round_up_fare(sample_of_fares, upper_limits)
```
Here's the result we expect to get:
(4) Replace the statistical approach to simulating `tip_amount` for the cash customers with the rule-based approach implemented in the above function. In the data transformation above (under `nyc_taxi <- mutate(...)`), replace the line `tip_if_heads = rnorm(...)` with the transformation corresponding to the rule-based approach, as implemented by `round_up_fare`. Use the following fare round-up upper limits:
```{r chap04chunk43, eval=FALSE}
fare_intervals <- c(0:10, seq(12, 20, by = 2), seq(25, 50, by = 5), seq(55, 100, by = 10))
round_up_fare(23, fare_intervals)
```
Run the new transformation and recreate the plot, comment on the new distribution.
### Solutions
(1) We want to return one of the elements of `upper_limits`. Which element we return is dynamically determined by `findIntarval`, except we need to add 1 to return the upper limit (otherwise the lower limit is returned).
```{r chap04chunk44}
upper_limits[findInterval(0, upper_limits) + 1]
```
(2) The problem reduces to finding out if `findInterval` is vectorized. We simply feed a vector, `1:5` in this example, to `findInterval` and make sure that it returns a vector.
```{r chap04chunk45}
upper_limits[findInterval(1:5, upper_limits) + 1]
```
(3) Once we have the logic figured out, wrapping it into a neat function is usually the easy part. Here the function will default to using `upper_limit` unless otherwise specified.
```{r chap04chunk46}
round_up_fare <- function(x, ul = upper_limits) {
upper_limits[findInterval(x, upper_limits) + 1]
}
sample_of_fares <- c(.55, 2.33, 4, 6.99, 15.20, 18, 23, 44)
round_up_fare(sample_of_fares)
```
(4) Just replace `tip_if_heads = rnorm(nrow(nyc_taxi), mean = fare_amount * 0.20, sd = .25)` with `tip_if_heads = round_up_fare(fare_amount, fare_intervals) - fare_amount` and rerun the whole code chunk and the one after it for recreating the plot.
```{r chap04chunk47}
nyc_taxi <- mutate(nyc_taxi,
toss_coin = rbinom(nrow(nyc_taxi), 1, p = .95), # toss a coin
tip_if_heads = round_up_fare(fare_amount, fare_intervals) - fare_amount,
tip_if_tails = 0, # if tails don't tip
tip_amount =
ifelse(payment_type == 'cash',
ifelse(toss_coin, tip_if_heads, tip_if_tails), # when payment method is cash apply the above rule
ifelse(payment_type == 'card', tip_amount, NA)), # otherwise just use the data we have
tip_percent = as.integer(tip_amount / (tip_amount + fare_amount) * 100), # recalculate tip percentage
toss_coin = NULL, # drop variables we no longer need
tip_if_heads = NULL,
tip_if_tails = NULL)
library(ggplot2)
ggplot(data = nyc_taxi) +
geom_histogram(aes(x = tip_percent), binwidth = 1) + # show a separate bar for each percentage
facet_grid(payment_type ~ ., scales = "free") + # break up by payment type and allow different scales for 'y-axis'
xlim(c(-1, 31)) # only show tipping percentages between 0 and 30
```
It shouldn't be surprising that the rounding behavior results in a histogram with certain gaps between the bars, especially between the numbers 10 and 20.