-
Notifications
You must be signed in to change notification settings - Fork 31
/
Tidyverse Maliat.Rmd
464 lines (306 loc) · 14.3 KB
/
Tidyverse Maliat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
---
title: "Tidyverse Recepie"
author: "Maliat Islam"
date: "4/8/2021"
output:
html_document:
code_folding: "hide"
prettydoc::html_pretty:
theme: leonids
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Tidyverse Recepie:
### The Tidyverses is an collection of R packages.When Tidyverse is loaded it loads ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats.
### Forcats and ggplot:
#### For the implementation of Tidyverse, I have selected Forcats and ggplot libraries from this package.dplyr was used as well. I have selected Disney movies gross income dataset from the 1937-2016 from Kaggle.
#### The purpose this analysis is to categorized Disney movies according to their genre. Those movies gross income is also going to be analyzed.
#### https://www.kaggle.com/rashikrahmanpritom/disney-movies-19372016-total-gross
```{r d, warning=FALSE, message=FALSE, results='hide'}
library(dplyr)
library(forcats)
library(ggplot2)
library(kableExtra)
library(data.table)
disney_movies_total_gross <- read.csv("https://raw.githubusercontent.com/maliat-hossain/FileProcessing/main/disney_movies_total_gross.csv")
head(disney_movies_total_gross)%>% kable() %>%
kable_styling(bootstrap_options = "striped", font_size = 10) %>%
scroll_box(height = "500px", width = "100%")
```
=======
library(tidyverse)
url <-
"https://raw.githubusercontent.com/maliat-hossain/FileProcessing/main/disney_movies_total_gross.csv"
disney_movies_total_gross <-
read.csv(url)
head(disney_movies_total_gross)%>%
kable() %>%
kable_styling(bootstrap_options = "striped",
font_size = 10) %>%
scroll_box(height = "500px", width = "100%")
```
#### Only necessary rows and columns have been selected using Tidyverse package dplyr. For this assignment I am focusing on the Disney movies released from 1937 to 1961.
```{r e}
DisneyMovies<-
disney_movies_total_gross %>%
dplyr::select(1)
DisneyMovies1<-
DisneyMovies[1:10,]
```
#### The dataframe has been factorized for the purpose of implementing categories. The movies have been categorized as musical,adventure,comedy and drama.Forcats from tidyverse works really well to manipulate categorical variable.
```{r f}
DisneyMovies2<-
factor(DisneyMovies1)
view(DisneyMovies2)%>%
kable() %>%
kable_styling(bootstrap_options = "striped",
font_size = 10) %>%
scroll_box(height = "500px", width = "100%")
```
```{r g}
DisneyMovies2<-
fct_recode(DisneyMovies2,
Musical="Snow White and the Seven Dwarfs",
Adventure="Pinocchio",
Musical="Fantasia",
Adventure="Song of the South",
Drama="Cinderella",
Adventure="20,000 Leagues Under the Sea",
Drama="Lady and the Tramp",
Drama="Sleeping Beauty",
Comedy="101 Dalmatians",
Comedy="The Absent Minded Professor")
```
#### Total gross income column for these movies have been added.
```{r h}
DisneyMovies3<-
disney_movies_total_gross %>%
dplyr::select(1,5)
DisneyMovies3<-
DisneyMovies3[1:10,]
```
#### Summary statistics for total gross revenue from Disney movies has been calculated.
```{r i}
summary(DisneyMovies3)
```
#### case_when from dplyr is used for binning the gross income for movies.A variable named comparison_movies has been created which shows if the gross income of selected movie is "Below Average", "Around Average",or "Above Average". To determine the average information from the summary statistics have been used.
```{r j}
DisneyMovies4<-
DisneyMovies3 %>%
mutate(comparison_movies=case_when(
total_gross < 81219150 ~ "Below Average",
total_gross > 81219150 & total_gross <83810000 ~ "Around Average",
TRUE ~ "Above Average"))%>%
select(movie_title,total_gross,comparison_movies)
```
```{r a}
view(DisneyMovies4)%>%
kable() %>%
kable_styling(bootstrap_options = "striped",
font_size = 10) %>%
scroll_box(height = "500px",
width = "100%")
```
#### The outcome of selected movies' income has been visualized through the barplot. Each color represents different income status.
```{r b}
ggplot(data = DisneyMovies4,aes(x = movie_title,fill = comparison_movies))+
geom_bar(position = "dodge")+
coord_flip()
```
### Conclusion
#### The plot shows most of the Disney movies have earned above average from the year 1937 to 1954.
--------------------------------------------------------------------------------
### Extension of code By Tage N Singh April 24 2021
```{r tns_extension}
library(ggplot2)
library(latex2exp)
dis_mov <- as.data.frame(disney_movies_total_gross)
#dis_mov$genre == ""
dis_mov$genre[dis_mov$genre == ""] <- "Unknown" # finxing the genre field for blank values
#dis_mov$genre == ""
dis_genre <- dis_mov %>%
select(genre,total_gross,inflation_adjusted_gross) %>%
group_by(genre) # grouping by genre
dis_genre_tots <- aggregate(cbind(total_gross,inflation_adjusted_gross)~genre,data=dis_genre,FUN=mean)
dis_genre_tots$total_gross <- dis_genre_tots$total_gross/1000000
dis_genre_tots$inflation_adjusted_gross <- dis_genre_tots$inflation_adjusted_gross/1000000
head(dis_genre_tots, 13)
summary(dis_genre_tots)
```
```{r tns_extension_2}
library(lubridate)
str(disney_movies_total_gross$release_date) # testing the format for date
disney_movies_total_gross$year <- str_sub(disney_movies_total_gross$release_date, start= -4)
dis_mov2 <- data.frame(disney_movies_total_gross)
dis_year <- dis_mov2 %>%
select(year,total_gross,inflation_adjusted_gross) %>%
group_by(year, ) # grouping by year
dis_year_tots <- aggregate(cbind(total_gross,inflation_adjusted_gross)~year,data=dis_year,FUN=mean)
dis_year_tots$total_gross <- dis_year_tots$total_gross/1000000
dis_year_tots$inflation_adjusted_gross <- dis_year_tots$inflation_adjusted_gross/1000000
dis_year_tots$total_gross <- format(round(dis_year_tots$total_gross, 0), nsmall = 0)
dis_year_tots$inflation_adjusted_gross <- format(round(dis_year_tots$inflation_adjusted_gross, 0), nsmall = 0)
summary(dis_year_tots)
str(dis_year_tots)
dis_year_tots$year <- as.numeric(as.character(dis_year_tots$year))
dis_year_tots$total_gross <- as.numeric(as.character(dis_year_tots$total_gross))
dis_year_tots$inflation_adjusted_gross <- as.numeric(as.character(dis_year_tots$inflation_adjusted_gross))
#ggplot(data = dis_year_tots, aes(x = year, y = total_gross))+
# geom_line(color = "#00AFBB", size = 2)
ggplot(dis_year_tots, aes(x=year)) +
geom_line(aes(y = total_gross), color = "#00AFBB", size=2) +
geom_line(aes(y = inflation_adjusted_gross), color="red", size = 1) +
ggtitle("Comparison of Gross and Adjusted Gross Sales X $1000000")
```
=======
## But wait - there's more we can do with Forcats (Eric Hirsch revision)
In addition to creating categories as shown above, the Forcats package helps us solve many other problems related to the display of categorical variables. For example:
1. How do we display a category by its frequency?
2. How can we reduce our categories by creating an "other" category
3. How can we order a category by another variable?
Or for even more advanced Forcats functionality:
4. How can we make our catgeories anonymous?
5. How can we shuffle our categoires in random order?
#### 1. Display a category by its count frequency - we use fct_infreq(). We will also use fct_rev() to reverse the default order of fct_infreq - which sorts columns from smallest to largest::
```{r freq a}
Disney5 <-
disney_movies_total_gross %>%
filter(genre!="")
(g1 <-
ggplot(Disney5, aes(x=fct_rev(fct_infreq(genre)))) +
geom_bar() +
coord_flip() +
ggtitle("Total Counts by Genre") +
ylab("Counts") +
xlab("Genre"))
```
#### 2. Create an "other" category to collect together the smaller categories.
Forcats has many ways to do this, with many options for choosing which categories to collect - here we use fct_lump() which combines the categories below a specified n parameter:
```{r freq f}
Disney6 <-
disney_movies_total_gross %>%
filter(genre!="") %>%
mutate(genre = fct_lump(genre, n=5))
(g1 <-
ggplot(Disney6, aes(x=fct_rev(fct_infreq(genre)))) +
geom_bar() +
coord_flip() +
ggtitle("Total Counts by Genre") +
ylab("Counts") +
xlab("Genre"))
```
#### 3. Reorder a category based on another category - here we use ftc_reorder to reorder our genre by revenue generated:
```{r freq b}
Disney6 <- disney_movies_total_gross %>%
group_by(genre) %>%
filter(genre!="") %>%
summarize(Revenue= sum(round(total_gross/1000000)))
ggplot(Disney6, aes(x=fct_reorder(genre, Revenue), y=Revenue)) +
geom_col() +
coord_flip() +
ggtitle("Total Revenue By Genre") +
ylab("Revenue (in millions)") +
xlab("Genre")
```
#### 4. Make a category anonymous - we use fct_anon():
Imagine every movie has one chief hair stylist who gets rated 1-10 for each movie they work on. Management is interested in analyzing these ratings to look for trends compared to the previous year, and they plan to present the findings at a general staff meeting. However, management is interested in trends - not individual performance- and would like you to hide the individual names from the graph.
First we show the graph as it would appear without anonymizing:
```{r freq d}
set.seed("12348")
Disney7 <- disney_movies_total_gross %>%
mutate(hair_stylist = factor(sample(letters[1:15], 579, replace = TRUE))) %>%
mutate(hair_stylist_rating = sample(10, 579, rep=TRUE)) %>%
group_by(hair_stylist) %>%
summarize(AveRating=mean(hair_stylist_rating))
(g1 <-
ggplot(Disney7, aes(x=fct_reorder(hair_stylist, AveRating), y=AveRating)) +
geom_col() +
coord_flip() +
ggtitle("Average Hairstylist Ratings") +
ylab("Ratings") +
xlab("Hair Stylists"))
```
With fct_anon we can make categories anonymous simply and effectively:
```{r c}
Disney7$hair_stylist2 <-
fct_anon(Disney7$hair_stylist, "hair_stylist_")
(g1 <-
ggplot(Disney7, aes(x=fct_reorder(hair_stylist2, AveRating), y=AveRating)) +
geom_col() +
coord_flip() +
ggtitle("Average Hairstylist Ratings") +
ylab("Ratings") +
xlab("Hair Stylists"))
```
#### 5. For our last piece of functionality we will use fct_shuffle() to randomly shuffle our category order.
The Hairstylist Review committee is holding their monthly meeting where hairstylists will present their latest ideas. You always put the presentation list in alphabetical, reverse alphabetical order or rating order - styists 'e','f' and 'g' are demanding your resignation since they never get to go first. Senior management asks you to randomize the order - you can do it easily with fct_shuffle().
```{r freq e}
Disney5 <-
disney_movies_total_gross %>%
group_by(genre) %>%
mutate(Revenue= sum(total_gross)) %>%
filter(genre!="")
ggplot(Disney7, aes(x=fct_shuffle(hair_stylist), y=AveRating)) +
geom_col() +
coord_flip() +
ggtitle("Presentation Order - with Hairstylists and their Ratings") +
ylab("Ratings") +
xlab("Hair Stylists")
```
Alas, "f" is still near the bottom of the list, but random is random.
### Conclusion 2
Factors make categories easy to use in R, and forcats makes it easy to manipulate them.
### <span style="color:blue">Tidyverse Extend</span>
<span style="color:blue">Selecting movies release from 1937 to 1961 can also be done using `filter()` function from the `dplyr` package as shown below. Doing so will select the same rows and columns as specifying `DisneyMovies[1:10,]` and `select(1)`. The select function can also be used to attain the total_gross column AND create `comparison_movies` column, all in this same chunk</span>
<span style="color:red">**NOTE** The date in format *month/day/year* which it presently is, is most likely of class character. This can be verified with the function `class()` as shown below. In order to use the dates to filter, they can temporarily be modified using `mutate()`</span>
```{r}
class(disney_movies_total_gross$release_date)
(DisneyMovies1_ext<-disney_movies_total_gross%>%
select(movie_title,release_date,total_gross)%>%
filter(
as.Date(release_date,format = "%m/%d/%Y") > "1937-1-1" &
as.Date(release_date,format = "%m/%d/%Y") < "1961-12-1" )%>%
mutate(comparison_movies=case_when(
total_gross < 81219150 ~ "Below Average",
total_gross > 81219150 & total_gross <83810000 ~ "Around Average",
TRUE ~ "Above Average"))%>%
select(movie_title,release_date, total_gross,comparison_movies)
)
```
<span style="color:blue"> The `summary()` function can still be used with this larger data.frame, but the column `total_gross` needs to be subsetted.</span>
```{r}
summary(DisneyMovies1_ext$total_gross)
```
<span style="color:blue">The interesting thing about the options for `ggplot()` is that the `fill` option, essentially works as a factor, *IF* a column name is used over a specific color.</span>
```{r}
ggplot(data = DisneyMovies1_ext,aes(x = movie_title,fill = comparison_movies))+
geom_bar(position = "dodge")+
coord_flip()
```
<span style="color:blue"> My favorite feature of the `dplyr` package is the ability to pipe `%>%` within another function. As an example, I piped the data, in the same way it was used to create `Disney5` data.frame, only I did so from within the `ggplot()` function and I factored the columns by genre using `fill`</span>
```{r}
(g1 <-
ggplot(disney_movies_total_gross %>%
filter(genre!=""),
aes(x=fct_rev(fct_infreq(genre)),fill = genre)) +
geom_bar() +
coord_flip() +
ggtitle("Total Counts by Genre") +
ylab("Counts") +
xlab("Genre"))
```
<span style="color:blue"> Finally, `ggplot()` has various features that can really enhance the visualizations I create. In this very simple example, I add the count to the plot, which originally used `Disney6` data.frame, which depending on the circumstances can add value to the visual representation.</span>
```{r}
(g1 <-
ggplot(disney_movies_total_gross %>%
filter(genre!="") %>%
mutate(genre = fct_lump(genre, n=5)),
aes(x=fct_rev(fct_infreq(genre)), fill = genre)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), hjust=1)+
coord_flip() +
ggtitle("Total Counts by Genre") +
ylab("Counts") +
xlab("Genre"))
```