-
Notifications
You must be signed in to change notification settings - Fork 0
/
lab_answer.Rmd
325 lines (220 loc) · 11 KB
/
lab_answer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "My answers"
author: "My name"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---
## Learning Goals
By the end of this tutorial you will be able to:
1. Transform text data to obey "tidy" data principles.
2. Summarize the length a set of texts.
3. Clean up text data by removing stop words, excess numbers, and lemmatize words.
4. Visualize differences in text across different pre-classified classes of text.
5. Use the insights from text analytics to answer questions that are managerially relevant and/or have marketing implications.
## Instructions to Students
These tutorials are **not graded**, but we encourage you to invest time and effort into working through them from start to finish.
Add your solutions to the `lab_answer.Rmd` file as you work through the exercises so that you have a record of the work you have done.
Obtain a copy of both the question and answer files using Git.
To clone a copy of this repository to your own PC, use the following command:
Once you have your copy, open the answer document in RStudio as an RStudio project and work through the questions.
The goal of the tutorials is to explore how to "do" the technical side of social media analytics.
Use this as an opportunity to push your limits and develop new skills.
## Exercise 1: Text Analytics with Fake Reviews
This exercise works with data on hotel reviews.
The reviews are a collection of truthful and deceptive (i.e. fake) reviews of 20 hotels in the Chicago area, known in the computational linguistics community as the Deceptive Opinion Spam dataset.^[
The data originally was published in the paper "Finding Deceptive Opinion Spam by Any Stretch of the Imagination" by M. Ott, Y. Choi, C. Cardie, and J.T. Hancock in 2011 in the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
]
Deceptive reviews are reviews that have been written by someone who has not stayed at the hotel they are reviewing.
The data contains 1600 reviews:
* 400 truthful, positive reviews from TripAdvisor
* 400 deceptive positive reviews from Mechanical Turk
* 400 truthful, negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor, and Yelp
* 400 deceptive negative reviews from Mechanical Turk
The data is saved as part of this repository, stored in `data/reviews.csv`:
You might need to use the following `R` libraries throughout this exercise:^[
If you haven't installed one or more of these packages, do so by entering `install.packages("PKG_NAME")` into the R console and pressing ENTER.
]
```{r, eval = TRUE, message=FALSE, warning=FALSE}
library(readr)
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(magrittr)
library(textstem)
library(reshape2)
library(wordcloud)
```
1. Assuming online reviews are all truthful, explain whether they should benefit or harm consumers and the hotels themselves.
Write your answer here
2. If fake reviews are mixed in with truthful reviews on a website (for example, TripAdvisor) identify two reasons why their presence may harm consumers, hotels and the review website.
Explain the mechanisms through which the harm is done.
Write your answer here
3. Load the data located in `data/reviews.csv` into a data frame called `hotel_reviews`.
```{r}
# Write your answer here
```
4. In what follows we will need each hotel review (i.e. each row of the data) to have a unique identifier.
Create a new column called `ID` in the dataset that has a unique identifier for each review.
HINT: the function `row_to_column()` function will help you do this.
```{r}
# Write your answer here
```
5. Convert the dataset to conform with tidy data principles:
* Each variable is a column
* Each observation is a row
* Every cell is a single value
Name the resulting dataset `tidy_reviews`.
```{r}
# Write your answer here
```
6. Create a dataset called `review_length` that has two columns, a column that identifies the review and a column that counts the number of words in the review.
Merge the the `review_length` data into the `hotels_df` dataset.
```{r}
# Write your answer here
```
7. Are deceptive reviews longer than truthful reviews?
If you presented this result to a marketing manager that has little analytics training what do you expect they would say about the possibility of detecting deceptive reviews?
Would you agree or disagree?
```{r}
# Write your answer here
```
8. What are stop words?
Explain why you would want to remove them from the review text.
Write your answer here
9. Remove stop words from each review using a default stop word list. Name the dataset with no stop words `tidy_reviews_no_stop`.
What are the 20 most common words in the dataset?
```{r}
# Write your answer here
```
10. Create a custom stop word list that contains words you might want to remove in addition to the default stop word list.
Remove these words from `tidy_reviews_no_stop`.
```{r}
# Write your answer here
```
11. Also remove any numbers from `tidy_reviews_no_stop`.
The following code can be used to get started:
```{r, eval = FALSE}
# First, find all numbers:
nums <- YOUR_CODE %>%
# find all numbers in the data
filter(str_detect(word, "^[0-9]")) %>%
YOUR_CODE
# now remove those:
tidy_reviews_no_stop <- YOUR_CODE
```
```{r}
# Write your answer here
```
The next step we want to perform is to lemmatize each word.
The goal of lemmatizing is to reduce inflectional forms and some common derivationally related forms of a word to a base word.
For example:
* am, are, is $\Rightarrow$ be
* car, cars, car's, cars' $\Rightarrow$ car
To lemmatize each word in the reviews we will use the `lemmatize_words()` from the `textstem package`:^[
An alternative would be to "stem" each word, which essentially chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
]
```{r}
# Write your answer here
```
12. Create a plot that graphs the frequency a word is used in deceptive reviews on the x-axis against the frequency the same word is used on the y-axis.
To help you in coding up this plot, your final result should look like the following:^[
Depending on your custom stopword list in (10) you might see slight differences between your figure and this one.
]
```{r, echo = FALSE, eval = TRUE, fig.align="center", out.width="75%"}
knitr::include_graphics('figs/reviews_plot-01.png')
```
Perform the following steps:
(a) Create a dataset called `wrd_frequency` that has three columns: a column `word` that contains all the unique words, a column `deceptive` that has the percentage of deceptive reviews that each word is included in and a column `truthful` that has the percentage of truthful reviews that each word is included in.
(b) Create the plot described above.
Include a 45 degree line for reference.
To assist you in creating the plot, use the following starter code:
```{r, eval = FALSE}
# part (a)
wrd_frequency <- tidy_reviews_lemma %>%
count(YOUR_CODE) %>%
group_by(YOUR_CODE) %>%
mutate(YOUR_CODE) %>%
spread(YOUR_CODE)
# part (b)
wrd_frequency %>%
ggplot(aes(x = YOUR_CODE,
y = YOUR_CODE,
label = YOUR_CODE
)) +
geom_YOURCODE(alpha = 0.15,
size = 2.5
) +
geom_text(aes(label = YOUR_CODE),
check_overlap = TRUE,
vjust = 1.5
) +
geom_abline(YOUR_CODE,
lty = 2,
color = "grey40"
) +
YOUR_CODE
```
```{r}
# Write your answer here
```
An alternative way to visualize the differences and commonalities in word use between truthful and deceptive reviews is to construct word clouds.
To assist you in creating the plots, use the following starter code for each of the next two questions:
```{r, eval=FALSE}
tidy_reviews_lemma %>%
count(YOUR_CODE,
YOUR_CODE,
sort = TRUE
) %>%
acast(YOUR_CODE ~ YOUR_CODE,
value.var = "n",
fill = 0
) %>%
YOUR_CODE
```
13. Create a word cloud that visualizes the differences in word use between deceptive and truthful reviews.
Use a maximum of 75 words in the word cloud.
```{r}
# Write your answer here
```
14. Create a word cloud that visualizes the commonalities in word use between deceptive and truthful reviews.
Use a maximum of 75 words in the word cloud.
```{r}
# Write your answer here
```
15. Provide a brief summary of what you learn from the three plots. (Max. 10 sentences)
Write your answer here
16. (Harder!) Suppose you were working in an analytics team that wanted to build a predictive model to detect fake reviews on a platform such as TripAdvisor.
(a) Can you sketch out a method/model to predict whether a review is fake? Your method should be scaleable so that it can handle thousands of new reviews per day.
You can use any combination of words, figures, simple equations or pseudocode to explain your thoughts.
Write your answer here
(b) How could you assess how successful the model is at detecting fake reviews?
Write your answer here
(c) Suppose the model was adopted by the platform.
If you detected a newly posted review was fake, would you want to remove it from the platform?
Or would you leave the review online with a notification that this review might be fake?
Explain.
Write your answer here
Exercise 2: Online Review Manipulation
In their 2014 AER article, titled "Promotional Reviews: An Empirical Investigation of Online Review Manipulation." Mayzlin, Dover and Chevalier investigate online review manipulation by hotels.^[
Mayzlin, Dina, Yaniv Dover, and Judith Chevalier. 2014. "Promotional Reviews: An Empirical Investigation of Online Review Manipulation." American Economic Review, 104 (8): 2421-55.
]
This exercise reviews the research design and main findings of this article.
1. List and explain the two main arguments that support the idea that online reputation is most useful when it is not tainted by fake reviews.
Write your answer here.
2. What are the research questions that the paper tries to answer.
Write your answer here.
3. Explain the research design that the authors adopt to answer the research question. In what ways is it similar to a difference in differences design? In what ways is it different?
Write your answer here.
4. Provide an intuitive explanation of why the research design "works".
Write your answer here.
5. Explain the regression equation that the authors use to obtain their main results. What are the variables included? Of these, what are the explanatory variables that directly relate to the research question(s)?
Write your answer here.
6. Interpret the regression results in Table 3 of the main text and provide a brief summary of the results.
Write your answer here.
7. What are the managerial implications of the results in this paper (for hotel managers and for review platforms)?
Write your answer here.
8. In the article, the authors argue that fake/deceptive reviews are difficult to detect. Based on your analysis in Exercise 1, do you agree with their presumption? Explain.
Write your answer here.