Skip to content

Paper Dataset

StefanKennedy edited this page Mar 14, 2019 · 3 revisions

Collecting data for the deceptive opinion spam is difficult because human labelling is only slightly better than random [1] and for this reason it is difficult to find large scale ground truth data. An available dataset proposes 800 gold-standard, labelled reviews [1, 2]. These reviews are all deceptive and were written by paid, crowd-funded workers for popular chicago hotels. Although small, it is useful to assess the performance of our model. This research uses a dataset introduced by [3] as the largest ground truth dataset available to date. The deceptive reviews in this dataset are those that were filtered by Yelp’s review software for being manufactured, solicited or malicious [4]. Yelp removed 7% of it's reviews and marked 22% as not recommended [5]. Yelp acknowledges that their recommendation software makes errors [4]. This dataset is broken into three review sets [3], one containing 67,395 hotel and restaurant reviews from Chicago, one containing 359,052 restaurant reviews from NYC and a final one containing 608,598 restaurant reviews from a number of zip codes. There is overlap between the zip code dataset and the NYC dataset, and it is known that there are significant differences between product review categories [6] (hotels and restaurants) so we will only use the zip code dataset in training our model. There are many more genuine reviews than deceptive, so we extract 80,466 each of genuine and deceptive classes to create a balanced dataset. We can derive a number of useful features from this data on top of the words of the review text. For example, the reviewer data allows us to derive useful features such as the maximum reviews of the review author in one day in history, and we can calculate the average review length of the reviewer. The entire dataset contains 447,666 more (unused) genuine reviews. In our supervised training we split this into 61.25% training, 12.5% test set, and 26.25% validation set.

[1] = M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination.
[2] = M. Ott, C. Cardie, and J.T. Hancock. 2013. Negative Deceptive Opinion Spam.
[3] = Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: bridging review networks and metadata.
[4] = https://www.yelpblog.com/2010/03/yelp-review-filter-explained
[5] = https://www.yelp.com/factsheet
[6] = John Blitzer et al. Domain Adaptation for Sentiment Classification

Clone this wiki locally