Skip to content
StefanKennedy edited this page Mar 9, 2019 · 5 revisions

The Data

We need data that labels online user-submitted content as genuine or purposely falsified. We have found three sources of data.

Yelp Reviews

1,035,045 reviews Split into three sets, New York, Chicago and a third 'zip' set. Ground truth set, derived from Yelp's spam detection detection. Yelp's spam detection is accurate

NYC

359,052 reviews All reviews are for restaurants. 923 resteraunts 160,225 reviewers 322167 Genuine, 36885 Fake

Chicago

67,395 reviews Reviews are split between restaurants and hotels. 201 hotels/resteraunts 38,063 reviewers 5854 hotel reviews / 61541 restaurant reviews ~36874 Fake reviews

Zip

608,598 reviews All reviews are for restaurants. 5045 restaurants 260,277 reviewers Organised by zip codes (NY, NJ, VT, CT, PA)

Amazon Reviews

Compiled as a result of an investigation into fake review production 628 fake reviews, 942 real reviews Gold standard set. From book authors that confessed to buying fake reviews. "All reviews were between 50 and 150 words as a minimum length"

OpSpam reviews

Gold standard set?

1600 reviews for 20 Chicago Hotels. Each hotel has 20 reviews in each of 4 datasets

400 truthful positive reviews from TripAdvisor (described in [1])
400 deceptive positive reviews from Mechanical Turk (described in [1])
400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp (described in [2])
400 deceptive negative reviews from Mechanical Turk (described in [2])

Which dataset to use

We should think about what dataset we should use to avoid poor performance later. For example, the Review Skeptic site uses a technique that can achieve 90% accuracy on a specific hotel review corpus, however it does not have the same results cross-domain.

We have multiple categories of reviews (restaurants, hotels, books). There are significant differences between these categories. To use all of them together we must use Multi-domain learning, and understand it's implications.

Mahesh Joshi et al. Multi-Domain Learning: When Do Domains Matter?

Well known points

There are significant differences between product review categories. (John Blitzer et al. Domain Adaptation for Sentiment Classification) Trained classifiers lose accuracy when the test data distribution is significantly different from the training data distribution. Features may be distributed differently in different domains. This is the case of p(x) changing between domains. The result is some features may only appear in a single domain. Features may behave differently in different domains. This is the case of p(y|x) changing between domains. The result is that the learning algorithm cannot generalise feature behaviour across domains. 3.1 Is there a problem with interlacing all our data? Possibly. A natural first step is to attempt to create our benchmark from a single domain. Afterwards, we can try interlacing the data to see what results we get. When we do this, we should add a feature 'domain' meaning which domain it comes from (restaurants, hotels or books).

4 Which training set do we use? The most trustworthy sets are those that are gold standard. The problem with these is that there is not enough data available in these sets.

We don't know how much the different domains will affect our results. It might be possible to train on our ground truth set of hotel reviews and test on the gold standard set, however we only have 5854 ground truth hotel reviews. We do not know how this will perform until we experiment with it.

Initial experiments could be:

Train & test on single domain 'ground truth' restaurant set from NYC. Train & test on multi domain 'ground truth' restaurant set from NYC and Chicago. (domains are NYC and Chicago) Train & test on multi domain 'ground truth' restaurant set and hotel set from NYC and Chicago (domains are NYC-Res, Chi-Res and Chi-Hotels) And the following, testing on the 'gold standard' TripAdvisor hotel set:

Train on single domain 'ground truth' hotel set from Chicago. Train on multi domain 'ground truth' hotel and restaurant set from Chicago. Train on multi domain 'ground truth' hotel and restaurant set from Chicago and NYC. All of which are in a different domain to the TripAdvisor hotel set.

Relevant Datasets

Clone this wiki locally