Data Science Hypothesis

Hypothesis

Considerations taken in forming a hypothesis

Should the self-annotation of data be done to increase the size of our dataset?

Tempting, as neural models benefit largely from large datasets. However, our model may become biased towards our labeller then, and our accuracies may not be founded in truth. The only way that this could work is if we ran the SOTA methods over our personally labelled dataset and compared the accuracies, which seems to be a lot more effort than it's worth.

Should stance detection, a hot topic in fake news detection, be used as a feature?

After deliberation, no. Stance detection is important to fake news detection because there is a 'topic' to be agreed or disagreed on. However, in our domain of deceptive opinions on products and services, stance would more or less simply become sentiment. Therefore, the benefits would be diminished.

Should we focus on pushing the SOTA in cross-domain adaptation?

Before discovering the low-hanging fruit that is Generative Adversarial Networks, we considered the option of focusing on cross-domain adaptation, by use of review-reviewer embeddings. However, there has been multiple papers on this approach, and thus the possibility of finding novelty in a method of cross-domain adaptation is low comparative to General Adversarial Networks, where only one paper has been produced, and very promising at that.

Proposed hypothesis

We propose the use of general adversarial networks [1] following the FakeGAN architecture [2] to aid in the detection of deceptive opinion spam [3]. Using the most important review and reviewer-centric features [4, 11] combined with extensive feature importance selection that has been shown to increase accuracy [4, 5], word and feature embeddings [14, 15], dimensionality reduction [12, 13], and transfer learning from kernels in the fake news detection [6] and email spam detection domains [7], we can hope to build on the cutting edge accuracy with a novel approach.

Possible proposed hypotheses

Building on FakeGAN [2] and the findings of Ott in 2013 [8], two generators, corresponding positive and negative sentiment deception, could be used to slow down convergence of accuracy and thus gain some more points.
Stance detection [6, 9], a feature found to be important in the domain of fake news detection [9], could be incorporated into our feature set for training.
The use of gradient boosted decision trees [10] in place of CNN's in our GAN [1] architecture.

References:

[1] Goodfellow et. al, 2014: 'General Adversarial Nets'

[2] Aghakhani et. al, 2018: 'Detecting Deceptive Reviews using Generative Adversarial Networks'

[3] Jindal and Liu, 2008: 'Opinion Spam and Analysis'

[4] Crawford et. al, 2015: 'Survey of review spam detection using machine learning techniques'

[5] Mukherjee et al, 2013: 'What Yelp Fake Review Filter Might Be Doing?'

[6] Ågren, 2018: 'Combating Fake News with Stance Detection using Recurrent Neural Networks'

[7] Faris et. al, 2018: 'An intelligent system for spam detection and identification of the most relevant features based on evolutionary Random Weight Networks'

[8] Ott et al, 2013: 'Negative Deceptive Opinion Spam'

[9] Riedel et al, 2018: 'A simple but tough-to-beat baseline for the Fake News Challenge stance detection task'

[10] Hazim et al, 2018: 'Detecting opinion spams through supervised boosting approach'

[11] Mukherjee et al, 2012: 'Spotting fake reviewer groups in consumer reviews'