Skip to content

Evaluation of Statistical Classifiers

StefanKennedy edited this page Apr 13, 2019 · 6 revisions

Evaluation of Statistical Classifiers

Update: It now appears that a high max reviews in one day statistic corresponds to a review being genuine. This would be the case if fake reviewers did not have many places to leave fake reviews for in one day.

Initially we create a Naive Bayes model to understand how it predicts deceptivity. Features derived about reviewer behaviour and about the structure of the review were the most predictive derived features, while sentiment and POS tags were only slightly predictive. To select features that work with this model we graphed the probability of deception as our interesting feature changed. This unveiled 3 types of results:

  1. Features are predictive:

The most separable features are max reviews in one day, where more reviews becomes a very strong indicator of fake reviews, and average length of a reviewer's reviews, where longer average reviews are much more likely to be genuine.

  1. Features are not predictive:

This graph represents proportion of words tagged with the given POS tag. Naive bayes fails to find a relationship between any of these and the overall deception of the review. This explains why these features cannot provide an improvement to the performance of the classifier.

  1. Multicollinearity exists with our features:

Short words and short sentences are not as common in deceptive reviews. A short review, written at haste by a crowdsourced worker is intuitively less likely to contain punctuation, or will be portraying it’s fake message without very much thought about structuring the review. It could be argued that longer reviews will have more place for shorter words. Consider “Fantastic service” where the average word length is 8. These three features could show collinearity. One of the assumptions of Naive Bayes is that multicollinearity does not exist in the data. Later we find that these features fail to significantly improve our classifier, and it is likely that multicollinearity is the reason for this. It is undesirable to have multicollinearity in the data, and some classifiers can handle this better than others

Switching to BoW without any derived features increased our accuracy significantly from 62.5% to 67%. Experimentation finds that the Naive Bayes classifier is suffering as the original un-logged values would be reaching the underflow problem due to the very large number of features. As a result it is very difficult to improve performance, for example tf-idf causes performance to decrease slightly. Naive Bayes classified our sample in the following distribution:

Logistic Regression easily competes with Naive Bayes, and using tf-idf instead of BoW increases the accuracy to 68%. This is because it does not have the same underflow problem as Naive Bayes. We can also see that Logistic Regression is capable of finding better meaning in the features, as is shown in comparing the difference between how Naive Bayes and Logistic Regression learn to predict using the standard deviation of user ratings:

We then assess what features will be beneficial with logistic regression, and find reviewer features to be the most separable:

We see similar results for Logistic Regression to Naive Bayes, except this time the standard deviation of a user's rating is also informative, where highly varying ratings from a user tend to be an indication that the user's reviews are genuine and the percentage of positive reviews slightly indicates genuineness.

Combining the reviewer features with our tf-idf features allowed Logistic Regression to achieve a validation accuracy of 72%, significantly more than Naive Bayes. A visualisation of the classification contour by Logistic Regression is as follows:

The green marks in the graphs represent genuine reviews, and the red marks represent deceptive reviews. Logistic regression finds the division shown above in the data samples. Although it does not create a linear separation, it does create a linear separation in the maximum likelihood and in this case it creates a straight line division of the samples.

One noticeable characteristic of our samples is that it does not appear likely that we will be able to achieve a very high accuracy by a linear separation. This is a simplification of the space that our samples exist in, but it is worthwhile to assess how our samples could be separated in a non-linear fashion.

We set up an SVM with a non-linear kernel (NuSVC), and tweaked the nu parameter using grid search to achieve the following separation:

This kernel appears much more capable of dividing reviews, however in practice it does not outperform the linear kernel. This is not displayed by the above visualisations, but is explainable because we are only viewing 2 components calculated by the PCA algorithm. The linear hyperplane must find a better separation in the high dimensionality space. This classifier can achieve an accuracy between 71% and 72%

Finally we make the last stretch to improve our classifier by using a linear kernel and tweaking the C parameter. For our samples grid search found that a C value of 0.05 was the highest performer, at a validation accuracy of between 74% and 75%.

Clone this wiki locally