% XK3 Portfolio Project -- Evaluation & Methodological Documentation % Richard Bergmair % Jan-31 2017
As part of this hiring exercise, I was asked to create a spam classifier to identify web spam. The point of departure was data on a set of domains, where each domain was represented in the data through a domain name, a label identifying whether or not the domain was to be considered spam, and an html page taken to be representative for the content to be found under that domain.
Given the very stringent time limit of 20 hours, the present submission represents nothing more than what I thought to be the smallest nontrivial and coherent set of coherent software modules needed to address the issue. It should be taken as a mere point of departure for serious work on the issue. High-quality software engineering was given priority over the adequacy of the overall approach for addressing a highly nontrivial problem.
As a placeholder for a core machine learning component, I implemented into this software framework a Bernoulli naive Bayes classifier.
This means that page content on each domain is represented for machine learning purposes through the following two kinds of features:
- Each word in the page's text content is a feature.
- The host-part of each domain that the page links to is a feature.
Each feature is significant only in terms of its presence or absence on each page/domain ("binary feature vectors"). Occurrence frequencies are not significant in this approach.
In this approach, every feature is assigned a weight. If, for example
the number of spam domains that contain the word click
makes up for a
large proportion of all spam domains, but the number of nonspam domains
that contain the word click
only makes up a small proportion of all
nonspam domains, then the weight of the feature click
will be large.
When called upon to classify data fetched from a previously unseen domain, the model will make a decision by looking for the words that occur, and summing up the corresponding weights. If those weights exceed a value (the so-called "bias term"), then the domain is classified as spam.
The host-part of each domain that a domain links to is treated in exactly
the same way. So if spam-domains are known to frequently link to e.g.
a particular advertising network, e.g. ads.doubleclick.net
, then all
pages linking there will be considered more likely to be spam.
As a first step in this methodology, one needs to ensure that training and testing data are separate. This takes care of data snooping bias. If I were to fit a model to some data, and then use the same data to evaluate how good the fit is, then I might end up fooling myself. In an extreme case, my model might just be a data store for the association between page content and spam classification overfitting. This would not be useful. Instead it is necessary to check how well a model generalizes, i.e. one needs to fit the model to some training data, and then test on data drawn from a dataset which is representative (i.e. was generated by the same underlying stochastic process) but which is actually different.
To that end, I divide the provided data into 10 folds, by assigning random numbers from one to ten to each of the domains. One can then, for example fit the model to data from all domains, except those that were randomly assigned to fold number one. One would then evaluate how well the model predicts the spam/no-spam classification on those domains from fold number one, i.e. exactly those domains omitted from training.
During the development cycle, I generally evaluate only by looking at evaluation metrics pertaining to that fold. After I'm done, I can rerun on the other 9 fold, and get an idea of the statistical spread around the evaluation measures thus obtained crossvalidation.
I then use the usual battery of evaluation measures such as precision, recall, and f-measure (also see here).
On fold #2, these evaluation statistics are as follows:
--
total: 114
true_positives: 37
false_positives: 20
true_negatives: 53
false_negatives: 4
dataset_bias: 0.3596
model_bias: 0.5000
accuracy: 0.7895
precision: 0.6491
recall: 0.9024
fmeasure: 0.7551
A few comments are in place about these numbers.
It can be seen that in 78.95% of all cases, the model decision agrees with the decision as recorded in the data.
This number needs to be contextualized by considering the bias involved. The proportion of spam domains in the test data is 35.96%. So a baseline model that outputs the negative label every time (i.e. that considers nothing to be spam) would score 64.04%. The accuracy number of 78.95% represents a healthy improvement over that, but it can be seen that there is a lot of room for improvement. We've only brought behind us 41.46% of the way from the baseline to the perfect model.
In cases like this, were there the classification is strongly biased, i.e. very asymmetrical, it becomes more useful to look at precision and recall with respect to the privileged "positive" class, than it is to look at the raw probability of agreement between two class labels.
Here it can be seen that, among all the cases where our model considers something to be spam, we get it right 64.91% of the time (precision). This needs to be contrasted with a baseline strategy, such as marking items as spam randomly, in which case we would only get it right 35.96% of the time. So, there is some substance to our model's decision making there, but room for improvement.
On the other hand if we've got a particular spam domain in front of us, the probability that our model will be able to actually identify it as spam is as much as 90.24%. In this case it is somewhat tedious to think about possible baselines, as we can hit 100% recall, simply by considering everything to be spam. So one needs to think about recall always in relation to precision.
Often it is useful to be able to summarize the quality of a filter like that using a single number, in which case the F-measure is commonly used. It is defined as the harmonic mean of precision and recall.
Feature List: There is a lot of work that one would normally need to do in relation to a modelling effort such as this. In particular something I would normally always do would be to extract a list of features, sorted by their Bayesian weight. It is always useful to look eyeball such a list, and make up a commonsense opinion about whether or not the features make sense. One would expect keywords associated with spam to feature very prominently, as well as links to domains of high-paying advertising networks, etc. etc. One could also evaluate this based on mutual information. Using such an analysis one could even make quantifications such as "this feature has twice the information content w.r.t. the spam/no-span distiction than this other feature, etc."
Crossvalidation / Sensitivity Analysis: The above evaluation measures would normally need to be computed for each of the ten folds, and then a standard deviation figure should be reported alongside the point estimate for each of the evaluation statistics.
Bias Term / ROC Curve: The model still has serious drawbacks in that the threshold between the spam/no-spam distinction includes a more or less arbitrary element (called "fudge factor" in the code). -- This could be based on taking nonparametric statistics, about which percentile corresponds to which value for the sum of the Bayesian weights. The threshold could then be set in line with the threshold observed for the percentile that corresponds to the bias that's actually in the data. The reason why the analytical solution for the bias factor does not work very well is due to the smoothing-factors which would also require some more work for proper calibrarion. Last but not least, instead of looking at precision/recall as single values, it would be more useful to look at them as an ROC Curve by varying the threshold.
More Features: And, obviously, there is no end to the creative ways of constructing features for this classification task. Some things that come to mind would be to look at owners of domains as per WHOIS data, the idea being that there will be certain companies creating the web-spam, or that web-spammers will use domains-by-proxy anonymization, etc. One could also look at the IP-ranges where domains are hosted, the idea being that spammers will host many domains from the same IPs, all of them being spam, etc. etc. There are many books in print that contain the SEO industry's speculations about what it is that Google really does. Some of those ideas are worth looking into for purposes of generating a classifier such as this.