Skip to content

Naive Bayes

Niall Walsh edited this page Dec 3, 2018 · 6 revisions

Naive Bayes

As a first step towards developing a better solution to the 'Opinion spam problem', and also to generate a benchmark for ourselves, we used a traditional statistical modelling approach to create our first model. In this case we used Naive Bayes from the scikit learn library.

Naive Bayes is a common method for producing a baseline that can be compared to other, experimental methods. It is based on Bayes theroem with a 'naive' twist.

It is Bayes Theorem with an assumption: “The conditions are independent.”

What is a condition?

In Bayes Theorem, we say 'Probability of A given B'. The condition is B:

𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)𝑃(𝐴)𝑃(𝐵) Now, when we have multiple conditions Bayes Theorem looks like this:

𝑃(𝐴∣𝑥1,...𝑥𝑛)=𝑃(𝑥1,...𝑥𝑛∣𝐴)𝑃(𝐴)𝑃(𝑥1,...𝑥𝑛) Where x1,…xn denotes the joint probability of features x1,…xn together.

The joint probability can be expressed like this:

𝑃(𝑥1,𝑥2,𝑥3,..,𝑥𝑛)=𝑃(𝑥1∣𝑥2,𝑥3,..,𝑥𝑛)𝑃(𝑥2∣𝑥3,..,𝑥𝑛)...𝑃(𝑥𝑛) Now for each term we resolve, we need to reuse this expression. This exponential algorithm is highly computationally intensive.

Instead of using this exponential algorithm, we can change the complexity to linear time by making an assumption.

Naive Bayes assumes features are independent

By assuming our incoming features are independent, we can compute our formula by simply multiplying all of their feature probabilities together. We no longer need to worry about combinations, so our algorithm goes from exponential to linear.

𝑃(𝐴∣𝑥1,...𝑥𝑛)∝𝑃(𝐴)∏𝑖=1𝑛𝑃(𝑥𝑖∣𝐴) Note also that our denominator P(x1,...,xn) is not needed since it is constant. We have excluded the denominator in our above equation.

This assumption made by Naive Bayes is very unlikely to be a correct assumption, however it often performs well regardless. This means that it has become a common benchmark technique for data science experiments.

Assuming that the words in every document are conditionally independent (according to the naive assumption), two different models can be used to compute the class-conditional probabilities: The Multi-variate Bernoulli model and the Multinomial model.

Multivariate Bernoulli Naive Bayes

The Multivariate Bernoulli model is based on binary data: Every token in the feature vector of a document is associated with the value 1 or 0. The feature vector has m dimensions where m is the number of words in the whole vocabulary. (1 means the word occurred in the document, 0 means it did not.)

The Bernoulli trials can be written as

𝑃(𝑥∣𝜔𝑗)=∏𝑖=1𝑚𝑃(𝑥𝑖∣𝜔𝑗)𝑏⋅(1−𝑃(𝑥𝑖∣𝜔𝑗))(1−𝑏), (𝑏∈0,1). Let 𝑃̂ (𝑥𝑖∣𝜔𝑗) be the maximum-likelihood estimate that a particular word (or token) 𝑥𝑖 occurs in class 𝜔𝑗.

𝑃̂ (𝑥𝑖∣𝜔𝑗)=(𝑑𝑓𝑥𝑖,𝑦+1)(𝑑𝑓𝑦+2), (𝑖=(1,...𝑚)) where

𝑑𝑓𝑥𝑖,𝑦 is the number of documents in the training dataset that contain the feature 𝑥𝑖 and belong to class 𝜔𝑗.

𝑑𝑓𝑦 is the number of documents in the training dataset that belong to class 𝜔𝑗.

+1 and +2 are the parameters of Laplace smoothing. (So if a class never seen before is tested, the class probabilities are not set to 0)

Although this model works, it is better to take term frequency into account rather than assigning them binary values. We then find ourselves at..

Multinomial Naive Bayes

Modelling and Classification

Multinomial Naive Bayes models the distribution of words in a document as a multinomial. A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other. For classification, we assume that there are a fixed number of classes, 𝑐∈{1,2,...,𝑚}, each with a fixed set of multinomial parameters. The parameter vector for a class 𝑐 is 𝑐⃗ ={𝜃𝑐1,𝜃𝑐2,...,𝜃𝑐𝑛}, where 𝑛 is the size of the vocabulary,

∑𝑖𝜃𝑐𝑖=1 , and 𝜃𝑐𝑖 is the probability that word 𝑖 occurs in that class. The likelihood of a document is a product of the parameters of the words that appear in the document,

(1) 𝑝(𝑑|𝜃⃗ 𝑐)=(∑𝑖𝑓𝑖)!∏𝑖𝑓𝑖!∏𝑖(𝜃𝑐𝑖)𝑓𝑖 We then apply a prior distribution over the set of classes, we arrive at the minimum error classification rule (Duda & Hart, 1973) which selects the class with the largest posterior probability,

(2) 𝑙(𝑑)=𝑎𝑟𝑔𝑚𝑎𝑥𝑐[𝑙𝑜𝑔𝑝(𝜃⃗ 𝑐)+∑𝑖𝑓𝑖𝑙𝑜𝑔(𝜃𝑐𝑖)] (3) =𝑎𝑟𝑔𝑚𝑎𝑥𝑐[𝑏𝑐+∑𝑖𝑓𝑖𝑤𝑐𝑖] where 𝑏𝑐 is the threshold term and 𝑤𝑐𝑖 is the class 𝑐 weight for word 𝑖. These values are natural parameters for the decision boundary. This is especially easy to see for binary classification, where the boundary is defined by setting the differences between the positive and negative class parameters equal to zero,

(4) (𝑏+−𝑏−)+∑𝑖𝑓𝑖(𝑤+𝑖−𝑤−𝑖)=0. (The form of this equation is identical to the decision boundary learned by the (linear) Support Vector Machine, logistic regression, linear least squares and the perceptron. Naive Bayes’ relatively poor performance results from how it chooses the 𝑏𝑐 and 𝑤𝑐𝑖.)

To estimate the parameters from the training data, we select a Dirichlet Prior and take the expectation of the parameter with respect to the posterior.

This gives us a simple form for the estimate of the multinomial parameter, which involves the number of times word 𝑖 appears in the documents in class 𝑐 (𝑁𝑐𝑖), divided by the total number of word occurrences in class 𝑐 (𝑁𝑐). For word 𝑖, a prior adds in 𝛼𝑖 imagined occurrences so that the estimate is a smoothed version of the maximum likelihood estimate,

(5) 𝜃̂ 𝑐𝑖=𝑁𝑐𝑖+𝛼𝑖𝑁𝑐+𝛼 where 𝛼 denotes the sum of the 𝛼𝑖. While 𝛼𝑖 can be set differently for each word, we follow common practice (Laplace smoothing) by setting 𝛼𝑖=1 for all words. Substituting the true parameters in 2. with our estimates, we get the MNB classifier,

(6) 𝑙𝑀𝑁𝐵(𝑑)=𝑎𝑟𝑔𝑚𝑎𝑥𝑐[𝑙𝑜𝑔 𝑝̂ (𝜃𝑐)+∑𝑖𝑓𝑖𝑙𝑜𝑔𝑁𝑐𝑖+𝛼𝑖𝑁𝑐+𝛼] , where 𝑝̂ (𝜃𝑐) is the class prior estimate.

The weights for the decision boundary defined by the MNB classifier are the log parameter estimates, (7) 𝑤̂ 𝑐𝑖=𝑙𝑜𝑔𝜃̂ 𝑐𝑖.

Complement Naive Bayes

It has been shown (see fig. 1) that skewed training data (that is, more training samples in one class than another) will cause the classifier to have bias towards that class, and prefer it. To deal with the problem, we introduce the 'Complement' class of Naive Bayes, CNB. Instead of the regular MNB, which uses training data from a single class, 𝑐, to estimate training weights, CNB estimates using data from every class except 𝑐.

Fig. 1

CNB’s estimate is (8) 𝜃̂ 𝑐⎯⎯⎯𝑖=𝑁𝑐⎯⎯⎯𝑖+𝛼𝑖𝑁𝑐⎯⎯⎯+𝛼 where 𝑁𝑐⎯⎯⎯𝑖 is the number of times word 𝑖 occurred in documents in classes other than 𝑐 and 𝑁𝑐⎯⎯⎯ is the total number of word occurrences in classes other than 𝑐, and 𝛼𝑖 and 𝛼 are smoothing parameters. As before, the weight estimate is 𝑤̂ 𝑐⎯⎯⎯𝑖=𝑙𝑜𝑔𝜃̂ 𝑐⎯⎯⎯𝑖 and the classification rule is

(9) 𝑙𝐶𝑁𝐵(𝑑)=𝑎𝑟𝑔𝑚𝑎𝑥𝑐[𝑙𝑜𝑔 𝑝(𝜃⃗ 𝑐)−∑𝑖𝑓𝑖𝑙𝑜𝑔𝑁𝑐⎯⎯⎯𝑖+𝛼𝑖𝑁𝑐⎯⎯⎯+𝛼]. The negative sign represents the fact that we want to assign to class c documents that poorly match the complement parameter estimates.

CNB is related to the one-versus-all-but-one technique that is frequently used in multi-label classification, where each example may have more than one label. Berger (1999) and Zhang and Oles (2001) have found that one-vs-all-but-one MNB works better than regular MNB. The combined classification rule is basically the above equation (8) except it has the complement weights subtracted from the normal weights. However, CNB performs better than OVA MNB and regular MNB because it eliminates the biased regular MNB weights.

Weight Magnitude Errors

Because of the independence assumption, Naive Bayes can be biased to give more weight to those classes that most violate the indepence assumption. A good example of this is:

Consider the problem of distinguishing between documents that discuss Boston and ones that discuss San Francisco. Let’s assume that “Boston” appears in Boston documents about as often as “San Francisco” appears in San Francisco documents (as one might expect).

Let’s also assume that it’s rare to see the words “San” and “Francisco” apart. Then, each time a test document has an occurrence of “San Francisco,” Multinomial Naive Bayes will double count — it will add in the weight for “San” and the weight for “Francisco.”

Since “San Francisco” and “Boston” occur equally in their respective classes, a single occurrence of “San Francisco” will contribute twice the weight as an occurrence of “Boston.” Hence, the summed contributions of the classification weights may be larger for one class than another — this will cause MNB to prefer one class incorrectly.

For example, if a document has five occurrences of “Boston” and three of “San Francisco,” MNB will label the document as “San Francisco” rather than “Boston.”

We correct for the fact that some classes have greater dependencies by normalizing the weight vectors. Instead of assigning 𝑤̂ 𝑐𝑖=𝑙𝑜𝑔𝜃̂ 𝑐𝑖, we assign

(10) 𝑤̂ 𝑐𝑖=𝑙𝑜𝑔𝜃̂ 𝑐𝑖∑𝑘∣𝑙𝑜𝑔𝜃̂ 𝑐𝑘. We call this, combined with CNB, Weight-normalized Complement Naive Bayes (WCNB). Experiments indicate that WCNB is effective.

Transforming Term Frequency

It was found that term distributions had heavier tails than predicted by the multinomial model, instead appearing like a power-law distribution. Using a simple transform, we can make these power-law-like term distributions look more multinomial.

(11) 𝑓′𝑖=𝑙𝑜𝑔(𝑑+𝑓𝑖). One such transform,

(12) 𝑓′𝑖=𝑙𝑜𝑔(1+𝑓𝑖), has the advantages of being an identity transform for zero and one counts, while pushing down larger counts as we would like.

(Although setting d = 1 does not match the data as well as an optimized d, it does produce a distribution that is much closer to the empirical distribution than the best fit multinomial.)

1.4.5 Transforming by Document Frequency Another useful transform discounts terms that occur in many documents. Common words are unlikely to be related to the class of a document, but random variations can create apparent fictitious correlations. This adds noise to the parameter estimates and hence the classification weights. Since common words appear often, they can hold sway over a classification decision even if their weight differences between classes is small. For this reason, it is advantageous to downweight these words.

'Inverse document frequency' (Jones, 1972) is one way to do this. Given by:

𝑓′𝑖=𝑓𝑖𝑙𝑜𝑔∑𝑗1∑𝑗𝛿𝑖𝑗 where 𝛿𝑖𝑗 is 1 if word i occurs in document j, 0 otherwise, and the sum is over all document indices (Salton & Buckley, 1988). Rare words are given increased term frequencies; common words are given less weight.

Transforming Based on Length

Documents have strong word inter-dependencies. After a word first appears in a document, it is more likely to appear again. Since MNB assumes occurrence independence, long documents can negatively effect parameter estimates. We normalize word counts to avoid this problem. We again use a common transform that is not seen with Naive Bayes. We discount the influence of long documents by transforming the term frequencies according to

𝑓′𝑖=𝑓𝑖∑𝑘(𝑓𝑘)2⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√ yielding a length 1 term frequency vector for each document. The transform keeps any single document from dominating the parameter estimates.

1.4.7 Conclusion and derivation of steps to arrive at TWCNB Let

𝑑⃗ =(𝑑⃗ 1,...,𝑑⃗ 𝑛) be a set of documents; 𝑑𝑖𝑗 is the count of word 𝑖 in document 𝑗. Let 𝑦⃗ =(𝑦1,...,𝑦𝑛) be the labels. TWCNB(~d, ~y):

𝑑𝑖𝑗=𝑙𝑜𝑔(𝑑𝑖𝑗+1) (TF transform) 𝑑𝑖𝑗=𝑑𝑖𝑗𝑙𝑜𝑔∑𝑘1∑𝑘𝛿𝑖𝑘 (IDF transform) 𝑑′𝑖𝑗=𝑑𝑖𝑗∑𝑘(𝑑𝑘𝑗)2√(Length norm.) 𝜃̂ 𝑐𝑖=∑𝑗:𝑦𝑗≠𝑐𝑑𝑖𝑗+𝛼𝑖∑𝑗:𝑦𝑗≠𝑐∑𝑘𝑑𝑘𝑗+𝛼 (Complement) 𝑤𝑐𝑖=𝑙𝑜𝑔𝜃̂ 𝑐𝑖 𝑤𝑐𝑖=𝑤𝑐𝑖∑𝑖𝑤𝑐𝑖 (weight normalization) Let t = (t1, . . . , tn) be a test document; let 𝑡𝑖 be the count of word 𝑖. Label the document according to 𝑙(𝑡)=𝑎𝑟𝑔𝑚𝑖𝑛𝑐[∑𝑖𝑡𝑖𝑤𝑐𝑖]

Clone this wiki locally