{{{begin-summary}}}
- The “naïve assumption” is that the presence of a word does not give
any information about the probability of the presence of any other
word. This assumption gives us
$P(w_1, w_2, …|C_c) = ∏_i P(w_i|C_c)$ . -
$d_i$ is the number of documents (in the training set, of course) with word$i$ . - $dci$ is the number of documents in category
$C_c$ with the word$i$ . -
$n_c$ is the number of documents in the category$C_c$ . -
$n$ is the number of documents. -
$|C|$ is the number of categories. -
$P(C_c) = \frac{n_c + 1}{n + |C|}$ is the probability of any random document having category$C_c$ . -
$\hat{X} = <w_1, w_2, …, w_k>$ is a binary document vector where each$w_i ∈ \{0,1\}$ . -
$P(\hat{X}|C_c) = P(w_1, w_2, …, w_k | C_c) = ∏_i P(w_i|C_c)$ is the probability of the document having the words that it does assuming that the document comes from category$C_c$ . - $P(w_i|C_c) = \frac{dci + 1}{d_i + n_c}$ is the probability of a
document from category
$C_c$ having word$w_i$ . - $arg\maxC_c P(\hat{X}|C_c) P(C_c)$ is the question: which
category
$C_c$ maximizes the probability that the document has these words? - $arg\maxC_c ∑_i log P(w_i|C_c) + log P(C_c)$ is the new question using log-probability to avoid “underflow” in the floating-point values.
{{{end-summary}}}
Documents are simply binary word vectors. Each dimension in the vector represents a unique word. The value in this dimension is 0 or 1 depending on whether the document has at least one occurrence of that word.
Each category vector is represented as a series of probabilities, one probability per word (each vector dimension represents a word, just like a document feature vector). Each probability means, “the probability of this word being present in a document that is a member of this category.” Thus, the category vector has terms $C_c = (pc1, pc2, …, pck)$, and
where $dci$ is the number of documents in
We assume, for simplicity, that the occurrences of words in documents are completely independent (this is what makes the method “naïve”). This is patently false since, for instance, the words “vision” and “image” often both appear in documents about computer vision; so seeing the word “vision” suggests that “image” will also appear in the document.
We further assume that the order the words appear in the document does not matter.
Because we make this independence assumption, we can calculate the probability of a document being a member of some category quite easily:
where $P(w_i|C_c) = pci$ (from the definition above).
Now, Bayes’ theorem gives us:
with,
where
Since we want to find the category
Thus, we are actually looking for:
We just check all the categories, and choose the single best or top
With a lot of unique words, we create very small values by multiplying many $pci$ terms. On a computer, the values may become so small that they may “underflow” (run out of bits required to represent the value). To prevent this, we just throw a logarithm around everything:
and furthermore,
So our multiplications turn to sums, and we avoid the underflow problem. Rewriting again, we ultimately have this problem:
The book Introduction to Information Retrieval gathered some published results for classification tasks. We can see that naïve Bayes is usually not as good as k-nearest neighbor (which we did learn about) nor support vector machines (which we didn’t learn about).
Dataset | Naïve Bayes | k-nearest neighbor | Support vector machines |
---|---|---|---|
earn | 0.96 | 0.97 | 0.98 |
acq | 0.88 | 0.92 | 0.94 |
money-fx | 0.57 | 0.78 | 0.75 |
grain | 0.79 | 0.82 | 0.95 |
crude | 0.80 | 0.86 | 0.89 |
trade | 0.64 | 0.77 | 0.76 |
interest | 0.65 | 0.74 | 0.78 |
ship | 0.85 | 0.79 | 0.86 |
wheat | 0.70 | 0.77 | 0.92 |
corn | 0.65 | 0.78 | 0.90 |
- It is very fast. In the table above, while naïve Bayes does not perform as well, it is significantly more efficient than either k-nearest neighbor or support vector machines. The latter, support vector machines, are painfully slow (at least in the training phase).
- Accuracy is low, as seen in the table above.