This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is.
Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example.
When Naive Bayes is used, a set of probabilities of the form Pr'(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing:
Pr'(label) * Productfeatures i Pr'(feature i | label)
(The Pr’ notation indicates these are estimated values.)
There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. However, (a) seems to never be true and (b) is often not true.
At this point, we can think a bit from a estimation perspective. When trying to estimate a coin with bias Pr(feature i | label), after observing n IID samples, the estimate is accurate to (at most) c/m for some constant c. (Actually, it’s c/m0.5 in the general case c/m for coins with bias near 0 or 1.) Given this observation, we should expect the estimates Pr’ to differ by c/m or more when the prior on the sample counts is weak.
The problem to notice is that errors of c/m can quickly accumulate. The final product in the naive bayes classifier is n-way linear in the error terms where n is the number of features. If every features true value happens to be v and we happen to have a 1/2 + 1/n0.5 feature fraction estimate too large and 1/2 – 1/n0.5 fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by:
(v – c/m)n/2 + n^0.5(v + c/m)n/2 + n^0.5 – vn
When
c/m is very small, this approximates as
c n0.5 /m, which suggests problems must arise when the number of features
n is greater than the number of samples squared
n > m2. This can actually happen in the text classification settings where Naive Bayes is often applied.
All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior.
In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems.
An important question is “What can you do about this brittleness?” There are several answers:
- Use a different system for prediction (there are many).
- Get much more serious about following Bayes law here. (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs.