This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is.

Several learning systems have the property that they estimate some conditional probabilities *P(event | other events)* either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example.

When Naive Bayes is used, a set of probabilities of the form *Pr'(feature i | label)* are estimated via counting statistics and some prior. Predictions are made according to the label maximizing:

*Pr'(label) * Product*

_{features i}Pr'(feature i | label)(The *Pr’* notation indicates these are estimated values.)

There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. However, (a) seems to never be true and (b) is often not true.

At this point, we can think a bit from a estimation perspective. When trying to estimate a coin with bias *Pr(feature i | label)*, after observing *n* IID samples, the estimate is accurate to (at most) *c/m* for some constant *c*. (Actually, it’s *c/m ^{0.5}* in the general case

*c/m*for coins with bias near 0 or 1.) Given this observation, we should expect the estimates

*Pr’*to differ by

*c/m*or more when the prior on the sample counts is weak.

The problem to notice is that errors of *c/m* can quickly accumulate. The final product in the naive bayes classifier is n-way linear in the error terms where *n* is the number of features. If every features true value happens to be *v* and we happen to have a *1/2 + 1/n ^{0.5}* feature fraction estimate too large and

*1/2 – 1/n*fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by:

^{0.5}*(v – c/m)*

^{n/2 + n^0.5}(v + c/m)^{n/2 + n^0.5}– v^{n}When

*c/m*is very small, this approximates as

*c n*, which suggests problems must arise when the number of features

^{0.5}/m*n*is greater than the number of samples squared

*n > m*. This can actually happen in the text classification settings where Naive Bayes is often applied.

^{2}All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior.

In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems.

An important question is “What can you do about this brittleness?” There are several answers:

- Use a different system for prediction (there are many).
- Get much more serious about following Bayes law here. (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs.

Very interesting. I have a question from the implementation point of view. When the examples have many features we need to compute a product of many numbers less than 1. In this case what can we do to avoid underflow? (For example, if we have 1000 features and the probabilities for their values for a certain class are 0.1 on average, then the nearest machine value to the true product would be 0, right?)

Precision problems are separate from what I was worrying about above and can be dealt with fairly easily. There are several tricks for coping with it:

log(x)rather thanx.The exact details of how to do these operations without losing precision are slightly tricky. Here is a bit of ocaml code from my libraries implementing the operations for (2). Even if you don’t use ocaml, it should be possible to puzzle out what is going on.

Why does Naive Bayes tend to be so successful in text classification, where you would expect the thousands of features to kill you? Is NB possibly

benefittingfrom the violation of the independence assumptions?It’s a bit more efficient to do the following (basically, you want to minimize the number of calls to log and exp, since they are really really slow). Though, in general, if speed is your concern, you should take the first suggestion.

let zero = neg_infinity

let one = 0.

`let add x y =`

if x == zero then y

else if y == zero then x

else if x -. y > 32 then x

else if x > y then (x +. log (1. +. exp (y -. x)))

else if y -. x > 32. then y

else (y +. log (1. +. exp (x -. y)))

This works because if y-x>32 (for instance), then the precision isn’t going to reflect any change when added. This alone cuts down the number of exp/log evaluations. Moreoever, we get rid of one call to exp for each evaluation that’s actually computed.

I tend to think that the success of Naive Bayes on text is a case of “our worst fears are not realized”. It may be because of a lack of independence, very strong signal, etc…

Nevertheless, the approximation sensitivity does suggest that it is possible to do better than Naive Bayes on text via more robust methods, and the experimental results I’ve seen suggest that to be true.

For the success of Naive Bayes, as I rememmer, there is a theoretic analysis paper “Beyond independence: Conditions for the Optimality of the Simple Bayesian Classifier” in ICML 96. It presents some good properties of NB. However, in practice, corresponding conditions are still hard to verify.

As to the problem of multiplication of probalibities or in some case potentials, I have some different opinion. In fact, the multiplication has been quite widely applied in Bayesian Network related learning problem, especially in variaous Markov networks.

John,

Calling failure in the case

n > m“our worst fear” is maybe a bit too strong, don’t you think. What method really works in such a case? Using the plain counts rather than smoothed ones is of course very bad; let’s not use that straw-man.^{2}Somehow I’m still not comfortable with the “multiplication of learned probabilities” interpretation. Rather I like the probabilistic model interpretation where naive Bayes is a joint probability model for the data (class plus features) that allows a nice factorization. Maximum likelihood, maximum a posteriori, and integrating over the posterior “just happen to” factor in a way that gives a look multiplying independently learned probabilities. From this view-point it is evident that putting together the learned probabilities can’t give a bad joint model.

Another issue is whether a good joint model is what we want. If not, we should probably use discriminative methods, such as optimizing the

conditionallikelihood, which amounts to logistic regression.Online learning algorithms can certainly do something sane in the n>m^2 case. In general, there is no relationship between number of samples and number of features required to achieve some level of performance.

You have two viewpoints on Naive Bayes, one of which is more precise (because it deals with sampling complexity issues). I tend to trust the more precise viewpoint.

The Naive Bayes to Logistic regression switch always seems a bit odd from a Bayesian point of view. If you believe in the Naive Bayes prior, it’s not quite clear you should believe in the conditional likelihood model posited for logistic regression.

About the Bayesian point of view: You get logistic regression by taking the conditional likelihood model of naive Bayes and assuming a separate model for the features. (The feature model is actually irrelevant unless there is missing data.) This way the conditional model can be learned independently of the model for the features. This doesn’t hold for naive Bayes. Is this what you meant by the ‘Naive Bayes prior’? If so, it is quite true.

Yes. There is some sort of logical gap here: if you believe in a naive bayes prior then you should not believe that logistic regression is the right approach.

Naive Bayes is a special case of logistic regression, so the priors are not mutually exclusive.

Furthermore, it is very easy to design a prior where a latent parameter lambda interpolates beween a naive Bayes model and an ordinary logistic regression model.

Prior isn’t something you would “believe in” – it is just your best guess about how your hypothesis space should be structured.

There seems to be some misunderstanding. Can you show us what specifically you mean by ‘special case’? (Feel free to add a link if html math is too limiting.)

this might be an example.

That’s a good link.

Tom Minka seems to say well what I’m trying to express: that Naive Bayes and Logistic Regression are simply different models. If you think a Naive Bayes prior/model is appropriate for your problem, then you should think that a logistic regression prior/model is suboptimal (and vice-versa).

You can use the prior to force the coefficients in logistic regression to hold the same values as they do in a corresponding naive Bayes model. It can be hard to do it the other way round. Anyway, see http://www.ailab.si/blaz/papers/2004-PKDD.pdf for examples of conversion of naive Bayes models into logistic regression ones for the purpose of visualization.

Thanks to Aleks for the link to Minka’s note, very interesting. Although, I don’t necessarily agree that the term ‘discriminative training’ should be abandoned. For one thing, conditional likelihood is not the only thing we may be interested in. Incorporating other loss functions is possible (Yamanishi, Vovk) but not quite as straightforward.

Thanks to Hal, I mean.

Interesting link. Here’s a derivation of what this parameter uncoupling means for Naive Bayes — http://yaroslavvb.blogspot.com/2006/04/naive-bayes-vs-logistic-regression.html

Something, which I feel was ignored in the above interesting discussion is that in classification, decision boundary matters, and not the probability estimates.

SVM seems to do better than NBC on text. But lack of speed kills its practical use.