An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply^{2}.

The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior *P(D)* over possible processes *D* producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior:

*P(D|x) = P(x|D)P(D)/P(x)*

After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss.

This basic idea is reused thousands of times with various choices of *P(D)* and loss functions which is unsurprising given the many nice properties:

- There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from
*P(D)*there is no better method. One way to think about this is that in the Bayesian setting, the worst case analysis*is*the average case analysis. - The Bayesian method is a straightforward extension of the engineering method for designing a solution to a problem.
- The Bayesian method is modular. The three information sources are prior
*P(D)*, data*x*, and loss, but loss only interacts with*P(D)*and*x*via the posterior*P(D|x)*.

The fly in the ointment is approximation. The basic claim of the approximation argument is that approximation is unavoidable in all real-world problems that we care about. There are several ways in which approximation necessarily invades applications of Bayes Law.

- When specifying the prior, the number of bits needed to describe the “real”
*P(D)*is typically too large. The meaning of “real”*P(D)*actually varies, but this statement appears to hold true across all of them. What happens instead is that people take short-cuts specifying something which isn’t quite the real prior. - Even if the real
*P(D)*is somehow specifiable, computing the posterior*P(D|x)*is often computationally intractable. Again, the common short-cut is to alter the prior so as to make it computationally tractable. (There are a few people who instead attempt to approximately compute the posterior via monte carlo methods.)

Consider for example the problem of speech recognition. A “real” prior *P(D)* (according to some definitions) might involve a distribution over the placement of air molecules, the shape of the throat producing the sound, and what is being pronounced. This prior might be both inarticulable (prior elicitation is nontrivial) and unrepresentable (because too many bits are required to store on a modern machine).

If the necessity of approximation is accepted, the question becomes “what do you do about it?” There are many answers:

- Ignore the problem. This works well sometimes but can not be a universal prescription.
- Avoid approximation and work (or at least work a computer) very hard. This also can work well, at least for some problems.
- Use an approximate Bayesian method and leave a test set on the side to sanity check results. This is a common practical approach.
- Violate the modularity of loss and attempt to minimize approximation errors near the decision boundary. There seems to be little deep understanding of the viability and universality of this approach but there are examples where this approach can provide significant benefits.

Some non-Bayesian approaches can be thought of as attempts at (4).

There are several Bayesian answers to this argument:

1) The same difficulty of all solutions being approximations applies to every other method too – it’s just that the Bayesian framework is more well-defined, so you can point more precisely to what is being approximated, and have some hope of understanding what the implications of this approximation might be.

2) You can’t hope for a perfect solution, but it’s by trying to reduce the degree of approximation as much as possible that you make progress in understanding your problem (and in understanding general classes of problems).

3) You’re right. Sometimes the problem is just too hard, and we should do something non-Bayesian. Bayesians are sometimes remiss in not realizing this. But non-Bayesians are also often remiss in not recognizing that some problems are actually very ammenable to good Bayesian solutions.

If you want to do anything other than classification (or regression), then just representing the decision boundary (or some other simple non-Bayesian formulation) may not be sufficient, and formulating a sensible non-Bayesian solution may be very difficult (as compared to an approximate Bayesian solution).

Can’t you make a similar argument against non-Bayesian approaches? Minimizing (regularized) empirical 0/1 loss is intractable, so we approximate it. Are there better answers to this approach than to the Bayesian approach? (It seems everything is basically the same…their prior is almost arbitrary, but regularization functions such as ell-2 norm are also almost arbitrary, modulo what seem to be rather weak bounds.)

One of the things which seems necessary to acknowledge is that there is some divergence in goals. I am probably extreme as a machine learning person in wanting highly (or fully) automated methods. In contrast, many other people (quite reasonably) want to solve specific encountered problems. This divergence in goals makes Radford’s (2) unconvincing to me—I don’t want to understand my problem better, but rather to have a computer solve it.

To the extent that non-Bayesian approaches optimize some other arbitrary choice (hinge loss, l

_{2}, etc…) the same complaint of “approximations can kill you” can be made. But this is not necessarily so—there are several algorithms (and meta-algorithms such as reductions) which take explicit account of the loss imposed directly by the world in the process of learning. This explicit knowledge seems quite helpful in reducing real-world loss.The ability to take losses explicitly into account is not an ‘always non-Bayesian’ thing. For example, here we played with the approximation process in a particle filter, yielding empirical improvements.

“It seems everything is basically the sameÃ¢â‚¬Â¦their prior is almost arbitrary”

One difference is that specifying prior knowledge using a real-valued function over models is hard. For instance in Knowledge Intensive Learning project at my school we get prior knowledge in the form of constraints on probability distribution. It’s easier for an expert to say things like “X and Y are independent / synergistic / monotonic”, than to pick a measureable function over R^n that will have a similar effect when you integrate over it. When knowledge is specified as constraints on true models, the non-Bayesian methods like MaxEnt seem more appropriate.

If an expert says “X and Y are independent”, there’s little doubt that they think random variables X,Y are independent. But even so, you can ask them questions related to properties of independence, ie, “do you also think X,Y are uncorrelated?”

But if they give us a prior they crafted, can we still be sure what they gave corresponds to what they believe? And how could we check?

> “I am probably extreme as a machine learning person in wanting highly (or fully) automated methods”

Reading a recent , it seems to me that the main difficulty for having “fully automated methods” is to automated the collection and the encoding of preliminary knowledge. In my opinion, whatever you call it ,”prior probability”, inductive bias, loss function, etc, this knowledge you (a human) have about a problem is fundamental for good results. Relying just on, say, a 2 hours movie of data for learning what is a human face in a noisy video seems to me impossible. For instance, for the problem of face tracking in images, a *huge* prior knowledge is coded in the shape and functionality of our body (eye, brain…) and a “smaller part” was acquired during life. Then if we want “fully automated methods” for this kind of difficult real world problems, do we have to simulate those million years of evolution ? Does a “fully automated methods” need to contain this kind of knowledge for all problems ?

In my opinion, taking into account this prior knowledge is the key idea, and one advantage of the bayesian approach is to say this explicitely. (even if deriving a prior is not easy!)

Oups, I made a mistake in the link syntax: I was pointing to O. Bousquet recent post:

http://ml.typepad.com/machine_learning_thoughts/2006/03/freakonomics.html

Sorry…

arrghh !

The underscrores are interpreted as italic.

It was “Freakonomics post” on http://ml.typepad.com/

Pierre