## The Approximation Argument

An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply2.

The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior:

P(D|x) = P(x|D)P(D)/P(x)

After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss.

This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties:

1. There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method. One way to think about this is that in the Bayesian setting, the worst case analysis is the average case analysis.
2. The Bayesian method is a straightforward extension of the engineering method for designing a solution to a problem.
3. The Bayesian method is modular. The three information sources are prior P(D), data x, and loss, but loss only interacts with P(D) and x via the posterior P(D|x).

The fly in the ointment is approximation. The basic claim of the approximation argument is that approximation is unavoidable in all real-world problems that we care about. There are several ways in which approximation necessarily invades applications of Bayes Law.

1. When specifying the prior, the number of bits needed to describe the “real” P(D) is typically too large. The meaning of “real” P(D) actually varies, but this statement appears to hold true across all of them. What happens instead is that people take short-cuts specifying something which isn’t quite the real prior.
2. Even if the real P(D) is somehow specifiable, computing the posterior P(D|x) is often computationally intractable. Again, the common short-cut is to alter the prior so as to make it computationally tractable. (There are a few people who instead attempt to approximately compute the posterior via monte carlo methods.)

Consider for example the problem of speech recognition. A “real” prior P(D) (according to some definitions) might involve a distribution over the placement of air molecules, the shape of the throat producing the sound, and what is being pronounced. This prior might be both inarticulable (prior elicitation is nontrivial) and unrepresentable (because too many bits are required to store on a modern machine).

If the necessity of approximation is accepted, the question becomes “what do you do about it?” There are many answers:

1. Ignore the problem. This works well sometimes but can not be a universal prescription.
2. Avoid approximation and work (or at least work a computer) very hard. This also can work well, at least for some problems.
3. Use an approximate Bayesian method and leave a test set on the side to sanity check results. This is a common practical approach.
4. Violate the modularity of loss and attempt to minimize approximation errors near the decision boundary. There seems to be little deep understanding of the viability and universality of this approach but there are examples where this approach can provide significant benefits.

Some non-Bayesian approaches can be thought of as attempts at (4).

## Why do people count for learning?

This post is about a confusion of mine with respect to many commonly used machine learning algorithms.

A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express:
P(A,B,C) = P(A | B,C)P(B)P(C)

What I don’t understand is the mechanism commonly used to estimate P(A | B, C). If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to:

P'(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)]

… in other words, people just estimate P'(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C, but it (naturally) falls apart when this is not the case as typically happens with “big” learning problems such as machine translation, vision, etc…

To compensate, people often try to pick some prior (such as Dirichlet prior with one “virtual count” per joint parameter setting) to provide a reasonable default value for the count. Naturally, in the “big learning” situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts. It’s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict A given B and C.

Stepping back a bit, we can think of the estimation of P(A | B, C) as a simple self-contained prediction (sub)problem. Why don’t we use existing technology for doing this prediction? Viewed as a learning algorithm “counting with a Dirichlet prior” is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default. It’s hard to imagine a more primitive learning algorithm.

There seems to be little downside to trying this approach. In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well. In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above. In either case something reasonable happens.

Using a general purpose probabilistic prediction algorithm isn’t a new idea, (for example, see page 57), but it appears greatly underutilized. This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}. It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception.

## Multiplication of Learned Probabilities is Dangerous

This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is.

Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example.

When Naive Bayes is used, a set of probabilities of the form Pr'(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing:

Pr'(label) * Productfeatures i Pr'(feature i | label)

(The Pr’ notation indicates these are estimated values.)

There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. However, (a) seems to never be true and (b) is often not true.

At this point, we can think a bit from a estimation perspective. When trying to estimate a coin with bias Pr(feature i | label), after observing n IID samples, the estimate is accurate to (at most) c/m for some constant c. (Actually, it’s c/m0.5 in the general case c/m for coins with bias near 0 or 1.) Given this observation, we should expect the estimates Pr’ to differ by c/m or more when the prior on the sample counts is weak.

The problem to notice is that errors of c/m can quickly accumulate. The final product in the naive bayes classifier is n-way linear in the error terms where n is the number of features. If every features true value happens to be v and we happen to have a 1/2 + 1/n0.5 feature fraction estimate too large and 1/2 – 1/n0.5 fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by:

(v – c/m)n/2 + n^0.5(v + c/m)n/2 + n^0.5 – vn

When c/m is very small, this approximates as c n0.5 /m, which suggests problems must arise when the number of features n is greater than the number of samples squared n > m2. This can actually happen in the text classification settings where Naive Bayes is often applied.

All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior.

In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems.

1. Use a different system for prediction (there are many).
2. Get much more serious about following Bayes law here. (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs.

## On Coding via Mutual Information & Bayes Nets

Say we have two random variables X,Y with mutual information I(X,Y). Let’s say we want to represent them with a bayes net of the form X< -M->Y, such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y). Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y), so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y) and coding Y with this mutual info (without X).

It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch).

What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be burdened with a harder modeling task. If all we care about is a information theoretic, compact hidden state, then constructing an accurate Bayes net might be harder, due to the fact that it takes more bits to specify the distribution of the hidden state. In fact, since we usually condition on the data, it seems odd that we should bother specifying a (potentially more complex) generative model. What are the alternatives? The information bottleneck seems interesting, though this has peculiarities of its own.

Alina’s counterexample:

Here is the joint distribution P(X,Y). Sample binary X from an unbiased coin. Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). So the joint is:

P(0,0)=1/4
P(0,1)=1/4
P(1,0)=0
P(1,1)=1/2

Note P(X=1)=1/2 and P(Y=1)=3/4. Here,

I(X,Y)= 3/4 log (4/3) ~= 0.31

The rest of the proof showing that this is not achievable in a ‘compact’ Bayes net is in a comment.

## Watchword: model

In everyday use a model is a system which explains the behavior of some system, hopefully at the level where some alteration of the model predicts some alteration of the real-world system. In machine learning “model” has several variant definitions.

1. Everyday. The common definition is sometimes used.
2. Parameterized. Sometimes model is a short-hand for “parameterized model”. Here, it refers to a model with unspecified free parameters. In the Bayesian learning approach, you typically have a prior over (everyday) models.
3. Predictive. Even further from everyday use is the predictive model. Examples of this are “my model is a decision tree” or “my model is a support vector machine”. Here, there is no real sense in which an SVM explains the underlying process. For example, an SVM tells us nothing in particular about how alterations to the real-world system would create a change.

Which definition is being used at any particular time is important information. For example, if it’s a parameterized or predictive model, this implies some learning is required. If it’s a predictive model, then the set of operations which can be done to the model are restricted with respect to everyday usage. I don’t have any particular advice here other than “watch out”—be aware of the distinctions, watch for this source of ambiguity, and clarify when necessary.