I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of:
- Specify a prior over world models.
- Integrate using Bayes law with respect to all observed information to compute a posterior over world models.
- Predict according to the posterior.
Bayesian learning has many advantages over other learning programs:
- Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. Many other machine learning approaches don’t have this guarantee.
- Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. This is significantly helpful when working on the “think harder” part of a solution.
- Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. (see intuitions).
With all of these advantages, Bayesian learning is a strong program. However, there are also some very significant disadvantages.
- Information theoretically infeasible It turns out that specifying a prior is extremely difficult. Roughly speaking, we must specify a real number for every setting of the world model parameters. Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons:
- They know languages allowing more compact specification of priors. Acquiring this knowledge takes some signficant effort.
- They lie. They don’t specify their actual prior, but rather one which is convenient. (This shouldn’t be taken too badly, because it often works.)
- Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. Even then, computing a posterior may be extremely difficult. This difficulty implies that computational approximation is required.
- Unautomatic The “think harder” part of the Bayesian research program is (in some sense) a “Bayesian employment” act. It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. (Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages.)
Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it.
I wish I knew whether or not the drawbacks can be convincingly addressed. My impression so far is “not always”.
I self-identify as a Bayesian (although I am perfectly happy to use simpler methods when appropriate), and I would characterize some of those benefits of the approach differently.
Philosophically, I believe that Bayesian probability is the “right” way to think about inference and estimation, or, at least, it’s the best theory we have.
Every problem can be posed as a probabilistic inference problem, and Bayesian methods can do inference in all kinds of cases where no other method can help. For example, multiple sources of information (e.g., multiple sources of measurements, such as image and laser range data, etc), hierarchical models (in which you don’t know the hyperparameters for the dataset you want to process), problems for which you need to use your domain knowledge, and complex models are all cases where your choices are (usually) either a Bayesian model or a fragile heuristic.
The big, big problem with the proper Bayesian approach is the many computational headaches one runs into. However, I believe it’s better to understand the real problem you’re trying to solve and then approximate that problem — while understanding what you’re losing in the approximation — rather than employing heuristics without an understanding of the deeper issues. Most of the computational difficulties (and difficulties defining priors) can be dodged by heuristics and approximations, although at some cost.
I feel like the ML community tends to overemphasize classification and regresion in the abstract. These are problems for which the advantages of Bayesian methods are subtler than for the kinds of domain-specific models that people ought to be building for specific applications, rather than trying to shoehorn everything into a binary classification problem.
One thing to be aware of is that “one man’s heuristic is another man’s solid theory”. What is (and is not) a heuristic to a person has a lot to do with what you consider the goals of learning to be. Bayesians are almost entirely in the camp “the goal of learning is to solve individual learning problems” while many other people are at “the goal of learning is to solve all learning problems”. Bayesian learning looks great with respect to (1), but maybe not so good with respect to (2).
It’s interesting that you only point out the computational problems. My impression is that the information theoretic problem and the unautomaticity are “silent drawbacks”. Nobody who isn’t educated in Bayes-based learning really tries it, so you don’t hear from all the people with learning problems that aren’t Bayes-educated. The information-theoretic problems are also silent because there is a strong tendency to not hear about failures to find the “right” prior.
I agree with Aaron that the Bayesian way of thinking about inference is the “right” way: Probability is epistemic, a degree-of-belief, rather that something “out there”. However, I think that the information-theoretic problem (Disadvantage 1 in John’t list) should be taken seriously.
It is my impression that in real-life inference problems it is often impossible to elicit the actual prior. In particular, we often have to live with the M-open scenario of Bernardo and Smith, i.e., we use model classes that do not contain the data-generating mechanism. John pointed out that using a convenient but wrong prior often works fine, but this sounds to me like a heuristic.
Sure enough, one can prove that Bayesian methods work quite well also in the M-open case, for instance, in terms of regret (additional loss over the best individual model in a model class) (see e.g. Hutter’s work). However, if this is the goal, why use Bayesian methods at all, why not optimize this criterion directly?
I may not be enough of a learning expert to know if I’m rehashing old arguments, but I don’t see those as problems unique to Bayesian methods.
With regard to the information-theoretic problem, I thought the standard argument is that prior assumptions are implicit within all learning methods; they’re made explicit only in Bayesian methods. Although some Frequentist methods use bounds that should apply for any data distribution, the bounds aren’t that useful for small datasets, exactly when priors become important.
In terms of the “unautomatic” problem: I don’t see why this isn’t a problem for any other method. There are “general-purpoes” techniques like Gaussian Processes, Bayesian neural nets, Infinite Mixture Models, etc., that can be treated as black boxes competitive with any non-Bayesian technique, if one is really not willing to put the extra modeling effort into the problem. These methods address the goals of “learning to solve all problems” just as well as any other technique. However, if you want to use a blackbox technique, then you may not care whether it’s Bayesian or not (except that the Bayesian model gives some extra benefits, like giving a probailistic answer and providing the option of extending the model, doing model selection,etc.)
On a related note, building a new model and algorithm for each new problem has been a huge hassle, and training students to do so is often a deal-killer. I would love to have a numerical package that could take a specification of a probabilistic model and an rough description of an inference algorithm, and perform efficient inference automatically, even on the kinds of crazy non-exponential models that I want to use. Of course, building a model specialized to a new problem in a non-Bayesian setting seems to be nearly impossible.
Regarding Aaron’s question: there has been some work on these things, like the AutoBayes system. Still, the generic sampler approach is preferred these days (WinBUGS), at least by the statisticians. In all, Bayesian modelling has become simpler, but it’s possible to do much more here, in the AutoBayes vein.
John’s criticism of Bayesian statistics is to the point, and manifests the machine learning perspective: you just feed the data, and do not specify the model at all. I guess that Hutter’s (and others’) work on universal priors addresses this problem in particular. One of the concerns addressed by universal priors is also the efficiency.
Teemu’s concern about loss functions is also valid. Herman Rubin’s work is quite clear about the inseparability of the utility and the prior. If one doesn’t want to accept log-score as the proximate utility function, then one needs to redefine the prior and the likelihood function. All this is fully compatible with the Bayesian framework, and we need not abandon it.
Finally, yes, let’s not be pretentious: Bayesian statistics is a heuristic. But I like it anyway, at least until something better comes along. My main reason is that it obeys the Epicurean maxim, which can be paraphrased as “Keep all hypotheses that are consistent with the facts.”
It is certainly a nice thing in Bayesian methods that the assumptions are made explicit. However, this is also a problem in a sense: one should be able to elicit all the prior knowledge and belief of the phenomenon in question. And like I said, this seems to me like an impossible task in practice. Take for instance image, speech, or text modeling. There will always be properties of the domain that have to be left out of the model.
In other words, one doesn’t use the actual subjective prior but one that is convenient. Bayarri and Berger (The interplay between Bayesian and frequentist analysis) say:
Isn’t it somewhat disappointing that the Bayesian framework isn’t self-contained enough to do without having to lean towards frequentist techniques in order to justify itself?
Regarding loss functions and priors, even though I may be taking the risk of getting stuck into the definition of Bayesian methods, I do see some question marks in adjusting the prior and the likelihood based on the loss function. Shouldn’t inference and decision-making be two separate steps in the Bayesian framework?
I’m not sure what “small” is, but it is possible to have nontrivial distribution-free bounds for (say) 10 or 20 examples. (Note: this is only distribution free given the constraint of IID samples.)
At the risk of descent into a flamewar: Bayesian neural nets are like neural nets, only slower. Guassian Processes are like SVMs, only slower. I regard computational issues as a very significant distinction here. Any system for making predictions can be automated, but we don’t really expect the system to be generally efficient and effective unless we optimize for those properties. It may be that Bayes-motivated learning algorithms are the right answer to these criteria, but I haven’t seen convincing evidence yet.
I don’t quite understand Aleks. If we have a correct prior, then using Bayes law to construct a posterior and then make a choice that optimizes a loss function is optimal for all losses. “Correct prior” means: the system producing data is drawn from the prior.
There is no ‘correct’ Bayesian school. There are quite a few ways of looking at things. There is notable disagreement between the “objective” Bayesians (such as Bayyari and Berger) and the more “subjective” Bayesians. It’s just a question of what you hold more sacred: the utilities or the probabilities. A subjective Bayesian will tell you that if you pick a proper loss function and elicit your prior (that’s clearly not “correct”), frequentistically meaningful “probabilities” will come out. An objective Bayesian will fix her “correct” prior and likelihood function with appropriate “objective” justifications, separate decision-making from modelling, and work from there. But the actual operations are the same. What one cannot deny is that priors and utilities are inherently entangled. If you can stomach some vague philosophising, I have a paper on the topic.
As for efficiency, John is right: Bayesian methods do tend to be slower. They provide a benefit primarily in borderline cases: when there is not enough data to unambiguously estimate a complex model, for example. But many of the “anomalies” and unexplained good performance of various heuristics in the non-Bayesian methods are just run-of-the-mill benefits of following a Bayesian approach. Even if it’s hard to follow a 100% pure Bayesian approach, it helps to *think* in terms of approximating a Bayesian approach. Most of the recent innovations in machine learning can be interpreted as adoptions of a slightly more rigorous Bayesian approach. At the same time, Bayesians can see some new ways of how to do the approximations.
Clearly, I don’t know how to use MT. Let me try that again…
I disagree. Well, I agree that they’re a lot slower. But the important difference is that GPs and Bayes NNs provide elegant and easy ways to learn the hyperparameters. If there are a lot of hyperparameters (e.g., a kernel function with dozens of parameters), this matters.
If all you care about is classification, and you don’t mind tuning a
few hyperparameters, and your data sets are reasonably large, and you
don’t have any useful domain knowledge, then you’ll probably be happy
about using SVMs for everything. These assumptions seem to characterize a lot of research in learning, and, as a consequence,
applications of learning. However, I think this is an artificially
limiting view.
Bayesian methods really shine when those assumptions don’t hold, and
Bayesian methods are routinely used in situations where non-Bayesians
would never dare venture. These are cases where, for example,
measurements are ambiguous/noisy, you have valuable domain knowledge,
multiple types of measurements, unknown hyperparameters, and so on.
For example, in one project I’ve worked on, we
simultaneously inferred 3D motion of non-rigid objects from noisy
video sequences with outliers. This is a case with many types of
hidden variables, which all have complex interactions with each other,
and are all very, very difficult to tune by hand. Moreover, we did
not have any relevant training data — learning and inference are all
done at the same time for each new input. I have no idea how this
could possibly be done in a non-Bayesian context and still be
automatic.
Incidentally, I’ve observed that most people outside of
learning who want to use learning methods would like a magic black box
that solves all learning problems (and vaguely feel like they’ve been
promised this). So this is certainly what people want, and I’ve read
plenty of papers in which one step of the system is an invocation of
SVMlight or AdaBoost. I believe that the algorithm that
solves all learning problems does not exist, but that methods like
SVMs and GPs can be applied to many problems. I also think it does a
disservice to focus entirely on universal systems, since you can get
so much mileage out of building domain-specific models (such as the
example I described above).
My impression is that coping with noise is not too difficult for non-Bayesian learning algorithms. The real key where Bayesian methods can greatly shine is when there is very strong domain knowledge.
Perhaps I’m a radical here, but I believe there is a black box learning algorithm that can solve most of the learning problems I care about. (“all” is clearly too much to hope for) The algorithm’s name might even be Aaron Hertzmann.
… which, of course, is Bayesian. 🙂
Seriously, whether a universal algorithm exists is a fascinating question, and I think the “superprior” approach is one plausible solution. I’m skeptical that a “universal” algorithm could exist in the way we use learning algorithms today, namely, on relatively small data sets. Humans, use tremendous amount of contextual information. So, you can do inference on a single datapoint, e.g., make inferences about someone’s personality from a short conversation, but you have lots and lots of contextual information that tells you what the data means, and you’ve also been observing humans all your life.
Bayes’theorem and other learning models have been derived as solutions to information theoretic optimization problems and are 100% efficient in the sense that input information equals output information for each learning model. See my 1988 American Statistician article,”Optimal Information Processing and Bayes’ Theorem” vol.42, No. 4, pp. 278-294, for my derivation of Bayes’theorem as an optimal information processing rule with discussion by Jaynes, Kullback, Hill and Bernardo. In later published articles, other optimal learning models have been derived and applied, some not requiring the use of a prior density and/or a likelihood function. Recent papers on this topic and references to earlier published papers can be found on my home page: http://gsbwww.uchicago.edu/fac/arnold.zellner/more
In particular see Sect.IV of my 2005 paper,”Some Thoughts about S.James Press and Bayesian Analysis” that can be downloaded and will be published.
Happy holidays and the best of the best New Year!
(Note new e-mail address)
advantages and disadvantages in pure binary
Zellners’s argument re: Bayes’ theorem are enlightening. His assumptions are primitive, and thus difficult to refute. The way Bayes’ theorem(and other learning theorems – these results will surprise some Bayesians) falls out of his use of variational optimization of net information loss is beautiful. These papers are well worth reading.
From a brief skim, that’s not a bad article.
Berger is often described as an objective Bayesian. I think Wikipedia describes him as the most objective Bayesian at present. There is a quote from him saying objective Bayes is nothing but a collection of useful ad-hoc devices (this means he is really subjectivist!). Although he does advocate the use of the word Objective Bayes essentially for marketing purposes in a recent issue of Bayesian Analysis, there is a lively discussion following the work most discussants are actually rather milantly subjective arguing objectivity is attempting to sell a lie – a few cautiously are in favour of the label. While a few would agree, I think objective Bayes is now a fringe idea. The popularity of the work of Ed Jaynes in machine learning however makes it seem less so than it is – also I believe Zellner – who comments at the end of this page might also advocate an objective Bayes approach… I haven’t read much of Zellner’s work. The idea of Objective Bayesian statistics was developed primarilly by Harrold Jeffreys, a paper by Dawid, Stone and Zidek is cited by most people as showing that objective Bayesian inference is not possible – although there are several other criticisms.
One thing that extreme subjectivists like de Finetti, Michael Goldstein and Frank Lad seem to have in common with objective Bayesians like Berger or Jaynes is the use of imprecisely defined models. It’s pretty obvious that imprecise prior specification is necessary in real problems. e.g the prior over a 32 bits requires 2^32 probability assertions representing indifference to betting, that must sum to one – a realistic attempt to do this would not only require enormous amounts of work, but would probably cause underflows in C double’s! – and this is very much a toy problem!
Common Bayesian practice at the moment is to attempt to fully specify a model. In practice some course specification of the model is done and conveniance priors are used to model the remainder – a certain amount of variation of these priors is sometimes undertaken i.e. ‘sensitivity analysis’. Also note in any finite real world problem – full specification is in principle possible if you want to predict a result that will be stored in N bits then there are ‘only’ 2^N possible results. The idea of infinete elicitation only occurs if you attempt to put a probability on a non-observable.
Perhaps ironicly a fully specified model is generally completely intractable so the required calculations are approximated crudely normally with monte carlo methods.
The full specification approach, is I think correct when 1. The number of outcomes of the experiment is small enough that a full probability distribution can realistically be elicited and 2. You are talking about the beliefs of a single person.
In other cases life becomes messy. Essentially a model of coherent beliefs is put forward, but the analysist is cagey about if anybody does or should hold those beliefs – either due to incomplete elicitation or because flat ‘objective’ priors were used.
I think machine learning might have quite a bit to gain from considering specification of incomplete models – from either the objective or subjective perspective. From the objective side there are flat priors, Jeffrey’s priors, Maxent (Jaynes) and reference priors (Bernardo and Berger). From the subjective side there is the use of imprecise probabilities see (Peter Walley), Bayes Linear (new book by Michael Goldstien out soon!) and the fundamental theorem of prevision (Lad).
The idea of concensus becomes a fascinating topic. In a machine learning problem is it possible for me and you to have ‘reasonable priors’ and after seeing the data to disagree about the optimal decision rule, and the expected utility of the optimal decision rule. If this were generally the case it might have interesting implications. There are some papers that suggest this, although I can’t understand them because they use Measure Theory!