Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used.

**Bayesian**The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at*Y*to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of*Y*. The probability is then*Y/(Y+1)*. Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively.**Frequentist**The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs*k*times in*n*experiments then it has probability about*k/n*. Frequentist probabilities can be used to measure how sure you are about something. They may be appropriate in a learning context for measuring confidence in various predictors. The frequentist notion of probability is common in physics, other sciences, and computer science theory.**Estimated**The estimated notion of probability is measured by running some learning algorithm which predicts the probability of events rather than events. I tend to dislike this use of the word because it confuses the world with the model of the world.

To avoid confusion, you should be careful to understand what other people mean for this word. It is helpful to always be explicit about which variables are randomized and which are constant whenever probability is used because Bayesian and Frequentist probabilities commonly switch this role.

What a way to start out! There’s unfortunate possibility for Holy War,

but I still thought it would be worth giving a somewhat different

impression of the types of probability. I certainly agree that

interpretation is important and worth understanding. Here are a view

of the different views the way I see them:

The Bayesian point of view: It is said that there are more varieties

of Bayesian than there are statisticians. I specifically mean the

Laplace-Cox-Jaynes variant. As John said, probability measures one’s

degree of belief in an event. Bayesian’s generally believe that in the

absence of adverserial uncertainty, probability is the uniquely

correct way to consider uncertainty. All Bayesian reasoning looks

something like this: I believe this proposition, therefore I believe

some other proposition.

Frequentism, on the other hand, is almost not an interpretation at

all. For instance, how does one interpret “independence” as a

frequentist? It certaintly doesn’t mean casual or physical

independence. The Frequentist takes the law-of-large numbers as

essentially axiomatic, while the the Bayesian takes it as theorem

connecting events in the world to beliefs. Frequentist probability is

really just relative frequencies of certain events.

An important point I’d disagree with John on is the generality of

each. A Bayesian agrees that everything a Frequentist proves is

true. For instance, a proof that, say a sorting algorithm works well

“in the average case”, means that an agent who believes that each of

the possible pivots are equally likely also believes (mostly!) that

the algorithm will sort quickly. In fact, a Bayesian would have generally have

no quarrel with a computer scientist’s probabilistic statements at

all. In cases of inferring from data, it’s usually just that the

Bayesian doesn’t find the Frequentist’s statements

interesting.Which brings us to John’s last point:

It’s really more fundemental than this: the entire terminology of

“random” sits poorly with the Bayesian. Randomness doesn’t really

enter into consideration– the essential issue is what quantities are

uncertain. Bayesian would much prefer, and sometimes say,“uncertain variables” where frequentists would say “random variables”.

For the Bayesian, the data is certain– it’s not particularly

interesting to ask questions about “randomization” over data you

already have. The uncertain things is usually some model or

hypothesis.

This reflects itself in many ways. For instances, a frequentist

provides “confidence intervals” during estimation. The tortured

interpretation is that if we say we have a measurement of a parameter

with confidence \alpha, then at least a fraction \alpha of the limits

computed over many “random” instances of data will contain the true

value of the parameter inside it’s limits. It’s actually rather hard

to say correctly…

In learning theory (see, for instance, John’s very nice tutorial on

“Prediction Theory”) a similar thing happens. Bounds are given with

respect to random draws of data. For a Bayesian, this simply isn’t the

interesting quantity to know. The Bayesian might be interested in

the

conditionalprobability of failure of her algorithm giventhe data she has already seen.

Finally, it’s worth noting that many (most?) physicists would describe

themselves as Bayesian. After all, it’s the probability of Laplace,

Cox, and Jaynes, physicists all. In lectures, even highly regarded

physicists who in writing seem to avoid interpretation per se, like,

for instance, Robert Swendsen and Bob Griffiths, say remarkably

Bayesian things in their descriptions of probability. Only fairly

recently had Bayesian reasoning found a foothold in statistics,

engineering and the information sciences.

Finally, it’s worth pointing out that there are some other interesting

interpretations of probability:

a) Algorithmic Probability

and

b) Game Theoretic Probability

The latter is particularly interesting and threatens to out-generalize even the Bayesian point of view.

I think Drew’s response goes well beyond the subject of the post (watch out for “probability”), but it is an interesting subject.

The interpretation of independence as a frequentist is simply that one event’s randomization is not tied to another event’s randomization. In other words, my measurements of my scattering experiments do not influence your measurements of your scattering experiments.

One point which should be clarified is that both Bayesian and Frequentist notions of probability are manipulated with the same axioms.

I think each approach can rationalize the other. To make a Bayesian statement sit within Frequentist statistics, the reasoning goes as: “If the world is drawn randomly from the prior, then using Bayes Law is the correct mechanism to use data to make predictions.”

I think the frequentist viewpoint on the Bayesian approach is that it is often naive. In particular, the process of writing down a correct prior and doing the integrations necessary are simply intractable. Consider, for instance, trying to predict the turbulent flow of air. The Bayesian recipe might be to form a prior over the locations of every atom, and then use Bayes law (with some very severe computations) to make predictions.

Practical people won’t do this. They will approximate in various ways and create a flawed but plausibly useful predictor. Frequentist statements give you some way to measure how good this approximate predictor is irrespective of the method of the method of approximation.

I tend to agree that the interpretation you gave to confidence intervals is tortured. I think of it with a different interpretation: “If I use the confidence interval many times in my future, I won’t often be wrong.” In other words, you can compute confidence intervals, and you are usually correct to say, use, or rely on “the interval contains the estimator”. This kind of result seems possibly useful in further automating learning.

My understanding is that thermodynamics has a reasonable Bayesian interpretation under the prior “every state of the system is equally likely”. I don’t understand a reasonable Bayesian interpretation of quantum mechanics, because it seems the basic laws simply predict rates of occurrence.

There is at least one more sense in which probability is used – in the measure theoretic

sense! I dont know about physicists, but main-stream math people do use this viewpoint. In fact, this is the primary viewpont they use. Although measure theoretic treatment of probability is not commonly used in machine learning, it has its advantages. Simple concepts like conditional expectation, martingales and related concentration properties just make life so much easier. I am not sure if its easy to gather equivalent intuitions using frequency based techniques.

Finally, just repeating Drew’s point, the recent game theoretic foundations of probability by Shafer and Vovk are perhaps the most interesting for machine learning theory – since it seems to have all the important laws of large numbers (classical and martingale forms) of the measure theoretic set-up, and, more importantly, it is very intuitive [u dont have to start explaining what a sigma-algebra is!].

Let the holy war continue …

To follow up: because I find it difficult to understand formally “randomization” in the frequentist sense, I suppose I equally have a hard time understanding this version of independence.

I think no-one argues about the potential computational difficulty of Bayesian methods. This is why one considers things like Maximum Entropy as large n limits of Bayesian inference. (To deal

with John’s thermodynamic example.) I think QM is a dangerous red-herring for interpretation, but there are a fairly large set of physicists who would argue QM nevertheless has only a

Bayesian interpretation. (I.e. it’s all about incomplete states of knowledge.) For instance Carlton Caves argues that this is exactly what

Gleason’s theorem is fundementally telling us. I’d stay away from this whole scene though– I think it really distracts from the issues ML people should care about.

Finally, I think it’s important to distinguish a mathematical formalization from it’s interpretation. Many Bayesian and Frequentists happily

use measure theoretic probability– it’s just that most applications in ML don’t need to bother with unncessary formalizations. There

are times when it’s helpful. (Deterministic generative models fit nicely here as does, for instance, Tatikonda’s work on belief propagation.).

I actually think that algebra’s of events (if not sigma- ones ) are rather easy to understand in contrast to the game theoretic approach. The

infinite varieties are hairy though. Vovk addresses this in a way I rather like: non-standard analysis. There are non-standard approaches

to measure theoretic probability as well, like Edward Nelson’s “Radically Elementary Probability Theory”.

I can’t resist a further knife-twist, but I must thank Drew for demonstrating why probability is a watchword.

Since no-one argues that there are computational (and perhaps even information theoretic) difficulties in applying Bayesian methods, no-one should argue that approximation errors are an issue. Furthermore, no one should argue that techniques for quantifying performance, even when there are

arbitraryforms of approximation, can be interesting. This is what confidence interval statements provide. We may not like their interpretation, but it seems plausible they have a useful role to play.For QM, if you open a standard textbook, and read the axioms of quantum mechanics, it seems very much like a frequentist interpretation. In my version of Liboff (which is second edition), postulate 3 uses words like “…One prepares a very large number (

N) of identical replicas ofX…” Bayesian interpretations of QM may exist, but I have not encountered or felt the need to encounter them. They do not seem to be in the standard material an undergraduate or graduate physicist learns at Caltech.Hi John, The information theoretic comment almost certainly deserves it’s own post– I’m pretty sure I know exactly where your going and it’s a point worth emphasizing. (And it’s a fight I’d like to see between you and a rabid Bayesian.)

Regarding compuation/approximation, I think there is a great deal of interest in quantifying the performance of an algorithm. For instance, bounding the distance to the true posterior or the true decision. And there’s nothing wrong per

se, with proving an algorithm will work well on average. It’s just more interesting to answer the conditional probability given the data we already see. I suspect we (mostly) see eye-to-eye…

I tend to think more physicists are frequentists. Or maybe it’s just that physicist are vague enough that I can’t pen down what they are.

Carl Caves has some nice notes which he calls “Resource material for promoting the Bayesian view of everything” where he discusses his joint work with Chris Fuchs and Rudiger Schack about a “Bayesian viw of quantum theory.”