Watchword: Probability

Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used.

  1. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y. The probability is then Y/(Y+1). Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively.
  2. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in n experiments then it has probability about k/n. Frequentist probabilities can be used to measure how sure you are about something. They may be appropriate in a learning context for measuring confidence in various predictors. The frequentist notion of probability is common in physics, other sciences, and computer science theory.
  3. Estimated The estimated notion of probability is measured by running some learning algorithm which predicts the probability of events rather than events. I tend to dislike this use of the word because it confuses the world with the model of the world.

To avoid confusion, you should be careful to understand what other people mean for this word. It is helpful to always be explicit about which variables are randomized and which are constant whenever probability is used because Bayesian and Frequentist probabilities commonly switch this role.

7 Replies to “Watchword: Probability”

  1. What a way to start out! There’s unfortunate possibility for Holy War,
    but I still thought it would be worth giving a somewhat different
    impression of the types of probability. I certainly agree that
    interpretation is important and worth understanding. Here are a view
    of the different views the way I see them:

    The Bayesian point of view: It is said that there are more varieties
    of Bayesian than there are statisticians. I specifically mean the
    Laplace-Cox-Jaynes variant. As John said, probability measures one’s
    degree of belief in an event. Bayesian’s generally believe that in the
    absence of adverserial uncertainty, probability is the uniquely
    correct way to consider uncertainty. All Bayesian reasoning looks
    something like this: I believe this proposition, therefore I believe
    some other proposition.

    Frequentism, on the other hand, is almost not an interpretation at
    all. For instance, how does one interpret “independence” as a
    frequentist? It certaintly doesn’t mean casual or physical
    independence. The Frequentist takes the law-of-large numbers as
    essentially axiomatic, while the the Bayesian takes it as theorem
    connecting events in the world to beliefs. Frequentist probability is
    really just relative frequencies of certain events.

    An important point I’d disagree with John on is the generality of
    each. A Bayesian agrees that everything a Frequentist proves is
    true. For instance, a proof that, say a sorting algorithm works well
    “in the average case”, means that an agent who believes that each of
    the possible pivots are equally likely also believes (mostly!) that
    the algorithm will sort quickly. In fact, a Bayesian would have generally have
    no quarrel with a computer scientist’s probabilistic statements at
    all. In cases of inferring from data, it’s usually just that the
    Bayesian doesn’t find the Frequentist’s statements
    interesting.

    Which brings us to John’s last point:

    To avoid confusion, you should be careful to understand what other
    people mean for this word. It is helpful to always be explicit about
    which variables are randomized and which are constant whenever
    probability is used because Bayesian and Frequentist probabilities
    commonly switch this role.

    It’s really more fundemental than this: the entire terminology of
    “random” sits poorly with the Bayesian. Randomness doesn’t really
    enter into consideration– the essential issue is what quantities are
    uncertain. Bayesian would much prefer, and sometimes say,
    “uncertain variables” where frequentists would say “random variables”.
    For the Bayesian, the data is certain– it’s not particularly
    interesting to ask questions about “randomization” over data you
    already have. The uncertain things is usually some model or
    hypothesis.

    This reflects itself in many ways. For instances, a frequentist
    provides “confidence intervals” during estimation. The tortured
    interpretation is that if we say we have a measurement of a parameter
    with confidence \alpha, then at least a fraction \alpha of the limits
    computed over many “random” instances of data will contain the true
    value of the parameter inside it’s limits. It’s actually rather hard
    to say correctly…

    In learning theory (see, for instance, John’s very nice tutorial on
    “Prediction Theory”) a similar thing happens. Bounds are given with
    respect to random draws of data. For a Bayesian, this simply isn’t the
    interesting quantity to know. The Bayesian might be interested in
    the conditional probability of failure of her algorithm given
    the data she has already seen.

    Finally, it’s worth noting that many (most?) physicists would describe
    themselves as Bayesian. After all, it’s the probability of Laplace,
    Cox, and Jaynes, physicists all. In lectures, even highly regarded
    physicists who in writing seem to avoid interpretation per se, like,
    for instance, Robert Swendsen and Bob Griffiths, say remarkably
    Bayesian things in their descriptions of probability. Only fairly
    recently had Bayesian reasoning found a foothold in statistics,
    engineering and the information sciences.

    Finally, it’s worth pointing out that there are some other interesting
    interpretations of probability:

    a) Algorithmic Probability

    and

    b) Game Theoretic Probability

    The latter is particularly interesting and threatens to out-generalize even the Bayesian point of view.

  2. I think Drew’s response goes well beyond the subject of the post (watch out for “probability”), but it is an interesting subject.

    For instance, how does one interpret “independence” as a frequentist?

    The interpretation of independence as a frequentist is simply that one event’s randomization is not tied to another event’s randomization. In other words, my measurements of my scattering experiments do not influence your measurements of your scattering experiments.

    One point which should be clarified is that both Bayesian and Frequentist notions of probability are manipulated with the same axioms.

    An important point I’d disagree with John on is the generality of each.

    I think each approach can rationalize the other. To make a Bayesian statement sit within Frequentist statistics, the reasoning goes as: “If the world is drawn randomly from the prior, then using Bayes Law is the correct mechanism to use data to make predictions.”

    … it’s usually just that the Bayesian doesn’t find the Frequentist’s statements interesting.

    I think the frequentist viewpoint on the Bayesian approach is that it is often naive. In particular, the process of writing down a correct prior and doing the integrations necessary are simply intractable. Consider, for instance, trying to predict the turbulent flow of air. The Bayesian recipe might be to form a prior over the locations of every atom, and then use Bayes law (with some very severe computations) to make predictions.

    Practical people won’t do this. They will approximate in various ways and create a flawed but plausibly useful predictor. Frequentist statements give you some way to measure how good this approximate predictor is irrespective of the method of the method of approximation.

    For instance, a frequentist provides “confidence intervals” during estimation.

    I tend to agree that the interpretation you gave to confidence intervals is tortured. I think of it with a different interpretation: “If I use the confidence interval many times in my future, I won’t often be wrong.” In other words, you can compute confidence intervals, and you are usually correct to say, use, or rely on “the interval contains the estimator”. This kind of result seems possibly useful in further automating learning.

    Finally, it’s worth noting that many (most?) physicists would describe themselves as Bayesian.

    My understanding is that thermodynamics has a reasonable Bayesian interpretation under the prior “every state of the system is equally likely”. I don’t understand a reasonable Bayesian interpretation of quantum mechanics, because it seems the basic laws simply predict rates of occurrence.

  3. There is at least one more sense in which probability is used – in the measure theoretic
    sense! I dont know about physicists, but main-stream math people do use this viewpoint. In fact, this is the primary viewpont they use. Although measure theoretic treatment of probability is not commonly used in machine learning, it has its advantages. Simple concepts like conditional expectation, martingales and related concentration properties just make life so much easier. I am not sure if its easy to gather equivalent intuitions using frequency based techniques.

    Finally, just repeating Drew’s point, the recent game theoretic foundations of probability by Shafer and Vovk are perhaps the most interesting for machine learning theory – since it seems to have all the important laws of large numbers (classical and martingale forms) of the measure theoretic set-up, and, more importantly, it is very intuitive [u dont have to start explaining what a sigma-algebra is!].

    Let the holy war continue …

  4. To follow up: because I find it difficult to understand formally “randomization” in the frequentist sense, I suppose I equally have a hard time understanding this version of independence.
    I think no-one argues about the potential computational difficulty of Bayesian methods. This is why one considers things like Maximum Entropy as large n limits of Bayesian inference. (To deal
    with John’s thermodynamic example.) I think QM is a dangerous red-herring for interpretation, but there are a fairly large set of physicists who would argue QM nevertheless has only a
    Bayesian interpretation. (I.e. it’s all about incomplete states of knowledge.) For instance Carlton Caves argues that this is exactly what
    Gleason’s theorem is fundementally telling us. I’d stay away from this whole scene though– I think it really distracts from the issues ML people should care about.

    Finally, I think it’s important to distinguish a mathematical formalization from it’s interpretation. Many Bayesian and Frequentists happily
    use measure theoretic probability– it’s just that most applications in ML don’t need to bother with unncessary formalizations. There
    are times when it’s helpful. (Deterministic generative models fit nicely here as does, for instance, Tatikonda’s work on belief propagation.).
    I actually think that algebra’s of events (if not sigma- ones ) are rather easy to understand in contrast to the game theoretic approach. The
    infinite varieties are hairy though. Vovk addresses this in a way I rather like: non-standard analysis. There are non-standard approaches
    to measure theoretic probability as well, like Edward Nelson’s “Radically Elementary Probability Theory”.

  5. I can’t resist a further knife-twist, but I must thank Drew for demonstrating why probability is a watchword.

    Since no-one argues that there are computational (and perhaps even information theoretic) difficulties in applying Bayesian methods, no-one should argue that approximation errors are an issue. Furthermore, no one should argue that techniques for quantifying performance, even when there are arbitrary forms of approximation, can be interesting. This is what confidence interval statements provide. We may not like their interpretation, but it seems plausible they have a useful role to play.

    For QM, if you open a standard textbook, and read the axioms of quantum mechanics, it seems very much like a frequentist interpretation. In my version of Liboff (which is second edition), postulate 3 uses words like “…One prepares a very large number (N) of identical replicas of X…” Bayesian interpretations of QM may exist, but I have not encountered or felt the need to encounter them. They do not seem to be in the standard material an undergraduate or graduate physicist learns at Caltech.

  6. 😉
    Hi John, The information theoretic comment almost certainly deserves it’s own post– I’m pretty sure I know exactly where your going and it’s a point worth emphasizing. (And it’s a fight I’d like to see between you and a rabid Bayesian.)
    Regarding compuation/approximation, I think there is a great deal of interest in quantifying the performance of an algorithm. For instance, bounding the distance to the true posterior or the true decision. And there’s nothing wrong per
    se, with proving an algorithm will work well on average. It’s just more interesting to answer the conditional probability given the data we already see. I suspect we (mostly) see eye-to-eye…

  7. I tend to think more physicists are frequentists. Or maybe it’s just that physicist are vague enough that I can’t pen down what they are.

    I don’t understand a reasonable Bayesian interpretation of quantum mechanics, because it seems the basic laws simply predict rates of occurrence.

    Carl Caves has some nice notes which he calls “Resource material for promoting the Bayesian view of everything” where he discusses his joint work with Chris Fuchs and Rudiger Schack about a “Bayesian viw of quantum theory.”

Comments are closed.