Prior, “Prior” and Bias

Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs.

  1. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data.
  2. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well.
  3. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another.

This only scratches the surface—there are yet more subtleties. For example the (as mentioned in meaning of probability) shifts from one viewpoint to another.

Problem: Online Learning

Despite my best intentions, this is not a fully specified problem, but rather a research direction.

Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow, Weighted Majority, and Binomial Weighting. Algorithms with this property haven’t taken over the world yet. Here might be some reasons:

  1. Lack of caring. Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done.
  2. Inefficiency. Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning setting) be enumerated and tested on every example. (This is similar to the inefficiency of using Bayes law as an algorithm directly, and there are strong similarities in the algorithms.)

For an example of (1), there is a mysterious factor of 2 in the Binomial Weighting algorithm which has not been fully resolved. Some succesful applications also exist such as those based on SNoW.

The way to combat problem (2) is to introduce more structure into the hypothesis/experts. Some efforts have already been made in this direction. For example, it’s generally feasible to introduce an arbitrary bias or “prior” over the experts in the form of some probability distribution, and perform well with respect to that bias. Another nice piece of work by Adam Kalai and Santosh Vempala discusses how to efficiently handle several forms of structured experts. At an intuitive level, further development discovering how to efficiently work with new forms of structure might payoff well.

The basic research direction here is to address the gap between theory and practice, possibly by solving the above problems and possibly by discovering and addressing other problems.

NIPS: Online Bayes

One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms. From the paper:

The philosophy taken in the Bayesian methodology is often at odds with
that in the online learning community…. the online learning setting
makes rather minimal assumptions on the conditions under which the
data are being presented to the learner —usually, Nature could provide
examples in an adversarial manner. We study the performance of
Bayesian algorithms in a more adversarial setting… We provide
competitive bounds when the cost function is the log loss, and we
compare our performance to the best model in our model class (as in
the experts setting).

It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem:

Let Q be any distribution over parameters theta. Then for all sequences S:

L_{Bayes}(S) leq L_Q(S) + KL(Q|P)

where P is our prior and the losses L are: first, log-loss for the Bayes algorithm (run online) and second, expected log-loss with respect to an arbitrary distribution Q.

Any thoughts? Any other papers you thought we have to read?