Learning Theory, by assumption

One way to organize learning theory is by assumption (in the assumption = axiom sense), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases.

  1. No assumptions
    1. Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms.
    2. Universal Learning Using a “bias” of 2– description length of turing machine in learning is equivalent to all other computable biases up to some constant.
    3. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems.
  2. Independent and Identically Distributed (IID) Data
    1. Performance Prediction Based upon past performance, you can predict future performance.
    2. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
  3. IID and partial constraints on the data source
    1. PAC Learning There exists fast algorithms for learning when all examples agree with some function in a function class (such as monomials, decision list, etc…)
    2. Weak Bayes The Bayes law learning algorithm will eventually reach the right solution as long as the right solution has a positive prior.
  4. Strong Constraints on the Data Source
    1. Bayes Learning When the data source is drawn from the prior, using Bayes law is optimal

This doesn’t include all forms of learning theory, because I do not know them all. If there are other bits you know of, please comment.

Watchword: Loss

A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss.

There seems to be a strong dichotomy between two views of what “loss” means in learning.

  1. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer.
  2. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem.

I don’t fully understand the second viewpoint. It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. Many of these losses-of-convenience also seem to have behavior unlike real world problems. For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. Under log loss, their loss became very high. This does not seem to correspond to the intuitive notion of what the loss should be on the problem.

Watchword: Assumption

“Assumption” is another word to be careful with in machine learning because it is used in several ways.

  1. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “no free lunch” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning.
  2. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage.
  3. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer.

One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if not assumption then not conclusion“. This is incorrect logic. For example, with variant (1), “the assumption of my prior is not met so the algorithm will not learn”. Or, with variant (3), “the data is not IID, so my learning algorithm designed for IID data will not work”. In each of these cases “will” must be replaced with “may” for correctness.

Learning Complete Problems

Let’s define a learning problem as making predictions given past data. There are several ways to attack the learning problem which seem to be equivalent to solving the learning problem.

  1. Find the Invariant This viewpoint says that learning is all about learning (or incorporating) transformations of objects that do not change the correct prediction. The best possible invariant is the one which says “all things of the same class are the same”. Finding this is equivalent to learning. This viewpoint is particularly common when working with image features.
  2. Feature Selection This viewpoint says that the way to learn is by finding the right features to input to a learning algorithm. The best feature is the one which is the class to predict. Finding this is equivalent to learning for all reasonable learning algorithms. This viewpoint is common in several applications of machine learning. See Gilad’s and Bianca’s comments.
  3. Find the Representation This is almost the same as feature selection, except internal to the learning algorithm rather than external. The key to learning is viewed as finding the best way to process the features in order to make predictions. The best representation is the one which processes the features to produce the correct prediction. This viewpoint is common for learning algorithm designers.
  4. Find the Right Kernel The key to learning is finding the “right” kernel. The optimal kernel is the one for which K(x, z)=1 when x and z have the same class and 0 otherwise. With the right kernel, an SVM(or SVM-like optimization process) can solve any learning problem. This viewpoint is common for people who work with SVMs.
  5. Find the Right Metric The key to learning is finding the right metric. The best metric is one which states that features with the same class label have distance 0 while features with different class labels have distance 1. With the best metric, the nearest neighbor algorithm can solve any problem.

Each of these viewpoints seems to be “right”, and each seems to have some utility in it’s context. It also seems important to realize that these are just different versions of the same problem. One consequence of this observation is that “wrapper methods” which try to automatically find a subset of features to feed into a learning algorithm in order to improve learning performance are simply trying to repair weaknesses in the learning algorithm.