Intuitions from applied learning

Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected.

  1. Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin, Caruana
  2. Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun, Phil Long
  3. Deep Representation Having a deep representation is necessary for having a good general learner. Decision Trees and Convolutional neural networks take advantage of this. SVMs get around it by allowing the user to engineer knowledge into the kernel. Boosting and Bagging rely on another algorithm for this. Sources: Yann LeCun
  4. Fine Representation of Bias Many learning theory applications use just a coarse representation of bias such as “function in the hypothesis class or not”. In practice, significantly better performance is achieved from a more finely tuned bias. Bayesian learning has this builtin with a prior. Other techniques can take advantage of this as well. Sources: Zoubin, personal experience.

If you have others, please comment on them.

The State of the Reduction

What? Reductions are machines which turn solvers for one problem into solvers for another problem.
Why? Reductions are useful for several reasons.

  1. Laziness. Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work.
  2. Crystallization. The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on.
  3. Theoretical Organization. By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder.

What we know now.

Typesafe reductions. In the beginning, there was the observation that every complex object on a computer can be written as a sequence of bits. This observation leads to the notion that a classifier (which predicts a single bit) can be used to predict any complex object. Using this observation, we can make the following statements:

  1. Any prediction problem which can be broken into examples can be solved with a classifier.
  2. In particular, reinforcement learning can be decomposed into examples given a generative model (see Lagoudakis & Parr and Fern, Yoon, & Givan).

This observation also often doesn’t work well in practice, because the classifiers are sometimes wrong, so one of many classifiers are often wrong.

Error Transform Reductions. Worrying about errors leads to the notion of robust reductions (= ways of using simple predictors such as classifiers to make complex predictions). Error correcting output codes were proposed in analogy to coding theory. These were analyzed in terms of error rates on training sets and general losses on training sets. The robustness can be (more generally) analyzed with respect to arbitrary test distributions, and algorithms optimized with respect to this notion are often very simple and yield good performance. Solving created classification problems up to error rate e implies:

  1. Solving importance weighed classifications up to error rate eN where N is the expected importance. Costing
  2. Solving multiclass classification up to error rate 4e using ECOC. Error limiting reductions paper
  3. Solving Cost sensitive classification up to loss 2eZ where Z is the sum of costs. Weighted All Pairs algorithm
  4. Finding a policy within expected reward (T+1)e/2 of the optimal policy for T step reinforcement learning with a generative model. RLgen paper
  5. The same statement holds much more efficiently when the distribution of states of a near optimal policy is also known. PSDP paper

A new problem arises: sometimes the subproblems created are inherently hard, for example when estimating class probability from a classifier. In this situation saying “good performance implies good performance” is vacuous.

Regret Transform Reductions To cope with this, we can analyze how good performance minus the best possible performance (called “regret”) is transformed under reduction. Solving created binary classification problems to regret r implies:

  1. Solving importance weighted regret up to r N using the same algorithm as for errors. Costing
  2. Solving class membership probability up to l2 regret 2r. Probing paper
  3. Solving multiclass classification to regret 4 r0.5. SECOC paper
  4. Predicting costs in cost sensitive classification up to l2 regret 4r SECOC again
  5. Solving cost sensitive classification up to regret 4(r Z)0.5 where Z is the sum of the costs of each choice. SECOC again

There are several reduction-related problems currently being worked on which I’ll discuss in the future.

Learning Theory, by assumption

One way to organize learning theory is by assumption (in the assumption = axiom sense), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases.

  1. No assumptions
    1. Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms.
    2. Universal Learning Using a “bias” of 2– description length of turing machine in learning is equivalent to all other computable biases up to some constant.
    3. Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems.
  2. Independent and Identically Distributed (IID) Data
    1. Performance Prediction Based upon past performance, you can predict future performance.
    2. Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
  3. IID and partial constraints on the data source
    1. PAC Learning There exists fast algorithms for learning when all examples agree with some function in a function class (such as monomials, decision list, etc…)
    2. Weak Bayes The Bayes law learning algorithm will eventually reach the right solution as long as the right solution has a positive prior.
  4. Strong Constraints on the Data Source
    1. Bayes Learning When the data source is drawn from the prior, using Bayes law is optimal

This doesn’t include all forms of learning theory, because I do not know them all. If there are other bits you know of, please comment.