John Langford – Page 96 – Machine Learning (Theory)

2/10/20052/11/2005

Conferences, Dates, Locations

Conference	Locate	Date
COLT	Bertinoro, Italy	June 27-30
AAAI	Pittsburgh, PA, USA	July 9-13
UAI	Edinburgh, Scotland	July 26-29
IJCAI	Edinburgh, Scotland	July 30 – August 5
ICML	Bonn, Germany	August 7-11
KDD	Chicago, IL, USA	August 21-24

The big winner this year is Europe. This is partly a coincidence, and partly due to the general internationalization of science over the last few years. With cuts to basic science in the US and increased hassle for visitors, conferences outside the US become more attractive. Europe and Australia/New Zealand are the immediate winners because they have the science, infrastructure, and english in place. China and India are possible future winners.

2/9/20052/9/2005

Intuitions from applied learning

Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected.

Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin, Caruana
Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun, Phil Long
Deep Representation Having a deep representation is necessary for having a good general learner. Decision Trees and Convolutional neural networks take advantage of this. SVMs get around it by allowing the user to engineer knowledge into the kernel. Boosting and Bagging rely on another algorithm for this. Sources: Yann LeCun
Fine Representation of Bias Many learning theory applications use just a coarse representation of bias such as “function in the hypothesis class or not”. In practice, significantly better performance is achieved from a more finely tuned bias. Bayesian learning has this builtin with a prior. Other techniques can take advantage of this as well. Sources: Zoubin, personal experience.

If you have others, please comment on them.

2/8/20052/9/2005

Some Links

Yaroslav Bulatov collects some links to other technical blogs.

2/7/20052/8/2005

The State of the Reduction

What? Reductions are machines which turn solvers for one problem into solvers for another problem.
Why? Reductions are useful for several reasons.

Laziness. Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work.
Crystallization. The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on.
Theoretical Organization. By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder.

What we know now.

Typesafe reductions. In the beginning, there was the observation that every complex object on a computer can be written as a sequence of bits. This observation leads to the notion that a classifier (which predicts a single bit) can be used to predict any complex object. Using this observation, we can make the following statements:

Any prediction problem which can be broken into examples can be solved with a classifier.
In particular, reinforcement learning can be decomposed into examples given a generative model (see Lagoudakis & Parr and Fern, Yoon, & Givan).

This observation also often doesn’t work well in practice, because the classifiers are sometimes wrong, so one of many classifiers are often wrong.

Error Transform Reductions. Worrying about errors leads to the notion of robust reductions (= ways of using simple predictors such as classifiers to make complex predictions). Error correcting output codes were proposed in analogy to coding theory. These were analyzed in terms of error rates on training sets and general losses on training sets. The robustness can be (more generally) analyzed with respect to arbitrary test distributions, and algorithms optimized with respect to this notion are often very simple and yield good performance. Solving created classification problems up to error rate e implies:

Solving importance weighed classifications up to error rate eN where N is the expected importance. Costing
Solving multiclass classification up to error rate 4e using ECOC. Error limiting reductions paper
Solving Cost sensitive classification up to loss 2eZ where Z is the sum of costs. Weighted All Pairs algorithm
Finding a policy within expected reward (T+1)e/2 of the optimal policy for T step reinforcement learning with a generative model. RLgen paper
The same statement holds much more efficiently when the distribution of states of a near optimal policy is also known. PSDP paper

A new problem arises: sometimes the subproblems created are inherently hard, for example when estimating class probability from a classifier. In this situation saying “good performance implies good performance” is vacuous.

Regret Transform Reductions To cope with this, we can analyze how good performance minus the best possible performance (called “regret”) is transformed under reduction. Solving created binary classification problems to regret r implies:

Solving importance weighted regret up to r N using the same algorithm as for errors. Costing
Solving class membership probability up to l₂ regret 2r. Probing paper
Solving multiclass classification to regret 4 r^0.5. SECOC paper
Predicting costs in cost sensitive classification up to l₂ regret 4r SECOC again
Solving cost sensitive classification up to regret 4(r Z)^0.5 where Z is the sum of the costs of each choice. SECOC again

There are several reduction-related problems currently being worked on which I’ll discuss in the future.

2/4/20052/4/2005

JMLG

The Journal of Machine Learning Gossip has some fine satire about learning research. In particular, the guides are amusing and remarkably true.

As in all things, it’s easy to criticize the way things are and harder to make them better.