February 2005 – Machine Learning (Theory)

2/28/20052/28/2005

Regularization

Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.

Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l₁ or l₂ norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.

One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.

2/27/20052/28/2005

Antilearning: When proximity goes bad

Joel Predd mentioned “Antilearning” by Adam Kowalczyk, which is interesting from a foundational intuitions viewpoint.

There is a pervasive intuition that “nearby things tend to have the same label”. This intuition is instantiated in SVMs, nearest neighbor classifiers, decision trees, and neural networks. It turns out there are natural problems where this intuition is opposite of the truth.

One natural situation where this occurs is in competition. For example, when Intel fails to meet its earnings estimate, is this evidence that AMD is doing badly also? Or evidence that AMD is doing well?

This violation of the proximity intuition means that when the number of examples is few, negating a classifier which attempts to exploit proximity can provide predictive power (thus, the term “antilearning”).

2/26/20052/26/2005

Problem: Reductions and Relative Ranking Metrics

This, again, is something of a research direction rather than a single problem.

There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1”, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.

Problem What does the ability to classify well imply about performance under these metrics?

Past Work

Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
Sample complexity bounds for AROC (and here).
A paper on “Learning to Order Things“.

Difficulty Several of these may be easy. Some of them may be hard.

Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.

2/25/20052/25/2005

Why Papers?

Makc asked a good question in comments—“Why bother to make a paper, at all?” There are several reasons for writing papers which may not be immediately obvious to people not in academia.

The basic idea is that papers have considerably more utility than the obvious “present an idea”.

Papers are a formalized units of work. Academics (especially young ones) are often judged on the number of papers they produce.
Papers have a formalized method of citing and crediting other—the bibliography. Academics (especially older ones) are often judged on the number of citations they receive.
Papers enable a “more fair” anonymous review. Conferences receive many papers, from which a subset are selected. Discussion forums are inherently not anonymous for anyone who wants to build a reputation for good work.
Papers are an excuse to meet your friends. Papers are the content of conferences, but much of what you do is talk to friends about interesting problems while there. Sometimes you even solve them.
Papers are an excuse to get a large number of smart people in the same room and think about the same topic.
Good papers are easy to read. In particular, they are much easier to read (and understand) then a long discussion thread. They are even easy to read in several decades. (Writing good papers is hard)

All of the above are reasons why writing papers is a good idea. It’s also important to understand that academia is a large system and large systems have a lot of inertia. This means switching from paper writing to some other method of doing research won’t happen unless the other method is significantly more effective, and even then there will be a lot of inertia.

Also note: the “similar sites” link to the right is to other discussion forums, etc…

2/25/20052/25/2005

Problem: Online Learning

Despite my best intentions, this is not a fully specified problem, but rather a research direction.

Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow, Weighted Majority, and Binomial Weighting. Algorithms with this property haven’t taken over the world yet. Here might be some reasons:

Lack of caring. Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done.
Inefficiency. Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning setting) be enumerated and tested on every example. (This is similar to the inefficiency of using Bayes law as an algorithm directly, and there are strong similarities in the algorithms.)

For an example of (1), there is a mysterious factor of 2 in the Binomial Weighting algorithm which has not been fully resolved. Some succesful applications also exist such as those based on SNoW.

The way to combat problem (2) is to introduce more structure into the hypothesis/experts. Some efforts have already been made in this direction. For example, it’s generally feasible to introduce an arbitrary bias or “prior” over the experts in the form of some probability distribution, and perform well with respect to that bias. Another nice piece of work by Adam Kalai and Santosh Vempala discusses how to efficiently handle several forms of structured experts. At an intuitive level, further development discovering how to efficiently work with new forms of structure might payoff well.

The basic research direction here is to address the gap between theory and practice, possibly by solving the above problems and possibly by discovering and addressing other problems.