Antilearning: When proximity goes bad

Joel Predd mentionedAntilearning” by Adam Kowalczyk, which is interesting from a foundational intuitions viewpoint.

There is a pervasive intuition that “nearby things tend to have the same label”. This intuition is instantiated in SVMs, nearest neighbor classifiers, decision trees, and neural networks. It turns out there are natural problems where this intuition is opposite of the truth.

One natural situation where this occurs is in competition. For example, when Intel fails to meet its earnings estimate, is this evidence that AMD is doing badly also? Or evidence that AMD is doing well?

This violation of the proximity intuition means that when the number of examples is few, negating a classifier which attempts to exploit proximity can provide predictive power (thus, the term “antilearning”).

Problem: Reductions and Relative Ranking Metrics

This, again, is something of a research direction rather than a single problem.

There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1”, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.

Problem What does the ability to classify well imply about performance under these metrics?

Past Work

  1. Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
  2. Sample complexity bounds for AROC (and here).
  3. A paper on “Learning to Order Things“.

Difficulty Several of these may be easy. Some of them may be hard.

Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.

Why Papers?

Makc asked a good question in comments—“Why bother to make a paper, at all?” There are several reasons for writing papers which may not be immediately obvious to people not in academia.

The basic idea is that papers have considerably more utility than the obvious “present an idea”.

  1. Papers are a formalized units of work. Academics (especially young ones) are often judged on the number of papers they produce.
  2. Papers have a formalized method of citing and crediting other—the bibliography. Academics (especially older ones) are often judged on the number of citations they receive.
  3. Papers enable a “more fair” anonymous review. Conferences receive many papers, from which a subset are selected. Discussion forums are inherently not anonymous for anyone who wants to build a reputation for good work.
  4. Papers are an excuse to meet your friends. Papers are the content of conferences, but much of what you do is talk to friends about interesting problems while there. Sometimes you even solve them.
  5. Papers are an excuse to get a large number of smart people in the same room and think about the same topic.
  6. Good papers are easy to read. In particular, they are much easier to read (and understand) then a long discussion thread. They are even easy to read in several decades. (Writing good papers is hard)

All of the above are reasons why writing papers is a good idea. It’s also important to understand that academia is a large system and large systems have a lot of inertia. This means switching from paper writing to some other method of doing research won’t happen unless the other method is significantly more effective, and even then there will be a lot of inertia.

Also note: the “similar sites” link to the right is to other discussion forums, etc…

Problem: Online Learning

Despite my best intentions, this is not a fully specified problem, but rather a research direction.

Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow, Weighted Majority, and Binomial Weighting. Algorithms with this property haven’t taken over the world yet. Here might be some reasons:

  1. Lack of caring. Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done.
  2. Inefficiency. Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning setting) be enumerated and tested on every example. (This is similar to the inefficiency of using Bayes law as an algorithm directly, and there are strong similarities in the algorithms.)

For an example of (1), there is a mysterious factor of 2 in the Binomial Weighting algorithm which has not been fully resolved. Some succesful applications also exist such as those based on SNoW.

The way to combat problem (2) is to introduce more structure into the hypothesis/experts. Some efforts have already been made in this direction. For example, it’s generally feasible to introduce an arbitrary bias or “prior” over the experts in the form of some probability distribution, and perform well with respect to that bias. Another nice piece of work by Adam Kalai and Santosh Vempala discusses how to efficiently handle several forms of structured experts. At an intuitive level, further development discovering how to efficiently work with new forms of structure might payoff well.

The basic research direction here is to address the gap between theory and practice, possibly by solving the above problems and possibly by discovering and addressing other problems.

Solution: Reinforcement Learning with Classification

I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).

One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.