Explorations of Exploration

Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models.

  • Reinforcement Learning (1). Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a. The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E3 by Satinder Singh and Michael Kearns. Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E3 and Metric-E3. (It turns out that the number of wasted samples can be less than the number of bits required to describe an MDP.) A weakness of all these results is that they rely upon (a) assumptions which are often false for real applications, (b) state spaces are too large, and (c) make a gurantee that is rather weak. Good performance is only guaranteed after suffering the possibly catastrophic consequences of exploration.
  • Reinforcement Learning (2). Several recent papers have been attempting to analyze reinforcement learning via reduction. To date, all results are either nonconstructive or involve the use of various hints (oracle access to an optimal policy, the distribution over states of an optimal policy etc…) which short-circuit the need to explore. Obviously, these hints are not always available for real-world problems.
  • Reinforcement Learning (3). Much of the rest of reinforcement learning has something to do with exploration, but it’s difficult to summarize succinctly.
  • Online Learning. The n-armed bandit setting can be thought of as an MDP with one state and many actions. In some variants, there is even an adversary who chooses the payoffs of the arms in a non-stochastic manner. The typical result here says that you can compete well with the best constant action after some wasted actions. The exact number of wasted actions varies with the precise setting, but it is typically linear in the number of actions. This work can be traced back to (at least) Gittins indices which (unfortunately) don’t seem to have a good description available on the internet.
  • Active Learning(1) The common current use of this term has to do with “selective sampling”=choosing unlabeled samples to label so as to minimize the number of labels required to learn a good predictor (typically a classifier). A typical result has the form: Given that your classifier comes from restricted class C and the labeled data distribution obeys some constraint, the number of adaptively labeled samples required O(log (1/e)) where e is the error rate. (It turns out that the even noisy distributions are allowed.) The constraints on distributions and hypothesis spaces required to achieve these speedups are often severe.
  • Active Learning(2) Membership query learning is another example. The distinguishing difference with respect to selective sampling is that the a labeled can be requested for any unlabeled point (not just those drawn according to some natural distribution). Several relatively strong results hold for membership query learning, but there is a significant drawback: it seems that the ability to query for a label on an arbitrary point is not very natural. For example, imagine query whether a text document is about sports or politics when the text is generated at random.
  • Active Learning(3) Experimental design (which is mostly based in statistics), is often about finding the extrema of some function rather than generalization. Often, the data generating distribution is assumed to come from some specific parametric family. Unfortunately, my knowledge is sketchy here.

The striking thing about all of these models is that they fail to apply to typical real world problems. This failure is either by design (making assumptions which are simply rarely met), by failure to prove interesting results, or both.

And yet, many of the pieces are here. Active learning deals with generalization, online learning can deal with adversarial situations, and reinforcement learning deals with the situation where your choices influence what you can later learn. At a high level, there is much room for research here by design of a new model of exploration, new theoretical statements, or both.

I’ve been told “exploration is too hard”, and that’s a good warning to bear in mind, but I’m still hopeful for progress.

Server Shift, Site Tweaks, Suggestions?

Hunch.net has shifted to a new server, and wordpress has been updated to the latest version. If anyone notices difficulties associated with this, please comment. (Note that DNS updates can take awhile so the shift may not yet be complete.)
More generally, this is a good time to ask for suggestions. What would make this blog more useful?

What is the best regret transform reduction from multiclass to binary?

This post is about an open problem in learning reductions.

Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e. After subtracting the minimum possible (Bayes) error rate b, we get a regret r = e – b. The PECOC(Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r0.5.

The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r. This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k. For example, we can not rule out the possibility that a reduction R exists with regret transform guaranteeing binary regret r implies at most multiclass regret c(k) r where c(k) is a k dependent constant between 1 and 4.

Difficulty I believe this is a solvable problem, given some serious thought.

Impact The use of some reduction from multiclass to binary is common practice, so a good solution could be widely useful. One thing to be aware of is that there is a common and reasonable concern about the ‘naturalness’ of induced problems. There seems to be no way to address this concern other than via empirical testing. On the theoretical side, a better reduction may help us understand whether classification or l2 regression is the more natural primitive for reduction. The PECOC reduction essentially first turns a binary classifier into an l2 regressor and then uses the regressor repeatedly to make multiclass predictions.

Some background material which may help:

  1. Dietterich and Bakiri introduce Error Correcting Output Codes.
  2. Guruswami and Sahai analyze ECOC as an error transform reduction. (see lemma 2)
  3. Allwein, Schapire, and Singer generalize ECOC to use loss-based decoding.
  4. Beygelzimer and Langford showed that ECOC is not a regret transform and proved the PECOC regret transform.

NIPS paper evaluation criteria

John Platt, who is PC-chair for NIPS 2006 has organized a NIPS paper evaluation criteria document with input from the program committee and others.

The document contains specific advice about what is appropriate for the various subareas within NIPS. It may be very helpful, because the standards of evaluation for papers varies significantly.

This is a bit of an experiment: the hope is that by carefully thinking about and stating what is important, authors can better understand whether and where their work fits.

Update: The general submission page and Author instruction including how to submit an appendix.

The value of the orthodox view of Boosting

The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).

Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.

I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.

Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.

There are two advantages of the “weak to strong learning” view:

  1. Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
  2. Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.

The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.

This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.