I have had interesting discussions about distinction between static vs. dynamic classes with Kishore and Hal.
The distinction arises in multiclass prediction settings. A static set of classes is given by a set of labels {1,…,k} and the goal is generally to choose the most likely label given features. The static approach is the one that we typically analyze and think about in machine learning.
The dynamic setting is one that is often used in practice. The basic idea is that the number of classes is not fixed, varying on a per example basis. These different classes are generally defined by a choice of features.
The distinction between these two settings as far as theory goes, appears to be very substantial. For example, in the static setting, in learning reductions land, we have techniques now for robust O(log(k)) time prediction in many multiclass setting variants. In the dynamic setting, the best techniques known are O(k), and furthermore this exponential gap may be essential, at least without further assumptions.
Are there techniques for converting from dynamic multiclass to static multiclass? For example, we could embed a dynamic set of classes within a much larger static set ranging over all possible dynamic classes while eliminating all class-dependent features. In some cases, this approach may work well, but I’ve also seen it fail, with the basic problem being that a learning algorithm might easily choose an invalid class. We could of course force a learning algorithm to choose amongst the dynamically valid set, but I don’t know a general way to do that without making the running time at least scale with the number of valid classes.
So, a basic question that’s bothering me is: When and how can we effectively predict amongst a set of dynamically defined classes in sublinear time? A quick answer is “it’s not possible because simply reading off the set of dynamically defined classes require O(class count) time”. This answer isn’t satisfying, because there are many ways to implicitly specify a set in sublinear time. So the modified question is “Are there natural ways to dynamically define classes in sublinear time? And can these be used for sublinear prediction?”
Isn’t this what structured prediction is all about? Specifying the structure of the output domain is what can make prediction feasible in sublinear time. For example, predicting labels on a tree-structured graphical model by finding the state with maximum aposteriori probability is finding in sublinear time the labeling in an exponentially large discrete output domain.
For a more abstract and general view on discrete output domains a natural field to seek inspiration from is combinatorial optimization: finding optimal solutions from large finite sets. One of the most successful and principled approaches there is polyhedral combinatorics, in which the structure of the output domain is analyzed in terms of polytopes. An efficient description of the polytope guarantees the existence of an efficient polynomial-time algorithm. Efficient descriptions are well defined as those for which a polynomial-time separation oracle exists. In a sense, the language of polyhedral combinatorics is universal, as all finite sets implicitly define a polytope.
As an illustration that such transfer already happens, in the graphical model example, finding the MAP state is equivalent to optimizing a linear function over the so called marginal polytope of the model. For tree-structured models this polytope has a simple structure and thus is efficiently solvable. The work of Martin Wainwright and David Sontag extensively analyze the structure of this polytope. Also, the discrete optimization workshop at this years NIPS will hopefully work further in this direction (http://www.discml.cc/).
“Structured prediction” has many meanings, so I’m not surprised if your’s encompasses this. I’ll need to better understand the polytope specification to comment further about the approach you suggest. (Pointer?)
What many people think of as structured prediction is not quite what I had in mind, as Yisong suggests. For example, with no particular structure on a set of static labels I can provide an effective O(log k) solution to multiclass prediction when certain fairly natural binary prediction problems are solvable. I don’t yet have a good grasp of whether/when a similar result is possible with dynamic classes.
I see your point. I am not familiar with your reductions work so I cannot comment on the relation to what I meant with structured prediction.
For the polytope approach, an excellent graphical model perspective is in Wainwright and Jordan’s not-so-short short book (available at http://www.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf).
For general polyhedral combinatorics a very brief and readable introduction can be found in Laurence Wolsey’s 1998 book “Integer Programming”, whereas the definite reference is the three-book work of Lex Schrijver, “Combinatorial Optimization: Polyhedra and Efficiency” (Springer, 2003). For something readable that is online accessible there is a pair of introductionary survey papers by Karen Aardal from 1996, titled “Polyhedral Techniques in Combinatorial Optimization”.
While structured prediction has been effective for many problems, specifying the structure of the output space implies imposing additional assumptions. I believe John is addressing the more general assumption-free problem (or alternatively the problem of finding the minimal set of assumptions).
Side note: it seems that the sleeping experts problem ( http://colt2008.cs.helsinki.fi/papers/114-Kleinberg.pdf ) can be thought of as an online and non-contextual version of this problem.