Papers – Page 9 – Machine Learning (Theory)

4/21/20054/21/2005

Dynamic Programming Generalizations and Their Use

David Mcallester gave a talk about this paper (with Pedro Felzenszwalb). I’ll try to give a high level summary of why it’s interesting.

Dynamic programming is most familiar as instantiated by Viterbi decoding in a hidden markov model. It is a general paradigm for problem solving where subproblems are solved and used to solve larger problems. In the Viterbi decoding example, the subproblem is “What is the most probable path ending at each state at timestep t?”, and the larger problem is the same except at timestep t+1. There are a few optimizations you can do here:

Dynamic Programming -> queued Dynamic Programming. Keep track of the “cost so far” (or “most probable path”) and (carefully) only look at extensions to paths likely to yield the shortest path. “Carefully” here is defined by Dijkstra’s shortest path algorithm.
queued Dynamic programming -> A^*Add a lower bound on the cost to complete a path (or an upper bound on the probability of a completion) for the priority queue of Dijkstra’s shortest path. This can yield computational speedups varying between negligible and outstanding.
A^* -> Hierarchical A^* The efficiency of A^* search is dependent on the tightness of it’s lower bound, which brings up the question: “Where do you get the lower bound?” One appealing answer is from A^* applied to a simplified problem equivalent to the original problem, but with states aliased (many states in original problem = 1 state in new problem). This technique can be applied recursively until the problem is trivial.

Each of these steps has been noted previously (although perhaps not in the generality of this paper). What seems new and interesting is that the entire hierarchy of A^* searches can be done simultaneously on one priority queue.

The resulting algorithm can use low level information to optimize high level search as well as high level information to optimize low level search in a holistic process. It’s not clear yet how far this approach can be pushed, but this quality is quite appealing. Naturally, there are many plausible learning-related applications.

4/8/20054/8/2005

Fast SVMs

There was a presentation at snowbird about parallelized support vector machines. In many cases, people parallelize by ignoring serial operations, but that is not what happened here—they parallelize with optimizations. Consequently, this seems to be the fastest SVM in existence.

There is a related paper here.

4/6/20054/7/2005

Structured Regret Minimization

Geoff Gordon made an interesting presentation at the snowbird learning workshop discussing the use of no-regret algorithms for the use of several robot-related learning problems. There seems to be a draft here. This seems interesting in two ways:

Drawback Removal One of the significant problems with these online algorithms is that they can’t cope with structure very easily. This drawback is addressed for certain structures.
Experiments One criticism of such algorithms is that they are too “worst case”. Several experiments suggest that protecting yourself against this worst case does not necessarily incur a great loss.

3/18/20053/18/2005

Binomial Weighting

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log₂ |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than min_c f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.

2/19/20052/19/2005

Loss Functions for Discriminative Training of Energy-Based Models

This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005. I found this paper very difficult to read, but it does have some point about a computational shortcut.

This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else?

All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum_y f(x,y) (the paper calls – log f(x,y) an “energy”). If f is parameterized by some w, the gradient has a term for Z(x), and hence for every value of y. The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y). This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimental results and intuition.

It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. Ability to use an approximate maximum y’ seems potentially very useful.

With that encouragement, I had significant difficulties with the paper, including the following:

Lack of a theorem. A good theorem proving these things work would be quite helpful. It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification.
Definition of Loss. For better or worse, the paper uses the second definition of loss, “Loss is part of the solution”, which I find unnatural.
Claims I don’t understand or which aren’t technically true. None of these seem to be related to the main point of the paper, but they are very distracting. For example, there is a claim that log-loss is the “only well-justified loss function”. The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified.

With the above difficulties, this paper seems lucky to have been accepted. This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations.