Dynamic Programming Generalizations and Their Use

David Mcallester gave a talk about this paper (with Pedro Felzenszwalb). I’ll try to give a high level summary of why it’s interesting.

Dynamic programming is most familiar as instantiated by Viterbi decoding in a hidden markov model. It is a general paradigm for problem solving where subproblems are solved and used to solve larger problems. In the Viterbi decoding example, the subproblem is “What is the most probable path ending at each state at timestep t?”, and the larger problem is the same except at timestep t+1. There are a few optimizations you can do here:

  1. Dynamic Programming -> queued Dynamic Programming. Keep track of the “cost so far” (or “most probable path”) and (carefully) only look at extensions to paths likely to yield the shortest path. “Carefully” here is defined by Dijkstra’s shortest path algorithm.
  2. queued Dynamic programming -> A*Add a lower bound on the cost to complete a path (or an upper bound on the probability of a completion) for the priority queue of Dijkstra’s shortest path. This can yield computational speedups varying between negligible and outstanding.
  3. A* -> Hierarchical A* The efficiency of A* search is dependent on the tightness of it’s lower bound, which brings up the question: “Where do you get the lower bound?” One appealing answer is from A* applied to a simplified problem equivalent to the original problem, but with states aliased (many states in original problem = 1 state in new problem). This technique can be applied recursively until the problem is trivial.

Each of these steps has been noted previously (although perhaps not in the generality of this paper). What seems new and interesting is that the entire hierarchy of A* searches can be done simultaneously on one priority queue.

The resulting algorithm can use low level information to optimize high level search as well as high level information to optimize low level search in a holistic process. It’s not clear yet how far this approach can be pushed, but this quality is quite appealing. Naturally, there are many plausible learning-related applications.

Which Assumptions are Reasonable?

One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful.

Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings. The meaning used here is “assumption = axiom” (i.e. something you can not verify).

Assumption Reasonable? Which analysis? Example/notes
Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…)
Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you have a cyclic process generating data.
Finite exchangeability (FEX) Sometimes reasonable as for IID There are a good number of situations where there is a population we wish to classify, pay someone to classify a random subset, and then try to learn.
Input space uniform on a sphere No. PAC, active learning I’ve never observed this in practice.
Functional form: “or” of variables, decision list, “and” of variables Sometimes reasonable PAC analysis There are often at least OK functions of this form that make good predictions
No Noise Rarely reasonable PAC, ERM Most learning problems appear to be of the form where the correct prediction given the inputs is fundamentally ambiguous.
Functional form: Monotonic on variables Often PAC-style Many natural problems seem to have behavior monotonic in their input variables.
Functional form: xor Occasionally PAC I was suprised to observe this.
Fast Mixing Sometimes RL Interactive processes often fail to mix, ever, because entropy always increases.
Known optimal state distribution Sometimes RL Sometimes humans know what is going on, and sometimes not.
Small approximation error everywhere Rarely RL Approximate policy iteration is known for sometimes behaving oddly.

If anyone particularly agrees or disagrees with the reasonableness of these assumptions, I’m quite interested.

What can Type Theory teach us about Machine Learning?

This post is some combination of belaboring the obvious and speculating wildly about the future. The basic issue to be addressed is how to think about machine learning in terms given to us from Programming Language theory.

Types and Reductions

John’s research programme (I feel this should be in British spelling to reflect the grandiousness of the idea…) of machine learning reductions StateOfReduction is at some essential level type-theoretic in nature. The fundamental elements are the classifier, a function f: alpha -> beta, and the corresponding classifier trainer g: List of (alpha,beta) -> (alpha -> beta). The research goal is to create *combinators* that produce new f’s and g’s given existing ones. John (probably quite rightly) seems unwilling at the moment to commit to any notion stronger than these combinators are correctly typed. One way to see the result of a reduction is something typed like: (For those denied the joy of the Hindly-Milner type system, “simple” is probably wildly wrong.)

g’: List of (gamma,delta) -> (List of (alpha,beta) -> (alpha -> beta)) -> (gamma -> delta)

Perhaps another is to think of the reduction itself type-theoretically as a combinator that takes something typed like g above and spits out a new classifier trainer. (I.e. a reduction “lifts” a lower-level function up to operate on some type it wasn’t originally designed for.)

Many of things John considers reductions have particularly error and sample complexity preserving properties. For instance, a reduction may have the property that small regret of a low level classifier implies small regret for the result of the reduction. There are a number of interesting issues here:

* Can we “type” these properties so a compiler might check them?
* Is this enough? Some reductions (including, I would argue, some that have been discussed seriously) are “trivial” reductions in the sense that the antecedent never holds. That is, the are combinators that setup learning problems impossible to solve. Can we somehow type this kind of thing so we can separate good from bad reductions, where we try to define bad as meaning something like “creates impossible subproblems where it was possible to do otherwise”?

RLBench hints at the power that formalizing interfaces for things like reductions can be. You can chain together powerful comands that leverage existing classifier trainers to learn in a trace-generating model which is adapted (again essentially a combinator library) from a generative model. Unfortunately, RLBench operates at the command line which is poorly designed for this kind of composition.

Types and Probabilistic Models

Arguably the graphical models/Bayesian community has just as grandiose plans as John, but here the reductions are to learning probabilities in the sense “beliefs”. Graphical models form a language that allows us to compose together little subgraphs that express our beliefs about some subsystem. Much research has gone into allowing the modeling language to drive inference as well. It’s interesting to explore what type theory can tell us about this as well.

Existing work

Lloyd Allison has taken thoughts like this quite seriously in his relatively recent work, “Types and Classes in Machine Learning and Data Mining” . Allison uses Haskell to type a number of ideas including
MML and some classifiers, and can check many things statically. Unfortunately, ML is one of those parts of information science where we demand every computational advantage we can get. If we learn to solve some problem with acceptable speed, in no time it is the inner loop of some more sophisticated machine learning algorithm. Despite my apppreciation (well, mostly) of Haskell, it simply isn’t practical to write ML algorithms in Haskell. (Although
presumambly both Allison and …. would disagree with me.)

It may still make a huge amount of sense to think about things in something like the Haskell type system and then translate them to the capable (but gross) type system of, say C++. Understanding polymorphism and type classes and there relation with Machine Learning may be a real fundamental breakthrough to making ML widely useful.

Contributions flowing towards PL theory

Right now, when push comes to shove, all good interfaces between systems basically amount to invoking functions or closures. When
you get over “object oriented” and other such hype this makes sense for structuring work. What’s interesting is that recent machine learning and AI work *can’t* be expressed that way. I think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries. This seems to me where ML people can challenge PL people– help us design a new conception of “interface” that preserves uncertainty between layers. (Say, that
passes probability or “belief messages” back and forth.) Perhaps the probabalistic machinery already exist: we can always define “sampling interfaces” between systems. My guess is that these interfaces are basically multi-directional (unlike functional interfaces). Why? Say I want a program to understand speech and I build a layered system that consists of “signaling processing”, “phoneme recognition”, “word recognition”, “language modeling”, and some form of “semantic understanding”. I can resolve something ambigious about, say a phoneme, by understanding the higher level language model. To get the phoneme parsing right I have to feedback the results from language layer. In this sense, interfaces need to preserve uncertainty and probably pass information both ways.

How to start (small)

I think a serious effort should be made to explain things like reductions in terms of PL theory– even if it means that, like Allison, you have to do some explaining type systems first. I’d love to spend some time hashing some of this out with PL people. (If only there were more hours in the day!)

We should write our libraries to have super-clean functional interfaces and make them as (parametrically) polymorphic as reasonable.

Any other thoughts on ways to proceed on this in the short term?

Academic Mechanism Design

From game theory, there is a notion of “mechanism design”: setting up the structure of the world so that participants have some incentive to do sane things (rather than obviously counterproductive things). Application of this principle to academic research may be fruitful.

What is misdesigned about academic research?

  1. The JMLG guides give many hints.
  2. The common nature of bad reviewing also suggests the system isn’t working optimally.
  3. There are many ways to experimentally “cheat” in machine learning.
  4. Funding Prisoner’s Delimma. Good researchers often write grant proposals for funding rather than doing research. Since the pool of grant money is finite, this means that grant proposals are often rejected, implying that more must be written. This is essentially a “prisoner’s delimma”: anyone not writing grant proposals loses, but the entire process of doing research is slowed by distraction. If everyone wrote 1/2 as many grant proposals, roughly the same distribution of funding would occur, and time would be freed for more research.

Mechanism design is not that easy—many counterintuitive effects can occur. Academic mechanism design is particularly difficult problem because there are many details. Nevertheless, it may be worthwhile because it’s hard to underestimate the value of an improvement in the rate of useful research.

The good news is that not everything needs to be solved at once. For example, on the empirical side, if we setup an easy system allowing anyone to create challenges like KDDCup, we might achieve a better (i.e. less cheat-prone) understanding of what works and what does not.

Going all the Way, Sometimes

At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example:

  1. Should I refine bounds to make them tighter?
  2. Should I take some learning theory and turn it into a learning algorithm?
  3. Should I implement the learning algorithm?
  4. Should I test the learning algorithm widely?
  5. Should I release the algorithm as source code?
  6. Should I go see what problems people actually need to solve?

The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not.

  1. Expertise Once expertise are developed on some subject, you are the right person to refine them.
  2. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamental flaws with respect to solving the real problem. Not going all the way means you never discover this, except the hard way—people lose interest in your work.
  3. Virtues of breadth When you go all the way, you gain breadth, with a deeper understanding about which problems are important and why. This can be invaluable in focusing your future research.
  4. More Tangible Accomplishment Going all the way means that you can point to your peers and say “I solved it”.

Going all the way is sometimes problematic in research. For example, a paper with a theory, an algorithm, and experimental results invites defeat-in-detail: a reviewer can disagree with any one of these components and eliminate it from consideration. Another issue is that academia doesn’t directly reward implementing and releasing algorithms. A third issue is that you will almost certainly discover topics of interest which don’t fit your home conference(s). It is also very difficult to publish a paper with the title “an incremental improvement on X (which makes it work great in practice)”.

Along with this advice, it is important to remember to fail fast, where appropriate. When you discover that an idea is not workable, quickly quitting it and moving on is a real virtue.