This is about a fundamental motivation for the investigation of reductions in learning. It applies to many pieces of work other than my own.
The reductionist approach to problem solving is characterized by taking a problem, decomposing it into as-small-as-possible subproblems, discovering how to solve the subproblems, and then discovering how to use the solutions to the subproblems to solve larger problems. The reductionist approach to solving problems has often payed off very well. Computer science related examples of the reductionist approach include:
- Reducing computation to the transistor. All of our CPUs are built from transistors.
- Reducing rendering of images to rendering a triangle (or other simple polygons). Computers can now render near-realistic scenes in real time. The big breakthrough came from learning how to render many triangles quickly.
This approach to problem solving extends well beyond computer science. Many fields of science focus on theories making predictions about very simple systems. These predictions are then composed to make predictions about where space craft go, how large a cannonball needs to be, etc… Obviously this approach has been quite successful.
It is an open question whether or not this approach can really succeed at learning.
- Against: We know that succesful learning requires the incorporation of prior knowledge in fairly arbitrary forms. This suggests that we can not easily decompose the process of learning.
- For: We know that humans can succeed at general purpose learning. It may be that arbitrary prior knowledge is required to solve arbitrary learning problems, but perhaps there are specific learning algorithms incorporating specific prior knowledge capable of solving the specific problems we encounter.
- Neutral: We know that learning reductions sometimes work. We don’t yet have a good comparison of how well they work with other approaches.
First, an alternate (and maybe more useful?) definition of what’s important about ‘reductionism’: In Sussman and Abelson’s SICP, they define three desiderata for a formal language (be it for modeling purposes or engineering purposes): (1) Expressive primitives (2) Composition rules and (3) Abstraction rules. Every successful modeling and engineering framework I’ve encountered has posessed all three; they seem to be necessary for modeling or building complex systems. The expressiveness of the composition and abstraction rules seems particularly key. (Particularly salient examples of successful formal languages include TTL for digital hardware, linear systems theory w/ block diagrams for basic signal processing and modeling simple physical systems, the lumped parameter abstraction for understanding analog circuits, and of course some programming languages. We suck at understanding nonlinear systems, for example, because we don’t know how to piece together phase space structure from simple bits; some topological results help us here, but not much.)
I’ve tended to be happiest with the probabilistic worldview in learning because probability theory provides this formal language: The notion of conditioning and independence lets me combine simple distributions to create more complicated ones, and the notion of marginalization allows me to abstract away the details of an implementation of a particular desired relationship. I’d argue graphical models are so popular because they help to make the formal language underpinning probabilistic learning explicit, and suggest intuitive means of combining simple models (gluing graph fragments together) and abstracting details (marginalization amounts to integrating out nodes). Two open issues in this kind of modeling, however, are 1) how to construct general-purpose inference algorithms whose generality matches the expressiveness of the modeling language and 2) how to include various infinite models in the language, with appropriate composition and abstraction rules.
I would love to see how to do the analogous tasks within other frameworks for learning (e.g. no regret learning, max-margin learning, etc). Perhaps this would just boil down to a language for assembling structured objective functions over mixed discrete and continuous spaces..
This composability has led to real learning successes (My AI favorite: Beal et al’s audio-visual tracking hack; my cogsci favorite: Kemp and Tenenbaum’s learning domain structures stuff, which is soon to be extended). As far as the general question of whether reductionism will work at all: I feel there’s empirical evidence saying it has helped, but I’m not sure what the No Free Lunch theorems have to say about this. That is, can we pin down a class of learning problems including the ones people can solve which *are* amenable to solution by reduction?
Finally, a comment about existing work under the umbrella of ‘learning reductions’, which may be quite wrong since I’ve only glanced at what’s been done here: It seems to me that they’ve basically followed the notion of reduction coming from complexity theory (given a solution to problem A, produce a box that solves problem B). This notion has been extremely useful in complexity because the composition rules allowed in constructing box B from box A are both expressive and robust (e.g. stable to polynomial changes in complexity of primitives and polynomial amounts of work). In learning, we’ll need to keep this notion (a learning reduction will probably only be relevant if it doesn’t change the computational complexity class of a problem) and add sample complexity (or account for this in the structure of the reduction).
The major issue I see there is that it’s not at all clear to me that all or most interesting learning problems will happen to have asymptotic polynomial complexity; instead, it might just be that we only need to solve them for small enough problem sizes that the exponential doesn’t *quite* kill us.
I would also love to see this.
I agree that the graphical model approach has been the most succesful example of reductionism in learning. A big question here is: can the remaining difficulties of graphical models be solved in that paradigm, or do we need other paradigms for reductionism? (Big problems = inference is hard and performance is sometimes suboptimal compared to other approaches.)
What are your favorite examples of the suboptimality in terms of performance of explicitly probabilistic approaches (not just ones expressible as graphical models)? I’ve seen some stuff arguing that (when a model is only intended for classification) ‘discriminative’ approaches do better than ‘generative’ ones; is this what you’re referring to? What is the most compelling instance you can point to here?
The discriminative vs. generative argument has some traction to me. The evidence I have is only anecdotal.
Observationally, I’ve only seen Radford Neal win competitions with probabilistic techniques.
Another item that comes up is that when people publish empirical results for graphical models, they typically do not compare with the approaches I would expect to achieve the best performance for classification. When pressed, a typical response is “but we are only comparing with understandable models”. This isn’t satisfying if you are oriented towards error rate performance.
See this report for an extensive comparison of some machine learning and probabilistic techniques: Benchmarking Support Vector Machines.
It’s important to note that few statistical methods concern themselves with zero-one loss explicitly – it’s considered to be an unreliable loss function. Robust statistics shows how it’s possible to work with things other than log-likelihood.