Suppose we have a set of observations over time x1,x2,…,xt and want to predict some future event yt+1. An inevitable problem arises, because learning a predictor h(x1,…,xt) of yt+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required.
Existing approaches for deriving state have some limitations.
- Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations.
- Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front.
- Dynamic Bayesian Networks (graphical models through time) require substantial a.priori specification and often require the solution of difficult computational problems to use. Some of these difficulties are representational rather than computational.
- The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. (Thanks to Drew for pointing it out.) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. Unfortunately, I’ll miss the tutorial, and I haven’t found the related paper.
The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. When writing papers you don’t generally gush too hard, but it’s fair to say that I’m excited by this approach.
- It’s not a known dead end.
- It doesn’t require lots of prior specification & information when you have lots of data.
- It leverages the huge amount of work that has gone into supervised learning algorithm design.
- It works in controlled systems also, where the control is simply another observation.
- It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later.
- It doesn’t require predicting everything in order to predict what you want.
- It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility.
I expect some people reading this paper will be disappointed that it doesn’t solve all problems. That’s good news for anyone interested in research. For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. Furthermore, we have some sample complexity analysis in the linear case.
It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up.