Suppose we have a set of observations over time *x _{1},x_{2},…,x_{t}* and want to predict some future event

*y*. An inevitable problem arises, because learning a predictor

_{t+1}*h(x*of

_{1},…,x_{t})*y*is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required.

_{t+1}Existing approaches for deriving state have some limitations.

- Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations.
- Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front.
- Dynamic Bayesian Networks (graphical models through time) require substantial a.priori specification and often require the solution of difficult computational problems to use. Some of these difficulties are representational rather than computational.
- The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. (Thanks to Drew for pointing it out.) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. Unfortunately, I’ll miss the tutorial, and I haven’t found the related paper.

The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. When writing papers you don’t generally gush too hard, but it’s fair to say that I’m excited by this approach.

- It’s not a known dead end.
- It doesn’t require lots of prior specification & information when you have lots of data.
- It leverages the huge amount of work that has gone into supervised learning algorithm design.
- It works in controlled systems also, where the control is simply another observation.
- It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later.
- It doesn’t require predicting everything in order to predict what you want.
- It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility.

I expect some people reading this paper will be disappointed that it doesn’t solve all problems. That’s good news for anyone interested in research. For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. Furthermore, we have some sample complexity analysis in the linear case.

It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up.

Just saying ‘hi’. One of my PhD students decided to talk about this paper in our networks journal club on Monday, and I noticed the familiar name. It sounds like everything is going well. I hope you still make time for D & D.

I haven’t really thought about modelling dynamical systems before but this does look like an interesting problem.

At least superficially, it seems to be tackling the same sort of problem as Song et al’s paper “Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems“, also in ICML. I’m not really familiar with your or their paper but am curious as to the main differences in the approaches.

Maybe some comments on the ICML discussion site are in order?

A pointer to conflate.net discussion.

Hi John,

I recently read this paper and have talked to a number of people who have also read it. Two main comments based on this:

1) What you describe sounds very much like the “back propagation through time” method, albeit generalised from neural networks to a wider range of learning architectures. I presume you don’t think so?

2) Nobody I’ve talked to has managed to understand the 3 step learning algorithm the paper describes. Exactly what all the B_t are and how they related to the B_i in step 3, and what exactly the alternation with B’ in step 2 is…? You definitely need to write a longer and more explicit description of this algorithm.

A reasonable way to think of it is “backprop through time, except done right”. My understanding is that backprop through time has problems with local minima both theoretically and empirically. The method of training here is much better at avoiding local minima, essentially by the virtue of having localized prediction problems.

The B_t and B_i’s are different indicies. Index ‘t’ is for the timestep while index ‘i’ is for the iteration. So, B_t is the propagator for timestep t, and a sequence of B_1,…,B_T are the propagators for all the timesteps. These propagators are then iterated (index ‘i’) towards convergence. I agree this is confusing—thanks for pointing it out, and I’ll work on it.

Perhaps it’s easiest to just understand step ‘1’, which gives you consistency, and then understand steps 2&3 as a sound technique to make all the B_t converge to a fixed function B.