I had a chance to attend UAI this year, where several papers interested me, including:

- Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture. We’ve already discussed this one, but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it.
- Yao-Liang Yu and Dale Schuurmans, Rank/norm regularization with closed-form solutions: Application to subspace clustering. This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally.
- Laurent Charlin, Richard Zemel and Craig Boutilier, A Framework for Optimizing Paper Matching. This is about what works in matching papers to reviewers, as has been tested at several previous NIPS. We are looking into using this system for ICML 2012.

In addition I wanted to comment on Karl Friston‘s invited talk. At the outset, he made a claim that seems outlandish to me: The way the brain works is to minimize surprise as measured by a probabilistic model. The majority of the talk was not actually about this—instead it was about how probabilistic models can plausibly do things that you might not have thought possible, such as birdsong. Nevertheless, I think several of us in the room ended up stuck on the claim in questions afterward.

My personal belief is that world modeling (probabilistic or not) is a useful subroutine for intelligence, but it could not possibly be the entirety of intelligence. A key reason for this is the bandwidth of our senses—we simply take in too much information to model everything with equal attention. It seems critical for the efficient functioning of intelligence that only things which might plausibly matter are modeled, and only to the degree that matters. In other words, I do not model the precise placement of items on my desk, or even the precise content of my desk, because these details simply do not matter.

This argument can be made in another way. Suppose for the moment that all the brain does is probabilistic modeling. Then, the primary notion of failure to model is “surprise”, which is low probability events occurring. Surprises (stumbles, car wrecks, and other accidents) certainly can be unpleasant, but this could be correct if modeling is a subroutine as well. The clincher is that there are many unpleasant things which are **not** surprises, including keeping your head under water, fasting, and self-inflicted wounds.

Accounting for the unpleasantness of these events requires more than probabilistic modeling. In other words, it requires rewards, which is why reinforcement learning is important. As a byproduct, rewards also naturally create a focus of attention, addressing the computational efficiency issue. Believing that intelligence is just probabilistic modeling is another example of simple wrong answer.

It looks like you don’t really understand Friston’s work. Your argument against it is superficial (and wrong). Please read his papers in more detail.

Counter points:

1) Humans undergo a gestation period which trains on what is comfortable and what is not.

Read:

Friston K, Mattout J, Kilner J. Action understanding and active inference.

Feldman H, Friston KJ. Attention, uncertainty, and free-energy.

Friston KJ, Daunizeau J, Kilner J, Kiebel SJ. Action and behavior: a free-energy formulation.

Friston KJ, Daunizeau J, Kiebel SJ. Reinforcement learning or active inference?

2) Probabilistic inference accounts for attention.

Read:

Feldman H, Friston KJ. Attention, uncertainty, and free-energy.

Friston K, Stephan KE. Free energy and the brain.

Friston K, Kilner J, Harrison L. A free energy principle for the brain.

My comment above was overly harsh. I’m sorry for sounding a bit aggressive.

I’m happy enough to have a debate, but realistically it can’t start with me reading 7 papers. Can you respond to the arguments? They weren’t made lightly—I listened to Friston’s talk carefully and we had a discussion afterwards.

Mohamad, can you summarized why you think that John didn’t understand Friston’s main idea?

summarize*

I completely agree with your assessment of Friston’s work. In fact, Nathaniel Daw and I have written a book chapter in which we discuss some of these issues (it happens to be in a book edited by Friston; see section 5.1):

http://www.princeton.edu/~sjgershm/book_chapter.pdf

Gershman, S.J. & Daw, N.D. (in press). Perception, action and utility: the tangled skein. M. Rabinovich, M., K. Friston & P. Varona, Eds, Principles of Brain Dynamics: Global State Interactions. MIT Press.

I alerted Karl Friston to this exchange and he sent me the following summary, which we thought might help:

“I agree entirely that probabilistic inference is not a sufficient account of intelligence. Indeed, in free energy minimisation, it is just a back story that provides (Bayes optimal) predictions for action to fulfil. The real story is active inference; in other words, the seeking out of sensory signals that are least surprising. This seems to address how ‘ broadband’ sensory signals are selected and managed through active sampling of salient sensory subspaces. In vision, this can be through visual search or through the selective biasing of sensory signals (prediction errors) that are deemed to be precise (this is the model of attention mentioned above).

In relation to value or utility, I do not think there is any real conflict here. In active inference, value or utility functions are replaced by prior beliefs. The implicit exchangeability of loss functions and priors is established by the complete class theorem. in this setting, value becomes (inverse) surprise and surprise is determined by prior beliefs. These priors can be inherited through natural selection (cf. Bayesian model selection based on free energy) or learned as empirical priors (perceptual learning of the parameters of hierarchical models).

Having said this, I think there is an issue with reward in the optimal control theory sense, where value or ‘cost to go’ is the path integral of a reward or cost function. In brief, solutions of the Bellman optimality equations may be an unnecessary and unnatural way of optimising prior expectations about state transitions (policies), if there are simpler alternatives that optimise state transitions under prior beliefs about future states.”

Friston’s concept of “active inference” is (I find) one of the most perplexing aspects of his theory. It seems to fly directly in the face of what people in machine learning call “active perception” or “active learning”: seeking out surprising sensory signals. One can place this in a reinforcement learning context and show that information-gathering actions will improve policies (cf. Howard’s value of information, and later POMDP work by Kaelbling, Littman and others). In addition to being decision-theoretically justifiable, there is ample behavioral evidence that people and animals also do something like active learning (in the traditional sense).

Yet Friston is arguing the opposite: that one should shrink from the unknown. This arises in his theory because utility is defined in terms of prior probability. Moreover, in order to justify his concept of active inference, it is necessary to assume that this prior is very strong, for otherwise Bayesian learning would change the utility function and the utility-probability equivalence would be broken. This seems to imply that either (a) learning is impossible, since the prior overpowers any new information, or (b) learning is possible, but then the utility function would change with the posterior. Neither of these seem right to me.

Perhaps I am misunderstanding all of this, but so far no one has shown me that my reasoning is fallacious. I’m totally open to being set right.

I think another way to frame this is that it is clearly possible, as Karl says above, to define a prior such that “surprise” under that prior coincides with (dis)utility and inference and utility maximization are equivalent. Karl (and others) have used this sort of transformation to solve RL problems; it has the same spirit of a lot of John’s work in reducing machine learning problems to other machine learning problems.

Then the response to John’s concerns is that holding your head under water really is surprising under that prior. But then John’s objection would be that this is a pretty strange prior and the word “surprise” doesn’t really apply — in particular, you dont want to use this prior for inference (vs decision) since it misrepresents the actual probability of events.

In the end I think what is provocative about Karl’s proposal, to the extent I understand it, is that he really is proposing that the utility-function-as-prior and the inferential prior-as-event-probability coincide. In particular, I suspect he’d say self-drowning was rare among your ancestors or they wouldn’t have been your ancestors, so the prior you inherited from them by natural selection makes self-drowning surprising. I find this a little tough to swallow but I think it’s definitely worth engaging.

Finally, a related point (which Sam has another paper about, among others) is that the idea of probabilistic inference as an isolated subroutine is a little problematic. In particular if you are using probabilistic inference as a subroutine for decision (loss minimization), and if you are limited to approximate inference, then the approximation you should best adopt (in the loss-minimizing sense) depends on your loss function. You can’t just isolate the inference subroutine and do it “as well as possible” in a loss-independent sense. This may bring you back to Friston-like formulations and license otherwise weird priors.

I don’t have much to add as Friston’s statement was along the lines of what I understood at UAI.

I am personally both surprise seeking (exploring new things) and surprise avoiding (avoiding exploring unpromising things), which is easy to explain with a theory of rewards, but which seems to require real gymnastics otherwise. And, it’s not clear the gymnastics buy you something algorithmically—if you are going to have “priors” equivalent to (inverse) rewards so strong that you can’t change them based on experience, then probabilistic updating rules degenerate.