It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out:
Online Multitask Learning
Ofer Dekel, Phil Long, Yoram Singer
This is on my reading list. Definitely an area I’m interested in.
Maximum Entropy Distribution Estimation with Generalized Regularization
Miroslav DudÃÂk, Robert E. Schapire
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
András Antos, Csaba Szepesvári, Rémi Munos
Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages?
Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization” of existing graph algorithms.
Cover Trees for Nearest Neighbor (Alina Beygelzimer, Sham Kakade, John Langford) finally seems to have gotten published. It’s an embarrassment to the community that it took this long– and a reminder of how diligent one has to be in ensuring good work gets published. This seems to happen on a regular basis. (See A New View of EM.)
Finally, I thought this one was very cool:
Constructing Informative Priors by Rajat Raina, Andrew Y. Ng, Daphne Koller.
Same interest as the first paper on the list.
Check them out!
\”Learning near-optimal policies with Bellman-residual minimization..\” if there are there any advantages to existing work in reductions?
Indeed, the paper can be seen as a reduction of reinforcement learning to regression. The question asked is quite simple: Given a single stream of experience what performance can be guaranteed? How fast (if ever) can we approach optimality? What is class of problems when this is possible? How do the properties of the problem (smoothness, mixing rate) influance the rates?
I realize that there exist quite a few works that could be relevant, though I did not find a box in your RL reductions page where I could easily put this algorithm into. Actually, the model considered lies between the Direct Experience Model and the Reset Model. Here the agent does not generate the data for itself as it would be the case in the Direct Experience Model, or does it have access to reset, but observes the experience trace of some policy in a passive manner. This is a form of \”Off-Policy Learning\”.
The result guarantees e-optimality, if you want.
If we rewrited the result as a sample-complexity result, we would get, omitting many-many-many factors and fine details, O(V/e^4), where V is a (new) complexity measure of the function space used for fitting the Q-functions. By using a different proof technique the 1/e^4 dependence can, in theory, be improved to 1/e^2. We plan to work on that (1/e^2 would be the comparable rate for regression).
Actually, there is a longer version with the full proofs for those who would like to dive into the details, downloadable from here.
I don’t consider this paper a reductions result, because it makes asssumptions about the world. Every other paper on the reductions page make no assumptions about the world. This may be counterintuitive the first time you run into it, because making assumptions about the world is so common when thinking about machine learning.
John, I think I agree, but i think your\’re reply is a bit terse.
I wouldn\’t view at as a reduction because of (from the paper):
with the biggest objections being the smoothness assumptions on the MDP. You could object as well to it being an MDP at all
by assumption, but this would also remove some other interesting work at least I think of as a reduction. Finally, you could
see the \”approximation power\” bit to be orthogonal to the reduction idea, but I think that\’s fine. It is somewhat disturbing that
this work seems to need a new notion of approximator complexity, especially when existing reductionss are happy with the
classic definitions.
Can you clarify?
The assumption that it is an MDP, that the MDP is smooth, and that the observed experience mixes, are all assumptions about the world.
It is entirely possible that the bar of a pure reduction is not possible in this (or other) RL settings, in which case it may be necessary to introduce additional assumptions. Proving the necessity of additional assumptions or the sufficiency of the ability to generalize is one of the bigger open questions.
John: I looked at the reductions page and picked two papers (Metric-E3, Pegasus). In my reading they make assumptions about the environment: Metric-E3 needs the “Local Modeling Assumption”, a kind-of-smoothness assumption (this allows it to generalize to neighboring states), and Pegasus at least for the infinite action space result requires again a smoothness assumption. Further, any result that assumes that the MDP is finite, arguably makes a kind of smoothness assumption (every function over a discrete space, when the space is considered as a topological space is smooth). Assuming that you can generate samples “at will” is an assumption. But of course, it is a question of taste which of these assumptions you like.
Actually, one assumption on the MDP that is sufficient for the result to hold is to assume that the transition kernel admits a uniformly bounded density. Actually, this still gives you a big MDP set as shown by a lower bound due to Chow and Tsitsiklis in their 1989 paper. The mixing assumption is weaker than assuming that you have access to a generative model or a reset, at least in my opinion.
I think we all agree that finding out which of these assumptions are essential is important.
Drew: Re the new notion of approximator complexity. Yes, it is disturbing:) I think this is the price of using policy iteration *and* a single stream of experience. However, this is yet to be proven (or disproven)!
I see a significant difference between two kinds of assumptions: verifiable and unverifiable. The mechanism used to gather data is verifiable, because you can check whether or not some interface is available. Your own ability to learn is often verifiable, because you can check whether or not you make correct predictions. IIDness of samples is not verifiable. Smoothness properties of the real world are not verifiable. In general, using verifiable assumptions seems preferred to using unverifiable assumptions.
I don’t regard the metric-E3 result as a good reduction. The problem is that there is a Markov assumption which is not verifiable.
What about finiteness of the state and action spaces? Is finitiness a verifiable assumption? I feel that this assumption is a rather strong one. In fact, as I argued, this assumption is much stronger than assuming smoothness. (In particular, the above paper’s smoothness assumptions are automatically satisfied in finite state-spaces.)
Mixing: I agree, this is a strong assumption.
Even worse, we assume that the trajectory is exponentially (beta-)mixing (i.e. mixing is very fast). I believe that under algebraic (i.e., weaker) beta-mixing no polynomial-time learning is possible (though, I did not yet see a proof of this). If there is no mixing then all random variables (samples) could be the same; learning is not possible in that case.
Is it possible to avoid making assumptions about the data generating mechanism? In universal prediction the bounds hold for any data sequences (though there are assumptions like that the actual data sequence does not depend on the decision maker’s choice). However, these bounds are relative to some model class (expert set) and do not yield any bounds on the actual performance. To arrive at such a bound one needs to argue about the relationship of the actual environment and the considered expert set. This argument seems to require some assumptions on the data generating mechanism.
Another result that seems relevant here is that for any universal method (e.g. in classification/regression, stick to the iid setting) you can always pick a bad distribution that makes the convergence of the method slow. Knowing this, it makes then sense to ask what makes fast learning possible (e.g. smoothness of the regressor). Then the next step is to start to study algorithms that are guaranteed to achieve the best rate whatever environment they are exposed to. This removes the need to know if the assumptions hold. Yet it makes sense to consider these various assumptions since they help us to make a difference between algorithms (algorithms that achieve the best rates vs. those that don’t).
This argument suggests that results that make nonverifiable assumptions about the environment can be viewed as a means to learn about the possible rates and I indeed like to think about them that way.