I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.
The cross validation problem: cash reward
I just presented the cross validation problem at COLT.
The problem now has a cash prize (up to $500) associated with it—see the presentation for details.
The write-up for colt.
Languages of Learning
A language is a set of primitives which can be combined to succesfully create complex objects. Languages arise in all sorts of situations: mechanical construction, martial arts, communication, etc… Languages appear to be the key to succesfully creating complex objects—it is difficult to come up with any convincing example of a complex object which is not built using some language. Since languages are so crucial to success, it is interesting to organize various machine learning research programs by language.
The most common language in machine learning are languages for representing the solution to machine learning. This includes:
- Bayes Nets and Graphical Models A language for representing probability distributions. The key concept supporting modularity is conditional independence. Michael Kearns has been working on extending this to game theory.
- Kernelized Linear Classifiers A language for representing linear separators, possibly in a large space. The key form of modularity here is kernelization.
- Neural Networks A language for representing and learning functions. The key concept supporting modularity is backpropagation. (Yann LeCun gave some very impressive demos at the Chicago MLSS.)
- Decision Trees Another language for representing and learning functions. The key concept supporting modularity is partitioning the input space.
Many other learning algorithms can be seen as falling into one of the above families.
In addition there are languages related to various aspects of learning.
- Reductions A language for translating between varying real-world losses and core learning algorithm optimizations.
- Feature Languages Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.
- Data interaction languages The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.
These lists surely miss some languages—feel free to point them out below.
With respect to research “interesting” language-related questions include:
- For what aspects of learning is a language missing? Anytime adhocery is encountered, this suggests that there is room for a language. Finding what is not there is both hard and valuable.
- Are any of these languages fundamentally flawed or fundamentally advantageous with respect to another language?
- What are the most easy to use and effective primitives for these languages?
Lower Bounds for Learning Reductions
Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B“.
A lower bound for a learning reduction would have the form “for all reductions R, there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B“.
The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here.
At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understanding of how feasible they are or which techniques may be useful in proving them. Here is a rough summary of what I know:
- For structured prediction, we have a partially worked out lower bound for all reductions using the structure to only carry single bits. A proof for reductions using the structure in others ways seems tricky at the moment.
- For Reinforcement learning it may (this is unclear) be possible to prove a lower bound showing that prediction ability alone can not solve RL well.
- There are various results which can be thought of as lower bounds for more limited families of reductions. One example is analyzing exactly how badly margin optimization can underperform for 0-1 loss when there is noise.
Overall, this is a moderately interesting direction of research which has not been much investigated.
Reopening RL->Classification
In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.
The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).
So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.
Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.