A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this.
Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.
A language is a set of primitives which can be combined to succesfully create complex objects. Languages arise in all sorts of situations: mechanical construction, martial arts, communication, etc… Languages appear to be the key to succesfully creating complex objects—it is difficult to come up with any convincing example of a complex object which is not built using some language. Since languages are so crucial to success, it is interesting to organize various machine learning research programs by language.
The most common language in machine learning are languages for representing the solution to machine learning. This includes:
- Bayes Nets and Graphical Models A language for representing probability distributions. The key concept supporting modularity is conditional independence. Michael Kearns has been working on extending this to game theory.
- Kernelized Linear Classifiers A language for representing linear separators, possibly in a large space. The key form of modularity here is kernelization.
- Neural Networks A language for representing and learning functions. The key concept supporting modularity is backpropagation. (Yann LeCun gave some very impressive demos at the Chicago MLSS.)
- Decision Trees Another language for representing and learning functions. The key concept supporting modularity is partitioning the input space.
Many other learning algorithms can be seen as falling into one of the above families.
In addition there are languages related to various aspects of learning.
- Reductions A language for translating between varying real-world losses and core learning algorithm optimizations.
- Feature Languages Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.
- Data interaction languages The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.
These lists surely miss some languages—feel free to point them out below.
With respect to research “interesting” language-related questions include:
- For what aspects of learning is a language missing? Anytime adhocery is encountered, this suggests that there is room for a language. Finding what is not there is both hard and valuable.
- Are any of these languages fundamentally flawed or fundamentally advantageous with respect to another language?
- What are the most easy to use and effective primitives for these languages?
Chicago ‘05 ended a couple of weeks ago. This was the sixth Machine Learning Summer School, and the second one that used a wiki. (The first was Berder ‘04, thanks to Gunnar Raetsch.) Wikis are relatively easy to set up, greatly aid social interaction, and should be used a lot more at summer schools and workshops. They can even be used as the meeting’s webpage, as a permanent record of its participants’ collaborations — see for example the wiki/website for last year’s NVO Summer School.
A basic wiki is a collection of editable webpages, maintained by software called a wiki engine. The engine used at both Berder and Chicago was TikiWiki — it is well documented and gets you something running fast. It uses PHP and MySQL, but doesn’t require you to know either. Tikiwiki has far more features than most wikis, as it is really a full Content Management System. (My thanks to Sebastian Stark for pointing this out.) Here are the features we found most useful:
A couple of other TikiWiki features that we didn’t get working at Chicago, but would have been nice to have, are these:
- Image Galleries. Gunnar got this working at Berder, where it was a huge success. Photographs are great icebreakers, even the ones that don’t involve dancing on tables.
- Surveys. These are easy to set up, and have option for participants to see, or not to see, the results of surveys — useful when asking people to rate lectures.
TikiWiki also has several features that we didn’t use, such as blogs and RSS feeds. It also has a couple of bugs (and features that are bad enough to be called bugs), such as permission issues and the inability to print calendars neatly. These will doubtless get cleaned up in due course.
Finally, owing to much prodding from John and some other MLSS participants, I’ve written up my experiences in using TikiWiki @ Chicago ‘05 on my website, including installation instructions and a list of “Good Things to Do”. This documentation is meant to be a survival guide complementary to the existing TikiWiki documentation, which can sometimes be overwhelming.
The diagram above shows a very broad viewpoint of learning theory.
|| Some prediction algorithm A does almost as well as any of a set of algorithms.
||Assuming independent samples, past performance predicts future performance.
||PAC analysis, ERM analysis
||Future prediction performance on subproblems implies future prediction performance using algorithm A.
A basic question is: Are there other varieties of statements of this type? Avrim noted that there are also “arrows between arrows”: generic methods for transforming between Past->Past statements and Past->Future statements. Are there others?
Steve Smale and I have a debate about goals of learning theory.
Steve likes theorems with a dependence on unobservable quantities. For example, if D is a distribution over a space X x [0,1], you can state a theorem about the error rate dependent on the variance, E(x,y)~D (y-Ey’~D|x[y'])2.
I dislike this, because I want to use the theorems to produce code solving learning problems. Since I don’t know (and can’t measure) the variance, a theorem depending on the variance does not help me—I would not know what variance to plug into the learning algorithm.
Recast more broadly, this is a debate between “declarative” and “operative” mathematics. A strong example of “declarative” mathematics is “a new kind of science”. Roughly speaking, the goal of this kind of approach seems to be finding a way to explain the observations we make. Examples include “some things are unpredictable”, “a phase transition exists”, etc…
“Operative” mathematics helps you make predictions about the world. A strong example of operative mathematics is Newtonian mechanics in physics: it’s a great tool to help you predict what is going to happen in the world.
In addition to the “I want to do things” motivation for operative mathematics, I find it less arbitrary. In particular, two reasonable people can each be convinced they understand a topic in ways so different that they do not understand the viewpoint. If these understandings are operative, the rest of us on the sidelines can better appreciate which understanding is “best”.
Machine Learning is a field with an impressively diverse set of reseearch styles. Understanding this may be important in appreciating what you see at a conference.
- Engineering. How can I solve this problem? People in the engineering research style try to solve hard problems directly by any means available and then describe how they did it. This is typical of problem-specific conferences and communities.
- Scientific. What are the principles for solving learning problems? People in this research style test techniques on many different problems. This is fairly common at ICML and NIPS.
- Mathematical. How can the learning problem be mathematically understood? People in this research style prove theorems with implications for learning but often do not implement (or test algorithms). COLT is a typical conference for this style.
Many people manage to cross these styles, and that is often beneficial.
Whenver we list a set of alternative, it becomes natural to think “which is best?” In this case of learning it seems that each of these styles is useful, and can lead to new useful discoveries. I sometimes see failures to appreciate the other approaches, which is a shame.
This is an attempt to organize the broad research programs related to machine learning currently underway. This isn’t easy—this map is partial, the categories often overlap, and there are many details left out. Nevertheless, it is (perhaps) helpful to have some map of what is happening where. The word ‘typical’ should not be construed narrowly here.
- Learning Theory Focuses on analyzing mathematical models of learning, essentially no experiments. Typical conference: COLT.
- Bayesian Learning Bayes law is always used. Focus on methods of speeding up or approximating integration, new probabilistic models, and practical applications. Typical conferences: NIPS,UAI
- Structured learning Predicting complex structured outputs, some applications. Typiical conferences: NIPS, UAI, others
- Reinforcement Learning Focused on ‘agent-in-the-world’ learning problems where the goal is optimizing reward. Typical conferences: ICML
- Unsupervised Learning/Clustering/Dimensionality Reduction Focused on simpiflying data. Typicaly conferences: Many (each with a somewhat different viewpoint)
- Applied Learning Worries about cost sensitive learning, what to do on very large datasets, applications, etc.. Typical conference: KDD
- Supervised Leanring Chief concern is making practical algorithms for simpler predictions. Many applications. Typical conference: ICML
Please comment on any missing pieces—it would be good to build up a better understanding of what are the focuses and where they are.
“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets.
We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.
||Train a complex predictor on too-few examples.
- Hold out pristine examples for testing.
- Use a simpler predictor.
- Get more training examples.
- Integrate over many predictors.
- Reject papers which do this.
|Parameter tweak overfitting
||Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.
||For example, choosing the features so as to optimize test set performance can achieve this.
||Same as above.
||Use a measure of performance which is especially brittle to overfitting.
||“entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.
||Prefer less brittle measures of performance.
||Misuse statistics to overstate confidences.
||One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.
||Don’t do this. Reject papers which do this.
|Choice of measure
||Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.
||This is fairly common and tempting.
||Use canonical performance measures. For example, the performance measure directly motivated by the problem.
||Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.
||Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.
|| Reject papers which do this.
||Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.
||This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.
||Make sure test examples are not available to the human.
|Data set selection
||Chose to report results on some subset of datasets where your algorithm performs well.
||The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.
||Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.
||Alter the problem so that your performance improves.
||For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.
||Discount papers which do this. Make sure problem specifications are clear.
||Create an algorithm for the purpose of improving performance on old datasets.
||After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…
||Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.
|Overfitting by review
||10 people submit a paper to a conference. The one with the best result is accepted.
||This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.
- Be more pessimistic of confidence statements by papers at high rejection rate conferences.
- Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)
I have personally observed all of these methods in action, and there are doubtless others.
One way to organize learning theory is by assumption (in the assumption = axiom sense), from no assumptions to many assumptions. As you travel down this list, the statements become stronger, but the scope of applicability decreases.
- No assumptions
- Online learning There exist a meta prediction algorithm which compete well with the best element of any set of prediction algorithms.
- Universal Learning Using a “bias” of 2- description length of turing machine in learning is equivalent to all other computable biases up to some constant.
- Reductions The ability to predict well on classification problems is equivalent to the ability to predict well on many other learning problems.
- Independent and Identically Distributed (IID) Data
- Performance Prediction Based upon past performance, you can predict future performance.
- Uniform Convergence Performance prediction works even after choosing classifiers based on the data from large sets of classifiers.
- IID and partial constraints on the data source
- PAC Learning There exists fast algorithms for learning when all examples agree with some function in a function class (such as monomials, decision list, etc…)
- Weak Bayes The Bayes law learning algorithm will eventually reach the right solution as long as the right solution has a positive prior.
- Strong Constraints on the Data Source
- Bayes Learning When the data source is drawn from the prior, using Bayes law is optimal
This doesn’t include all forms of learning theory, because I do not know them all. If there are other bits you know of, please comment.
Let’s define a learning problem as making predictions given past data. There are several ways to attack the learning problem which seem to be equivalent to solving the learning problem.
- Find the Invariant This viewpoint says that learning is all about learning (or incorporating) transformations of objects that do not change the correct prediction. The best possible invariant is the one which says “all things of the same class are the same”. Finding this is equivalent to learning. This viewpoint is particularly common when working with image features.
- Feature Selection This viewpoint says that the way to learn is by finding the right features to input to a learning algorithm. The best feature is the one which is the class to predict. Finding this is equivalent to learning for all reasonable learning algorithms. This viewpoint is common in several applications of machine learning. See Gilad’s and Bianca’s comments.
- Find the Representation This is almost the same as feature selection, except internal to the learning algorithm rather than external. The key to learning is viewed as finding the best way to process the features in order to make predictions. The best representation is the one which processes the features to produce the correct prediction. This viewpoint is common for learning algorithm designers.
- Find the Right Kernel The key to learning is finding the “right” kernel. The optimal kernel is the one for which K(x, z)=1 when x and z have the same class and 0 otherwise. With the right kernel, an SVM(or SVM-like optimization process) can solve any learning problem. This viewpoint is common for people who work with SVMs.
- Find the Right Metric The key to learning is finding the right metric. The best metric is one which states that features with the same class label have distance 0 while features with different class labels have distance 1. With the best metric, the nearest neighbor algorithm can solve any problem.
Each of these viewpoints seems to be “right”, and each seems to have some utility in it’s context. It also seems important to realize that these are just different versions of the same problem. One consequence of this observation is that “wrapper methods” which try to automatically find a subset of features to feed into a learning algorithm in order to improve learning performance are simply trying to repair weaknesses in the learning algorithm.
All branches of machine learning seem to be united in the idea of using data to make predictions. However, people disagree to some extent about what this means. One way to categorize these different goals is on an axis, where one extreme is “tools to aid a human in using data to do prediction” and the other extreme is “tools to do prediction with no human intervention”. Here is my estimate of where various elements of machine learning fall on this spectrum.
||Human partially necessary
|Clustering, data visualization
||Bayesian Learning, Probabilistic Models, Graphical Models
||Kernel Learning (SVM’s, etc..)
The exact position of each element is of course debatable. My reasoning is that clustering and data visualization are nearly useless for prediction without a human in the loop. Bayesian/probabilistic models/graphical models generally require a human to sit and think about what is a good prior/structure. Kernel learning approaches have a few standard kernels which often work on simple problems, although sometimes significant kernel engineering is required. I’ve been impressed of late how ‘black box’ decision trees or boosted decision trees are. The goal of reinforcement learning (rather than perhaps the reality) is designing completely automated agents.
The position in this spectrum provides some idea of what the state of progress is. Things at the ‘human necessary’ end have been succesfully used by many people to solve many learning problems. At the ‘human unnecessary’ end, the systems are finicky and often just won’t work well.
I am most interested in the ‘human unnecessary’ end.