Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.
- Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l1 or l2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
- Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
- Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.
One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.
Joel Predd mentioned “Antilearning” by Adam Kowalczyk, which is interesting from a foundational intuitions viewpoint.
There is a pervasive intuition that “nearby things tend to have the same label”. This intuition is instantiated in SVMs, nearest neighbor classifiers, decision trees, and neural networks. It turns out there are natural problems where this intuition is opposite of the truth.
One natural situation where this occurs is in competition. For example, when Intel fails to meet its earnings estimate, is this evidence that AMD is doing badly also? Or evidence that AMD is doing well?
This violation of the proximity intuition means that when the number of examples is few, negating a classifier which attempts to exploit proximity can provide predictive power (thus, the term “antilearning”).
This, again, is something of a research direction rather than a single problem.
There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google’s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.
Problem What does the ability to classify well imply about performance under these metrics?
Past Work
- Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
- Sample complexity bounds for AROC (and here).
- A paper on “Learning to Order Things“.
Difficulty Several of these may be easy. Some of them may be hard.
Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.
Makc asked a good question in comments—”Why bother to make a paper, at all?” There are several reasons for writing papers which may not be immediately obvious to people not in academia.
The basic idea is that papers have considerably more utility than the obvious “present an idea”.
- Papers are a formalized units of work. Academics (especially young ones) are often judged on the number of papers they produce.
- Papers have a formalized method of citing and crediting other—the bibliography. Academics (especially older ones) are often judged on the number of citations they receive.
- Papers enable a “more fair” anonymous review. Conferences receive many papers, from which a subset are selected. Discussion forums are inherently not anonymous for anyone who wants to build a reputation for good work.
- Papers are an excuse to meet your friends. Papers are the content of conferences, but much of what you do is talk to friends about interesting problems while there. Sometimes you even solve them.
- Papers are an excuse to get a large number of smart people in the same room and think about the same topic.
- Good papers are easy to read. In particular, they are much easier to read (and understand) then a long discussion thread. They are even easy to read in several decades. (Writing good papers is hard)
All of the above are reasons why writing papers is a good idea. It’s also important to understand that academia is a large system and large systems have a lot of inertia. This means switching from paper writing to some other method of doing research won’t happen unless the other method is significantly more effective, and even then there will be a lot of inertia.
Also note: the “similar sites” link to the right is to other discussion forums, etc…
Despite my best intentions, this is not a fully specified problem, but rather a research direction.
Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow, Weighted Majority, and Binomial Weighting. Algorithms with this property haven’t taken over the world yet. Here might be some reasons:
- Lack of caring. Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done.
- Inefficiency. Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning setting) be enumerated and tested on every example. (This is similar to the inefficiency of using Bayes law as an algorithm directly, and there are strong similarities in the algorithms.)
For an example of (1), there is a mysterious factor of 2 in the Binomial Weighting algorithm which has not been fully resolved. Some succesful applications also exist such as those based on SNoW.
The way to combat problem (2) is to introduce more structure into the hypothesis/experts. Some efforts have already been made in this direction. For example, it’s generally feasible to introduce an arbitrary bias or “prior” over the experts in the form of some probability distribution, and perform well with respect to that bias. Another nice piece of work by Adam Kalai and Santosh Vempala discusses how to efficiently handle several forms of structured experts. At an intuitive level, further development discovering how to efficiently work with new forms of structure might payoff well.
The basic research direction here is to address the gap between theory and practice, possibly by solving the above problems and possibly by discovering and addressing other problems.
I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).
One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.
At an intuitive level, the question here is “Can reinforcement learning be solved with classification?”
Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples.
Past Work
- There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model.
- Other work on learning reductions may be important.
- Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E3, Factored-E3, and metric-E3 and Rmax require that the observation be the state. Recent work extends this approach to POMDPs.
- This problem is related to predictive state representations, because we are trying to solve reinforcement learning with prediction ability.
Difficulty It is not clear whether this problem is solvable or not. A proof that it is not solvable would be extremely interesting, and even partial success one way or another could be important.
Impact At the theoretical level, it would be very nice to know if the ability to generalize implies the ability to solve reinforcement learning because (in a general sense) all problems can be cast as reinforcement learning. Even if the solution is exponential in the horizon time it can only motivate relaxations of the core algorithm like RLgen.
The essential problem here is the large gap between experimental observation and theoretical understanding.
Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (”folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.
Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.
Past Work (I’ll add more as I remember/learn.)
- Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
- Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
- Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
- Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
- Matti Kääriäinen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.
Some Extreme Cases to Sharpen Intuition
- Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
- Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).
(some discussion is here)
Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.
Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.
This is near the one month point, so it seems appropriate to consider meta-issues for the moment.
The number of posts is a bit over 20.
The number of people speaking up in discussions is about 10.
The number of people viewing the site is somewhat more than 100.
I am (naturally) dissatisfied with many things.
- Many of the potential uses haven’t been realized. This is partly a matter of opportunity (no conferences in the last month), partly a matter of will (no open problems because it’s hard to give them up), and partly a matter of tradition. In academia, there is a strong tradition of trying to get everything perfectly right before presentation. This is somewhat contradictory to the nature of making many posts, and it’s definitely contradictory to the idea of doing “public research”. If that sort of idea is to pay off, it must be significantly more succesful than previous methods. In an effort to continue experimenting, I’m going to use the next week as “open problems week”.
- Spam is a problem. Wordpress allows you to block specific posts by match, but there seems to be some minor bug (or maybe a misuse) in how it matches. This resulted in everything being blocked pending approval, which is highly unnatural for any conversation. I approved all posts by real people, and I think the ‘everything blocked pending approval’ problem has been solved. A site discussing learning ought to have a better system for coping with what is spam and what is not. Today’s problem is to solve the spam problem with learning techniques. (It’s not clear this is research instead of just engineering, but it is clear that it would be very valuable here and in many other places.)
- Information organization is a problem. This comes up in many ways. Threading would be helpful in comments because it would help localize discussion to particular contexts. Tagging of posts with categories seems inadequate because it’s hard to anticipate all the ways something might be thought about. Accessing old posts via “archives” is cumbersome. Idealy, the sequence of posts would create a well-organized virtual site. In many cases there are very good comments and it seems altering the post to summarize the comments is appropriate, but doing so leaves the comments out of context. Some mechanism of refinement which avoids this problem would be great. Many comments develop into something that should (essentially) be their own post on a new topic. Doing so is currently cumbersome, and a mechanism for making that shift would be helpful.
- Time commitment is a problem. Making a stream of good posts is hard and takes awhile. Naturally, some were (and even still are) stored up, but that store is finite, and eventually will be exhausted. Since I’m unwilling to compromise quality, this means the rate of posts may eventually fall. The obvious solution to this is to invite other posters. (Which I have with Alex Gray and Drew Bagnell.) Consider yourself invited. Email me (jl@hunch.net) with anything you consider worth posting.
It’s not all bad and I plan to continue experimenting. Several of the discussions have been quite interesting, and I often find that the process of writing posts helps clarify my understanding. Creating a little bit more understanding seems like it is worthwhile.
Yaroslav collected an extensive list of machine learning reading groups.
This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005. I found this paper very difficult to read, but it does have some point about a computational shortcut.
This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else?
All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sumy f(x,y) (the paper calls - log f(x,y) an “energy”). If f is parameterized by some w, the gradient has a term for Z(x), and hence for every value of y. The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y). This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimental results and intuition.
It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. Ability to use an approximate maximum y’ seems potentially very useful.
With that encouragement, I had significant difficulties with the paper, including the following:
- Lack of a theorem. A good theorem proving these things work would be quite helpful. It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification.
- Definition of Loss. For better or worse, the paper uses the second definition of loss, “Loss is part of the solution”, which I find unnatural.
- Claims I don’t understand or which aren’t technically true. None of these seem to be related to the main point of the paper, but they are very distracting. For example, there is a claim that log-loss is the “only well-justified loss function”. The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified.
With the above difficulties, this paper seems lucky to have been accepted. This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations.
I want to try to describe what doing research means, especially from the point of view of an undergraduate. The shift from a class-taking mentality to a research mentality is very significant and not easy.
- Problem Posing Posing the right problem is often as important as solving them. Many people can get by in research by solving problems others have posed, but that’s not sufficient for really inspiring research. For learning in particular, there is a strong feeling that we just haven’t figured out which questions are the right ones to ask. You can see this, because the answers we have do not seem convincing.
- Gambling your life When you do research, you think very hard about new ways of solving problems, new problems, and new solutions. Many conversations are of the form “I wonder what would happen if…” These processes can be short (days or weeks) or years-long endeavours. The worst part is that you’ll only know if you were succesful at the end of the process (and sometimes not even then because it can take a long time for good research to be recognized). This is very risky compared to most forms of work (or just going to classes).
- Concentration This is not so different from solving problems in class, except that you may need to concentrate on a problem for much longer to solve it. This often means shutting yourself off from the world (no TV, no interruptions, no web browsing, etc…) and really thinking.
- Lack of feedback While doing research there is often a lack of feedback or contradicting feedback. The processing of writing a paper can take a month, you may not get reviews for several months, and the review process can be extremely (and sometimes systematically) noisy.
- Curiousity This is not merely idle curiousity. A desire to understand things from different viewpoints, to understand that niggling detail which isn’t right, and to understand the global picture of the way things are. This often implies questioning the basics.
- Honesty Good Researchers have to understand the way things are (at least with respect to research). Learning to admit when you are wrong can be very hard.
- Prioritization You have many things to do and not enough time to do them in. The need to prioritize generally becomes common, but it’s not so common in undergrad life. This often means saying ‘no’ when you want to say ‘yes’.
- Memory Problems often aren’t solved in the first pass. Conversations from a year ago often contain the key to solving today’s problem. A good suite of problem-solving methods and a global understanding of how things fit together are often essential.
- Ephemeral Contact The set of people who you know and work with may only be talked with for a few brief but intense hours a year at a conference.
- Opportunism Possibilities come up. They must be recognized (which is hard for conservative people) and seized (which is hard for people without enough confidence to gamble).
Not all of these traits are necessary to do good research—some of them can be compensated for and others can be learned. Many parts of academia can be understood as helping to reduce some of these difficulties. For example, teaching reduces the extreme variance of gambling on research output. Tenure provides people a stable base from which they can take greater gambles (… and often results in people doing nothing). Conferences are partly succesful because they provide much more feedback than journals (which are generally slower). Weblogs might, in the future, provide even faster feedback. Many people are quite succesful simply solving problems that others pose.
This is an attempt to organize the broad research programs related to machine learning currently underway. This isn’t easy—this map is partial, the categories often overlap, and there are many details left out. Nevertheless, it is (perhaps) helpful to have some map of what is happening where. The word ‘typical’ should not be construed narrowly here.
- Learning Theory Focuses on analyzing mathematical models of learning, essentially no experiments. Typical conference: COLT.
- Bayesian Learning Bayes law is always used. Focus on methods of speeding up or approximating integration, new probabilistic models, and practical applications. Typical conferences: NIPS,UAI
- Structured learning Predicting complex structured outputs, some applications. Typiical conferences: NIPS, UAI, others
- Reinforcement Learning Focused on ‘agent-in-the-world’ learning problems where the goal is optimizing reward. Typical conferences: ICML
- Unsupervised Learning/Clustering/Dimensionality Reduction Focused on simpiflying data. Typicaly conferences: Many (each with a somewhat different viewpoint)
- Applied Learning Worries about cost sensitive learning, what to do on very large datasets, applications, etc.. Typical conference: KDD
- Supervised Leanring Chief concern is making practical algorithms for simpler predictions. Many applications. Typical conference: ICML
Please comment on any missing pieces—it would be good to build up a better understanding of what are the focuses and where they are.
Luis von Ahn has been running the espgame for awhile now. The espgame provides a picture to two randomly paired people across the web, and asks them to agree on a label. It hasn’t managed to label the web yet, but it has produced a large dataset of (image, label) pairs. I organized the dataset so you could explore the implied bipartite graph (requires much bandwidth).
Relative to other image datasets, this one is quite large—67000 images, 358,000 labels (average of 5/image with variation from 1 to 19), and 22,000 unique labels (one every 3 images). The dataset is also very ‘natural’, consisting of images spidered from the internet. The multiple label characteristic is intriguing because ‘learning to learn’ and metalearning techniques may be applicable. The ‘natural’ quality means that this dataset varies greatly in difficulty from easy (predicting “red”) to hard (predicting “funny”) and potentially more rewarding to tackle.
The open problem here is, of course, to make an internet image labeling program. At a minimum this might be useful for blind people and image search. Solving this problem well seems likely to require new learning methods.
“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets.
We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.
| Name |
Method |
Explanation |
Remedy |
| Traditional overfitting |
Train a complex predictor on too-few examples. |
|
- Hold out pristine examples for testing.
- Use a simpler predictor.
- Get more training examples.
- Integrate over many predictors.
- Reject papers which do this.
|
| Parameter tweak overfitting |
Use a learning algorithm with many parameters. Choose the parameters based on the test set performance. |
For example, choosing the features so as to optimize test set performance can achieve this. |
Same as above. |
| Brittle measure |
Use a measure of performance which is especially brittle to overfitting. |
“entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach. |
Prefer less brittle measures of performance. |
| Bad statistics |
Misuse statistics to overstate confidences. |
One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence. |
Don’t do this. Reject papers which do this. |
| Choice of measure |
Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs. |
This is fairly common and tempting. |
Use canonical performance measures. For example, the performance measure directly motivated by the problem. |
| Incomplete Prediction |
Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction. |
Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed. |
Reject papers which do this. |
| Human-loop overfitting. |
Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction. |
This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice. |
Make sure test examples are not available to the human. |
| Data set selection |
Chose to report results on some subset of datasets where your algorithm performs well. |
The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect. |
Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way. |
| Reprobleming |
Alter the problem so that your performance improves. |
For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset. |
Discount papers which do this. Make sure problem specifications are clear. |
| Old datasets |
Create an algorithm for the purpose of improving performance on old datasets. |
After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade… |
Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it. |
| Overfitting by review |
10 people submit a paper to a conference. The one with the best result is accepted. |
This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting. |
- Be more pessimistic of confidence statements by papers at high rejection rate conferences.
- Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)
|
I have personally observed all of these methods in action, and there are doubtless others.
Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion.
The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order.
Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance.
AROC is the area under the ROC curve. It is a single number summary of performance.
The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.”
The ROC vs. accuracy discussion is often conflated with “is the goal classification or ranking?” because ROC curve construction requires a ranking be produced. Here, we assume the goal is classification rather than ranking. (There are several natural problems where ranking of instances is much preferred to classification. In addition, there are several natural problems where classification is the goal.)
| Arguments for ROC |
Explanation |
| Ill-specification |
The costs of choices are not well specified. The training examples are often not drawn from the same marginal distribution as the test examples. ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions. |
| Ill-dominance |
Standard classification algorithms do not have a dominance structure as the costs vary. We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure. |
| Just-in-Time use |
Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. false negatives. |
AROC inherits the arguments of ROC except for Ill-dominance.
| Arguments for AROC |
Explanation |
| Summarization |
Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable. |
| Robustness |
Algorithms with a large AROC are robust against a variation in costs. |
Accuracy is the traditional approach.
| Arguments for Accuracy |
Explanation |
| Summarization |
As for AROC. |
| Intuitiveness |
People understand immediately what accuracy means. Unlike (A)ROC, it’s obvious what happens when one additional example is classified wrong. |
| Statistical Stability |
The basic test set bound shows that accuracy is stable subject to only the IID assumption. For AROC (and ROC) this is only true when the number in each class is not near zero. |
| Minimality |
In the end, a classifier makes classification decisions. Accuracy directly measures this while (A)ROC comprimises this measure with hypothetical alternate choice costs. For the same reason, computing (A)ROC may require significantly more work than solving the problem. |
| Generality |
Accuracy generalizes easily to multiclass accuracy, importance weighted accuracy, and general (per-example) cost sensitive classification. ROC curves become problematic when there are just 3 classes. |
The Just-in-Time argument seems to be the strongest for (A)ROC. One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered.” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”.
| Conference |
Locate |
Date |
| COLT |
Bertinoro, Italy |
June 27-30 |
| AAAI |
Pittsburgh, PA, USA |
July 9-13 |
| UAI |
Edinburgh, Scotland |
July 26-29 |
| IJCAI |
Edinburgh, Scotland |
July 30 – August 5 |
| ICML |
Bonn, Germany |
August 7-11 |
| KDD |
Chicago, IL, USA |
August 21-24 |
The big winner this year is Europe. This is partly a coincidence, and partly due to the general internationalization of science over the last few years. With cuts to basic science in the US and increased hassle for visitors, conferences outside the US become more attractive. Europe and Australia/New Zealand are the immediate winners because they have the science, infrastructure, and english in place. China and India are possible future winners.
Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected.
- Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin, Caruana
- Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ’symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun, Phil Long
- Deep Representation Having a deep representation is necessary for having a good general learner. Decision Trees and Convolutional neural networks take advantage of this. SVMs get around it by allowing the user to engineer knowledge into the kernel. Boosting and Bagging rely on another algorithm for this. Sources: Yann LeCun
- Fine Representation of Bias Many learning theory applications use just a coarse representation of bias such as “function in the hypothesis class or not”. In practice, significantly better performance is achieved from a more finely tuned bias. Bayesian learning has this builtin with a prior. Other techniques can take advantage of this as well. Sources: Zoubin, personal experience.
If you have others, please comment on them.
Yaroslav Bulatov collects some links to other technical blogs.
What? Reductions are machines which turn solvers for one problem into solvers for another problem.
Why? Reductions are useful for several reasons.
- Laziness. Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work.
- Crystallization. The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on.
- Theoretical Organization. By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder.
What we know now.
Typesafe reductions. In the beginning, there was the observation that every complex object on a computer can be written as a sequence of bits. This observation leads to the notion that a classifier (which predicts a single bit) can be used to predict any complex object. Using this observation, we can make the following statements:
- Any prediction problem which can be broken into examples can be solved with a classifier.
- In particular, reinforcement learning can be decomposed into examples given a generative model (see Lagoudakis & Parr and Fern, Yoon, & Givan).
This observation also often doesn’t work well in practice, because the classifiers are sometimes wrong, so one of many classifiers are often wrong.
Error Transform Reductions. Worrying about errors leads to the notion of robust reductions (= ways of using simple predictors such as classifiers to make complex predictions). Error correcting output codes were proposed in analogy to coding theory. These were analyzed in terms of error rates on training sets and general losses on training sets. The robustness can be (more generally) analyzed with respect to arbitrary test distributions, and algorithms optimized with respect to this notion are often very simple and yield good performance. Solving created classification problems up to error rate e implies:
- Solving importance weighed classifications up to error rate eN where N is the expected importance. Costing
- Solving multiclass classification up to error rate 4e using ECOC. Error limiting reductions paper
- Solving Cost sensitive classification up to loss 2eZ where Z is the sum of costs. Weighted All Pairs algorithm
- Finding a policy within expected reward (T+1)e/2 of the optimal policy for T step reinforcement learning with a generative model. RLgen paper
- The same statement holds much more efficiently when the distribution of states of a near optimal policy is also known. PSDP paper
A new problem arises: sometimes the subproblems created are inherently hard, for example when estimating class probability from a classifier. In this situation saying “good performance implies good performance” is vacuous.
Regret Transform Reductions To cope with this, we can analyze how good performance minus the best possible performance (called “regret”) is transformed under reduction. Solving created binary classification problems to regret r implies:
- Solving importance weighted regret up to r N using the same algorithm as for errors. Costing
- Solving class membership probability up to l2 regret 2r. Probing paper
- Solving multiclass classification to regret 4 r0.5. SECOC paper
- Predicting costs in cost sensitive classification up to l2 regret 4r SECOC again
- Solving cost sensitive classification up to regret 4(r Z)0.5 where Z is the sum of the costs of each choice. SECOC again
There are several reduction-related problems currently being worked on which I’ll discuss in the future.