Machine Learning (Theory) – Page 106 – Machine learning and learning theory research

2/18/20052/18/2005

What it means to do research.

I want to try to describe what doing research means, especially from the point of view of an undergraduate. The shift from a class-taking mentality to a research mentality is very significant and not easy.

Problem Posing Posing the right problem is often as important as solving them. Many people can get by in research by solving problems others have posed, but that’s not sufficient for really inspiring research. For learning in particular, there is a strong feeling that we just haven’t figured out which questions are the right ones to ask. You can see this, because the answers we have do not seem convincing.
Gambling your life When you do research, you think very hard about new ways of solving problems, new problems, and new solutions. Many conversations are of the form “I wonder what would happen if…” These processes can be short (days or weeks) or years-long endeavours. The worst part is that you’ll only know if you were succesful at the end of the process (and sometimes not even then because it can take a long time for good research to be recognized). This is very risky compared to most forms of work (or just going to classes).
Concentration This is not so different from solving problems in class, except that you may need to concentrate on a problem for much longer to solve it. This often means shutting yourself off from the world (no TV, no interruptions, no web browsing, etc…) and really thinking.
Lack of feedback While doing research there is often a lack of feedback or contradicting feedback. The processing of writing a paper can take a month, you may not get reviews for several months, and the review process can be extremely (and sometimes systematically) noisy.
Curiousity This is not merely idle curiousity. A desire to understand things from different viewpoints, to understand that niggling detail which isn’t right, and to understand the global picture of the way things are. This often implies questioning the basics.
Honesty Good Researchers have to understand the way things are (at least with respect to research). Learning to admit when you are wrong can be very hard.
Prioritization You have many things to do and not enough time to do them in. The need to prioritize generally becomes common, but it’s not so common in undergrad life. This often means saying ‘no’ when you want to say ‘yes’.
Memory Problems often aren’t solved in the first pass. Conversations from a year ago often contain the key to solving today’s problem. A good suite of problem-solving methods and a global understanding of how things fit together are often essential.
Ephemeral Contact The set of people who you know and work with may only be talked with for a few brief but intense hours a year at a conference.
Opportunism Possibilities come up. They must be recognized (which is hard for conservative people) and seized (which is hard for people without enough confidence to gamble).

Not all of these traits are necessary to do good research—some of them can be compensated for and others can be learned. Many parts of academia can be understood as helping to reduce some of these difficulties. For example, teaching reduces the extreme variance of gambling on research output. Tenure provides people a stable base from which they can take greater gambles (… and often results in people doing nothing). Conferences are partly succesful because they provide much more feedback than journals (which are generally slower). Weblogs might, in the future, provide even faster feedback. Many people are quite succesful simply solving problems that others pose.

2/17/20052/17/2005

Learning Research Programs

This is an attempt to organize the broad research programs related to machine learning currently underway. This isn’t easy—this map is partial, the categories often overlap, and there are many details left out. Nevertheless, it is (perhaps) helpful to have some map of what is happening where. The word ‘typical’ should not be construed narrowly here.

Learning Theory Focuses on analyzing mathematical models of learning, essentially no experiments. Typical conference: COLT.
Bayesian Learning Bayes law is always used. Focus on methods of speeding up or approximating integration, new probabilistic models, and practical applications. Typical conferences: NIPS,UAI
Structured learning Predicting complex structured outputs, some applications. Typiical conferences: NIPS, UAI, others
Reinforcement Learning Focused on ‘agent-in-the-world’ learning problems where the goal is optimizing reward. Typical conferences: ICML
Unsupervised Learning/Clustering/Dimensionality Reduction Focused on simpiflying data. Typicaly conferences: Many (each with a somewhat different viewpoint)
Applied Learning Worries about cost sensitive learning, what to do on very large datasets, applications, etc.. Typical conference: KDD
Supervised Leanring Chief concern is making practical algorithms for simpler predictions. Many applications. Typical conference: ICML

Please comment on any missing pieces—it would be good to build up a better understanding of what are the focuses and where they are.

2/15/20052/15/2005

ESPgame and image labeling

Luis von Ahn has been running the espgame for awhile now. The espgame provides a picture to two randomly paired people across the web, and asks them to agree on a label. It hasn’t managed to label the web yet, but it has produced a large dataset of (image, label) pairs. I organized the dataset so you could explore the implied bipartite graph (requires much bandwidth).

Relative to other image datasets, this one is quite large—67000 images, 358,000 labels (average of 5/image with variation from 1 to 19), and 22,000 unique labels (one every 3 images). The dataset is also very ‘natural’, consisting of images spidered from the internet. The multiple label characteristic is intriguing because ‘learning to learn’ and metalearning techniques may be applicable. The ‘natural’ quality means that this dataset varies greatly in difficulty from easy (predicting “red”) to hard (predicting “funny”) and potentially more rewarding to tackle.

The open problem here is, of course, to make an internet image labeling program. At a minimum this might be useful for blind people and image search. Solving this problem well seems likely to require new learning methods.

2/14/20051/11/2015

Clever Methods of Overfitting

“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets.

We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.

Name	Method	Explanation	Remedy
Traditional overfitting	Train a complex predictor on too-few examples.		Hold out pristine examples for testing. Use a simpler predictor. Get more training examples. Integrate over many predictors. Reject papers which do this.
Parameter tweak overfitting	Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.	For example, choosing the features so as to optimize test set performance can achieve this.	Same as above.
Brittle measure	Use a measure of performance which is especially brittle to overfitting.	“entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.	Prefer less brittle measures of performance.
Bad statistics	Misuse statistics to overstate confidences.	One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.	Don’t do this. Reject papers which do this.
Choice of measure	Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.	This is fairly common and tempting.	Use canonical performance measures. For example, the performance measure directly motivated by the problem.
Incomplete Prediction	Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.	Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.	Reject papers which do this.
Human-loop overfitting.	Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.	This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.	Make sure test examples are not available to the human.
Data set selection	Chose to report results on some subset of datasets where your algorithm performs well.	The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.	Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.
Reprobleming	Alter the problem so that your performance improves.	For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.	Discount papers which do this. Make sure problem specifications are clear.
Old datasets	Create an algorithm for the purpose of improving performance on old datasets.	After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…	Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.
Overfitting by review	10 people submit a paper to a conference. The one with the best result is accepted.	This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.	Be more pessimistic of confidence statements by papers at high rejection rate conferences. Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)

I have personally observed all of these methods in action, and there are doubtless others.

Edit: a repost on kdnuggets.

2/12/20052/12/2005

ROC vs. Accuracy vs. AROC

Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion.

The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order.

Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance.

AROC is the area under the ROC curve. It is a single number summary of performance.

The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.”

The ROC vs. accuracy discussion is often conflated with “is the goal classification or ranking?” because ROC curve construction requires a ranking be produced. Here, we assume the goal is classification rather than ranking. (There are several natural problems where ranking of instances is much preferred to classification. In addition, there are several natural problems where classification is the goal.)

Arguments for ROC	Explanation
Ill-specification	The costs of choices are not well specified. The training examples are often not drawn from the same marginal distribution as the test examples. ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions.
Ill-dominance	Standard classification algorithms do not have a dominance structure as the costs vary. We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure.
Just-in-Time use	Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. false negatives.

AROC inherits the arguments of ROC except for Ill-dominance.

Arguments for AROC	Explanation
Summarization	Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable.
Robustness	Algorithms with a large AROC are robust against a variation in costs.

Accuracy is the traditional approach.

Arguments for Accuracy	Explanation
Summarization	As for AROC.
Intuitiveness	People understand immediately what accuracy means. Unlike (A)ROC, it’s obvious what happens when one additional example is classified wrong.
Statistical Stability	The basic test set bound shows that accuracy is stable subject to only the IID assumption. For AROC (and ROC) this is only true when the number in each class is not near zero.
Minimality	In the end, a classifier makes classification decisions. Accuracy directly measures this while (A)ROC comprimises this measure with hypothetical alternate choice costs. For the same reason, computing (A)ROC may require significantly more work than solving the problem.
Generality	Accuracy generalizes easily to multiclass accuracy, importance weighted accuracy, and general (per-example) cost sensitive classification. ROC curves become problematic when there are just 3 classes.

The Just-in-Time argument seems to be the strongest for (A)ROC. One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered.” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”.