General – Page 14 – Machine Learning (Theory)

2/20/20052/20/2005

At One Month

This is near the one month point, so it seems appropriate to consider meta-issues for the moment.

The number of posts is a bit over 20.
The number of people speaking up in discussions is about 10.
The number of people viewing the site is somewhat more than 100.

I am (naturally) dissatisfied with many things.

Many of the potential uses haven’t been realized. This is partly a matter of opportunity (no conferences in the last month), partly a matter of will (no open problems because it’s hard to give them up), and partly a matter of tradition. In academia, there is a strong tradition of trying to get everything perfectly right before presentation. This is somewhat contradictory to the nature of making many posts, and it’s definitely contradictory to the idea of doing “public research”. If that sort of idea is to pay off, it must be significantly more succesful than previous methods. In an effort to continue experimenting, I’m going to use the next week as “open problems week”.
Spam is a problem. WordPress allows you to block specific posts by match, but there seems to be some minor bug (or maybe a misuse) in how it matches. This resulted in everything being blocked pending approval, which is highly unnatural for any conversation. I approved all posts by real people, and I think the ‘everything blocked pending approval’ problem has been solved. A site discussing learning ought to have a better system for coping with what is spam and what is not. Today’s problem is to solve the spam problem with learning techniques. (It’s not clear this is research instead of just engineering, but it is clear that it would be very valuable here and in many other places.)
Information organization is a problem. This comes up in many ways. Threading would be helpful in comments because it would help localize discussion to particular contexts. Tagging of posts with categories seems inadequate because it’s hard to anticipate all the ways something might be thought about. Accessing old posts via “archives” is cumbersome. Idealy, the sequence of posts would create a well-organized virtual site. In many cases there are very good comments and it seems altering the post to summarize the comments is appropriate, but doing so leaves the comments out of context. Some mechanism of refinement which avoids this problem would be great. Many comments develop into something that should (essentially) be their own post on a new topic. Doing so is currently cumbersome, and a mechanism for making that shift would be helpful.
Time commitment is a problem. Making a stream of good posts is hard and takes awhile. Naturally, some were (and even still are) stored up, but that store is finite, and eventually will be exhausted. Since I’m unwilling to compromise quality, this means the rate of posts may eventually fall. The obvious solution to this is to invite other posters. (Which I have with Alex Gray and Drew Bagnell.) Consider yourself invited. Email me (jl@hunch.net) with anything you consider worth posting.

It’s not all bad and I plan to continue experimenting. Several of the discussions have been quite interesting, and I often find that the process of writing posts helps clarify my understanding. Creating a little bit more understanding seems like it is worthwhile.

2/18/20052/18/2005

What it means to do research.

I want to try to describe what doing research means, especially from the point of view of an undergraduate. The shift from a class-taking mentality to a research mentality is very significant and not easy.

Problem Posing Posing the right problem is often as important as solving them. Many people can get by in research by solving problems others have posed, but that’s not sufficient for really inspiring research. For learning in particular, there is a strong feeling that we just haven’t figured out which questions are the right ones to ask. You can see this, because the answers we have do not seem convincing.
Gambling your life When you do research, you think very hard about new ways of solving problems, new problems, and new solutions. Many conversations are of the form “I wonder what would happen if…” These processes can be short (days or weeks) or years-long endeavours. The worst part is that you’ll only know if you were succesful at the end of the process (and sometimes not even then because it can take a long time for good research to be recognized). This is very risky compared to most forms of work (or just going to classes).
Concentration This is not so different from solving problems in class, except that you may need to concentrate on a problem for much longer to solve it. This often means shutting yourself off from the world (no TV, no interruptions, no web browsing, etc…) and really thinking.
Lack of feedback While doing research there is often a lack of feedback or contradicting feedback. The processing of writing a paper can take a month, you may not get reviews for several months, and the review process can be extremely (and sometimes systematically) noisy.
Curiousity This is not merely idle curiousity. A desire to understand things from different viewpoints, to understand that niggling detail which isn’t right, and to understand the global picture of the way things are. This often implies questioning the basics.
Honesty Good Researchers have to understand the way things are (at least with respect to research). Learning to admit when you are wrong can be very hard.
Prioritization You have many things to do and not enough time to do them in. The need to prioritize generally becomes common, but it’s not so common in undergrad life. This often means saying ‘no’ when you want to say ‘yes’.
Memory Problems often aren’t solved in the first pass. Conversations from a year ago often contain the key to solving today’s problem. A good suite of problem-solving methods and a global understanding of how things fit together are often essential.
Ephemeral Contact The set of people who you know and work with may only be talked with for a few brief but intense hours a year at a conference.
Opportunism Possibilities come up. They must be recognized (which is hard for conservative people) and seized (which is hard for people without enough confidence to gamble).

Not all of these traits are necessary to do good research—some of them can be compensated for and others can be learned. Many parts of academia can be understood as helping to reduce some of these difficulties. For example, teaching reduces the extreme variance of gambling on research output. Tenure provides people a stable base from which they can take greater gambles (… and often results in people doing nothing). Conferences are partly succesful because they provide much more feedback than journals (which are generally slower). Weblogs might, in the future, provide even faster feedback. Many people are quite succesful simply solving problems that others pose.

2/12/20052/12/2005

ROC vs. Accuracy vs. AROC

Foster Provost and I discussed the merits of ROC curves vs. accuracy estimation. Here is a quick summary of our discussion.

The “Receiver Operating Characteristic” (ROC) curve is an alternative to accuracy for the evaluation of learning algorithms on natural datasets. The ROC curve is a curve and not a single number statistic. In particular, this means that the comparison of two algorithms on a dataset does not always produce an obvious order.

Accuracy (= 1 – error rate) is a standard method used to evaluate learning algorithms. It is a single-number summary of performance.

AROC is the area under the ROC curve. It is a single number summary of performance.

The comparison of these metrics is a subtle affair, because in machine learning, they are compared on different natural datasets. This makes some sense if we accept the hypothesis “Performance on past learning problems (roughly) predicts performance on future learning problems.”

The ROC vs. accuracy discussion is often conflated with “is the goal classification or ranking?” because ROC curve construction requires a ranking be produced. Here, we assume the goal is classification rather than ranking. (There are several natural problems where ranking of instances is much preferred to classification. In addition, there are several natural problems where classification is the goal.)

Arguments for ROC	Explanation
Ill-specification	The costs of choices are not well specified. The training examples are often not drawn from the same marginal distribution as the test examples. ROC curves allow for an effective comparison over a range of different choice costs and marginal distributions.
Ill-dominance	Standard classification algorithms do not have a dominance structure as the costs vary. We should not say “algorithm A is better than algorithm B” when you don’t know the choice costs well enough to be sure.
Just-in-Time use	Any system with a good ROC curve can easily be designed with a ‘knob’ that controls the rate of false positives vs. false negatives.

AROC inherits the arguments of ROC except for Ill-dominance.

Arguments for AROC	Explanation
Summarization	Humans don’t have the time to understand the complexities of a conditional comparison, so having a single number instead of a curve is valuable.
Robustness	Algorithms with a large AROC are robust against a variation in costs.

Accuracy is the traditional approach.

Arguments for Accuracy	Explanation
Summarization	As for AROC.
Intuitiveness	People understand immediately what accuracy means. Unlike (A)ROC, it’s obvious what happens when one additional example is classified wrong.
Statistical Stability	The basic test set bound shows that accuracy is stable subject to only the IID assumption. For AROC (and ROC) this is only true when the number in each class is not near zero.
Minimality	In the end, a classifier makes classification decisions. Accuracy directly measures this while (A)ROC comprimises this measure with hypothetical alternate choice costs. For the same reason, computing (A)ROC may require significantly more work than solving the problem.
Generality	Accuracy generalizes easily to multiclass accuracy, importance weighted accuracy, and general (per-example) cost sensitive classification. ROC curves become problematic when there are just 3 classes.

The Just-in-Time argument seems to be the strongest for (A)ROC. One way to rephrase this argument is “Lack of knowledge of relative costs means that classifiers should be rankers so false positive to false negative ratios can be easily altered.” In other words, this is an argument for “ranking instead of classification” rather than “(A)ROC instead of Accuracy”.

2/9/20052/9/2005

Intuitions from applied learning

Since learning is far from an exact science, it’s good to pay attention to basic intuitions of applied learning. Here are a few I’ve collected.

Integration In Bayesian learning, the posterior is computed by an integral, and the optimal thing to do is to predict according to this integral. This phenomena seems to be far more general. Bagging, Boosting, SVMs, and Neural Networks all take advantage of this idea to some extent. The phenomena is more general: you can average over many different classification predictors to improve performance. Sources: Zoubin, Caruana
Differentiation Different pieces of an average should differentiate to achieve good performance by different methods. This is know as the ‘symmetry breaking’ problem for neural networks, and it’s why weights are initialized randomly. Boosting explicitly attempts to achieve good differentiation by creating new, different, learning problems. Sources: Yann LeCun, Phil Long
Deep Representation Having a deep representation is necessary for having a good general learner. Decision Trees and Convolutional neural networks take advantage of this. SVMs get around it by allowing the user to engineer knowledge into the kernel. Boosting and Bagging rely on another algorithm for this. Sources: Yann LeCun
Fine Representation of Bias Many learning theory applications use just a coarse representation of bias such as “function in the hypothesis class or not”. In practice, significantly better performance is achieved from a more finely tuned bias. Bayesian learning has this builtin with a prior. Other techniques can take advantage of this as well. Sources: Zoubin, personal experience.

If you have others, please comment on them.

2/8/20052/9/2005

Some Links

Yaroslav Bulatov collects some links to other technical blogs.