# Machine Learning (Theory)

## 3/21/2005

### Research Styles in Machine Learning

Tags: Organization jl@ 4:50 pm

Machine Learning is a field with an impressively diverse set of reseearch styles. Understanding this may be important in appreciating what you see at a conference.

1. Engineering. How can I solve this problem? People in the engineering research style try to solve hard problems directly by any means available and then describe how they did it. This is typical of problem-specific conferences and communities.
2. Scientific. What are the principles for solving learning problems? People in this research style test techniques on many different problems. This is fairly common at ICML and NIPS.
3. Mathematical. How can the learning problem be mathematically understood? People in this research style prove theorems with implications for learning but often do not implement (or test algorithms). COLT is a typical conference for this style.

Many people manage to cross these styles, and that is often beneficial.

Whenver we list a set of alternative, it becomes natural to think “which is best?” In this case of learning it seems that each of these styles is useful, and can lead to new useful discoveries. I sometimes see failures to appreciate the other approaches, which is a shame.

## 3/18/2005

### Binomial Weighting

Tags: Online,Papers,Problems jl@ 8:16 pm

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than minc f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.

## 3/17/2005

### Going all the Way, Sometimes

Tags: General jl@ 8:01 pm

At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example:

1. Should I refine bounds to make them tighter?
2. Should I take some learning theory and turn it into a learning algorithm?
3. Should I implement the learning algorithm?
4. Should I test the learning algorithm widely?
5. Should I release the algorithm as source code?
6. Should I go see what problems people actually need to solve?

The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not.

1. Expertise Once expertise are developed on some subject, you are the right person to refine them.
2. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamental flaws with respect to solving the real problem. Not going all the way means you never discover this, except the hard way—people lose interest in your work.
3. Virtues of breadth When you go all the way, you gain breadth, with a deeper understanding about which problems are important and why. This can be invaluable in focusing your future research.
4. More Tangible Accomplishment Going all the way means that you can point to your peers and say “I solved it”.

Going all the way is sometimes problematic in research. For example, a paper with a theory, an algorithm, and experimental results invites defeat-in-detail: a reviewer can disagree with any one of these components and eliminate it from consideration. Another issue is that academia doesn’t directly reward implementing and releasing algorithms. A third issue is that you will almost certainly discover topics of interest which don’t fit your home conference(s). It is also very difficult to publish a paper with the title “an incremental improvement on X (which makes it work great in practice)”.

Along with this advice, it is important to remember to fail fast, where appropriate. When you discover that an idea is not workable, quickly quitting it and moving on is a real virtue.

## 3/15/2005

### The State of Tight Bounds

Tags: Prediction Theory jl@ 11:49 am

What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals.
Why?

1. Good Judgement. In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not.
2. Learning Essence. The form of some of these bounds helps you understand what the essence of learning is.
3. Algorithm Design. Some of these bounds suggest, motivate, or even directly imply learning algorithms.

What We Know Now

There are several families of bounds, based on how information is used.

1. Testing Bounds. These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound, progressive validation also here and here, train and test bounds, and cross-validation (but see the big open problem). These techniques are the best available for goal (1) above, but provide little information towards goals (2) or (3). Some of these techniques are computationally efficient while others are not.
2. Unlabeled test set bounds Instead of using labeled data to construct a tight bound, it is sometimes possible to use unlabeled data.
3. Training Bounds These are methods which use labeled data to for both training and testing. These bounds provide insight into goals (2) and (3), but are not very effective for goal (1) above. These bounds teach us about how many independent examples are required to guarantee learning success on different on different representations. Any bound of this sort implies a learning algorithm: “minimize the bound”. The set covering machine is an application of this approach to a (variant) sample compression bound. Most other potential applications aren’t “quite there” yet, although they are close. Here is a list of learning algorithms and bounds that “almost work” for these algorithms.
1. SVM PAC-Bayes margin bound.
2. Perceptron PAC-Bayes margin bound, Sample Compression Bound
3. Neural Network PAC-Bayes bound.
4. Decision Tree Occam’s razor bound.
5. Decision Tree Pruning Rademacher Complexity Bounds
6. Semisupervised Clustering PAC-MDL or transductive PAC-Bayes bound
7. Nearest neighbor ??(Note that the cross validation approaches work well here.)

Limitations
The independence assumption is a significant limitation in the applicability of this approach since we often can not believe that independence is satisfied on natural learning problems. Some work has gone into weakening this assumption.

## 3/13/2005

Tags: Reviewing jl@ 11:08 am

If we accept that bad reviewing often occurs and want to fix it, the question is “how”?

Reviewing is done by paper writers just like yourself, so a good proxy for this question is asking “How can I be a better reviewer?” Here are a few things I’ve learned by trial (and error), as a paper writer, and as a reviewer.

1. The secret ingredient is careful thought. There is no good substitution for a deep and careful understanding.
2. Avoid reviewing papers that you feel competitive about. You almost certainly will be asked to review papers that feel competitive if you work on subjects of common interest. But, the feeling of competition can easily lead to bad judgement.
3. If you feel biased for some other reason, then you should avoid reviewing. For example…
4. Feeling angry or threatened by a paper is a form of bias. See above.
5. Double blind yourself (avoid looking at the name even in a single-blind situation). The significant effect of a name you recognize is making you pay close attention to a paper. Since not paying enough attention is a standard failure mode of reviewers, a name you recognize is inevitably unfair to authors you do not recognize.
6. Don’t review fast. For conferences there is a tendency to review papers right at the deadline. This tendency can easily result in misjudgements because you do not have the opportunity to really understand what a paper is saying.
7. Don’t review too much. “Too much” is defined on a per-person basis. If you don’t have time to really understand the papers that you review, then you should say “no” to review requests.
8. Overconfidence is the enemy of truth. If you are not confident about your review, you should not state that you are. Bad reviews are often overconfident reviews.
9. Always try to make review comments nonpersonal and constructive, especially in a rejection.

Given the above, observations, a few suggestions for improved review organization can be derived.

1. Double blind. A common argument against double blind reviewing is that it is often defeatable. This is correct and misses the point. The reason why double blind reviewing is helpful is that a typical reviewer who wants to review well is aided by the elimination of side information which should not effect the acceptance of a paper. (ICML and AAAI are double blind this year.) Another reason why double blind reviewing is “right”, is that it simply appears fairer. This makes it easier on average for authors to take rejections in a more constructive manner.
2. Staggered deadlines. Many people can’t prioritize reviews well, so the prioritization defaults to deadline proximity. Consequently, instead of having many paper reviews due on one day, having them due at the rate of one-per-day (or an even slower rate) may be helpful. These should be real deadlines in the sense that “you get it in by this date or you are excluded from conversation and decision making about the paper”.
3. Large PCs. There is a tendency to value (and romanticize) the great researcher. But a great researcher with many papers to review can only be a mediocre reviewer due to lack of available attention and time. Consequently, increasing the size of the PC may be helpful for small PC conferences.
4. Communication channels. A typical issue in reviewing a paper is that some detail is unintentionally (and accidentally) unclear. In this case, being able to communicate with the authors is helpful. This communication can be easily setup to respect the double blind guarantee by routing through the conference site. This communication does not change the meaning of a reviewers job. ICML and AAAI are allowing author feedback. I mean something more spontaneous, but this is a step in that direction.
5. Refusal. In many cases, it is not possible to tell that you have a conflict of interest in a paper until after seeing it. A mechanism for saying “I have a conflict of interest, please reassign the paper” should exist, and it’s use should be respected.
6. Independence. Access to other reviews should not be available until after completing your own review. The point of having multiple reviews is reducing noise. Allowing early access to other reviews increases noise by decreasing independence amongst reviewers. Many conferences (but not all) follow this pattern.