Presentation Preparation

A big part of doing research is presenting it at a conference. Since many people start out shy of public presentations, this can be a substantial challenge. Here are a few notes which might be helpful when thinking about preparing a presentation on research.

  1. Motivate. Talks which don’t start by describing the problem to solve cause many people to zone out.
  2. Prioritize. It is typical that you have more things to say than time to say them, and many presenters fall into the failure mode of trying to say too much. This is an easy-to-understand failure mode as it’s very natural to want to include everything. A basic fact is: you can’t. Example of this are:
    1. Your slides are so densely full of equations and words that you can’t cover them.
    2. Your talk runs over and a moderator prioritizes for you by cutting you off.
    3. You motor-mouth through the presentation, and the information absorption rate of the audience prioritizes in some uncontrolled fashion.
    4. The rate of flow of concepts simply exceeds the information capacity of the audience. Even with nondense slides and an easy succinct delivery, this can often happen.

    One way to prioritize is figure out: “What would I present in 1 minute?” or “What would I present in 5 minutes?”, and then let this guide your overall presentation.

  3. Unassume. When you are working in an area, it’s typical to buildup an internal shorthand for concepts. This needs to be peeled away when preparing a presentation. Decide what the minimal set of concepts are, and then be sure to define them as they are introduced. For people familiar with the basic concepts, this gives them a way to reconcile choices of language, and others at least have a prayer of following.
  4. Practice Well. Some people try to get a talk right by practicing it relentlessly until it is memorized, and then deliver it as a memorized monologue. This is terrible, because people in the audience know it is a memorized monologue and zone out. A good talk is delivered like a conversation, where it happens to be your turn to speak for awhile, and practicing that is more difficult. Some practice by yourself can be helpful—but not too much. A much better method is to practice on your friends by delivering to them before delivering it to the wider world.

The points above avoid the common failure modes which seem to come up with first-time presenters. There is much more advice to give (and for me to learn) about giving better presentations.

Interesting Papers at COLT 2007

Here are two papers that seem particularly interesting at this year’s COLT.

  1. Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
  2. Adam Tauman Kalai. Learning Nested Halfspaces and Uphill Decision Trees. Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. Under the assumption that:
    1. The level sets of the correct regressed value are halfspaces.
    2. The level sets obey a Lipschitz condition.

    this paper proves that a good regressor can be PAC-learned using a boosting algorithm. (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition.)

Not Posting

If you have been disappointed by the lack of a post for the last month, consider contributing your own (I’ve been busy+uninspired). Also, keep in mind that there is a community of machine learning blogs (see the sidebar).

Loss Function Semantics

Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself.

  1. Optimizing squared loss lsq(y,y’)=(y-y’)2 means predicting the (conditional) mean of y.
  2. Optimizing absolute value loss lav(y,y’)=|y-y’| means predicting the (conditional) median of y. Variants can handle other quantiles. 0/1 loss for classification is a special case.
  3. Optimizing log loss llog(y,y’)=log (1/Prz~y’(z=y)) means minimizing the description length of y.

The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form:
For all distributions D over Y, if

y’ = arg miny’ Ey ~ D lsq (y,y’)
then
y’ = Ey~D y

Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y.

There are 3 points to this post.

  1. Everyone doing general machine learning should be aware of the laundry list above. They form a handy toolkit which can match many of the problems naturally encountered.
  2. People also try to optimize a variety of other loss functions. Some of these are (effectively) a special case of the above. For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. Some of the other losses do not have any known semantics. In this case, discovering a semantics could be quite valuable.
  3. The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. I don’t know of any general way to do this other than simply applying the laundry list above. As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? (How do we do squared error regression which is insensitive to outliers?) Which semantics are satisfiable with a loss?

The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.