Loss Function Semantics

Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself.

  1. Optimizing squared loss lsq(y,y’)=(y-y’)2 means predicting the (conditional) mean of y.
  2. Optimizing absolute value loss lav(y,y’)=|y-y’| means predicting the (conditional) median of y. Variants can handle other quantiles. 0/1 loss for classification is a special case.
  3. Optimizing log loss llog(y,y’)=log (1/Prz~y’(z=y)) means minimizing the description length of y.

The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form:
For all distributions D over Y, if

y’ = arg miny’ Ey ~ D lsq (y,y’)
then
y’ = Ey~D y

Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y.

There are 3 points to this post.

  1. Everyone doing general machine learning should be aware of the laundry list above. They form a handy toolkit which can match many of the problems naturally encountered.
  2. People also try to optimize a variety of other loss functions. Some of these are (effectively) a special case of the above. For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. Some of the other losses do not have any known semantics. In this case, discovering a semantics could be quite valuable.
  3. The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. I don’t know of any general way to do this other than simply applying the laundry list above. As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? (How do we do squared error regression which is insensitive to outliers?) Which semantics are satisfiable with a loss?

The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.

Alternative Machine Learning Reductions Definitions

A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.

For this post, we can think of a learning reduction as

  1. A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
  2. A mapping Q from predictors for type T’ to predictors for type T.

The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)))

Here LT is the loss for the type T problem and LT’ is the loss for the type T’ problem. Also, R(D) is the distribution over samples induced by first drawing from D and then mapping the sample via R. The function f() is the loss transform function—we try to find reductions R,Q which minimize it’s value.

If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) – minc E(x,y) ~ D LT(y,c(x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ E(x’,y’)~R(D) LT’(y’,b'(x’)))

The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.

The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.

  1. Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
    Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
    E(x,y) ~ D LT(y,Q(b,x)) – minb’ in B E(x,y) ~ D LT(y,Q(b’,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ in B E(x’,y’)~R(D) LT’(y’,b'(x’)))

    This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
    One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
  2. Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
    Theorem There exists R-1,Q-1 such that for all base predictors b, for base learning problems D’:
    E(x’,y’)~D’ LT’(y’,b(x’)) = E(x’,y’) ~ R(R-1(D’)) LT’(y’,b(x’))

    and Q-1(Q(b))=b

    Closely related (but different) is the following:
    Theorem There exists R-1,Q-1 such that for all type T predictors h, for all type T distributions D:
    E(x,y) ~ D LT(y,h(x)) = E(x,y)~R-1(R(D)) LT(y,h(x))

    and Q(Q-1(h)) = h
  3. Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
    Theorem For some large family of priors P over distributions D of type T:
    Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))

    Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.

The two missing components for these kinds of reductions are:

  1. Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
  2. Empirical evidence that algorithmic modifications driven by the theory are useful.

My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.

Parallel Machine Learning Problems

Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:

  1. Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
  2. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
    1. Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
    2. IBM’s Cell processor has (essentially) 9 cores.
    3. Modern graphics chips can have an order of magnitude more separate execution units.

    The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.

So, how do we parallelize machine learning algorithms?

  1. The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
  2. The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
  3. The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.

A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.

The Spam Problem

The New York Times has an article on the growth of spam. Interesting facts include: 9/10 of all email is spam, spam source identification is nearly useless due to botnet spam senders, and image based spam (emails which consist of an image only) are on the growth.

Estimates of the cost of spam are almost certainly far to low, because they do not account for the cost in time lost by people.

The image based spam which is currently penetrating many filters should be catchable with a more sophisticated application of machine learning technology. For the spam I see, the rendered images come in only a few formats, which would be easy to recognize via a support vector machine (with RBF kernel), neural network, or even nearest-neighbor architecture. The mechanics of setting this up to run efficiently is the only real challenge. This is the next step in the spam war.

The response to this system is to make the image based spam even more random. We should (essentially) expect to see Captcha spam, and our inability to recognize captcha spam should persist as long as the vision problem is not solved. This hopefully degrades the value of spam to the spammers, but it may not make the value of spam nonzero.

Solutions beyond machine learning may be necessary. One simple economic solution is to transfer from first time sender to receiver a small amount (10 cents?) in a verifiable manner. If the receiver classifies the email as spam then the charge repeats on the next receipt, and otherwise it goes away.

There are several difficulties with this approach: How do you change a huge system in heavy use which no one controls? How do you deal with mailing lists? These problems appear surmountable. For example, we could extend the mail protocol to include a payment system (using the “X-” lines) and use the existence of a payment as a feature in existing spam-or-not prediction systems. Over time, this feature may become the most useful feature encouraging every legitimate email user to offer a small payment with the first email to a recipient.