Machine Learning – Page 34 – Machine Learning (Theory)

2/16/2009

KDNuggets

Eric Zaetsch points out KDNuggets which is a well-developed mailing list/news site with a KDD flavor. This might particularly interest people looking for industrial jobs in machine learning, as the mailing list has many such.

2/4/2009

Optimal Proxy Loss for Classification

Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f_w(x)= sum_i w_ix_i
where optimizing squared loss (y-f_w(x))² over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f_w(x) – 0.5)).

While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices:

Log loss. If we confine the prediction to [0,1], we can treat it as a predicted probability that the label is 1, and measure loss according to log 1/p'(y|x) where p'(y|x) is the predicted probability of the observed label. A standard method for confining the prediction to [0,1] is logistic regression which exponentiates the dot product and normalizes.
Squared loss. The squared loss approach (discussed above) is also quite common. It shares the same “proper scoring rule” semantics as log loss: the optimal representation-independent predictor is the conditional probability of the label y given the features x.
Hinge loss. For hinge loss, you optimize max(0, 1- 4 (y – 0.5) (f_w(x) – 0.5) ). The form of hinge loss is slightly unfamiliar, because the label is {0,1} rather than {-1,1}. The optimal prediction for hinge loss is not the probability of y given x but rather some value which is at least 1 if the most likely label is 1 and 0 or smaller if the most likely label is 0. Hinge loss was popularized with support vector machines. Hinge loss is not a proper scoring rule for mean, but since it does get the sign right, using it for classification is reasonable.

Many people have made qualitative arguments about why one loss is better than another. For example see Yaroslav’s old post for an argument about the comparison of log loss and hinge loss and why hinge loss might be better. In the following, I make an elementary quantitative argument.

Log loss is qualitatively dissimilar from the other two, because it is unbounded on the range of interest. Restated, there is no reason other than representational convenience that f_w(x) needs to take a value outside of the interval [0,1] for squared loss or hinge loss. In fact, we can freely reduce these losses by considering instead the function f_w‘(x) = max(0,min(1,f_w(x))). The implication is that optimization of log loss can be unstable in ways that optimization of these other losses is not. This can be stated precisely by noting that sample complexity bounds (simple ones here) for 0-1 loss hold for f_w‘(x) under squared or hinge loss, but the same theorem statement does not hold for log loss without additional assumptions. Since stability and convergence are of substantial interest in machine learning, this suggests not using log loss.

For further analysis, we must first define some function converting f_w(x) into label predictions. The only reasonable approach is to threshold at 0.5. For log loss and squared loss, any other threshold is inconsistent. Since the optimal predictor for hinge loss always takes value 0 or 1, there is some freedom in how we convert, but a reasonable approach is to also threshold at 0.5.

Now, we want to analyze the stability of predictions. In other words, if an adversary picks the true conditional probability distribution p(y|x) and the prediction f_w‘(x), how does the proxy loss of f_w‘(x) bound the 0-1 loss? Since we imagine that the conditional distribution is noisy, it’s important to actually consider a regret: how well we do minus the loss of the best possible predictor.

For each of these losses, an optimal strategy of the adversary is to have p(y|x) take value 0.5 – eps and f_w‘(x) = 0.5. The 0-1 regret induced is simply 2 eps, since the best possible predictor has error rate 0.5 – eps while the actual predictor has error rate 0.5 + eps. For hinge loss, the regret is eps and for squared loss the regret is eps². Doing some algebra, this implies that 2 hinge_regret bounds 0-1 regret while 2 squared_regret^0.5 bounds 0-1 regret. Since we are only interested in regrets less than 1, the square root is undesirable, and hinge loss is preferred, because a stronger convergence of squared loss is needed to achieve the same guarantee on 0-1 loss.

Can we improve on hinge loss? I don’t know any proxy loss which is quantitatively better, but generalizations exist. The regret of hinge loss is the same as for absolute value loss |y-f_w‘(x)| since they are identical for 0,1 labels. One advantage of absolute value loss is that it has a known and sometimes useful semantics for values between 0 and 1: the optimal prediction is the median. This makes the work on quantile regression (Two Three) seem particularly relevant for machine learning.

1/27/2009

Key Scientific Challenges

Yahoo released the Key Scientific Challenges program. There is a Machine Learning list I worked on and a Statistics list which Deepak worked on.

I’m hoping this is taken quite seriously by graduate students. The primary value, is that it gave us a chance to sit down and publicly specify directions of research which would be valuable to make progress on. A good strategy for a beginning graduate student is to pick one of these directions, pursue it, and make substantial advances for a PhD. The directions are sufficiently general that I’m sure any serious advance has applications well beyond Yahoo.

A secondary point, (which I’m sure is primary for many 🙂 ) is that there is money for graduate students here. It’s unrestricted, so you can use it for any reasonable travel, supplies, etc…

1/23/2009

An Active Learning Survey

Burr Settles wrote a fairly comprehensive survey of active learning. He intends to maintain and update the survey, so send him any suggestions you have.

1/21/2009

Nearly all natural problems require nonlinearity

One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.

There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.

In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?

There are several reasons for investigating the linear viewpoint.

Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.

Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:
At night you see someone searching the ground under a streetlight.
You ask, “What happened?”
They say, “I’m looking for the keys I dropped in the bushes.”
“But there aren’t any bushes where you are searching.”
“Yes, but I can’t see over there.”