Not Posting

If you have been disappointed by the lack of a post for the last month, consider contributing your own (I’ve been busy+uninspired). Also, keep in mind that there is a community of machine learning blogs (see the sidebar).

Loss Function Semantics

Some loss functions have a meaning, which can be understood in a manner independent of the loss function itself.

  1. Optimizing squared loss lsq(y,y’)=(y-y’)2 means predicting the (conditional) mean of y.
  2. Optimizing absolute value loss lav(y,y’)=|y-y’| means predicting the (conditional) median of y. Variants can handle other quantiles. 0/1 loss for classification is a special case.
  3. Optimizing log loss llog(y,y’)=log (1/Prz~y’(z=y)) means minimizing the description length of y.

The semantics (= meaning) of the loss are made explicit by a theorem in each case. For squared loss, we can prove a theorem of the form:
For all distributions D over Y, if

y’ = arg miny’ Ey ~ D lsq (y,y’)
then
y’ = Ey~D y

Similar theorems hold for the other examples above, and they can all be extended to predictors of y’ for distributions D over a context X and a value Y.

There are 3 points to this post.

  1. Everyone doing general machine learning should be aware of the laundry list above. They form a handy toolkit which can match many of the problems naturally encountered.
  2. People also try to optimize a variety of other loss functions. Some of these are (effectively) a special case of the above. For example, “hinge loss” is absolute value loss when the hinge point is at the upper range. Some of the other losses do not have any known semantics. In this case, discovering a semantics could be quite valuable.
  3. The natural direction when thinking about how to solve a problem is to start with the semantics you want and then derive a loss. I don’t know of any general way to do this other than simply applying the laundry list above. As one example, what is a loss function for estimating the mean of a random variable y over the 5th to 95th quantile? (How do we do squared error regression which is insensitive to outliers?) Which semantics are satisfiable with a loss?

The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.

Conditional Tournaments for Multiclass to Binary

This problem has been cracked (but not quite completely solved) by Alina, Pradeep, and I. The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here:


For the single elimination tournament, we can prove that:
For all multiclass problems D, for all learned binary classifiers c, the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log2 k. Restated:

regmulticlass(D,Filter_tree_test(c)) <= regbinary (Filter_tree_train(D),c)

Here:

  1. Filter_tree_train(D) is the induced binary classification problem
  2. Filter_tree_test(c) is the induced multiclass classifier.
  3. regmulticlass is the multiclass regret (= difference between error rate and minimum possible error rate)
  4. regbinary is the binary regret

This result has a slight dependence on k which we suspect is removable. The current conjecture is that this dependence can be removed by using higher order tournaments such as double elimination, triple elimination, up to log2 k-elimination.

The key insight which makes the result possible is conditionally defining the prediction problems at interior nodes. In essence, we use the learned classifiers from the first level of the tree to filter the distribution over examples reaching the second level of the tree. This process repeats, until the root node is reached. Further details, including a more precise description and some experimental results are in the draft paper.

The Coming Patent Apocalypse

Many people in computer science believe that patents are problematic. The truth is even worse—the patent system in the US is fundamentally broken in ways that will require much more significant reform than is being considered now.

The myth of the patent is the following: Patents are a mechanism for inventors to be compensated according to the value of their inventions while making the invention available to all. This myth sounds pretty desirable, but the reality is a strange distortion slowly leading towards collapse.

There are many problems associated with patents, but I would like to focus on just two of them:

  1. Patent Trolls The way that patents have generally worked over the last several decades is that they were a tool of large companies. Large companies would amass a large number of patents and then cross-license each other’s patents—in effect saying “we agree to owe each other nothing”. Smaller companies would sometimes lose in this game, essentially because they didn’t have enough patents to convince the larger companies that cross-licensing was a good idea. However, they didn’t necessarily lose, because small companies are also doing fewer things which makes their patent violation profile smaller.

    The patent trolls arrived recently. They are a new breed of company which does nothing but produce patents and collect money from them. The thing which distinguishes patent troll companies is that they have no patent violation profile. A company with a large number of patents can not credibly threaten to sue them unless they cross-license their patents, because they don’t do anything which violates a patent.

    The best example (and proof that this method works) is NTP, which extracted $612.5M from RIM. Although this is the big case with lots of publicity, the process of extracting money goes on constantly all around your in backroom negotiations with the companies that actually do things. In effect, patent trolls impose an invisible tax on companies that do things by companies that don’t. Restated in another way, patent trolls are akin to exploiting tax loopholes—except they exploit the law to make money rather than simply to avoid losing it. Smaller companies are particularly prone to lose, because they simply can not afford the extreme legal fees associated with fighting even a winning battle, but even large companies are also vulnerable to a patent troll.

    The other side of this argument is that patent trolls are simply performing a useful business function: employing researchers to come up with ideas or (at least) putting a floor on the value of ideas which they buy up through patents. Unfortunately, this is simply not true in my experience, due to the next problem.

  2. Combinatorial Ideas Patents are too easy. In fact, the process of coming up with patentable ideas is simply a matter of combinatorial application of existing ideas. This is a simple game that any reasonably intelligent person can play: you take idea 1 and idea 2, glue them together in any reasonable way, and get a patent.

    There are several reasons why the combinatorial application of existing ideas has become standard for patents.

    1. One of these is regulatory capture. It should surprise no one that the patent office, which gets paid for every patent application, has found a way to increase the number of patent applications.
    2. Another reason has to do with the way that patent law developed. Initially, patents were for processes doing things, rather than ideas. The scope of patents has steadily extended over time but the basic idea of patenting a process has been preserved. The fundamental problem is that processes can be created by the combinatorial application of ideas.

    The ease of patents is fundamentally valuable to patent troll type companies because they can acquire a large number of patents on processes which other companies accidentally violate.

The patent apocalypse happens when we project forward these two trends. Patents become ever easier to acquire and patent troll companies become ever more prevalent. In the end, every company which does something uses some obvious process that violates someone’s patent, and they have to pay at rates the patent owner chooses. There is no inherent bound on the number of patent troll type companies which can exist—they can multiply unchecked and drain money from every other company which does things until the system collapses.

I would like to make some positive suggestions here about how to reform the patent system, but it’s a hard mechanism design problem. Some obvious points are:

  1. Patents should have higher standards than research papers rather than substantially lower standards.
  2. The patent office should not make money from patents (this is not equivalent to saying that the patent applications should not be charged).
  3. The decision in whether or not to grant a patent should weigh the cost of granting. Many private property afficionados think “patents are great, because they compensate inventors”, but there is a real cost to society in granting obvious patents. You block people from doing things in the straightforward way or create hidden liabilities.
  4. Patents should be far more readable if the “available for all” part of the myth is to be upheld. Published academic papers are substantially more readable (which isn’t all that high of a bar).

Patent troll companies have found a clever way to exhibit the flaws in the current patent system. Substantial patent reform to eliminate this style of company would benefit just about everyone, except for these companies.