Kolmogorov Complexity and Googling

Machine learning makes the New Scientist. From the article:

COMPUTERS can learn the meaning of words simply by plugging into Google. The finding could bring forward the day that true artificial intelligence is developed….
But Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam, the Netherlands, realised that a Google search can be used to measure how closely two words relate to each other. For instance, imagine a computer needs to understand what a hat is.

You can read the paper at KC Google.

Hat tip: Kolmogorov Mailing List

Any thoughts on the paper?

NIPS: Online Bayes

One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms. From the paper:

The philosophy taken in the Bayesian methodology is often at odds with
that in the online learning community…. the online learning setting
makes rather minimal assumptions on the conditions under which the
data are being presented to the learner —usually, Nature could provide
examples in an adversarial manner. We study the performance of
Bayesian algorithms in a more adversarial setting… We provide
competitive bounds when the cost function is the log loss, and we
compare our performance to the best model in our model class (as in
the experts setting).

It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem:

Let Q be any distribution over parameters theta. Then for all sequences S:

L_{Bayes}(S) leq L_Q(S) + KL(Q|P)

where P is our prior and the losses L are: first, log-loss for the Bayes algorithm (run online) and second, expected log-loss with respect to an arbitrary distribution Q.

Any thoughts? Any other papers you thought we have to read?

Watchword: Loss

A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss.

There seems to be a strong dichotomy between two views of what “loss” means in learning.

  1. Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer.
  2. Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem.

I don’t fully understand the second viewpoint. It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. Many of these losses-of-convenience also seem to have behavior unlike real world problems. For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. Under log loss, their loss became very high. This does not seem to correspond to the intuitive notion of what the loss should be on the problem.

Watchword: Assumption

“Assumption” is another word to be careful with in machine learning because it is used in several ways.

  1. Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “no free lunch” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning.
  2. Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage.
  3. Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer.

One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if not assumption then not conclusion“. This is incorrect logic. For example, with variant (1), “the assumption of my prior is not met so the algorithm will not learn”. Or, with variant (3), “the data is not IID, so my learning algorithm designed for IID data will not work”. In each of these cases “will” must be replaced with “may” for correctness.