January 2005 – Machine Learning (Theory)

1/31/20052/4/2005

Watchword: Assumption

“Assumption” is another word to be careful with in machine learning because it is used in several ways.

Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “no free lunch” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning.
Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage.
Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer.

One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if not assumption then not conclusion“. This is incorrect logic. For example, with variant (1), “the assumption of my prior is not met so the algorithm will not learn”. Or, with variant (3), “the data is not IID, so my learning algorithm designed for IID data will not work”. In each of these cases “will” must be replaced with “may” for correctness.

1/27/20051/27/2005

Learning Complete Problems

Let’s define a learning problem as making predictions given past data. There are several ways to attack the learning problem which seem to be equivalent to solving the learning problem.

Find the Invariant This viewpoint says that learning is all about learning (or incorporating) transformations of objects that do not change the correct prediction. The best possible invariant is the one which says “all things of the same class are the same”. Finding this is equivalent to learning. This viewpoint is particularly common when working with image features.
Feature Selection This viewpoint says that the way to learn is by finding the right features to input to a learning algorithm. The best feature is the one which is the class to predict. Finding this is equivalent to learning for all reasonable learning algorithms. This viewpoint is common in several applications of machine learning. See Gilad’s and Bianca’s comments.
Find the Representation This is almost the same as feature selection, except internal to the learning algorithm rather than external. The key to learning is viewed as finding the best way to process the features in order to make predictions. The best representation is the one which processes the features to produce the correct prediction. This viewpoint is common for learning algorithm designers.
Find the Right Kernel The key to learning is finding the “right” kernel. The optimal kernel is the one for which K(x, z)=1 when x and z have the same class and 0 otherwise. With the right kernel, an SVM(or SVM-like optimization process) can solve any learning problem. This viewpoint is common for people who work with SVMs.
Find the Right Metric The key to learning is finding the right metric. The best metric is one which states that features with the same class label have distance 0 while features with different class labels have distance 1. With the best metric, the nearest neighbor algorithm can solve any problem.

Each of these viewpoints seems to be “right”, and each seems to have some utility in it’s context. It also seems important to realize that these are just different versions of the same problem. One consequence of this observation is that “wrapper methods” which try to automatically find a subset of features to feed into a learning algorithm in order to improve learning performance are simply trying to repair weaknesses in the learning algorithm.

1/26/20052/4/2005

Watchword: Probability

Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used.

Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y. The probability is then Y/(Y+1). Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively.
Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in n experiments then it has probability about k/n. Frequentist probabilities can be used to measure how sure you are about something. They may be appropriate in a learning context for measuring confidence in various predictors. The frequentist notion of probability is common in physics, other sciences, and computer science theory.
Estimated The estimated notion of probability is measured by running some learning algorithm which predicts the probability of events rather than events. I tend to dislike this use of the word because it confuses the world with the model of the world.

To avoid confusion, you should be careful to understand what other people mean for this word. It is helpful to always be explicit about which variables are randomized and which are constant whenever probability is used because Bayesian and Frequentist probabilities commonly switch this role.

1/26/20051/26/2005

Summer Schools

There are several summer schools related to machine learning.

We are running a two week machine learning summer school in Chicago, USA May 16-27.

IPAM is running a more focused three week summer school on Intelligent Extraction of Information from Graphs and High Dimensional Data in Los Angeles, USA July 11-29.

A broad one-week school on analysis of patterns will be held in Erice, Italy, Oct. 28-Nov 6.

1/24/20051/25/2005

Holy grails of machine learning?

Let me kick things off by posing this question to ML researchers:

What do you think are some important holy grails of machine learning?

For example:
– “A classifier with SVM-level performance but much more scalable”
– “Practical confidence bounds (or learning bounds) for classification”
– “A reinforcement learning algorithm that can handle the ___ problem”
– “Understanding theoretically why ___ works so well in practice”
etc.

I pose this question because I believe that when goals are stated explicitly and well (thus providing clarity as well as opening up the problems to more people), rather than left implicit, they are likely to be achieved much more quickly. I would also like to know more about the internal goals of the various machine learning sub-areas (theory, kernel methods, graphical models, reinforcement learning, etc) as stated by people in these respective areas. This could help people cross sub-areas.