Machine Learning – Page 64 – Machine Learning (Theory)

3/2/20063/2/2006

Why do people count for learning?

This post is about a confusion of mine with respect to many commonly used machine learning algorithms.

A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express:
P(A,B,C) = P(A | B,C)P(B)P(C)

What I don’t understand is the mechanism commonly used to estimate P(A | B, C). If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to:

P'(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)]
… in other words, people just estimate P'(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C, but it (naturally) falls apart when this is not the case as typically happens with “big” learning problems such as machine translation, vision, etc…

To compensate, people often try to pick some prior (such as Dirichlet prior with one “virtual count” per joint parameter setting) to provide a reasonable default value for the count. Naturally, in the “big learning” situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts. It’s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict A given B and C.

Stepping back a bit, we can think of the estimation of P(A | B, C) as a simple self-contained prediction (sub)problem. Why don’t we use existing technology for doing this prediction? Viewed as a learning algorithm “counting with a Dirichlet prior” is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default. It’s hard to imagine a more primitive learning algorithm.

There seems to be little downside to trying this approach. In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well. In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above. In either case something reasonable happens.

Using a general purpose probabilistic prediction algorithm isn’t a new idea, (for example, see page 57), but it appears greatly underutilized. This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}. It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception.

2/27/20062/27/2006

The Peekaboom Dataset

Luis von Ahn‘s Peekaboom project has yielded data (830MB).

Peekaboom is the second attempt (after Espgame) to produce a dataset which is useful for learning to solve vision problems based on voluntary game play. As a second attempt, it is meant to address all of the shortcomings of the first attempt. In particular:

The locations of specific objects are provided by the data.
The data collection is far more complete and extensive.

The data consists of:

The source images. (1 file per image, just short of 60K images.)
The in-game events. (1 file per image, in a lispy syntax.)
A description of the event language.

There is a great deal of very specific and relevant data here so the hope that this will help solve vision problems seems quite reasonable.

2/24/20062/24/2006

A Fundamentalist Organization of Machine Learning

There are several different flavors of Machine Learning classes. Many classes are of the ‘zoo’ sort: many different learning algorithms are presented. Others avoid the zoo by not covering the full scope of machine learning.

This is my view of what makes a good machine learning class, along with why. I’d like to specifically invite comment on whether things are missing, misemphasized, or misplaced.

Phase	Subject	Why?
Introduction	What is a machine learning problem?	A good understanding of the characteristics of machine learning problems seems essential. Characteristics include: a data source, some hope the data is predictive, and a need for generalization. This is probably best taught in a case study manner: lay out the specifics of some problem and then ask “Is this a machine learning problem?”
Introduction	Machine Learning Problem Identification	Identification and recognition of the type of learning problems is (obviously) a very important step in solving such problems. People need to be familiar witth the concept of ‘regression’, ‘classification’, ‘cost sensitive classification’, ‘reinforcement learning’, etc… A good organization of these things is possible, but not yet well done.
Introduction	Example algorithm 1	To really understand machine learning, a couple learning algorithms must be understood in detail.
Introduction	Example algorithm 2	Ditto. The reason why the number is “2” and not “1” or “3” is that 2 is the minimum number required to make people naturally aware of the degrees of freedom available in learning algorithm design.
Analysis	Bias for Learning	The need for a good bias is one of the defining characteristics of learning. This includes discussing the means to specify bias (via Bayesian priors, choice of features, graphical models, etc…). This statement is generic so it will always apply to one degree or another.
Analysis	Learning can be boosted.	This is the boosting observation: that it is possible to bootstrap predictive ability to create a better overall system. This statement is similarly generic.
Analysis	Learning can be transformed	This is the reductions observation: that the ability to solve one kind of learning problems implies the ability to solve other kinds of leanring problems. This statement is similarly generic.
Analysis	Learning can be preserved	This is the online learning with experts observation: that we can have a master algorithm which preserves the best learning performance of subalgorithms. This statement is again generic.
Analysis	Overfitting	Learning algorithms can easily overfit to existing training data. How to analyze this (with an IID assumption), and how to avoid it are very important for success.
Analysis	Hardness of Learning	It turns out that there are several different ways in which machine learning can be hard including computational and information theoretic hardness. Some of PAC learning is relevant here. An understanding of how and why learning algorithms can fail seems important to understand the process.
Applications	Vision	One example of how learning is applied to solve vision problems.
Applications	Language	Ditto for language problems.
Applications	Robotics	Ditto for robotics
Applications	Speech	Ditto for speech
Applications	Businesses	Ditto for businesses
	Where is machine learning going?	Insert predictions of the future here. It should be understood that the field of machine learning is changing rapidly.

The emphasis here is on fundamentals: generally applicable mathematical statements and understandings of the learning problem. Given that emphasis, the ‘applications’ section could be cut without harming the integrity of the purpose.

2/18/20062/18/2006

Multiplication of Learned Probabilities is Dangerous

This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is.

Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example.

When Naive Bayes is used, a set of probabilities of the form Pr'(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing:

Pr'(label) * Product_{features i} Pr'(feature i | label)

(The Pr’ notation indicates these are estimated values.)

There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. However, (a) seems to never be true and (b) is often not true.

At this point, we can think a bit from a estimation perspective. When trying to estimate a coin with bias Pr(feature i | label), after observing n IID samples, the estimate is accurate to (at most) c/m for some constant c. (Actually, it’s c/m^0.5 in the general case c/m for coins with bias near 0 or 1.) Given this observation, we should expect the estimates Pr’ to differ by c/m or more when the prior on the sample counts is weak.

The problem to notice is that errors of c/m can quickly accumulate. The final product in the naive bayes classifier is n-way linear in the error terms where n is the number of features. If every features true value happens to be v and we happen to have a 1/2 + 1/n^0.5 feature fraction estimate too large and 1/2 – 1/n^0.5 fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by:

(v – c/m)^{n/2 + n^0.5}(v + c/m)^{n/2 + n^0.5} – vⁿ
When c/m is very small, this approximates as c n^0.5 /m, which suggests problems must arise when the number of features n is greater than the number of samples squared n > m². This can actually happen in the text classification settings where Naive Bayes is often applied.

All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior.

In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems.

An important question is “What can you do about this brittleness?” There are several answers:

Use a different system for prediction (there are many).
Get much more serious about following Bayes law here. (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs.

2/11/20062/11/2006

Yahoo’s Learning Problems.

I just visited Yahoo Research which has several fundamental learning problems near to (or beyond) the set of problems we know how to solve well. Here are 3 of them.

Ranking This is the canonical problem of all search engines. It is made extra difficult for several reasons.
1. There is relatively little “good” supervised learning data and a great deal of data with some signal (such as click through rates).
2. The learning must occur in a partially adversarial environment. Many people very actively attempt to place themselves at the top of
  rankings.
3. It is not even quite clear whether the problem should be posed as ‘ranking’ or as ‘regression’ which is then used to produce a
  ranking.
Collaborative filtering Yahoo has a large number of recommendation systems for music, movies, etc… In these sorts of systems, users specify how they liked a set of things, and then the system can (hopefully) find some more examples of things they might like
by reasoning across multiple such sets.
Exploration with Generalization The cash cow of
search engines is displaying advertisements which are relevant to search along with search results. Better targeting these advertisements makes money (a small improvement might be worth $millions) and improves the value of the search engine for the user.

It is natural to predict the set of advertisements which maximize the advertising payoff. This natural idea is stymied by both the extreme
multiplicity of advertisements under contract (think millions) and a lack of ability to measure hypotheticals like “What would have
happened if we had displayed a different set of advertisements for this (query,user) pair instead?” This is a combined exploration and
generalization problem.

Good solutions to any of these problems would be extremely useful (and not just at Yahoo). Even further small improvements on the existing solutions may be very useful.

For those interested, Yahoo (as an organization) knows these are learning problems and is very actively interested in solving them. Yahoo Research is committed to a relatively open method of solving these problems. Dennis DeCoste is one contact point for machine learning research at Yahoo Research.