Yahoo’s Learning Problems.

I just visited Yahoo Research which has several fundamental learning problems near to (or beyond) the set of problems we know how to solve well. Here are 3 of them.

  1. Ranking This is the canonical problem of all search engines. It is made extra difficult for several reasons.
    1. There is relatively little “good” supervised learning data and a great deal of data with some signal (such as click through rates).
    2. The learning must occur in a partially adversarial environment. Many people very actively attempt to place themselves at the top of
      rankings.
    3. It is not even quite clear whether the problem should be posed as ‘ranking’ or as ‘regression’ which is then used to produce a
      ranking.
  2. Collaborative filtering Yahoo has a large number of recommendation systems for music, movies, etc… In these sorts of systems, users specify how they liked a set of things, and then the system can (hopefully) find some more examples of things they might like
    by reasoning across multiple such sets.
  3. Exploration with Generalization The cash cow of
    search engines is displaying advertisements which are relevant to search along with search results. Better targeting these advertisements makes money (a small improvement might be worth $millions) and improves the value of the search engine for the user.

    It is natural to predict the set of advertisements which maximize the advertising payoff. This natural idea is stymied by both the extreme
    multiplicity of advertisements under contract (think millions) and a lack of ability to measure hypotheticals like “What would have
    happened if we had displayed a different set of advertisements for this (query,user) pair instead?” This is a combined exploration and
    generalization problem.

Good solutions to any of these problems would be extremely useful (and not just at Yahoo). Even further small improvements on the existing solutions may be very useful.

For those interested, Yahoo (as an organization) knows these are learning problems and is very actively interested in solving them. Yahoo Research is committed to a relatively open method of solving these problems. Dennis DeCoste is one contact point for machine learning research at Yahoo Research.

Pittsburgh Mind Reading Competition

Francisco Pereira points out a fun Prediction Competition. Francisco says:

DARPA is sponsoring a competition to analyze data from an unusual functional Magnetic Resonance Imaging experiment. Subjects watch videos inside the scanner while fMRI data are acquired. Unbeknownst to these subjects, the videos have been seen by a panel of other subjects that labeled each instant with labels in categories such as representation (are there tools, body parts, motion, sound), location, presence of actors, emotional content, etc.

The challenge is to predict all of these different labels on an instant-by-instant basis from the fMRI data. A few reasons why this is particularly interesting:

  1. This is beyond the current state of the art, but not inconceivably hard.
  2. This is a new type of experiment design current analysis methods cannot deal with.
  3. This is an opportunity to work with a heavily examined and preprocessed neuroimaging dataset.
  4. DARPA is offering prizes!

Research Budget Changes

The announcement of an increase in funding for basic research in the US is encouraging. There is some discussion of this at the Computing Research Policy blog.

One part of this discussion has a graph of NSF funding over time, presumably in dollar budgets. I don’t believe that dollar budgets are the right way to judge the impact of funding changes on researchers. A better way to judge seems to be in terms of dollar budget divided by GDP which provides a measure of the relative emphasis on research.

This graph was assembled by dividing the NSF budget by the US GDP. For 2005 GDP, I used the current estimate and for 2006 and 2007 assumed an increase by a factor of 1.04 per year. The 2007 number also uses the requested 2007 budget which is certain to change.

This graph makes it clear why researchers were upset: research funding emphasis has fallen for 3 years in a row. The reality has been significantly more severe due to DARPA decreasing funding and industrial research labs (ATnT and Lucent for example) laying off large numbers of researchers about when the governments emphasis on basic research started declining.

It is certainly encouraging to see the emphasis on science growing again.

Introspectionism as a Disease

In the AI-related parts of machine learning, it is often tempting to examine how you do things in order to imagine how a machine should do things. This is introspection, and it can easily go awry. I will call introspection gone awry introspectionism.

Introspectionism is almost unique to AI (and the AI-related parts of machine learning) and it can lead to huge wasted effort in research. It’s easiest to show how introspectionism arises by an example.

Suppose we want to solve the problem of navigating a robot from point A to point B given a camera. Then, the following research action plan might seem natural when you examine your own capabilities:

  1. Build an edge detector for still images.
  2. Build an object recognition system given the edge detector.
  3. Build a system to predict distance and orientation to objects given the object recognition system.
  4. Build a system to plan a path through the scene you construct from {object identification, distance, orientation} predictions.
  5. As you execute the above, constantly repeat the above steps.

Introspectionism begins when you believe this must be the way that it is done.

Introspectionism arguments are really argument by lack of imagination. It is like saying “This is the only way I can imagine doing things, so it must be the way they should be done.” This is a common weak argument style that can be very difficult to detect. It is particularly difficult to detect here because it is easy to confuse capability with reuse. Humans, via experimental tests, can be shown capable of executing each step above, but this does not imply they reuse these computations in the next step.

There are reasonable evolution-based reasons to believe that brains minimize the amount of computation required to accomplish goals. Computation costs energy, and since a human brain might consume 20% of the energy budget, we can be fairly sure that the evolutionary impetus to minimize computation is significant. This suggests telling a different energy-conservative story.

An energy consevative version of the above example might look similar, but with very loose approximations.

  1. The brain might (by default) use a pathetically weak edge detector that is lazily refined into something more effective using time-sequenced images (since edges in moving scenes tend to stand out more).
  2. The puny edge detector might be used to fill a rough “obstacle-or-not” fill map that coarsens dramatically with distance.
  3. Given this, a decision about which direction to go next (rather than a full path) might be made.

This strategy avoids the need to build a good edge detector for still scenes, avoids the need to recognize objects, avoids the need to place them with high precision in a scene, and avoids the need to make a full path plan. All of these avoidances might result in more tractable computation or learning problems. Note that we can’t (and shouldn’t) say that the energy conservative path “must” be right because that would also be introspectionism. However, it does exhibit an alternative showing the failure of imagination in introspectionism on the first approach.

It is reasonable to take introspection derived ideas as suggestions for how to go about building a (learning) system. But if the suggestions don’t work, it’s entirely reasonable to try something else.

Should the Input Representation be a Vector?

Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds.

To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values).

On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix. In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted.

A vector window and a matrix both imply a notion of locality which is being actively and effectively used by these algorithms. In contrast, common algorithms like support vector machines, neural networks, and decision trees do not use or take advantage of locality in the input representation. For example, many of these algorithms are (nearly) invariant under permutation of input vector features.

A basic question we should ask is: “Does it really matter whether or not a learning algorithm knows the locality structure of the input?” Consider a simplistic example where we have n input bits as features and the function f to learn uses k of hte n bits. Suppose we also know that the function is one of two candidates: f1 and f2, but that these are two otherwise complicated functions. Then, finding which of the k to use might require an O(n choose k) = O(n!/k!(n-k)!) computation and O(k log (n/k)) samples when k is much smaller than n. On the other hand, when we know which k are relevant, “learning” is trivial. This suggests that telling the algorithm which k features are “near” to other features can be very helpful in solving the problem, at least computationally.

There are several natural questions which arise:

  1. How do we specify locality? Using a neighborhood graph? A weighted neighborhood graph?
  2. What are natural algorithms subject to locality? It seems most practical existing learning algorithms using locality have some form of sweeping operation over local neighborhoods. Is that it, or is there more?
  3. Can a general purpose locality aware algorithm perform well on a broad variety of different tasks?