Gradients everywhere

One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance.

It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down.

  1. Sometimes there are clever approximations of the gradient. (Simon Osindero)
  2. Sometimes we can compute constrained gradients via iterated gradient/project steps. (Ben Taskar)
  3. Sometimes we can compute gradients anyways over mildly nondifferentiable functions. (Drew Bagnell)
  4. Even given a gradient, the choice of update is unclear, and might be cleverly chosen (Nic Schraudolph)

Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees.

Is Multitask Learning Black-Boxable?

Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1}n. Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana) using methods which share an internal representation. On other words, the the computation and learning of the ith prediction is shared with the computation and learning of the jth prediction. Another way to ask this question is: can we avoid sharing the internal representation?

For example, it might be feasible to solve multitask learning by some process feeding the ith prediction f(x)i into the jth predictor f(x,f(x)i)j,

If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to solve multitask learning problems.

Finding a satisfying answer to this question at a theoretical level appears tricky. If you consider the infinite data limit with IID samples for any finite X, the answer is “yes” because any function can be learned. However, this does not take into account two important effects:

  1. Using a shared representation alters the bias of the learning process. What this implies is that fewer examples may be required to learn all of the predictors. Of course, changing the input features also alters the bias of the learning process. Comparing these altered biases well enough to distinguish their power seems very tricky. For reference, Jonathon Baxter has done some related analysis (which still doesn’t answer the question).
  2. Using a shared representation may be computationally cheaper.

One thing which can be said about multitask learning (in either black-box or shared representation form), is that it can make learning radically easier. For example, predicting the first bit output by a cryptographic circuit is (by design) extraordinarily hard. However, predicting the bits of every gate in the circuit (including the first bit output) is easily done based upon a few examples.

Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.

Fast Gradient Descent

Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD).

Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.

  1. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
  2. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
  3. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.

Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.

SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.

Fast SVMs

There was a presentation at snowbird about parallelized support vector machines. In many cases, people parallelize by ignoring serial operations, but that is not what happened here—they parallelize with optimizations. Consequently, this seems to be the fastest SVM in existence.

There is a related paper here.