Interesting Presentations at Snowbird

Here are a few of presentations interesting me at the snowbird learning workshop (which, amusingly, was in Florida with AIStat).

  1. Thomas Breuel described machine learning problems within OCR and an open source OCR software/research platform with modular learning components as well has a 60Million size dataset derived from Google‘s scanned books.
  2. Kristen Grauman and Fei-Fei Li discussed using active learning with different cost labels and large datasets for image ontology. Both of them used Mechanical Turk as a labeling system, which looks to become routine, at least for vision problems.
  3. Russ Tedrake discussed using machine learning for control, with a basic claim that it was the way to go for problems involving a medium Reynold’s number such as in bird flight, where simulation is extremely intense.
  4. Yann LeCun presented a poster on an FPGA for convolutional neural networks yielding a factor of 100 speedup in processing. In addition to the graphics processor approach Rajat has worked on, this seems like an effective approach to deal with the need to compute many dot products.

Asymmophobia

One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.

  1. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.
  2. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach.
  3. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg maxl wl x but it occurs in many other places as well.

Empirically, breaking symmetry well seems to yield great algorithms.

  1. feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.
  2. example asymmetry Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.
  3. label asymmetry Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.

The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.

A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.

What methods for symmetry breaking exist?

  1. Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
  2. Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
  3. Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.

While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?

Machine Learning is too easy

One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others.

There are two fundamental reasons why this is possible.

  1. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples.
  2. There is nothing like a unifying problem defining the field. In many other areas there are unifying problems for which finer distinctions in approaches matter.

The implications of this observation agrees with inspection of the field.

  1. Particular problems are often “solved” by the first technique applied to them. This is particularly exacerbated when some other field first uses machine learning, as people there may not be aware of other approaches. A popular example of this is Naive Bayes for spam. In other fields, the baseline method is a neural network, an SVM, a graphical model, etc…
  2. The analysis of new learning algorithms is often subject to assumptions designed for the learning algorithm. Examples include large margins for support vector machines, a ‘correct’ Bayesian prior, correct conditional independence assumptions for graphical models, etc… Given such assumptions, it’s unsurprising to learn that the algorithm is the right method, and justifying a new algorithm becomes an exercise in figuring out an assumption which seems natural sounding under which the algorithm performs well. This assumption set selection problem is the theoretician’s version of the data set selection problem.

A basic problem is: How do you make progress in a field with this (lack of) structure? And what does progress even mean? Some possibilities are:

  1. Pick an approach and push it. [Insert your favorite technique] everywhere.
  2. Find new real problems and apply ML. The fact that ML is easy means there is a real potential for doing great things this way.
  3. Find a hard problem and work on it. Although almost anyone can do something nontrivial on most problems, achieving best-possible performance on some problems is not at all easy.
  4. Make the problem harder. Create algorithms that work fast online, in real time, with very few or no labeled examples, but very many examples, very many features, and very many things to predict.

I am least fond of approach (1), although many people successfully follow approach (1) for their career. What’s frustrating about approach (1), is that there does not seem to be any single simple philosophy capable of solving all the problems we might recognize as machine learning problems. Consequently, people following approach (1) are at risk of being outpersuaded by someone sometime in the future.

Approach (2) is perhaps the easiest way to accomplish great things, and in some sense much advance comes from new applications.

Approach (3) seems solid, promoting a different kind of progress than approach (2).

Approach (4) seems particularly cool to me at the moment. It is not as specialized as (2) or (3), and it seems many constraints are complementary. For example, there is large scale learning = online learning.

Parallel ML primitives

Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML.

Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second?

One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit. Antoine Bordes showed a variant was competitive in the large scale learning challenge. It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimization, implying the algorithm can be used in a host of situations where batch algorithms are awkward or unviable.

If we start with a fast learning algorithm as a baseline, there seem to be two directions we can go with parallel ML:

  1. (easier) Try to do more in the same amount of time. For example, Paul and Neil suggest using parallelism to support ensemble methods.
  2. (harder) Try to use parallelism to reduce the amount of time required to effectively learn on large amounts of data. For this approach, bandwidth and latency become uncomfortably critical implementation details. Due to these issues, it appears important to at least loosen the goal to competing with learning on large amounts of data. Alternatively, we could consider this as evidence some other learning primitive is desirable, although I’m not sure which one.