Machine Learning (Theory)

4/2/2009

Asymmophobia

One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.

  1. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.
  2. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach.
  3. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg maxl wl x but it occurs in many other places as well.

Empirically, breaking symmetry well seems to yield great algorithms.

  1. feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.
  2. example asymmetry Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.
  3. label asymmetry Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.

The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.

A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.

What methods for symmetry breaking exist?

  1. Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
  2. Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
  3. Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.

While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?

Powered by WordPress