One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.
- feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.
- example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach.
- label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg maxl wl x but it occurs in many other places as well.
Empirically, breaking symmetry well seems to yield great algorithms.
- feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.
- example asymmetry Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.
- label asymmetry Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.
The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.
A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.
What methods for symmetry breaking exist?
- Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
- Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
- Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.
While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?
Good post. What is a good alternative to matlab for prototyping asymmetric algorithms?
I am a bit of a newcomer to ML, but I think the biggest reason for avoiding symmetry breaking is that a lot of theoretical papers don’t have a specific application domain in mind, and so ab initio there is no reason to assume an asymmetric structure. I guess by “data-driven” you mean designing algorithms which can become asymmetric, but I would assume under iid inputs with sufficient “symmetry” even boosting would also be “symmetric.” Naturally, an algorithm tuned to an application with asymmetries will have better empirical performance, as will an adaptive algorithm that can accomodate asymmetries “on the fly” as it were.
So at least to me, the problem is that often the mathematics comes first and the application domain later (of course, this is the problem everywhere in theory 🙂 )
Does active learning also fit into example asymmetry? In principle, wouldn’t an active learning and boosted tree-based algorithm fit your description? Boosted trees is already a very popular approach.
There’s also Ping Li’s ABC Boost (arXiv:0811.1250) which I think combines the aforementioned symmetry breaking properties of boosting with label asymmetry. If I remember correctly, the class that the weak learner messes up more often is treated in a special way in the next round.
Cascades of boosting stages (popularized in computer vision by the Viola & Jones paper) successfully handle both extreme label imbalance and extreme feature imbalance. Label imbalance is extreme in some vision problems because you care only about a small positive class (eg “faces”) versus all other possible natural image patches, of which you can generate an almost infinite amount. Feature imbalance is likewise extreme because most features in the very large feature space considered are not useful and even if you could consider all of them, at test time you want a fast and therefore small hypothesis.
Unfortunately, albeit popular only very few theoretical analysis on this cascades exist.
I don’t see why Matlab and matrix algebra preclude symmetry breaking. Simple counter-examples: (1) for “example symmetry,” in prediction problems, weighted least squares is not symmetric (different data points have different weight), but it is done with matrix operations; (2) for “feature symmetry,” in clustering, we can change the metric (e.g., substitute another weighting matrix in the expression for Mahalanobis distance) and still do this with matrix operations. What examples did you have in mind where matrix algebra precludes symmetry breaking?
Paul, I don’t know a better programming environment for asymmetric algorithms than general purpose languages.
Yisong, I would consider active learning an instance of example asymmetry and the triple you outline a plausible example. I suspect there is a better triple out there, at least computationally.
Nikos, ABC boost seems like a weak example of symmetry breaking to me, as many of the labels remain symmetrically treated.
guest, I agree that cascade style algorithms merit substantially more study than they have received so far. They seem to be an essential enabler of several computationally constrained applications.
Guest, note that ‘bias toward symmetric approaches’ does not mean ‘preclude symmetric approaches’.
I cannot help but think in terms of invariance when you mention symmetry, as is typical in physics.
In PCA for instance, it is standard practice to shift and scale variables to a magnitude of one with an average of zero. Thus the results have translation and scaling symmetries. If you have e.g. “temperature A” and “produced mass B”, the symmetries are in effect saying that the relationship between A and B is unaffected by (invariant to) A being in celcius or kelvin or B being in kg or pound.
Assuming invariance in PCA thus means removing information believed to be irrelevant to the classification. Batch algorithms which you mention, assume that the ordering of the examples are arbitrary, a permutation symmetry or invariance for the training.
Symmetry would therefore seem both a tool for input feature reduction and an occhams razor approach to making ones model as simple as possible. I think little symmetry breaking has been done for general purpose algorithms because “a good break” is context dependent.
I believe Online learning on time series data succeeds because it is used in the specific contexts where the stationarity assumption breaks down. (Stationarity is itself an assumption of time translation invariance)
As for a hypothetical system that breaks symmetry in all three areas, I think you’ve already answeared it in a sense. Online learning which breaks example symmetry and boosting which breaks feature symmetry, can both be implemented as ensemble learning. Feature symmetry is also broken in ensemble systems that perform information fusion. Finally, I think ensemble systems lik ECOC which can combine one-class classifiers (already asymmetric) into multiclass classifiers, can be said to break label symmetry.
A hypothetical triple breaker could then be: an ECOC or similar combination of one-class classifiers, where the classifiers are possibly of different types suited to different subsets of the heterogeneous inputs, and which implements online learning, directly or through the ensemble. The bigger question is where we would find a context where all these symmetries are broken simultaneously.