Wielding a New Abstraction

This post is partly meant as an advertisement for the reductions tutorial Alina, Bianca, and I are planning to do at ICML. Please come, if you are interested.

Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems.

In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult.

  1. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, you often don’t know the right way to state things.

    Here’s my current attempt: The process of abstraction for learning reductions can start with sample complexity bounds (or online learning against an adversary analysis). A very simple sample complexity bound is that for all sets of hypotheses H, for all distributions D on examples (x,y), and for all confidence parameters d

    Pr(x,y)m~Dm(for all h in H: |e(h,D)-e(h,(x,y)m)| < (ln( |H|/ d )/m)0.5 ) > 1 – d

    Here (x,y)m is a sequence of m IID samples, e(h,D) is the error rate of h on D and e(h,(x,y)m) is the empirical error rate of h on the set of IID samples.

    The previous bound is a very simple example, and yet remarkably complex both to state and to interpret—many people have been lost by the meaning of d. The impact of this complexity is that it is difficult to effectively use these bounds in practical learning algorithm design, particularly in solving more complex learning problems where much more than one bit of prediction is required. This was a central frustration that I ran into in my thesis work. Some progress has been made since then, but it is still quite difficult. The abstraction in the learning reduction setting is:

    1. You throw away d, because it only has a logarithmic dependence anyways.
    2. You eliminate H and m on the theory that intelligent choices for H and m are made in practice.
    3. You eliminate the IID assumption, because it is no longer needed to define things

    The statement then is

    e(A((x,y)m),D)-e(h*,D) < eps

    where A() is the hypothesis output by the learning algorithm, h* is the best possible predictor, and eps is used to parameterize the theorems. This abstraction is radical in some sense, but something radical was needed to yield tractable and useful analysis on the complex problems people need to solve in practice.

  2. A consequence of lack of familiarity, is that people often misread. In reading a paper, there is a temptation to not read carefully and fill in your understanding of things. Most of the time this works out well, but not here. For example, we saw many instances where people inserted IID sample assumptions or other things that simply weren’t there.
  3. Once you get past the lack of familiarity and misunderstandings, there is a feeling that the new abstraction is cheating. To some extent I understand, as I remember learning about abstractions in class, and I remember feeling that they were in some real sense cheating by dropping important details. For example:
    1. Big-O notation provides an upper bound specified up to constants. For example O(log n) computational complexity means there exists a constant c such that the number of operations requires is less than c log n. Big-O can be abused by hiding “constants” larger than the plausible values for the parameters. In machine learning, a particularly egregious case occurs in Bandit analysis where the punchline of some papers is “logarithmic regret”, hiding an arbitrarily large problem dependent constant.
    2. TCP provides a mechanism for reliable transport over an unreliable network. It is a very commonly used mechanism for sending information over the internet—you used TCP in reading this. TCP is both a programming construct and a mechanism for abstracting communicating over a network. The TCP abstraction is broken when the network is too unreliable for it to recover, such as on sketchy wireless networks where the programmer built for the TCP abstraction which wasn’t delivered.
    3. Dimensional analysis is a technique for quick analysis in physics. The basic idea is to just look at the units when estimating some quantity and combine them to get the right unit answer. For example, to compute the distance d traveled after time t with acceleration a, you simply use at2, since that formula is the only way to combine a with units of distance/time2 and t with units of time to get units of distance. This answer is off by a factor of 2 from what a more detailed analysis using integration yields, which is typical. Dimensional analysis can be misleading when the constants are very large. One example is in Gravitation where there is a table with time and distance equated since they are related by a constant—the speed of light 3*108 m/s. For example, E=mc2 becomes E=m.

    Although the above breakages are real, the usefulness of these abstractions, in terms of allowing us to quickly think about and make decisions more than offsets the drawbacks. Indeed, even the breakages stated above are thought provoking or useful enough that I can’t even say it is wrong to consider them. This property that abstractions can be abused is generically essential to the process of abstraction itself. Abstraction is about neglecting details, and when these details are not neglectable, the abstraction is abused or ineffective. Because of this, any abstraction is insufficient for analyzing and solving real problems where the neglected details matter.

    Just as for these abstractions, the learning reduction abstraction can be abused—the chosen learning algorithm can be pathetic yielding vacuous bounds, or the reduction can scramble the feature information with an encryption algorithm making it so no reasonable learning algorithm could yield other than pathetic performance. Similarly, there are situations in which I don’t know how to effectively use a learning reduction to build a learning algorithm, and it seems implausible that observation changes as more is learned in the future.

For a good abstraction, the drawbacks are matched by the advantages. The principle advantage is that there is a new way to examine and solve problems. This has several interesting effects.

  1. A good abstraction can capture a more complete specification of the problem. As an example, the sample complexity view of learning is broken in practice, because when insufficient performance is achieved people choose a different set of hypotheses H by throwing in additional features or choosing a different learning algorithm. Capturing this process in the sample complexity view requires an additional level of complexity. In the reduction view, this is entirely natural, because any means for achieving a better generalization—more/better features, a better learning algorithm, a better prior, sticking a human in the learning process, etc… are legitimate. This is particularly powerful when architecting solutions, providing a partial answer to the “What?” question Yehuda pointed out.
  2. A higher level abstraction can let you accidentally solve problems in other areas as well. A good example of this is error correcting tournaments which are useful for tournament design to select the best player/team/paper in real tournaments. Recently, I was amused to learn that a standard betting procedure for basketball tournaments exactly mirrors the importance weights suggested for the final elimination of ECTs. The first phase of ECTs provides a sound and practical method to seed a final elimination tournament, eliminating the need for (and biases of) a committee.
  3. Perhaps the most interesting effect is that the new abstraction can aid you in finding effective solutions to new problems. For learning reductions, there are about 3 compelling instances I’ve seen so far.
    1. Given training-time access to a good policy oracle, Searn provides a method for decomposing any complex prediction problem into simple problems, such that low regret solutions to the simple problems imply a low regret solution to the original problem. While Searn competes well (computationally and prediction-wise) with existing methods for linear chain style structured prediction, it really shines on more complex problems. Hal used Searn for automatic document summarization (see section 6.2) which previously wasn’t really solved via ML. More generally, when I learn about the details of other complex prediction systems for machine translation or vision, the base algorithms are tweaked, typically in ways that Searn would suggest. This suggests that Searn formalizes and automates the intuitions of practical people.
    2. The “one step RL” reduction in Bianca‘s thesis (page 119) provided tractable and effective approaches to learning in partial feedback problems where only the loss of a chosen label is learned. An even simpler reduction exists as a matter of folklore—estimate the the value of each label and then take an argmax. However, we have found classification approaches generally work better, where applicable, and as the theory suggests.
    3. Many commonly used algorithms for prediction have a running time linear (or worse) in the number of labels with decision trees a good exception. While simply predicting faster isn’t normally solving a “new problem”, an exponential improvement in computational time seems to merit this description because it allows entirely new kinds of applications. It turns out that it is both very easy to do logarithmic time prediction wrong, and that this problem is often fixable. Furthermore, it appears logarthmic time prediction can really work in practice over very many labels.

When we started working on learning reductions, I had no idea what either the difficulties or rewards were going to be—it simply seemed like a natural and compelling direction of investigation. Given the substantial difficulties encountered, it’s not at all clear that this pursuit was personally worthwhile. It has cost much time which could have been put to good use in other ways.

On the other hand, the advantages are also substantial. I’ve learned something about architecting solutions to problems, both expanding the domain of application for the field and providing a personal edge that I can bring to many conversations about ML. It’s also progress towards the AI goal, which interests me. When I think of what I could have worked on instead to achieve these goals, I don’t have any more compelling answer yet. Learning reductions seem to have accomplished more per unit thought than any other theoretical approach I can identify over the last 5 or 6 years. Furthermore, they are composable by design, so they should stay relevant (and perhaps even become more so), when people use an online active deep semisupervised probabilistic convolutional algorithm to solve a problem, particularly for complex problems.

As I said at the beginning, please join us for the tutorial, if you are interested.

Interesting Presentations at Snowbird

Here are a few of presentations interesting me at the snowbird learning workshop (which, amusingly, was in Florida with AIStat).

  1. Thomas Breuel described machine learning problems within OCR and an open source OCR software/research platform with modular learning components as well has a 60Million size dataset derived from Google‘s scanned books.
  2. Kristen Grauman and Fei-Fei Li discussed using active learning with different cost labels and large datasets for image ontology. Both of them used Mechanical Turk as a labeling system, which looks to become routine, at least for vision problems.
  3. Russ Tedrake discussed using machine learning for control, with a basic claim that it was the way to go for problems involving a medium Reynold’s number such as in bird flight, where simulation is extremely intense.
  4. Yann LeCun presented a poster on an FPGA for convolutional neural networks yielding a factor of 100 speedup in processing. In addition to the graphics processor approach Rajat has worked on, this seems like an effective approach to deal with the need to compute many dot products.

Asymmophobia

One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.

  1. feature symmetry Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.
  2. example symmetry Every example is treated the same. Batch learning algorithms are great exemplars of this approach.
  3. label symmetry Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to arg maxl wl x but it occurs in many other places as well.

Empirically, breaking symmetry well seems to yield great algorithms.

  1. feature asymmetry For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.
  2. example asymmetry Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.
  3. label asymmetry Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.

The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.

A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.

What methods for symmetry breaking exist?

  1. Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
  2. Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
  3. Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.

While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?

Machine Learning is too easy

One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others.

There are two fundamental reasons why this is possible.

  1. For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples.
  2. There is nothing like a unifying problem defining the field. In many other areas there are unifying problems for which finer distinctions in approaches matter.

The implications of this observation agrees with inspection of the field.

  1. Particular problems are often “solved” by the first technique applied to them. This is particularly exacerbated when some other field first uses machine learning, as people there may not be aware of other approaches. A popular example of this is Naive Bayes for spam. In other fields, the baseline method is a neural network, an SVM, a graphical model, etc…
  2. The analysis of new learning algorithms is often subject to assumptions designed for the learning algorithm. Examples include large margins for support vector machines, a ‘correct’ Bayesian prior, correct conditional independence assumptions for graphical models, etc… Given such assumptions, it’s unsurprising to learn that the algorithm is the right method, and justifying a new algorithm becomes an exercise in figuring out an assumption which seems natural sounding under which the algorithm performs well. This assumption set selection problem is the theoretician’s version of the data set selection problem.

A basic problem is: How do you make progress in a field with this (lack of) structure? And what does progress even mean? Some possibilities are:

  1. Pick an approach and push it. [Insert your favorite technique] everywhere.
  2. Find new real problems and apply ML. The fact that ML is easy means there is a real potential for doing great things this way.
  3. Find a hard problem and work on it. Although almost anyone can do something nontrivial on most problems, achieving best-possible performance on some problems is not at all easy.
  4. Make the problem harder. Create algorithms that work fast online, in real time, with very few or no labeled examples, but very many examples, very many features, and very many things to predict.

I am least fond of approach (1), although many people successfully follow approach (1) for their career. What’s frustrating about approach (1), is that there does not seem to be any single simple philosophy capable of solving all the problems we might recognize as machine learning problems. Consequently, people following approach (1) are at risk of being outpersuaded by someone sometime in the future.

Approach (2) is perhaps the easiest way to accomplish great things, and in some sense much advance comes from new applications.

Approach (3) seems solid, promoting a different kind of progress than approach (2).

Approach (4) seems particularly cool to me at the moment. It is not as specialized as (2) or (3), and it seems many constraints are complementary. For example, there is large scale learning = online learning.