All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

  1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
  2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
  3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
  4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name Methodology What’s right What’s wrong
Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. Handles the small data limit. Very flexible. Interpolates to engineering.
  1. Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
  2. Computationally difficult problems are commonly encountered.
  3. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist. Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.
  1. Often (still) fails to fix problems with the Bayesian approach.
  2. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.
  1. The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous.
  2. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters. Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.
  1. Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance.
  2. Overfitting is a significant issue.
Kernel-based learning You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning. People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n2) for instance. Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data.
Boosting You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor. A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners. The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret. This is an effective automated method to extract performance from a pool of predictors. Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. The reductions approach can yield highly automated learning algorithms. The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute. There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis. The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning Learning is a process of cutting up the input space and assigning predictions to pieces of the space. Decision tree algorithms are well automated and can be quite fast. There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity Learning is about finding a program which correctly predicts the outputs given the inputs. Any reasonable problem is learnable with a number of samples related to the description length of the program. The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state. Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

Create Your Own ICML Workshop

As usual ICML 2007 will be hosting a workshop program to be held this year on June 24th. The success of the program depends on having researchers like you propose interesting workshop topics and then organize the workshops. I’d like to encourage all of you to consider sending a workshop proposal. The proposal deadline has been extended to March 5. See the workshop web-site for details.

Organizing a workshop is a unique way to gather an international group of researchers together to focus for an entire day on a topic of your choosing. I’ve always found that the cost of organizing a workshop is not so large, and very low compared to the benefits. The topic and format of a workshop are limited only by your imagination (and the attractiveness to potential participants) and need not follow the usual model of a mini-conference on a particular ML sub-area. Hope to see some interesting proposals rolling in.

Thoughts regarding “Is machine learning different from statistics?”

Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues.

To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable).

If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?”

With this in mind, let’s spend the rest of the post thinking about agnostic learning. The following definition is due to Kearns, Schapire, and Sellie:

Fix a function class C. Given a training set drawn from an arbitrary distribution D and *arbitrarily labeled* (that is, we are given samples from an arbitrary distribution D over some set X x Y), efficiently output a hypothesis that is competitive with the best function in C with respect to D. If there exists a provably efficient solution for the above problem we say C is agnostically learnable with respect to D. Notice the difference between agnostic and PAC learning: we *do not assume* that the data is labeled according to a fixed function (I find this to be a common criticism of the PAC model).

To make this a little more concrete, let C be the class of halfspaces, let X = Rn, and let Y = {-1,1}. Let opt = min{h in C} (classification error of h with respect to D). Then the goal is, given an *arbitrarily* labeled data set of points in Rn , find a hypothesis whose true error with respect to D is at most opt + eps. Furthermore the solution should provably run in time polynomial in n and 1/eps. Note that the hypothesis we output *need not be a halfspace*– we only require that its error rate is close to the error rate of the optimal halfspace.

Why choose C to be halfspaces? At the heart of many of our most popular learning tools, such as Support Vector Machines, is an algorithm for learning a halfspace over a set of features in many dimensions. Perhaps surprisingly, *it is still not known*, given samples from an arbitrary distribution on X x Y, how to efficiently ‘find the best fitting halfspace’ over a set of features. In fact, if we require our algorithm to output a halfspace as its hypothesis then the agnostic halfspace learning problem is NP-hard. Recent results have shown that even in the case where there is a halfspace with error rate .0001 with respect to D it is NP-hard to output a halfspace with error rate .49999 (achieving .5 is of course trivial).

The astute reader here may comment that I am ignoring the whole point of the ‘kernel trick,’ which is to provide an implicit representation of a halfspace over some feature set in many dimensions, therefore bypassing the above hardness result. While this is true, I am still not aware of any results in the statistical learning literature that give provably efficient algorithms for agnostically learning halfspaces (regardless of the choice of kernel).

My larger point is that a fundamental approach from statistical learning theory (running an SVM) is deeply connected to strong negative results from computational complexity theory. Still, as I explain in the next paragraph, finding the ‘best fitting halfspace’ remains an open and important computational problem.

Negative results not withstanding, it now seems natural to ask ‘well, can we find any efficient algorithms for agnostically learning halfspaces if we allow ourselves to output hypotheses that are not halfspaces?’ Briefly, the answer is ‘to some extent Yes,’ if we restrict the marginal distribution on X to be one of our favorite nice distributions (such as Gaussian over R^n). The solution runs in polynomial-time for any constant eps > 0 (for various reasons involving the ‘noisy XOR’ topic three posts ago, obtaining efficient algorithms for subconstant eps seems intractable).

The complexity of agnostically learning halfspaces with respect to arbitrary distributions remains open. A solution (if one exists) will require a powerful new algorithmic idea and will have important consequences regarding our most commonly used machine learning tools.

To respond to Andrej Bauer’s comment three posts ago:

‘I guess I have a more general question. I understand that for the purposes of proving lower bounds in computational complexity it’s reasonable to show theorems like “can’t learn XOR with NOT, AND, OR.” But I always thought that machine learning was more “positive”, i.e., you really want to learn whenever possible. So, instead of dispairing one should try to use a different bag of tricks.’

I think Andrej is right. Solving future machine learning problems will require a more sophisticated ‘bag of tricks’ than the one we have now, as most of our algorithms are based on learning halfspaces (in a noisy or agnostic setting). Computational learning theory gives us a methodology for how to proceed: find provably efficient solutions to unresolved computational problems– novel learning algorithms will no doubt follow.

Parallel Machine Learning Problems

Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:

  1. Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
  2. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
    1. Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
    2. IBM’s Cell processor has (essentially) 9 cores.
    3. Modern graphics chips can have an order of magnitude more separate execution units.

    The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.

So, how do we parallelize machine learning algorithms?

  1. The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
  2. The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
  3. The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.

A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.