All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

 Name Methodology What’s right What’s wrong Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. Handles the small data limit. Very flexible. Interpolates to engineering. Information theoretically problematic. Explicitly specifying a reasonable prior is often hard. Computationally difficult problems are commonly encountered. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable. Graphical/generative Models Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist. Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation. Often (still) fails to fix problems with the Bayesian approach. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence. Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable. The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult. Gradient Descent Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters. Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict. Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance. Overfitting is a significant issue. Kernel-based learning You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning. People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n2) for instance. Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data. Boosting You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor. A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners. The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process. Online Learning with Experts You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret. This is an effective automated method to extract performance from a pool of predictors. Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction. Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. The reductions approach can yield highly automated learning algorithms. The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm. PAC Learning You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute. There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways. Statistical Learning Theory You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis. The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness). Decision tree learning Learning is a process of cutting up the input space and assigning predictions to pieces of the space. Decision tree algorithms are well automated and can be quite fast. There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking. Algorithmic complexity Learning is about finding a program which correctly predicts the outputs given the inputs. Any reasonable problem is learnable with a number of samples related to the description length of the program. The theory literally suggests solving halting problems to solve machine learning. RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state. Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways. RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

24 Replies to “All Models of Learning have Flaws”

1. dennis says:

It would be helpful if you added
another column that identified
specific approaches for each
class of learning method.

2. “The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.” John, the value of theory is in its ideas, and you may be giving it short shrift here. For example, SVMs and Boosting first appeared in COLT (and EuroCOLT), which is a solidly theoretical conference. It’s difficult to think of machine learning papers that have been as applicable or influential. Do you think the theories in these papers were broadly tested before publication? For that matter, we may as well go back to Valiant’s original paper on PAC learning. I suspect his paper continues to have a deep influence on the way theorists and practitioners think about learning abstractly (this is not always easy to quantify by the usual metrics), although it may not predict much actual phenomena.

3. The only thing that bothers me is the conflation of statistical principles with computational methods. For example, the table lists “Bayesian learning,” “graphical/generative models,” and “gradient descent” (among others) as separate methods. But gradient descent is an optimization algorithm, while Bayesian decision analysis is an approach that tells you what to optimize (given assumptions). And I don’t see how he can distiguish graphical/generative models from “pure Bayesian systems.” Bayesian models are almost always generative, no? This must be a language difference in CS versus statistics.

Anyway, it’s an interesting set of comparisons. And, as Aleks points out, we’re trying our best to reduce the difficulties of the Bayesian approach (in particular, the difficulty of setting up models that are structured enough to learn from the data but weak enough to learn from the data).

4. jl says:

In my experience, boosting yields moderate but fairly consistent empirical improvements for decision trees, and occasionally provides improvement for other learning algorithms. I’m unclear on the extent to which COLT can claim credit for SVMs as both the kernel idea and the large margin idea existed previously. Do you have a particular citation in mind?

If you want to argue that learning theory has been influential, I think that’s reasonable. However, the empirical impact on practice could not reasonably be compared to Newton’s or Maxwell’s work in physics. We should aspire to theory at this level.

5. jl says:

I’m really trying to discuss frameworks in which people approach machine learning.

For some people, the way to approach things is to write down their best description of the prior, then apply bayes law and see what it implies.

For other people, the way to approach things is to create a graphical model which hopefully doesn’t lie to strongly about what is conditionally independent of what. This is not equivalent to the Bayesian approach—at best you make MAP estimates and more commonly you simply make maximum likelihood predictions.

A third camp thinks that “all learning algorithms are gradient descent”. The way you do ML is by specifying some representation and then doing gradient descent. There is no probabilistic semantics to the value of the output. A good choice of representation is simply one which allows you to minimize your energy function well on heldout samples.

(And there are a number of other approaches…)

1. Nice list!
I guess my recent work would fall into the “Gradient Descent” group but I think of it more in terms of

(Mostly) Differentiable Objective Functions

Starting with such a function, the optimization algorithm needed to minimize it then often turns out to simply be (a variants of) gradient descent.

6. Csaba Szepesvari says:

Some quotations that I find relevant for this topic:

Although this may seem a paradox, all exact science is dominated by the idea of approximation.
– Bertrand Russell

The purpose of models is not to fit the data but to sharpen the questions.
– Samuel Karlin

He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.
– Leonardo da Vinci

A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it.
– Albert Einstein

7. Somehow this seems relevant, too (stolen from someone’s e-mail sig):

“I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth.”
— Umberto Eco

8. Your list is very interesting and well structured. My only comment concerns decision trees. According to me, one of their main advantages is to allow a clear view of the model itself (white box) in comparison with other categories such as kernel methods (black box). In the case of decision trees, both results (rules) as well as the model (tree) are clearly readable.

9. Michael Tangermann says:

Another PRO of decision trees worth mentioning is their applicability for categorial data.

@John: Using only orthorgonal categories for your above listing of frameworks is probably not possible. I encourage you to add names of methods to the list, even if there are double-entries.

10. I’m not sure I agree that decision trees are always/usually interpretable. My experience is that if you have a problem with more than a few dozen features and enough examples to build a full tree, the tree gets so large that it ceases to make sense. Moreover, in order to get really competitive performance from DTs, one often has to bag or boost them. Which again makes them uninterpretable.

This is perhaps a domain specific observation, but in my domains, DTs are really no better than any other algorithm (aside from, perhaps, simple rule induction, which typically underperforms significantly).

11. Agreed with hal. Igor Kononenko has actually asked physicians if they preferred the conditional probabilities out of naive Bayes or classification trees. They preferred conditional probabilities. They would like them even more had they been shown as a nomogram.

12. Chunyu Yang says:

I am interested in Bayesian Learning which is surely a Generative Model. “specify a prior” is quite boring, but some algorithms have been proposed to tackle this problem in the name of “Probability Estimation”. Maybe the variation of the a prior along with the data sequence is more serious.

13. You seem to have neglected logic-based learning, which is often neglected nowadays because of other “numerical” learning methods. But logic-based learning (eg inductive logic programming) is also a form of statistical learning. Also, if the substrate is a probabilistic logic then it is related to Bayesian networks. It would be very interesting to find out any deep connection between numerical and logic-based learning paradigms.

14. Yan King, you should definitely check out the Markov Logic Networks framework, which is a good combination of logic and statistical learning methods.

15. To follow up on that da Vinci quote, one of my favourites regarding theory and practice:

In theory, there is no difference between theory and practice. But, in practice, there is.

This has been attributed to Jan Snepschuet, Yogi Berra and Albert Einstein.

16. ik says:

My experience is that if you have a problem with more than a few dozen features and enough examples to build a full tree, the tree gets so large that it ceases to make sense. Moreover, in order to get really competitive performance from DTs, one often has to bag or boost them. Which again makes them uninterpretable.