All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

  1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
  2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
  3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
  4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name Methodology What’s right What’s wrong
Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. Handles the small data limit. Very flexible. Interpolates to engineering.
  1. Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
  2. Computationally difficult problems are commonly encountered.
  3. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist. Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.
  1. Often (still) fails to fix problems with the Bayesian approach.
  2. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.
  1. The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous.
  2. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters. Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.
  1. Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance.
  2. Overfitting is a significant issue.
Kernel-based learning You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning. People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n2) for instance. Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data.
Boosting You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor. A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners. The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret. This is an effective automated method to extract performance from a pool of predictors. Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. The reductions approach can yield highly automated learning algorithms. The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute. There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis. The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning Learning is a process of cutting up the input space and assigning predictions to pieces of the space. Decision tree algorithms are well automated and can be quite fast. There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity Learning is about finding a program which correctly predicts the outputs given the inputs. Any reasonable problem is learnable with a number of samples related to the description length of the program. The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state. Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?

Best Practices for Collaboration

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

  1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
  2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
    1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
    2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
    3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
    4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3” rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
  3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
    1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
    2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
    3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
  4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

Parallel Machine Learning Problems

Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:

  1. Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
  2. Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
    1. Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
    2. IBM’s Cell processor has (essentially) 9 cores.
    3. Modern graphics chips can have an order of magnitude more separate execution units.

    The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.

So, how do we parallelize machine learning algorithms?

  1. The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
  2. The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
  3. The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.

A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.

The Machine Learning Department

Carnegie Mellon School of Computer Science has the first academic Machine Learning department. This department already existed as the Center for Automated Learning and Discovery, but recently changed it’s name.

The reason for changing the name is obvious: very few people think of themselves as “Automated Learner and Discoverers”, but there are number of people who think of themselves as “Machine Learners”. Machine learning is both more succinct and recognizable—good properties for a name.

A more interesting question is “Should there be a Machine Learning Department?”. Tom Mitchell has a relevant whitepaper claiming that machine learning is answering a different question than other fields or departments. The fundamental debate here is “Is machine learning different from statistics?”

At a cultural level, there is no real debate: they are different. Machine learning is characterized by several very active large peer reviewed conferences, operating in a computer science mode. Statistics tends to function with a greater emphasis on journals and a lesser emphasis on conferences which often implies a much longer publishing cycle.

In terms of the basic questions driving the field, the answer seems less clear. It is true that the core problems of statistics in the past have typically differed from the core problems of machine learning today. Yet, there has been some substantial overlap, and there are a number of statisticians nowadays that are actively doing machine learning. It’s reasonably plausible that in the long term statistics departments will adopt the core problems of machine learning, removing the reasons for a separate machine learning department.

The parallel question for computer science comes up less often perhaps because computer science is a notoriously broad field.

The practical implication of a new department is the ability to create a more specific curricula, admit more specific students, and hire faculty based upon more specific interests. Compared to a computer science program, classes on programming languages, computer architecture, or graphics might be dropped in favor of classes on learning theory, statistics, etc… Compared to a statistics program, classes on advanced parameter estimation and measure theory might be dropped in favor of algorithms and programming experience.

An alternative solution like “learn everything from computer science and statistics” is personally appealing to me, and I have benefitted from and recommend a broad education. However this is not practical for everyone. In my experience, a machine learning skill set is an effective specialization with which people can do important things in the world. Given this, having a department with a machine learning centered curricula seems like a good idea. At Carnegie Mellon, this is the Machine Learning department. In the future and elsewhere it may have a different name, but the value of the machine learning skill set should grow with research, improving computers, and improving data sources.