## A NIPS paper

I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.

I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.

There are a couple workarounds for this computational bug:

1. Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
2. Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.

This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.

## It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

I have recently completed a 500+ page-book on MDL, the first comprehensive overview of the field (yes, this is a sneak advertisement 🙂 ).
Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are:

1. In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron, MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes!
2. In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by something similar to a Bayesian prior; the interpretation is different though; it is more similar to what has been called a “luckiness function” in the computational learning theory literature).

Both points are explained at length in the book (see esp page 544). Here I’ll merely say a bit more about the first.

MDL is always based on designing a universal code L relative to some given model M. Informally this is a code such that whenever some distribution P in M can be used to compress some data set well, then L will compress this data set well as well (I’ll skip the formal definition here). One method (but by no means the only method) for designing a universal code relative to model M is by taking some prior W on M and using the corresponding Shannon-Fano code, i.e. the code that encodes data z with length

L(z) = – log Pbayes(z),

where Pbayes(.) = \int P(.) d W(P) is the Bayesian marginal distribution for M relative to prior W. If M is parametric, then with just about any ‘smooth’ prior, the Bayesian code with lengths L(z) = – log Pbayes(z) leads to a reasonable universal code. But if M is nonparametric (infinite dimensional, such as in Gaussian process regression, or histogram density estimation with an arbitrary nr of components) then many priors which are perfectly fine according to Bayesian theory are ruled out by MDL theory. The reason is that for some P in M, the Bayesian codes based on such priors do not compress data sampled from P at all, even if the amount of data tends to infinity. One can formally prove that such Bayesian codes are not “universal” according to the standard definition of universality.

Now there exist two theorems by Andrew Barron (from 1991 and 1998, respectively) that directly connect data compression with frequentist statistical consistency. In essence, they imply that estimation based on universal codes must always be statistically consistent (the theorems also directly connect the convergence rates to the amount of compression obtained). For Bayesian inference, there exist various inconsistency results such as those by Diaconis and Freedman (1986) and Barron (1998). These say that, for some nonparametric models M, and with some priors on M, Bayesian inference can be inconsistent, in the sense that for some P in M, if data are i.i.d. sampled from P then even with an infinite amount of data, the posterior puts all its mass on distributions P’ in M that are substantially different from the “true” P. By Barron’s theorems, something like this can never happen for MDL; Diaconis and Freedman use priors which are not allowed according to MDL theory. In fact, MDL-based reasoning can also motivate certain prior choices in nonparametric contexts. For example, if one has little prior knowledge, why would one adopt an RBF kernel in Gaussian process regression? Answer: because the corresponding code has excellent universal coding properties, as shown by Kakade, Seeger and Foster (NIPS 2005): it has only logarithmic coding overhead if the underlying data generating process satisfies some smoothness properties; many other kernels have polynomial overhead. Thus, Gaussian processes combined with RBF kernels lead to substantial compression of the data, and therefore, by Barron’s theorem, predictions based on such Gaussian processes converge fast to the optimal predictions that one could only make make if one had access to the unknown imagined “true” distribution.

In general, it is often thought that different priors on M lead to codes that better compress data for some P in M, and that worse compress data for other P in M. But with nonparametric contexts, it is not like that: then there exist priors with “universally good” and “universally bad” coding properties.

This is not to say that all’s well for MDL in terms of consistency: as John and I showed in a paper that appeared earlier this year (but is really much older), if the true distribution P is not contained in the model class M under consideration but contains a good approximation P’ in M then both MDL and Bayes may become statistically inconsistent in the sense that they don’t necessarily converge to P’ or any other good approximation of P.

Thus: if model M parametric and P in M , then MDL and Bayes consistent. If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. If P not in M, then both Bayes and MDL may be inconsistent.

This leaves one more very important case: what if P is in the closure of M, but not in M itself? For example, M is the set of all Gaussian mixtures with arbitrarily many components, and P is not a Gaussian mixture, but can be arbitrarily well-approximated (in the sense of KL divergence) by a sequence of Gaussian mixtures with ever more components? In this case, Bayes will be consistent but it can be too slow, i.e. it needs more data before the posterior converges than some other methods (like leave-one-out-cross-validation combined with ML estimation). In our forthcoming NIPS 2007 paper, Steven de Rooij, Tim van Erven and I provide a universal-coding based procedure which converges faster than Bayes in those cases, but does not suffer from the disadvantages of leave-one-out-cross validation. Since the method is directly based on universal coding, I’m tempted to call it “MDL”, but the fact that nobody in the MDL community has thought about our idea before, makes me hesitate. When I talked about it to the famous Bayesian Jim Berger, I said “it’s MDL Jim, but not as we know it”.

## All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

 Name Methodology What’s right What’s wrong Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. Handles the small data limit. Very flexible. Interpolates to engineering. Information theoretically problematic. Explicitly specifying a reasonable prior is often hard. Computationally difficult problems are commonly encountered. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable. Graphical/generative Models Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist. Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation. Often (still) fails to fix problems with the Bayesian approach. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence. Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable. The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult. Gradient Descent Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters. Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict. Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance. Overfitting is a significant issue. Kernel-based learning You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning. People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n2) for instance. Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data. Boosting You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor. A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners. The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process. Online Learning with Experts You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret. This is an effective automated method to extract performance from a pool of predictors. Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction. Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. The reductions approach can yield highly automated learning algorithms. The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm. PAC Learning You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute. There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways. Statistical Learning Theory You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis. The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness). Decision tree learning Learning is a process of cutting up the input space and assigning predictions to pieces of the space. Decision tree algorithms are well automated and can be quite fast. There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking. Algorithmic complexity Learning is about finding a program which correctly predicts the outputs given the inputs. Any reasonable problem is learnable with a number of samples related to the description length of the program. The theory literally suggests solving halting problems to solve machine learning. RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state. Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways. RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons.

A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algorithm, and we could attempt to predict whether a return occurs.

The problem is that General Health and Health General don’t have any shared unique identifier for John Doe. Information is often mispelled (name misspellings are very common), mistyped, changed (people move), and simply not unique (how many people were born on your birthday?).

Although this is just one example, data linkage problems seem to be endemic to learning applications. There seem to be several solutions:

1. Improved recording. Sometimes minor changes to what information is recorded can strongly disambiguate. For example, there is a big difference between recording the pages visited at a website versus tracking the sequence of pages visited. The essential thing to think about when designing the information to record is: How will I track the consequences of decisions?
2. Two-stage learning. First predict which records should be linked, based upon a smaller dataset that is hand checked. Then, use your learned predictor to do the linkage, and then solve your real prediction problem. There are several pitfalls here.
1. Rarity problems. Links are typically much more rare than nonlinks. The training process needs to take this into account by properly representing the scarcity of nonlinks.
2. Information interfaces. A prediction of “link” or “no link” is too scarce an information source in an inherently noisy environment. Instead, a probability of link may need to be estimated.
3. Two stage estimation. A common approach to improving performance is turning a double approximation (given x predict y, given y predict z) into a single approximation (given x predict z). A method for achieving single approximation here is tricky because we have ancillary information about the intermediate prediction.
3. Customized algorithms. The Bayesian approach of “specify a prior, then use Bayes law to get a posterior, then predict with the posterior” is attractive here because we often have strong prior beliefs about at least the linkage portion of the problem.
4. Others?

The data linkage problem also makes very clear the tension between privacy and machine learning. For example, being able to cross index hospital cases might yield a large jump in our ability to predict outcomes, which might suggest improved treatments (it is only a weak suggestion that must be verified—we must be very careful about applying a predictor to an input distribution it did not learn with respect to). And yet, linking records can result in unexpectedly large pools of information on individuals. Furthermore explicitly sensitive information (like credit card numbers) might easily be the most useful bit of information for linkage.

## Objective and subjective interpretations of probability

An amusing tidbit (reproduced without permission) from Herman Chernoff’s delightful monograph, “Sequential analysis and optimal design”:

The use of randomization raises a philosophical question which is articulated by the following probably apocryphal anecdote.

The metallurgist told his friend the statistician how he planned to test the effect of heat on the strength of a metal bar by sawing the bar into six pieces. The first two would go into the hot oven, the next two into the medium oven, and the last two into the cool oven. The statistician, horrified, explained how he should randomize to avoid the effect of a possible gradient of strength in the metal bar. The method of randomization was applied, and it turned out that the randomized experiment called for putting the first two pieces into the hot oven, the next two into the medium oven, and the last two into the cool oven. “Obviously, we can’t do that,” said the metallurgist. “On the contrary, you have to do that,” said the statistician.

What are arguments for and against this design? In a “larger” design or sample, the effect of a reasonable randomization scheme could be such that this obvious difficulty would almost certainly not happen. Assuming that the original strength of the bar and the heat treatment did not “interact” in a complicated nonlinear way, the randomization would virtually cancel out any effect due to a strength gradient or other erratic phenomena, and computing estimates as though these did not exist would lead to no real error. In this small problem, the effect may not be cancelled out, but the statistician still has a right to close his eyes to the design actually selected if he is satisfied with “playing fair”. That is, if he instructs an agent to select the design and he analyzes the results, assuming there are no gradients, his conclusions will be unbiased in the sense that a tendency to overestimate is balanced on the average by a tendency to underestimate the desired quantities. However, this tendency may be substantial as measured by the variability of the estimates which will be affected by substantial gradients. On the other hand, following the natural inclination to reject an obviously unsatisfactory design resulting from randomization puts the statistician in the position of not “playing fair”. What is worse for an objective statistician, he has no way of evaluating in advance how good his procedure is if he can change the rules in the middle of the experiment.

The Bayesian statistician, who uses subjective probability and must consider all information, is unsatisfied to simply play fair. When randomization leads to the original unsatisfactory design, he is aware of this information and unwilling to accept the design. In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.