Machine Learning (Theory)


A Variance only Deviation Bound

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.


Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?


Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

  1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
  2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
  3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
  4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
  5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
  6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

  1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
  2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
  3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
  4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
  5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.


Interesting Papers at COLT 2007

Here are two papers that seem particularly interesting at this year’s COLT.

  1. Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
  2. Adam Tauman Kalai. Learning Nested Halfspaces and Uphill Decision Trees. Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. Under the assumption that:
    1. The level sets of the correct regressed value are halfspaces.
    2. The level sets obey a Lipschitz condition.

    this paper proves that a good regressor can be PAC-learned using a boosting algorithm. (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition.)


The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.


All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

  1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
  2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
  3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
  4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name Methodology What’s right What’s wrong
Bayesian Learning You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly. Handles the small data limit. Very flexible. Interpolates to engineering.
  1. Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
  2. Computationally difficult problems are commonly encountered.
  3. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist. Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.
  1. Often (still) fails to fix problems with the Bayesian approach.
  2. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima. Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.
  1. The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous.
  2. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters. Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.
  1. Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance.
  2. Overfitting is a significant issue.
Kernel-based learning You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning. People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n2) for instance. Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data.
Boosting You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor. A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners. The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret. This is an effective automated method to extract performance from a pool of predictors. Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner. The reductions approach can yield highly automated learning algorithms. The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner. The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute. There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses. There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis. The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning Learning is a process of cutting up the input space and assigning predictions to pieces of the space. Decision tree algorithms are well automated and can be quite fast. There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity Learning is about finding a program which correctly predicts the outputs given the inputs. Any reasonable problem is learnable with a number of samples related to the description length of the program. The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process. We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state. Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process In a sense, we’ve made no assumptions, so algorithms have wide applicability. All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)


Continuizing Solutions

This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful.

Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples.

  1. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set.
  2. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially.
  3. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that simply adopting the new policy can go very wrong.

For each of these problems, there are “continuized” solutions which can yield smaller computation, more elegant mathematics, or both.

  1. ERM By shifting from choosing a single classifier to choosing a stochastic classifier we can prove a new style of bound which is significantly tighter, easier to state, and easier to understand than traditional bounds in the traditional setting. This is the PAC-Bayes bound idea.
  2. Experts By giving the adversary slightly more power—the ability to split experts and have them fractionally predict one way vs. another, the optimal policy becomes much easier to compute (quadratic in the horizon, or maybe less). This is the continuous experts idea.
  3. Policy Iteration For policy iteration, by stochastically mixing the old and the new policy, we can find a new policy better than the old policy. This is the conservative policy iteration idea.

There is some danger to continuizing. The first and second examples both involve a setting shift, which may not be valid—in general your setting should reflect your real problem rather than the thing which is easy to solve. However, even with the setting shift, the solutions appear so compellingly more elegant that it is hard to not hope to use them in a solution to the original setting.

I have not seen a good formulation of the general approach of continuizing. Nevertheless, I expect to see continuizing in more places and to use it in the future. By making it explicit, perhaps this can be made eaesier.


Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. It’s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin Wainwright’s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a “prior” that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and it’s a sleight of hand because the penalties don’t come from the motivating problem (the primal) but rather get tacked on at the end. In short: it’s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that I’m not even close to having been the first to figure that out. Miroslav Dudík, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. It’s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if you’re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework you’re in luck. Yasemin Altun and Alex Smola have a paper (that I haven’t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If you’re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.


Bounds greater than 1

Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1.

Background One branch of learning theory focuses on theorems which

  1. Assume samples are drawn IID from an unknown distribution D.
  2. Fix a set of classifiers
  3. Find a high probability bound on the maximum true error rate (with respect to D) as a function of the empirical error rate on the training set.

Many of these bounds become extremely complex and hairy.

Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.)

One of the dangers of striving for “functional tightness” is that the bound can depend on strangely interrelated parameters. In fact, apparently these strange interrelations can become so complex that they end up always larger than 1 (some bounds here and here).

It seems we should ask the question: “Why are we doing the math?” If it is just done to get a paper accepted under review, perhaps this is unsatisfying. The real value of math comes when it guides us in designing learning algorithms. Math from bounds greater than 1 is a dangerously weak motivation for learning algorithm design.

There is a significant danger in taking this “oops” too strongly.

  1. There exist some reasonable arguments (not made here) for aiming at functional tightness.
  2. The value of the research a person does is more related to the best they have done than the worst.


Multiplication of Learned Probabilities is Dangerous

This is about a design flaw in several learning algorithms such as the Naive Bayes classifier and Hidden Markov Models. A number of people are aware of it, but it seems that not everyone is.

Several learning systems have the property that they estimate some conditional probabilities P(event | other events) either explicitly or implicitly. Then, at prediction time, these learned probabilities are multiplied together according to some formula to produce a final prediction. The Naive Bayes classifier for binary data is the simplest of these, so it seems like a good example.

When Naive Bayes is used, a set of probabilities of the form Pr’(feature i | label) are estimated via counting statistics and some prior. Predictions are made according to the label maximizing:

Pr’(label) * Productfeatures i Pr’(feature i | label)

(The Pr’ notation indicates these are estimated values.)

There is nothing wrong with this method as long as (a) the prior for the sample counts is very strong and (b) the prior (on the conditional independences and the sample counts) is “correct”—the actual problem is drawn from it. However, (a) seems to never be true and (b) is often not true.

At this point, we can think a bit from a estimation perspective. When trying to estimate a coin with bias Pr(feature i | label), after observing n IID samples, the estimate is accurate to (at most) c/m for some constant c. (Actually, it’s c/m0.5 in the general case c/m for coins with bias near 0 or 1.) Given this observation, we should expect the estimates Pr’ to differ by c/m or more when the prior on the sample counts is weak.

The problem to notice is that errors of c/m can quickly accumulate. The final product in the naive bayes classifier is n-way linear in the error terms where n is the number of features. If every features true value happens to be v and we happen to have a 1/2 + 1/n0.5 feature fraction estimate too large and 1/2 – 1/n0.5 fraction estimate too small (as might happen with a reasonable chance), the value of the product might be overestimated by:

(v – c/m)n/2 + n^0.5(v + c/m)n/2 + n^0.5 – vn

When c/m is very small, this approximates as c n0.5 /m, which suggests problems must arise when the number of features n is greater than the number of samples squared n > m2. This can actually happen in the text classification settings where Naive Bayes is often applied.

All of the above is under the assumption that the conditional independences encoded in the Naive Bayes classifier are correct for the problem. When these aren’t correct, as is often true in practice, the estimation errors can be systematic rather than stochastic implying much more brittle behavior.

In all of the above, note that we used Naive bayes as a simple example—this brittleness can be found in a number of other common prediction systems.

An important question is “What can you do about this brittleness?” There are several answers:

  1. Use a different system for prediction (there are many).
  2. Get much more serious about following Bayes law here. (a) The process of integrating over a posterior rather than taking the maximum likelihood element of a posterior tends to reduce the sampling effects. (b) Realize that the conditional independence assumptions producing the multiplication are probably excessively strong and design softer priors which better fit reasonable beliefs.


Fallback Analysis is a Secret to Useful Algorithms

The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is:

If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.

We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:

  1. Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
  2. The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.

The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.

Several significant difficulties occur for anyone working on fallback analysis.

  1. It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
  2. It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
  3. It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
  4. It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.


Complexity: It’s all in your head

One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.

The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.

In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.

In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
“non-defective distributions”.

So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.


Prediction Bounds as the Mathematics of Science

Tags: Prediction Theory jl@ 3:12 pm

“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:

  1. Form a hypothesis about the world.
  2. Use the hypothesis to make predictions.
  3. Run experiments to confirm or disprove the predictions.

The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.

Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.

  1. Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme.
  2. Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.

Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)

A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:

  1. They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
  2. They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.

These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.

  1. The most basic test set bound corresponds to the scientific method above.
  2. The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
  3. The Sample Compression bound allows careful disposal of some datapoints.
  4. Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.

Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)


A COLT paper

Tags: Papers,Prediction Theory jl@ 2:22 pm

I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.

The cross validation problem: cash reward

Tags: Prediction Theory,Problems jl@ 2:00 pm

I just presented the cross validation problem at COLT.

The problem now has a cash prize (up to $500) associated with it—see the presentation for details.

The write-up for colt.


Active learning

Tags: Active,Prediction Theory sanjoy@ 9:29 pm

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0′s followed by a sequence of 1′s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]‘s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.


The State of Tight Bounds

Tags: Prediction Theory jl@ 11:49 am

What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals.

  1. Good Judgement. In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not.
  2. Learning Essence. The form of some of these bounds helps you understand what the essence of learning is.
  3. Algorithm Design. Some of these bounds suggest, motivate, or even directly imply learning algorithms.

What We Know Now

There are several families of bounds, based on how information is used.

  1. Testing Bounds. These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound, progressive validation also here and here, train and test bounds, and cross-validation (but see the big open problem). These techniques are the best available for goal (1) above, but provide little information towards goals (2) or (3). Some of these techniques are computationally efficient while others are not.
  2. Unlabeled test set bounds Instead of using labeled data to construct a tight bound, it is sometimes possible to use unlabeled data.
  3. Training Bounds These are methods which use labeled data to for both training and testing. These bounds provide insight into goals (2) and (3), but are not very effective for goal (1) above. These bounds teach us about how many independent examples are required to guarantee learning success on different on different representations. Any bound of this sort implies a learning algorithm: “minimize the bound”. The set covering machine is an application of this approach to a (variant) sample compression bound. Most other potential applications aren’t “quite there” yet, although they are close. Here is a list of learning algorithms and bounds that “almost work” for these algorithms.
    1. SVM PAC-Bayes margin bound.
    2. Perceptron PAC-Bayes margin bound, Sample Compression Bound
    3. Neural Network PAC-Bayes bound.
    4. Decision Tree Occam’s razor bound.
    5. Decision Tree Pruning Rademacher Complexity Bounds
    6. Semisupervised Clustering PAC-MDL or transductive PAC-Bayes bound
    7. Nearest neighbor ??(Note that the cross validation approaches work well here.)

The independence assumption is a significant limitation in the applicability of this approach since we often can not believe that independence is satisfied on natural learning problems. Some work has gone into weakening this assumption.


The Big O and Constants in Learning

Tags: Prediction Theory jl@ 12:23 am

The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n).

In learning theory, there are many statements about learning algorithms of the form “under assumptions x,y, and z, the classifier learned has an error rate of at most O(f(m))“.

There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well.

  1. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor.
  2. You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems.

We can’t say “never use O()“, because sometimes abstracting away details is the right thing to do. However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms.


Prior, “Prior” and Bias

Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs.

  1. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data.
  2. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well.
  3. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another.

This only scratches the surface—there are yet more subtleties. For example the (as mentioned in meaning of probability) shifts from one viewpoint to another.


Problem: Cross Validation

Tags: Prediction Theory,Problems jl@ 10:47 am

The essential problem here is the large gap between experimental observation and theoretical understanding.

Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.

Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.

Past Work (I’ll add more as I remember/learn.)

  1. Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
  2. Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
  3. Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
  4. Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
  5. Matti Kääriäinen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.

Some Extreme Cases to Sharpen Intuition

  1. Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
  2. Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).

(some discussion is here)

Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.

Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.

Powered by WordPress