What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals.
Why?
- Good Judgement. In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not.
- Learning Essence. The form of some of these bounds helps you understand what the essence of learning is.
- Algorithm Design. Some of these bounds suggest, motivate, or even directly imply learning algorithms.
What We Know Now
There are several families of bounds, based on how information is used.
- Testing Bounds. These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound, progressive validation also here and here, train and test bounds, and cross-validation (but see the big open problem). These techniques are the best available for goal (1) above, but provide little information towards goals (2) or (3). Some of these techniques are computationally efficient while others are not.
- Unlabeled test set bounds Instead of using labeled data to construct a tight bound, it is sometimes possible to use unlabeled data.
- Training Bounds These are methods which use labeled data to for both training and testing. These bounds provide insight into goals (2) and (3), but are not very effective for goal (1) above. These bounds teach us about how many independent examples are required to guarantee learning success on different on different representations. Any bound of this sort implies a learning algorithm: “minimize the bound”. The set covering machine is an application of this approach to a (variant) sample compression bound. Most other potential applications aren’t “quite there” yet, although they are close. Here is a list of learning algorithms and bounds that “almost work” for these algorithms.
- SVM PAC-Bayes margin bound.
- Perceptron PAC-Bayes margin bound, Sample Compression Bound
- Neural Network PAC-Bayes bound.
- Decision Tree Occam’s razor bound.
- Decision Tree Pruning Rademacher Complexity Bounds
- Semisupervised Clustering PAC-MDL or transductive PAC-Bayes bound
- Nearest neighbor ??(Note that the cross validation approaches work well here.)
Limitations
The independence assumption is a significant limitation in the applicability of this approach since we often can not believe that independence is satisfied on natural learning problems. Some work has gone into weakening this assumption.
I’ve always been curious about the justification for points (2) and (3) under “Why?”.
In particular, for what class of bounds do we believe that following the credo “minimize the bound” will do anything good? Certainly this approach has been applied in lots of contexts in which the bounds on p(error) are always >1; so merely being a bound isn’t enough. For I can construct the bound p(error|theta)<= 1+theata^2, without knowing anything about your problem or your data or your model. Do you belive that setting theta=0 is a good thing to do? On the other hand, requiring the bounds to be "tight" is unecessarily restrictive, and would have eliminated a lot of lose bounds that *did* provide very useful guidance in the design of learning algorithms. So what's the "sniff test" for a bound that it is sensible to minimize? There seems to be an implicit extra condition, something like "as long as the bounds were derived using a methodology I find compelling". I'd like to see that implicit condition formalized, in a way that tosses out my 1+theta^2 example but retains pac-bayes or margins or sample-compression.
sorry, seemed to be cut off. For example, I can always construct the bound p(error|theta)=1+theta^2 without knowing anything about your model or your data. Do you believe that setting theta=0 is a good thing to do? So merely being a bound isn’t enough. On the other hand, requiring bounds to be tight is overly restrictive and would have eliminated a lot of bounds that *did* provide a lot of useful guidance in the design of learning algorithms. What’s the “sniff test” for a bound that is sensible to minimize? There seems to be an implicit condition, of the form “the bound was derived using methods I find compelling”. I’d like to see that condition formalized, in a way that throws out my 1+theta^2 example but keeps pac-bayes, margin, sample compression, etc. even when they are always >1
I have been entertained by a chapter in “The Mathematics of Generalization” (David H. Wolpert, ed.). It’s titled “Reflections after Refereeing papers for NIPS” by Leo Breiman, and lists the following uses of a theory: 1. comfort (“We knew it worked, but it’s nice to have a proof”), 2. insight (“Aha! So that’s why it works”), 3. innovation (“At last, a mathematically proven idea that applies to data”), 4. suggestion (“Something like this might work with data”) and 5. inquiry (“Sensible and intelligent efforts to understand what is going on”).
Absolute bounds work only if you assume what the “truth” is. Adopting the epistemological stance, we shouldn’t assume what the truth is, so all the comfort is false. I’m highly skeptical of absolute bounds (method A will have the error rate of [x,y]), but I appreciate relative bounds (method A will be better (in a specific sense) than B if ….)
The PAC-Bayes (margin) bound, sample compression bound, PAC-MDL bound and Occam’s razor bound all differ from the 1+theta^2 example in that they are never greater than 1.
I think we are still struggling to understand which bounds are sensible to minimize and which are not. There is potential, there are some initial succesful attempts, but there is not a full understanding.
As for the why of (2), I think it’s fair to say that something can be learned about learning from the training set bounds. This doesn’t exclude the possibility of learning other things from other sources.
As for uses of bounds, I am mostly only compelled by (3) or efforts plausibly aimed at achieving (3).
There is one subtle point which I should have pointed out earlier. These bounds are not of the form: “You will do at least this good” after seeing m examples (a standard PAC-learning statement). This sort of statement requires an additional assumption. Instead, these bounds are of the form: assuming past and future observations are iid, you will rarely be wrong in bounding the error rate according to f(observed error rate, rarity).
Neither of these approaches assume the full truth—they assume lesser or greater aspects of the full truth and then reason about the full truth from these aspects. This approach is dangerous (because you reason based upon the unknowable), but it might plausibly lead to better learning algorithms for some situations.
As a practitioner, I have been somewhat worried about several aspects of the derivations (and results) related to these bounds. Perhaps this is not the best place to request clarifications(since I dont think I understand *all* of the tons of work which has been done in this area, though I have tried to read most of the standard stuff such as Rademacher theory, VC theory, PAC-Bayes theory etc). On the other hand I strongly believe that such an open discussion forum may well be the best place for a person like me to request help. I would particularly appreciate intuitive counter-arguments that will allow me to see why the concerns I mentione below are wrong (if that is the case)
Some of my concerns have been partly touched upon in the previous discussion but perhaps a little bit more elaboration and links to literature may help me understand things better. In any case, here they are:
1. Test set bounds: in essence they attempt to study how well the learned classifier will perform on future test samples provided they are also drawn i.i.d from the same data generating distribution that is sampled in the validation set used to compute the bound.
While I like this sort of bound better (because it provides tight lower and upper bounds on performance), I still have some concerns which nag me. Perhaps the most important is the assumption of i.i.d. draw of the validation set and the future test sets *from the same distribution*. Let us see two circumstances when this will be violated in practical problems.
In the first situation, the sample used for validation is drawn in a biased way and is not entirely reflective of the full set of environments that may occur in the future on test samples. For example, while studying if a prostate cancer detection algorithm will work well when released into the market, we may choose to see how well the algorithm performs on a validation set, and use this as a basis for assessing future performance. However, if the set of patients used in validation is drawn only from a sub-population (eg US males) then there is no guarantee that the predictions based on this will be reflective of the performance on the entire world population. In more formal terms, correcting the bias in the selection of the validation set is a critical aspect that is essential. How would we ever know if the validation set is truly drawn from the entire population in an unbiased way? Very tricky in many practical situations, so we have to use physical intuition to ensure that we have samples from various geographical locations, various different sub-types of the cancer etc. Again, it is critical to evaluate what sorts of sampling mechanisms are used in creating a good validation set: ie what should be the ratio of number the validation set samples for various sub-types of the disease, and what exact effect will slight errors in this ratio have on the accuracy of the bound. This is often much more important in practice than whether we use cross validation or whether we use the binomial (or in multi-class cases, multinomial) test set bounds.
Second illustration of practical problems: how do you know that the samples are drawn independently? Even if you know they are not i.i.d what can you do about it in these bounds? For example if you collect multiple samples from the same patient (eg from multiple locations in his body or on different days), then clearly the samples are not i.i.d. Similarly, the results from any one site are not i.i.d in a larger sense, because you have to account for the site specific characteristics in the data. This is addressed well in designing classifiers using (for example) random or mixed effects in generalized linear models (or even more generally, a multi-level hierarchical latent variable model). However, this is not properly addressed in the current test set bounds, at least as far as I know.
2. Training set bounds: Again a separate litany of problems can be listed for these sort of bounds. The most compelling question is whether their use is based on a logical fallacy (fundamentally so, in a way that just cant be overcome). Perhaps the best one page explanation of this criticism I saw is in slide 22 of Radford Neal’s NIPS tutorial a few months back. See here for the slides: ftp://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf
For those who dont want to break the continuity of thought by going to a separate document, the question raised by Neal is this:
The statement from these bounds that
P(|test error rate – training set error rate|> \epsilon) < \delta DOES NOT imply without logical fallacy that P(|test error rate - 0.02|> epsilon | training error rate = 0.02) < delta He points out that the same arguments which show that conventional statistical hypothesis testing is based on logical fallacies, also extend to the training set bounds. Essentially, without logical fallacy, training set bounds dont mean anything useful. b. In more poignant and easy to understand terms let us see how it actually matters in practice. If the upper/lower bound on error rate (as a function of some regularization parameter for example) is not "of the same shape" as the true error rate, then we cant use these bounds for choosing the regularization parameter, or more generally for model selection (when comparing different models such as a linear vs higher order polynomial classifier). Nothing in the nature of these bounds gurantees that the bound is *correlated* with reality! By nature these are probabilistic *bounds* on future performance (not estimators of performance), so this guarantee is missing in a fundamental sense. This is not just idle criticism, eg. Matthias Seeger pointed out (in the recent NIPS workshop on (ab)use of bounds) that when he developed these sorts of bounds, the very shape of the bound was wrong and not matched to the reality in several cases. this points to a fundamental problem with trying to minimize upper bounds on error rates rather than minimizing estimated future error rates, in some sense. c. A third criticism I had is the following. The PAC-Bayes theorem can be used with several different priors to get several different bounds. for example, relying on the PAC-Bayes theorem, margin bounds can be derived as the result of using a special form of the prior. However, if I used some other prior, I can still derive other alternative forms of bounds, and they are also equally valid: these bounds do not need to depend in any way on the concept of the margin in the data at all. Thus, we can come up with bounds for any "prior intuition" used to derive these bounds (or equivalently learning algorithms that use these priors). As an example I did try taking this route for a sparseness inducing Laplacian prior to see what the bounds imply in this case: http://www.lx.it.pt/~mtf/Krishnapuram_Carin_Figueiredo_Hartemink_2005.pdf
Thus, you can get a whole slew of bounds, one for each intuition that is used to guide algorithm development. No single intuition is universally better than any other: indeed no single prior is better than any other in all cases, so this should be obvious. Therefore it appears to me that the whole objective of trying to find bounds and to use these to guide algorithm development is somewhat moot. Afterall, every intuition you can use results in a different form of bound, so why is one intuition (eg maximize the margin[or equivalently minimize the L-2 norm of the weight vector of the SVM, w] because my bound tells me this is a good thing to do in order to get good generalization) better than another (eg. minimize the L1 norm of the vector w). This is also related to the very elegant paper by Thore Graepel and Ralf Herbrich, which raised very strong questions and which I have not seen answered properly since then:
http://www.research.microsoft.com/~rherb/papers/graeherwill00.ps.gz
Before seeing the training data, a priori there is no way to state which algorithm is better. After seeing the training data, if we want to see which of several models perform best, this is a model selection or model comparison problem and should be based not on bounds, but on expectations of performance. Essentially, it may be more worthwhile to ask how well the algorithm models whatever data has been provided (ie the Bayesian marginal, which is also sometimes referred to as the evidence for the model, eg in Mackay’s work) in order to see which algorithm is best: this also avoids the artificial separation of the data into a training and separate validation set, and this is a good thing because the choice of split itself is arbitrary and allows variation in the answers, especially when the available data is very limited.
Sorry: in the above post I accidentally deleted a paragraph between Radford Neal’s comments and the paper link
http://www.lx.it.pt/~mtf/Krishnapuram_Carin_Figueiredo_Hartemink_2005.pdf. This paragraph is below:
2.b From PAC-Bayes theory it follows that every “prior” intuition I use will result in a different bound for for the test set error, and also in a different learning algorithm. Therefore trying to get a good algorithm by trying to use the results from the bound is a wrong and rather pointless excercise.
For example in the following paper I essentially used the same ideas used for developing margin bounds to develop different bounds that study what happens when you use L1 regularization instead, in order to obtain sparse classifiers (intuitively analogous in some sense to an algorithm trying to us few support vectors, or in another sense trying to use very few input dimension features)
http://www.lx.it.pt/~mtf/Krishnapuram_Carin_Figueiredo_Hartemink_2005.pdf
Oops: justnoticed that I had also accidentally deleted Neal’s comments in the above comment #5… I’m having a bad typing day!
essentially it is :
2. Training set bounds: Again a separate litany of problems can be listed for these sort of bounds. The most compelling question is whether their use is based on a logical fallacy (fundamentally so, in a way that just cant be overcome). Perhaps the best one page explanation of this criticism I saw is in slide 22 of Radford Neal’s NIPS tutorial a few months back. See here for the slides: ftp://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf
For those who dont want to break the continuity of thought by going to a separate document, the point raised by Neal is this:
The statement from these bounds that
P(|test error rate – training set error rate|> \epsilon) < \delta DOES NOT IMPLY Without logical fallacy that P(|test error rate -0.02| > \epsilon | training set error rate =0.02) <\delta This realization has profound implications for how these training set bounds can be interpreted and therefore used in practice! epsilon | training error rate = 0.02)
I give up… typing equations in text is just not working properly, though I am sure I did enter the whole equation properly before I submitted it last time. Sorry for the sequence of posts which still did not fix the problem in my typing. Please just refer to slide 22 of Neal’s tutorial cited above for the criticism.
BTW, Yet another criticism of training set bounds raised by Matthias Seeger in the (ab)use of bounds workshop at NIPS was that these training set bounds (as a function of some regularization or other parameter of the learning algorithm) often dont even have the same *shape* as the true error rate. In this sense, not only are the bounds loose, but often of the wrong shape, so choosing regularization or kernel parameters on the basis of these bounds is *misleading*. I have also observed this in other cases for bounds I derived but did not publish for this very reason.
The reason is that the bound is derived based on some intuition (“prior”) which is either implicit or in the case of PAC-Bayes theory explicit. When the intuition is borne out on the training data, the bound is tight, but when the intuition is not supported by the training data, the bound grows large. However, in the latter case, the shape of the bound is not correlated with the shape of the true error. This is a problem: minimizing a bound simply does not minimize the true test error since the shape itself is different for the two different curves (ie despite some illustrations to the contrary from the published literature, it is not just a case of large slack but same shape).
Essentially, nothing in the derivation of these bounds makes it necessary that the correlation between true error and the bound has to be high for the entir range of scenarios (eg entire range of parameters for which the bound is somputed).
Balaji’s comment highlight’s something that Sam and I have talked about as well in light of Radford’s slides. Certainly the probability of an algorithm going wrong makes sense in many applications of computer science. But in data-driven work
like machine learning, don’t we only really care about the conditional probability of future successes given success on the training data? We can concoct algorithms that “succeed” on the training data only when they are getting lucky and would fail on spectacularly on future test data. The fact that a ML practitioner, even if the data were really I.I.D., can’t use these bounds to predict future successes, or run what Sam dubbed the “VC insurance company” and be able to guarrantee making money with high probability makes the form the bounds seem less useful then one might be lead to believe.
Some of Balaji’s concerns are addressable and some are not.
1) Lack of independence and identicality in real data. This is a serious concern which sharply limits the scope of the theory. Many datasets are not independent, while some might (plausibly) be independent such as the KDD 2004 physics dataset. This concern applies to training set bounds just as strongly as for test set bounds.
2) Misinterpretation of the bounds. Radford Neal’s slide is entirely correct about how people misinterpret, and I think people who think in a Bayesian manner are particularly prone to misinterpret that way. (BTW, this also applies to test set bounds.) Misinterpretation of that sort is clearly a fallacy. The solution for this is “don’t misinterpret bounds that way”.
Part of the problem here is that the thing people want(say, a theory which allows conditioning yet doesn’t make many assumptions) is not the thing that people can get. Drew’s comment points this out.
The way I think about these bounds is that they provide information (about the true error rate) which is not often wrong (over different draws of examples). Suppose you are a manager who has a learning problem that he delegates to some employee. How are you supposed to know the employee succeeded? He might come to you and say “the log probability of your data under my model is very small!”, but this is inherently not convincing because the data might be memorized with a probabilistic model explaining it with probability 1. Using a confidence interval-like bound provides a mechanism of evaluation for you-the-manager which does not often go wrong.
3) The set of possible bounds is too large so bound minimization is a moot algorithm. The idea of plugging in an arbitrary “prior” and getting a sane algorithm is absurd. The idea of plugging in a good “prior” and getting a sane algorithm is not. If you’ll forgive a self-reference, Section 2.1 of the microchoice bound paper shows that plugging in good information yields a tight bound.
We can compare this to Bayesian learning. Plugging an arbitrary prior into the Bayes law is absurd. Plugging in a good prior can easily result in a sane algorithm.
The difference between these viewpoints is the difference between a “prior” and a (Bayesian) prior. If you’ll forgive another self-reference, here is some evidence that this distinction is essential: there does not seem to be an easy correct way to transform from a “prior” to a prior.
4) We don’t want to minimize a true error bound, but rather an expected value of the true error rate. This is certainly correct, but it presupposes that we know this expectation, and that knowledge is typically a greater assumption. Even a Bayes-law expert accepts that there are difficulties related to transcribing their prior into a software program. Whether or not it is a good idea to try to add in extra possibly-incorrect information or attempt to perform well given lesser information seems like an empirical question.
5) Bounda can be minimized while error rates increase. This can easily be true, especially with bounds that are not tight. For any bound, the degree to which bound minimization can “go bad” is related to the looseness of the bound which means that minimizing tight bounds is saner than otherwise.
This issue is real, and I think we have to ask for the indulgence of the learning community at large, while we experiment. We haven’t had tight enough bounds that bound minimization was reasonably sane until just recently, so there is a reasonable hope for future success.
Independent of whether or not bound minimization works out, there is some hope that this line of research can yield better algorithms.
For example, many learning algorithms have the regularization form “minimize error rate + constant*complexity” and there exists a bound of the form: true error rate less than f(error rate, complexity, delta). A study of tight bounds might yield algorithmic speedup because we can more sharply focus the search for the right “constant”.
I do agree with a good bit of what John says above in comment#11 (or at least I understand where comes from). Nevertheless, I think there are a couple of points where I would disagree in fundamental terms, or maybe I still dont get the essence of the argument.
In particular the first point is about (3): my question is more fundamental, and relates to why any learning theoretical motivation for an algorithm is good for *general purpose application* in a large set of completely different datasets. Lets take an example: SVMs maximize the margin (ie minimize the L_2 norm of w vector) based on the support of the Margin bounds. Now, the Margin bound is the result of a certain prior, and if you used another prior, you get another bound which favours minimizing the training rate along with the L_1 norm. So what is the right complexity criterion: should we minimize the L_1 norm or the L_2 norm? This should depend on the data, and a-priori there is no universal answer that works for all circumstances. This message is completely acceptable, especially when we are told to use a “sensible” prior for whatever data we have at hand. The standard criticism of Bayesian work (and the standard reply) are both still valid here.
However, this clearly points at another facet of the issue: for *general purpose application* in a completely new dataset, people use SVMs just because of the margin bounds. In all of the applications of SVMs (that justify themselves based on learning theory) I see in papers, I dont even see a single paper which tries to say why the form of prior which results in this bound is sensible for the data.
My point: when there is no reason why the L_2 is more sensible than L_1 for general purpose use, there is absolutely no theoretical reason why they should not instead minimize the L-1 norm. Now, let us fully extend the argument to logical conclusions: without designing priors for specific applications, and specific data, (ie without taking into account the characterisitics of the data we have) there is absolutely no reason to believe that any one form of algorithm is better. Afterall, just like the L_1 minimization and the L_2 minimization arguments resulting from two differen choices of prior, we can easily come up with any number of other choices of prior, and any number of other algorithms.
All this is fair, and exactly same situation remains with Bayesian alternatives also. However, the fundamental problem: People think that just because there is a large margin bound, the SVM is somehow superior, and dont even bother to study whether large margin assumptions are appropriate for their data. In general it is very difficult to come up with “sensible priors”, eg what should be the sensible prior when my data results from a very complicated and ill understood physical process? Further, how does one know then whether the L_1 or L_2 minimization result from learning theory is better, before seeing the data? Answer: you dont. There is no good answer I can come up with which can help.
Which brings me to the next point of disagreement:
2. Importance of tight bounds, and an argument about why even this is not good enough. The point I want to make is that even if the bounds are tight, it doesnt matter! The question is whether the bound has the same shape as the rue error. In other words, if the bound is loose, but equally so in all cases I would be less worried; unfortunately, for some choices of regularization parameter, kernel parameter etc the bound is less loose, and for some others it is more loose.
In this sense it is not about how loose the bound is, it is about whether the bound is equally loose/tight in all circumstances. If the latter is true, then bound minimization makes sense, even if the bound is loose; otherwise, even if the bound is tighter, it has the wrong shape, and you are in trouble.
My point: PAC-Bayes does improve the tightness as compared to say VC bounds, but it *does not* overcome the fundamental limitation. I doubt if any bound can do so, since it is after all a *bound* on error rate, not an *estimator* of accuracy. The correlation is what matters most, not the slack!
3. We do have some estimators of accuracy, for example Xi-alpha (I think this is what they were called but forget the exact name..) estimators of Joachims. These are not bounds, but work reasonably, though they still have problems. I have not yet studied their derivations carefully, but they also make some good points, and assume relatively little about the data. My point: it is not correct to decry estimators of error (as a class of work), and claim that bounds are to be preferred.
BTW, John can you please provide links to the literature on handling non-independent samples (which you mentioned at the bottom of your original post). I would like to read those papers if I can get my hands on them.
I’m very interested in these problems and would at least like to know about such work! I think this may be closely related to several other problems which are of interest to the ML community now. For example co-training, multi-view learning (there is a workshop in ICML I think), and multi-task learning (eg Rich Caruana’s thesis work in the 90s) should all benefit from better understanding that may result from such work.
Counterstatement (1) seems to be: bound theory doesn’t tell us which prior or “prior” to use, and that’s the important question in solving learning. You also point out that motivations from bounds are sometimes abused. I agree with both points. With respect to bound minimization, I only hope for guidance about how to tradeoff “complexity” versus error rate. Hoping for an explanation about what the right “complexity” is seems impossible because it is problem dependent.
For counterstatement (2), let’s consider the case of two bounds: one is Sam’s bound 1+theta^2 bound, and the other is a hypothetical bound which is everywhere tight: for every classifier the bound on the true error rate is within 0.01 of the true error rate (and always larger than the true error rate). Suppose we minimize these two bounds. For Sam’s bound, this minimization could result in a classifier with error rates anywhere in the interval [0,1]. For the other, the error rate of the classifier minimizing the bound must be within 0.01 of the error rate of the best classifier. This is what I mean by “minimizing tight bounds is saner than otherwise”.
It is certainly true that a great quantity to minimize is a bound which happens to be some constant offset from the truth. However, I have no idea how to achieve that sort of behavior, and some idea of how to create tight bounds.
For point (3), I do not mean to decry the entire idea of an accuracy estimator, but rather the implementations I’ve seen and studied. Perhaps my mathematical heritage betrays me, but I find it difficult to feel motivated by an estimator which must fail sometimes and for which I can’t provide a succinct characterization of those times.
You requested pointers. The PAC-MDL bound referred to above can be thought of as applying to a finite exchangeable set (you’ll see what I mean if you read the proof). The same applies to the other work on “Transductive PAC-Bayes bounds”. For another, I found this paper about relaxing IID assumption to a conditional (on y) IID assumption interesting.
Thanks a lot for the clarifications/pointers!
OK, let me see if in my post UAI haze I can reconstruct the basic idea Drew
is referring to.
Imagine we start a “VC Insurance Company” with the following operating policy.
1) We accept clients who have trained a model of known VC complexity on
a finite labelled training set, but we require them to certify that they
did not look at their training set before selecting their model.
2) We sell them “epsilon-level” insurance for future test set performance,
assuming the test set comes from the same distribution as the training set.
In other words, if their performance on the test set is more than epsilon
worse than their performance on the training set, we pay out $1, otherwise
we don’t pay out.
3) We charge a premium “delta” based on the size of their training set
and the VC dimension of their model, which we compute based on the VC bounds.
The question is, will the VC insurance company break even in the long run?
The answer, unfortunately, is “no”. Why? Because we can’t control who are clients are.
What will happen is that people will pick ridiculously simple models,
train them on large training sets, and see if they get lucky enough to achieve
a low training error. If they don’t get lucky, they will go to the beach.
If they do get lucky, they will come to the VC insurance company and buy insurance,
correctly certifying that they did not look at the data before choosing the model
and that the model has a low VC dimension and that they achived excellent
training error on a large dataset. They will buy insurance and we will, on average,
get taken to the cleaners.
There is a variant of the VC insurance company which can survive (at least, it can survive when the data is truly IID). In this variant, the customer comes to the company and buys epsilon-level insurance for some model at some bound-determined rate before the training examples are created.
After the insurance purchase, training examples are sampled, and learning is conducted resulting in some training error rate. The insurance company then insures that the average number of test errors will not be more than epsilon larger than the training error rate.
The distinction between the business plans of these VC insurance companies is critical to avoiding the standard misunderstanding pointed out by Balaji (and Neal).
Yep– that nails the distinction between conditinal and joint right there.
Another way to see this is to a construct a classifier/distribution that collude so that the classifier only performs well on a data-set only when it fails miserably on a test set. This totally breaks the conditional interpretation as well.
I agree that conditional probability is generally a more Bayesian way to think about things. But I think this is actually because its the way people naturally think about things unless trained otherwise, and the Bayesian formalism supports the natural interpretation. Frequentists are (well, sometimes) trained not to think about the conditional probablility, but rather the unconditional probability of failure. I suppose for algorithm design the unconditional is fairly compelling– I’m just not sure it has anything useful to say when you are applying it after the fact to data.
One of the bayesian ways to avoid giving exact priors or overfitting is setting priors over priors (Eg techniques like hierarchical Dirichelete Processes). John in your paper about inconsistency of bayesian and MDL classifiers you proved conditions under which choosing a bad prior leads to bayesian, MAP or MDL classifiers never being able to converge to required error rate. Well I would like to know if having prior over priors means your hyperpriors can in a sense be less sane? The general notion is that when you have priors over priors, you don’t need to be very vary of having the right hyperpriors or for that matter even hyperparameters. It seems obvious that setting absurd hyperparameters or hyperpriors should have similar bad results as setting absurd priors. (This would definitely be true for small number of samples m). But does it hold when m tends to infinity if so under what conditions?
The short answer is “I don’t know”.
A more complicated answer is that defining the difference between hyperpriors and priors is very tricky. There are several reasonable definitions for which a hyperprior is just another name for a prior and so any result holding for priors hold for hyperpriors.
A simple critisism of statistical learning theory that i see as fairly strong is the following :
I train two neural networks on a dataset. One neural network has just two hidden units and a small VC dimension the second neural network has 100 hidden units and a large VC dimension.
Now say after training both neural networks in the end represent the same decision boundary. It is of course possible for a 100 unit neural network to implement the same function of a two hidden unit model. According to VC theory the 2 unit model should have better generalisation to the 100 unit model. However they implement the same boundary!
The reason this happens is that VC theory violates the likelihood principle. That is the quality of the decision boundary used is measured by the VC dimension which is a measure of the decision rules that might have been produced by the algorithm but were not. The likelihood principle states loosely that data you might have received but did not is irrelevant.
I should also add that the list of algorithms that violate the likelihood principle is very long and includes : monte carlo, the bootstrap, cross-validation, Mackay’s evidence framework, confidence intervals, p-values, objective Bayes and of course statistical learning theory. What survives? Only likelihoodist and subjective Bayesian methods.
You might be substantially happier with learning theory if you use a different bound than the VC-bound. For example, if you use “Sructural Risk Minimization” (aka a union bound aka a Bonferoni(sp?) correction) over the set of bounds with different VC dimensions, this problem is addressed.