I recently discovered that supervised learning is a controversial term. The two definitions are:
- Known Loss Supervised learning corresponds to the situation where you have unlabeled examples plus knowledge of the loss of each possible predicted choice. This is the definition I’m familiar and comfortable with. One reason to prefer this definition is that the analysis of sample complexity for this class of learning problems are all pretty similar.
- Any kind of signal Supervised learning corresponds to the situation where you have unlabeled examples plus any source of side information about what the right choice is. This notion of supervised learning seems to subsume reinforcement learning, which makes me uncomfortable, because it means there are two words for the same class. This also means there isn’t a convenient word to describe the first definition.
Reviews suggest there are people who are dedicated to the second definition out there, so it can be important to discriminate which you mean.
In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused.
- Confidence = Probability. For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5, but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain.
- Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information.
- Bayesian Confidence Intervals. These are common in several machine learning applications. If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. With respect to this posterior distribution, you construct an interval containing the truth with high probability. The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. No assumption of independent samples is required. Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. For example, “the probability of disease given the observations is in [0.8,1]“. My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior.
- Asymptotic Intervals. This is also common in applied machine learning, which I strongly dislike. The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. The simplest example of this is when a classifier has zero error rate on a test set. A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered.
- Internal Confidence Intervals. This is not used much, except in agnostic active learning analysis. The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. The real world might assign label 0 or label 1 given a particular context x, and we could only discover the world’s truth by actually observing x,y labeled examples. Yet, it turns out to sometimes be easy to infer “our learning algorithm will definitely predict label 1 given features x“. This allowed dependence on x means we can efficiently guide exploration. A basic question is: can this notion of internal confidence guide other forms of exploration?
- Gamesman intervals. Vovk and Shafer have been working on new foundations of probability, where everything is stated in terms of games. In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world.
It turns out that many different people use the term “Online Learning”, and often they don’t have the same definition in mind. Here’s a list of the possibilities I know of.
- Online Information Setting Online learning refers to a problem in which unlabeled data comes, a prediction is made, and then feedback is acquired.
- Online Adversarial Setting Online learning refers to algorithms in the Online Information Setting which satisfy guarantees of the form: “For all possible sequences of observations, the algorithim has regret at most log ( number of strategies) with respect to the best strategy in a set.” This is sometimes called online learning with experts.
- Online Optimization Constraint Online learning refers to optimizing a predictor via a learning algorithm tunes parameters on a per-example basis. This may or may not be applied in the Online Information Setting, and the strategy may or may not satisfy Adversarial setting theory.
- Online Computational Constraint Online learning refers to an algorithmic constraint that the amount of computation per example is constant as the number of examples increases. Again, this doesn’t imply anything in particular about the Information setting in which it is applied.
- Lifelong Learning Online learning refers to learning in a setting where different tasks come at you over time, and you need to rapidly adapt to past mastered tasks.
In everyday use a model is a system which explains the behavior of some system, hopefully at the level where some alteration of the model predicts some alteration of the real-world system. In machine learning “model” has several variant definitions.
- Everyday. The common definition is sometimes used.
- Parameterized. Sometimes model is a short-hand for “parameterized model”. Here, it refers to a model with unspecified free parameters. In the Bayesian learning approach, you typically have a prior over (everyday) models.
- Predictive. Even further from everyday use is the predictive model. Examples of this are “my model is a decision tree” or “my model is a support vector machine”. Here, there is no real sense in which an SVM explains the underlying process. For example, an SVM tells us nothing in particular about how alterations to the real-world system would create a change.
Which definition is being used at any particular time is important information. For example, if it’s a parameterized or predictive model, this implies some learning is required. If it’s a predictive model, then the set of operations which can be done to the model are restricted with respect to everyday usage. I don’t have any particular advice here other than “watch out”—be aware of the distinctions, watch for this source of ambiguity, and clarify when necessary.
This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same.
- For all sequences of examples. This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.”
- For all training sets. This is the standard quantification for boosting analysis such as adaboost or multiclass boosting.
Standard theorems have the form “for all training sets the error rate inequalities … hold”.
- For all distributions over examples. This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”.
It is not quite true that each of these is equivalent. For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa.
However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. To see the equivalence, note that:
- “For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa.
- “For any sequence of examples” is equivalent to “For any distribution over examples” when the theorems are about unconditional example transformations because:
- The uniform distribution over a sufficiently long sequence of examples can approximate any distribution we care about arbitrarily well.
- If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples.
The natural debate here is “how should the theorems be quantified?” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. Here are the arguments I know.
- For all sequences of examples.
- Learning theory people (at least) are used to thinking about “For all sequences of examples”.
- (Applied) Machine learning people are not so familiar with this form of quantification.
- When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”.
- For all training sets.
- This is very simple.
- It is misleadingly simple. For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. It is fairly common for this to be misunderstood.
- For all distributions over examples.
- Distributions over examples is simply how most people think about learning problems.
- “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. It seems most common to encounter this confusion amongst learning theory folks.
What quantification should be used and why?
(My thanks to Yishay Mansour for clarifying the debate.)