The Stats Handicap

Graduating students in Statistics appear to be at a substantial handicap compared to graduating students in Machine Learning, despite being in substantially overlapping subjects.

The problem seems to be cultural. Statistics comes from a mathematics background which emphasizes large publications slowly published under review at journals. Machine Learning comes from a Computer Science background which emphasizes quick publishing at reviewed conferences. This has a number of implications:

  1. Graduating statistics PhDs often have 0-2 publications while graduating machine learning PhDs might have 5-15.
  2. Graduating ML students have had a chance for others to build on their work. Stats students have had no such chance.
  3. Graduating ML students have attended a number of conferences and presented their work, giving them a chance to meet people. Stats students have had fewer chances of this sort.

In short, Stats students have had relatively few chances to distinguish themselves and are heavily reliant on their advisors for jobs afterwards. This is a poor situation, because advisors have a strong incentive to place students well, implying that recommendation letters must always be considered with a grain of salt.

This problem is more or less prevalent depending on which Stats department students go to. In some places the difference is substantial, and in other places not.

One practical implication of this, is that when considering graduating stats PhDs for hire, some amount of affirmative action is in order. At a minimum, this implies spending extra time getting to know the candidate and what the candidate can do is in order.

The Meaning of Confidence

In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused.

  1. Confidence = Probability. For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5, but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain.
  2. Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information.
  3. Bayesian Confidence Intervals. These are common in several machine learning applications. If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. With respect to this posterior distribution, you construct an interval containing the truth with high probability. The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. No assumption of independent samples is required. Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. For example, “the probability of disease given the observations is in [0.8,1]”. My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior.
  4. Asymptotic Intervals. This is also common in applied machine learning, which I strongly dislike. The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. The simplest example of this is when a classifier has zero error rate on a test set. A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered.
  5. Internal Confidence Intervals. This is not used much, except in agnostic active learning analysis. The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. The real world might assign label 0 or label 1 given a particular context x, and we could only discover the world’s truth by actually observing x,y labeled examples. Yet, it turns out to sometimes be easy to infer “our learning algorithm will definitely predict label 1 given features x“. This allowed dependence on x means we can efficiently guide exploration. A basic question is: can this notion of internal confidence guide other forms of exploration?
  6. Gamesman intervals. Vovk and Shafer have been working on new foundations of probability, where everything is stated in terms of games. In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world.

Complexity Illness

One of the enduring stereotypes of academia is that people spend a great deal of intelligence, time, and effort finding complexity rather than simplicity. This is at least anecdotally true in my experience.

  1. Math++ Several people have found that adding useless math makes their paper more publishable as evidenced by a reject-add-accept sequence.
  2. 8 page minimum Who submitted a paper to ICML violating the 8 page minimum? Every author fears that the reviewers won’t take their work seriously unless the allowed length is fully used. The best minimum violation I know is Adam‘s paper at SODA on generating random factored numbers, but this is deeply exceptional. It’s a fair bet that 90% of papers submitted are exactly at the page limit. We could imagine that this is because papers naturally take more space, but few people seem to be clamoring for more space.
  3. Journalong Has anyone been asked to review a 100 page journal paper? I have. Journal papers can be nice, because they give an author the opportunity to write without sharp deadlines or page limit constraints, but this can and does go awry.

Complexity illness is a burden on the community. It means authors spend more time filling out papers, reviewers spend more time reviewing, and (most importantly) effort is misplaced on complex solutions over simple solutions, ultimately slowing (sometimes crippling) the long term impact of an academic community.

It’s difficult to imagine an author-driven solution to complexity illness, because the incentives are simply wrong. Reviewing based on solution value rather than complexity is a good way for individual people to reduce the problem. More generally, it would be great to have a system which explicitly encourages research without excessive complexity. The best example of this seems to be education—it’s the great decomplexifier. The process of teaching something greatly encourages teaching the simple solution, because that is what can be understood. This seems to be true both of traditional education and less conventional means such as wikipedia articles. I’m not sure exactly how to use this observation—Is there some way we can shift conference formats towards the process of creating teachable material?

Sufficient Computation

Do we have computer hardware sufficient for AI? This question is difficult to answer, but here’s a try:

One way to achieve AI is by simulating a human brain. A human brain has about 1015 synapses which operate at about 102 per second implying about 1017 bit ops per second.

A modern computer runs at 109 cycles/second and operates on 102 bits per cycle implying 1011 bits processed per second.

The gap here is only 6 orders of magnitude, which can be plausibly surpassed via cluster machines. For example, the BlueGene/L operates 105 nodes (one order of magnitude short). It’s peak recorded performance is about 0.5*1015 FLOPS which translates to about 1016 bit ops per second, which is nearly 1017.

There are many criticisms (both positive and negative) for this argument.

  1. Simulation of a human brain might require substantially more detail. Perhaps an additional 102 is required per neuron.
  2. We may not need to simulate a human brain to achieve AI. There are certainly many examples where we have been able to design systems that work much better than evolved systems.
  3. The internet can be viewed as a supercluster with 109 or so CPUs, easily satisfying the computational requirements.
  4. Satisfying the computational requirement is not enough—bandwidth and latency requirements must also be satisfied.

These sorts of order-of-magnitude calculations appear sloppy, but they work out a remarkable number of times when tested elsewhere. I wouldn’t be surprised to see it work out here.

Even with sufficient harrdware, we are missing a vital ingredient: knowing how to do things.