This post is really for people *not* in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it.

Suppose we have a set of events, each described by a vector of features.

0 | 1 | 0 | 1 | 1 |

1 | 0 | 1 | 0 | 1 |

1 | 1 | 0 | 1 | 0 |

0 | 0 | 1 | 1 | 1 |

1 | 1 | 0 | 0 | 1 |

1 | 0 | 0 | 0 | 1 |

0 | 1 | 1 | 1 | 0 |

Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by *one* feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7.

There are two issues here. The first is that this is really a training error rate, and (hence) may be an overoptimistic prediction. This is not a very serious issue as long as there are a reasonable number of representative examples.

The second issue is more serious. A simple rule (number of 1’s less than 3 implies 1, else 0) achieves error rate 0. By binning the data according to only one feature, the potential of achieving error rate 0 is removed.

The reason for binning is often *definitional*. Many people think of probability as an observed (or observable) rate. For these people, the probabilities of events can only be learned by finding a large number of identical events and then *calculating* the observed rate. Constructing “identical events” *always* involves throwing away the unique context of the event. This disposal of information eliminates the possibility of good prediction performance.

The solution to this problem is education. There are other definitions of probability which are more appropriate when every event is unique. One thing which makes people uncomfortable about probabilities over unique events is that probabilities are no longer observable—they are only estimatable. This loss of grounding is a price which must be paid for improved performance. Luckily, we can tell if our prediction performance improves on labeled examples.

I am in your target audience. I understand what you’re saying and it’s very interesting. Can you give a couple of other examples, preferably slightly less obvious, and maybe even from “real life” where this may occur?

A real world example would be coin flipping. Using simply the sequence of flips may get you close to 50% probability of each but seeing which side it started off on or measuring the power and angle the coin was launched would allow you to predict better the coins eventual side.

http://www-stat.stanford.edu/~cgates/PERSI/papers/headswithJ.pdf

Suppose you had two paths available for a daily commute, and you wanted to predict which one to take to minimize your time. You might look at the weather and see “cloud” or “clear sky”. You might watch the weatherman and see “rain” or “shine”. You might listen to the radio and hear “jam on route 1” or not, and “jam on route 2” or not. You might ask your friend would say “route 1” or “route 2”. You might check to see whether or not your dog wags his tail or not that morning. You might consider the past performance of route 1 and route when each of them was experienced. You might wake up feeling “alert” or “sleepy”. This can continue infinitely.

All of these bits of information make each event unique (with respect to description) but not fully determined (with respect to the prediction). If we want to predict which route is best, we can’t hope to simply bin alike events and go with the best choice.