This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it.
Suppose we have a set of events, each described by a vector of features.
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7.
There are two issues here. The first is that this is really a training error rate, and (hence) may be an overoptimistic prediction. This is not a very serious issue as long as there are a reasonable number of representative examples.
The second issue is more serious. A simple rule (number of 1’s less than 3 implies 1, else 0) achieves error rate 0. By binning the data according to only one feature, the potential of achieving error rate 0 is removed.
The reason for binning is often definitional. Many people think of probability as an observed (or observable) rate. For these people, the probabilities of events can only be learned by finding a large number of identical events and then calculating the observed rate. Constructing “identical events” always involves throwing away the unique context of the event. This disposal of information eliminates the possibility of good prediction performance.
The solution to this problem is education. There are other definitions of probability which are more appropriate when every event is unique. One thing which makes people uncomfortable about probabilities over unique events is that probabilities are no longer observable—they are only estimatable. This loss of grounding is a price which must be paid for improved performance. Luckily, we can tell if our prediction performance improves on labeled examples.