Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following:
- Marginal. In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information.
- First context. A clever person realizes that some bit of information x1 could be helpful. If x1 is discrete, they condition on it and make a predictor h(x1), typically by counting. If they are clever, then they also do some smoothing. If x1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand.
- Second. Another clever person (or perhaps the same one) realizes that some other bit of information x2 could be helpful. As long as the space of (x1, x2) remains small and discrete, they continue to form a predictor by counting. When (x1, x2) are real valued, the space remains visualizable, and so a hand crafted decision boundary works fine.
- … The previous step repeats for information x3,…,x100. It’s no longer possible to visualize the data but a human can still function as a learning algorithm by carefully tweaking parameters and testing with the right software support to learn h(x1,…,x100). Graphical models can sometimes help scale up counting based approaches. Overfitting becomes a very serious issue. The “human learning algorithm” approach starts breaking down, because it becomes hard to integrate new information sources in the context of all others.
- Automation. People realize “we must automate this process of including new information to keep up”, and a learning algorithm is adopted. The precise choice is dependent on the characteristics of the learning problem (How many examples are there at training time? Is this online or batch? How fast must it be at test time?) and the familiarity of the people involved. This can be a real breakthrough—automation can greatly ease the inclusion of new information, and sometimes it can even improve results given the limited original information.
Understanding the process of contextual scaling seems particularly helpful for teaching about machine learning. It’s often the case that the switch to the last step could and should have happened before the the 100th bit of information was integrated.
We can also judge learning algorithms according to their ease of contextual scaling. In order from “least” to “most”, we might have:
- Counting based approaches. Number of examples required is generally exponential in the number of features.
- Counting based approaches with smoothing. Still exponential, but with saner defaults.
- Counting based approaches with smoothing and some prior language (graphical models, bayes nets, etc…). Number of examples required is no longer exponential, but can still be intractably large. Prior specification from a human is required.
- Prior based systems (Many Bayesian Learning algorithms). No particular number of examples required, but sane prior specification from a human may be required.
- Similarity based systems (nearest neighbor, kernel based algorithms). A similarity measure is a weaker form of prior information which can be substantially easier to specify.
- Highly automated approaches. “Just throw the new information as a feature and let the learning algorithms sort it out”.
At each step in this order, less effort is required to integrate new information.
Designing a learning algorithm which can be useful in many different contextual scales is fundamentally challenging. Obviously, when specific prior information is available, we want to incorporate it. Equally obviously, when specific prior information is not available, we want to be able to take advantage of new information which happens to be easily useful. When we have so much information that counting could work, a learning algorithm should behave similar to counting (with smoothing).
Your problem reminds me of modeling with interactions. For example, majority vote corresponds to modeling P(Y), first context to modeling P(Y|X1), second context to P(Y|X1,X2), and so on. Any kind of “automation” is about making assumptions about the parametric form of the conditional probability distribution. Logistic regression (largely synonymous with CRF), for example, makes P(Y|X) simple in the sense that the effect of one variable cannot depend on the value of another variable. In some belief nets they used probability estimation trees or rules to “smooth” the conditional probability tables, although Dirichlet priors are most popular.