Deep Learning 2012

2012 was a tumultuous year for me, but it was undeniably a great year for deep learning efforts. Signs of this include:

  1. Winning a Kaggle competition.
  2. Wide adoption of deep learning for speech recognition.
  3. Significant industry support.
  4. Gains in image recognition.

This is a rare event in research: a significant capability breakout. Congratulations are definitely in order for those who managed to achieve it. At this point, deep learning algorithms seem like a choice undeniably worth investigating for real applications with significant data.

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

Contextual Scaling

Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following:

  1. Marginal. In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information.
  2. First context. A clever person realizes that some bit of information x1 could be helpful. If x1 is discrete, they condition on it and make a predictor h(x1), typically by counting. If they are clever, then they also do some smoothing. If x1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand.
  3. Second. Another clever person (or perhaps the same one) realizes that some other bit of information x2 could be helpful. As long as the space of (x1, x2) remains small and discrete, they continue to form a predictor by counting. When (x1, x2) are real valued, the space remains visualizable, and so a hand crafted decision boundary works fine.
  4. The previous step repeats for information x3,…,x100. It’s no longer possible to visualize the data but a human can still function as a learning algorithm by carefully tweaking parameters and testing with the right software support to learn h(x1,…,x100). Graphical models can sometimes help scale up counting based approaches. Overfitting becomes a very serious issue. The “human learning algorithm” approach starts breaking down, because it becomes hard to integrate new information sources in the context of all others.
  5. Automation. People realize “we must automate this process of including new information to keep up”, and a learning algorithm is adopted. The precise choice is dependent on the characteristics of the learning problem (How many examples are there at training time? Is this online or batch? How fast must it be at test time?) and the familiarity of the people involved. This can be a real breakthrough—automation can greatly ease the inclusion of new information, and sometimes it can even improve results given the limited original information.

Understanding the process of contextual scaling seems particularly helpful for teaching about machine learning. It’s often the case that the switch to the last step could and should have happened before the the 100th bit of information was integrated.

We can also judge learning algorithms according to their ease of contextual scaling. In order from “least” to “most”, we might have:

  1. Counting based approaches. Number of examples required is generally exponential in the number of features.
  2. Counting based approaches with smoothing. Still exponential, but with saner defaults.
  3. Counting based approaches with smoothing and some prior language (graphical models, bayes nets, etc…). Number of examples required is no longer exponential, but can still be intractably large. Prior specification from a human is required.
  4. Prior based systems (Many Bayesian Learning algorithms). No particular number of examples required, but sane prior specification from a human may be required.
  5. Similarity based systems (nearest neighbor, kernel based algorithms). A similarity measure is a weaker form of prior information which can be substantially easier to specify.
  6. Highly automated approaches. “Just throw the new information as a feature and let the learning algorithms sort it out”.

At each step in this order, less effort is required to integrate new information.

Designing a learning algorithm which can be useful in many different contextual scales is fundamentally challenging. Obviously, when specific prior information is available, we want to incorporate it. Equally obviously, when specific prior information is not available, we want to be able to take advantage of new information which happens to be easily useful. When we have so much information that counting could work, a learning algorithm should behave similar to counting (with smoothing).

Data Linkage Problems

Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons.

A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algorithm, and we could attempt to predict whether a return occurs.

The problem is that General Health and Health General don’t have any shared unique identifier for John Doe. Information is often mispelled (name misspellings are very common), mistyped, changed (people move), and simply not unique (how many people were born on your birthday?).

Although this is just one example, data linkage problems seem to be endemic to learning applications. There seem to be several solutions:

  1. Improved recording. Sometimes minor changes to what information is recorded can strongly disambiguate. For example, there is a big difference between recording the pages visited at a website versus tracking the sequence of pages visited. The essential thing to think about when designing the information to record is: How will I track the consequences of decisions?
  2. Two-stage learning. First predict which records should be linked, based upon a smaller dataset that is hand checked. Then, use your learned predictor to do the linkage, and then solve your real prediction problem. There are several pitfalls here.
    1. Rarity problems. Links are typically much more rare than nonlinks. The training process needs to take this into account by properly representing the scarcity of nonlinks.
    2. Information interfaces. A prediction of “link” or “no link” is too scarce an information source in an inherently noisy environment. Instead, a probability of link may need to be estimated.
    3. Two stage estimation. A common approach to improving performance is turning a double approximation (given x predict y, given y predict z) into a single approximation (given x predict z). A method for achieving single approximation here is tricky because we have ancillary information about the intermediate prediction.
  3. Customized algorithms. The Bayesian approach of “specify a prior, then use Bayes law to get a posterior, then predict with the posterior” is attractive here because we often have strong prior beliefs about at least the linkage portion of the problem.
  4. Others?

The data linkage problem also makes very clear the tension between privacy and machine learning. For example, being able to cross index hospital cases might yield a large jump in our ability to predict outcomes, which might suggest improved treatments (it is only a weak suggestion that must be verified—we must be very careful about applying a predictor to an input distribution it did not learn with respect to). And yet, linking records can result in unexpectedly large pools of information on individuals. Furthermore explicitly sensitive information (like credit card numbers) might easily be the most useful bit of information for linkage.