Explicit Randomization in Learning algorithms

There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others:

  1. Neural Networks. Neural networks use randomization to assign initial weights.
  2. Boltzmann Machines/Deep Belief Networks. Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness.
  3. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote.
  4. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies.
  5. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees.

A basic question is: “Should there be explicit randomization in learning algorithms?” It seems perverse to feed extra random bits into your prediction process since they don’t contain any information about the problem itself. Can we avoid using random numbers? This question is not just philosophy—we might hope that deterministic version of learning algorithms are both more accurate and faster.

There seem to be several distinct uses for randomization.

  1. Symmetry breaking. In the case of a neural network, if every weight started as 0, the gradient of the loss with respect to every weight would be the same, implying that after updating, all weights remain the same. Using random numbers to initialize weights breaks this symmetry. It is easy to believe that there are good deterministic methods for symmetry breaking.
  2. Overfit avoidance. A basic observation is that deterministic learning algorithms tend to overfit. Bagging avoids this by randomizing the input of these learning algorithms in the hope that directions of overfit for individual predictions cancel out. Similarly, using random bits internally as in a deep belief network avoids overfitting by forcing the algorithm to learn a robust-to-noise set of internal weights, which are then robust-to-overfit. Large margin learning algorithms and maximum entropy learning algorithms can be understood as deterministic operations attempting to achieve the same goal. A significant gap remains between randomized and deterministic learning algorithms: the deterministic versions just deal with linear predictions while the randomized techniques seem to yield improvements in general.
  3. Continuizing. In reinforcement learning, it’s hard to optimize a policy over multiple timesteps because the optimal decision at timestep 2 is dependent on the decision at timestep 1 and vice versa. Randomized interpolation of policies offers a method to remove this cyclic dependency. PSDP can be understood as a derandomization of CPI which trades off increased computation (learning a new predictor for each timestep individually). Whether or not we can avoid a tradeoff in general is unclear.
  4. Adversary defeating. Some algorithms, such as randomized weighted majority are designed to work against adversaries who know your algorithm, except for random bits. The randomization here is provably essential, but the setting is often far more adversarial than the real world.

The current state-of-the-art is that random bits provide performance (computational and predictive) which we don’t know (or at least can’t prove we know) how to achieve without randomization. Can randomization be removed or is it essential to good learning algorithms?

Context and the calculation misperception

This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it.

Suppose we have a set of events, each described by a vector of features.

0 1 0 1 1
1 0 1 0 1
1 1 0 1 0
0 0 1 1 1
1 1 0 0 1
1 0 0 0 1
0 1 1 1 0

Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7.

There are two issues here. The first is that this is really a training error rate, and (hence) may be an overoptimistic prediction. This is not a very serious issue as long as there are a reasonable number of representative examples.

The second issue is more serious. A simple rule (number of 1’s less than 3 implies 1, else 0) achieves error rate 0. By binning the data according to only one feature, the potential of achieving error rate 0 is removed.

The reason for binning is often definitional. Many people think of probability as an observed (or observable) rate. For these people, the probabilities of events can only be learned by finding a large number of identical events and then calculating the observed rate. Constructing “identical events” always involves throwing away the unique context of the event. This disposal of information eliminates the possibility of good prediction performance.

The solution to this problem is education. There are other definitions of probability which are more appropriate when every event is unique. One thing which makes people uncomfortable about probabilities over unique events is that probabilities are no longer observable—they are only estimatable. This loss of grounding is a price which must be paid for improved performance. Luckily, we can tell if our prediction performance improves on labeled examples.

Data Linkage Problems

Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons.

A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algorithm, and we could attempt to predict whether a return occurs.

The problem is that General Health and Health General don’t have any shared unique identifier for John Doe. Information is often mispelled (name misspellings are very common), mistyped, changed (people move), and simply not unique (how many people were born on your birthday?).

Although this is just one example, data linkage problems seem to be endemic to learning applications. There seem to be several solutions:

  1. Improved recording. Sometimes minor changes to what information is recorded can strongly disambiguate. For example, there is a big difference between recording the pages visited at a website versus tracking the sequence of pages visited. The essential thing to think about when designing the information to record is: How will I track the consequences of decisions?
  2. Two-stage learning. First predict which records should be linked, based upon a smaller dataset that is hand checked. Then, use your learned predictor to do the linkage, and then solve your real prediction problem. There are several pitfalls here.
    1. Rarity problems. Links are typically much more rare than nonlinks. The training process needs to take this into account by properly representing the scarcity of nonlinks.
    2. Information interfaces. A prediction of “link” or “no link” is too scarce an information source in an inherently noisy environment. Instead, a probability of link may need to be estimated.
    3. Two stage estimation. A common approach to improving performance is turning a double approximation (given x predict y, given y predict z) into a single approximation (given x predict z). A method for achieving single approximation here is tricky because we have ancillary information about the intermediate prediction.
  3. Customized algorithms. The Bayesian approach of “specify a prior, then use Bayes law to get a posterior, then predict with the posterior” is attractive here because we often have strong prior beliefs about at least the linkage portion of the problem.
  4. Others?

The data linkage problem also makes very clear the tension between privacy and machine learning. For example, being able to cross index hospital cases might yield a large jump in our ability to predict outcomes, which might suggest improved treatments (it is only a weak suggestion that must be verified—we must be very careful about applying a predictor to an input distribution it did not learn with respect to). And yet, linking records can result in unexpectedly large pools of information on individuals. Furthermore explicitly sensitive information (like credit card numbers) might easily be the most useful bit of information for linkage.

2006 NIPS workshops

I expect the NIPS 2006 workshops to be quite interesting, and recommend going for anyone interested in machine learning research. (Most or all of the workshops webpages can be found two links deep.)

Exemplar programming

There are many different abstractions for problem definition and solution. Here are a few examples:

  1. Functional programming: a set of functions are defined. The composed execution of these functions yields the solution.
  2. Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum.
  3. Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower).
  4. Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower).
  5. Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks.
  6. SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner.

These abstractions have different tradeoffs between ease of use, generality, and the effectiveness of existing solutions.

Machine learning can be thought of as exemplar programming. Exemplar programming is creating examples of a (input,output) pairs which are used by an algorithm to predict a function from input to output. Basic questions about this abstraction are:

  1. How easy to use is it? An honest answer here is “not very”. There are several subtle issues related to overfitting, independence, and representativeness of the samples which take significant effort to describe to an unfamiliar person. Making this abstraction easier to use via careful language design is an area where effort may pay off.
  2. How effectve are the exemplar programming solvers (aka learning algorithms)? There is huge variance with respect to the problem under consideration. For some problems, some learning algorithm is very effective while for others you might as well have not tried.
  3. How general is exemplar programming? Very general. A great many problems can be phrased as “given this, I want that”.
  4. How effective has examplar programming been? Increasingly effective.

Exemplar programming is a viewpoint of machine learning which (mostly) ignores statistics, prior information, and the possibility of overfitting. That’s a great deal to ignore, but there are gains as well.

  1. Exemplar programming creates a split between problem solution and problem formation. This is important because the problem solver can heavily optimized (for speed, scalibility, effectiveness on common problems, etc…) making the process of apply machine learning simply a matter of specifying the exemplars.
  2. The formation/solution split helps us focus on problem formation independent of solution. The big gains in machine learning in the last decade seem to be discovering how to apply to new areas. A significant step in any application to a new area is discovering the right way to formulate the problem.

Exemplar programming seems to be a useful viewpoint for machine learning in “big data” problems with many examples where significant prior information is lacking. When either of these conditions are not met, other viewpoints/approaches to machine learning may be more succesful.