Computation – Machine Learning (Theory)

9/18/2009

Necessary and Sufficient Research

Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways.

Decomposition failure. The solution to the decomposition does not in fact yield a solution to the overall problem.
Artificial hardness. The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary.

As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary.

In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively little idea about how to do research, so they naturally abdicate the problem choice to an advisor. As an inexperienced graduate student, it’s difficult to avoid this, because an advisor often really does know much better about what is easy, what is hard, and how to decompose a complex problem into solvable subproblems. Nevertheless, if your plan in life is to do substantial research, it’s essential even then to question research directions carefully.

In contrast to sufficient subgoals of a greater goal, there are also necessary subgoals. A necessary subgoal is one which must be solved to solve the greater goal. One of the reasons why the artificial hardness problem is so common is that the sufficient subgoals are commonly confused with necessary subgoals. The essential test for a necessary subgoal is whether or not a solution to the global problem can be used as a solution to the subgoal.

My personal greater goal is creating a master machine learning algorithm that can solve any reasonable learning problem where “reasonable” includes at least the set that humans can solve. Relative to this greater goal, many existing research programs do not appear necessary.

The short form of my second comment on Marcus’s post is that I see the sufficiency but not the necessity of competing with all Turing machines.
The necessity of several statistical modeling approaches appears unclear to me, because they often encounter severe computational problems. Perhaps this is an example of creating an artificially hard problem, as empirical experiences with Searn suggest.

What is necessary?

Large data. It is clear that humans solving learning problems have access to large amounts of information which they employ to solve these problems. While we don’t stick children into a sensory deprivation tank to see how much it retards their ability to solve problems when grown, some experiments along these lines have been done with animals yielding obvious ability deficiency.
Online learning. The ability to learn in an online environment with relatively little processing per bit of input is clearly a sufficient approach to solve many problems. We can also argue that it is a computational necessity, as retraining based upon all past information appears computationally infeasible, or at least deeply wasteful.
Interactive learning. It’s clear that many animals use an interactive process to learn about the world. This form of learning is also necessary, because it provides the ability to answer unanticipated questions. We can further argue the necessity by pointing out that interactive proofs appear much more powerful in computational complexity theory than noninteractive proofs. For example, viewed from a learning perspective, much of science is about interactive learning.
Compositional Design of a learning system. The necessity of compositional design in machine learning is not entirely clear. For example, we could imagine that it’s possible to design good learning systems using an evolutionary approach. Nevertheless, since our basic goal in research is a much more efficient and faster design, it seems that the decision to take a research-based approach implies that compositional design is necessary. Restated: we should be able to design the system to learn in components which compose to form an overall solution.
Large contexts. It’s necessary that a learning algorithm be able to use a relatively large number of bits when making a decision. For example, people working on vision have cool examples where people manage to use many different cues to predict what an object is.
Nonlinearity. People can clearly solve learning problems for which no linear representation (of input information) is capable of achieving good performance.

Some of these are criticizable as perhaps unnecessary, and I can easily imagine missing others. If you have other arguments for what is or is not necessary for this greater goal, please speak up.

There are two other categories of subgoal research we could consider. There are subgoals which are necessary and sufficient (in combination) to solve the greater goal. I don’t know of any such arguments for my greater goal.

The fourth category is subgoals which are neither necessary nor sufficient for a greater goal. In my experience such papers are quite common at conferences with types that include:

Work on speeding up a slow algorithm leaving it slower than the state of the art,
Otherwise improving an algorithm which is suboptimal while leaving it suboptimal.

The nitty-gritty of these questions come at review time. Which papers should be accepted? In general, decision making is pretty terrible because greater goals are rarely stated, perhaps as a form of strategic ambiguity. After all, the set of people attending a conference have multiple greater goals. Nevertheless, my personal ordering is:

Necessary and sufficient research directions. An emptyset in my experience.
Necessary research directions. This is typically a small fraction.
Sufficient research directions. This is a greater fraction.
Neither. This is often the majority.
Wrong. Must be rejected.

So the question to periodically ask yourself as a researcher is: What is the greater goal? Is this subgoal necessary? Sufficient? Something else?

3/18/2009

Parallel ML primitives

Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML.

Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second?

One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit. Antoine Bordes showed a variant was competitive in the large scale learning challenge. It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimization, implying the algorithm can be used in a host of situations where batch algorithms are awkward or unviable.

If we start with a fast learning algorithm as a baseline, there seem to be two directions we can go with parallel ML:

(easier) Try to do more in the same amount of time. For example, Paul and Neil suggest using parallelism to support ensemble methods.
(harder) Try to use parallelism to reduce the amount of time required to effectively learn on large amounts of data. For this approach, bandwidth and latency become uncomfortably critical implementation details. Due to these issues, it appears important to at least loosen the goal to competing with learning on large amounts of data. Alternatively, we could consider this as evidence some other learning primitive is desirable, although I’m not sure which one.

2/4/2009

Optimal Proxy Loss for Classification

Many people in machine learning take advantage of the notion of a proxy loss: A loss function which is much easier to optimize computationally than the loss function imposed by the world. A canonical example is when we want to learn a weight vector w and predict according to a dot product f_w(x)= sum_i w_ix_i
where optimizing squared loss (y-f_w(x))² over many samples is much more tractable than optimizing 0-1 loss I(y = Threshold(f_w(x) – 0.5)).

While the computational advantages of optimizing a proxy loss are substantial, we are curious: which proxy loss is best? The answer of course depends on what the real loss imposed by the world is. For 0-1 loss classification, there are adherents to many choices:

Log loss. If we confine the prediction to [0,1], we can treat it as a predicted probability that the label is 1, and measure loss according to log 1/p'(y|x) where p'(y|x) is the predicted probability of the observed label. A standard method for confining the prediction to [0,1] is logistic regression which exponentiates the dot product and normalizes.
Squared loss. The squared loss approach (discussed above) is also quite common. It shares the same “proper scoring rule” semantics as log loss: the optimal representation-independent predictor is the conditional probability of the label y given the features x.
Hinge loss. For hinge loss, you optimize max(0, 1- 4 (y – 0.5) (f_w(x) – 0.5) ). The form of hinge loss is slightly unfamiliar, because the label is {0,1} rather than {-1,1}. The optimal prediction for hinge loss is not the probability of y given x but rather some value which is at least 1 if the most likely label is 1 and 0 or smaller if the most likely label is 0. Hinge loss was popularized with support vector machines. Hinge loss is not a proper scoring rule for mean, but since it does get the sign right, using it for classification is reasonable.

Many people have made qualitative arguments about why one loss is better than another. For example see Yaroslav’s old post for an argument about the comparison of log loss and hinge loss and why hinge loss might be better. In the following, I make an elementary quantitative argument.

Log loss is qualitatively dissimilar from the other two, because it is unbounded on the range of interest. Restated, there is no reason other than representational convenience that f_w(x) needs to take a value outside of the interval [0,1] for squared loss or hinge loss. In fact, we can freely reduce these losses by considering instead the function f_w‘(x) = max(0,min(1,f_w(x))). The implication is that optimization of log loss can be unstable in ways that optimization of these other losses is not. This can be stated precisely by noting that sample complexity bounds (simple ones here) for 0-1 loss hold for f_w‘(x) under squared or hinge loss, but the same theorem statement does not hold for log loss without additional assumptions. Since stability and convergence are of substantial interest in machine learning, this suggests not using log loss.

For further analysis, we must first define some function converting f_w(x) into label predictions. The only reasonable approach is to threshold at 0.5. For log loss and squared loss, any other threshold is inconsistent. Since the optimal predictor for hinge loss always takes value 0 or 1, there is some freedom in how we convert, but a reasonable approach is to also threshold at 0.5.

Now, we want to analyze the stability of predictions. In other words, if an adversary picks the true conditional probability distribution p(y|x) and the prediction f_w‘(x), how does the proxy loss of f_w‘(x) bound the 0-1 loss? Since we imagine that the conditional distribution is noisy, it’s important to actually consider a regret: how well we do minus the loss of the best possible predictor.

For each of these losses, an optimal strategy of the adversary is to have p(y|x) take value 0.5 – eps and f_w‘(x) = 0.5. The 0-1 regret induced is simply 2 eps, since the best possible predictor has error rate 0.5 – eps while the actual predictor has error rate 0.5 + eps. For hinge loss, the regret is eps and for squared loss the regret is eps². Doing some algebra, this implies that 2 hinge_regret bounds 0-1 regret while 2 squared_regret^0.5 bounds 0-1 regret. Since we are only interested in regrets less than 1, the square root is undesirable, and hinge loss is preferred, because a stronger convergence of squared loss is needed to achieve the same guarantee on 0-1 loss.

Can we improve on hinge loss? I don’t know any proxy loss which is quantitatively better, but generalizations exist. The regret of hinge loss is the same as for absolute value loss |y-f_w‘(x)| since they are identical for 0,1 labels. One advantage of absolute value loss is that it has a known and sometimes useful semantics for values between 0 and 1: the optimal prediction is the median. This makes the work on quantile regression (Two Three) seem particularly relevant for machine learning.

1/28/2008

Sufficient Computation

Do we have computer hardware sufficient for AI? This question is difficult to answer, but here’s a try:

One way to achieve AI is by simulating a human brain. A human brain has about 10¹⁵ synapses which operate at about 10² per second implying about 10¹⁷ bit ops per second.

A modern computer runs at 10⁹ cycles/second and operates on 10² bits per cycle implying 10¹¹ bits processed per second.

The gap here is only 6 orders of magnitude, which can be plausibly surpassed via cluster machines. For example, the BlueGene/L operates 10⁵ nodes (one order of magnitude short). It’s peak recorded performance is about 0.5*10¹⁵ FLOPS which translates to about 10¹⁶ bit ops per second, which is nearly 10¹⁷.

There are many criticisms (both positive and negative) for this argument.

Simulation of a human brain might require substantially more detail. Perhaps an additional 10² is required per neuron.
We may not need to simulate a human brain to achieve AI. There are certainly many examples where we have been able to design systems that work much better than evolved systems.
The internet can be viewed as a supercluster with 10⁹ or so CPUs, easily satisfying the computational requirements.
Satisfying the computational requirement is not enough—bandwidth and latency requirements must also be satisfied.

These sorts of order-of-magnitude calculations appear sloppy, but they work out a remarkable number of times when tested elsewhere. I wouldn’t be surprised to see it work out here.

Even with sufficient harrdware, we are missing a vital ingredient: knowing how to do things.

1/25/20081/26/2008

Turing’s Club for Machine Learning

Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions:

How does the learning algorithm scale with the number of examples m? Any algorithm using all of the data is at least O(m), but in many cases this is O(m²) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled.
The above question can also be asked for test cases. In some applications, test-time performance is of great importance.
How does the algorithm scale with the number of features n per example? Many second order gradient descent algorithms are O(n²) or worse which becomes unacceptable as the number of parameters grows. Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible.
What are the memory requirements of the learning algorithm? Something linear in the number of features (or less) is nice. Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled.

One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. A simple trick is to pick a scale, and ask what size problem can be solved given the big-O dependence. For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of 10¹⁰ is currently appropriate. Computing scales, you get:

O(m)	O(m log(m))	O(m²)	O(m³)	O(e^m)
10¹⁰	510⁸*	10⁵	210³*	25

There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems.

There are various depths to which we can care about computation. The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a O(mn log(mn)) or better dependence is the target.

For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. This is mismatched for machine learning. It’s too crude because O(m^9) algorithms are interesting to basically no one. It’s too fine because (a) there are a number of problems of interest with only a small amount of data where algorithms with unquantifiable computation may be of interest (think of Bayesian integration) and (b) some problems simply have no solution yet, so the existence of a solution (which is not necessarily efficient) is of substantial interest.

The right degree of care about computation I’ll call “Turing’s club”. Computation is a primary but not overriding concern. Every algorithm should be accompanied by some statement about it’s computational and space costs. Algorithms in the “no known computational bound” category are of interest if they accomplish something never before done, but are otherwise of little interest. Algorithms with controlled guarantees on computational requirements are strongly preferred. Linear time algorithms are strongly preferred. Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well.