Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions:

- How does the learning algorithm scale with the number of examples
*m*? Any algorithm using all of the data is at least*O(m)*, but in many cases this is*O(m*(naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled.^{2}) - The above question can also be asked for test cases. In some applications, test-time performance is of great importance.
- How does the algorithm scale with the number of features
*n*per example? Many second order gradient descent algorithms are*O(n*or worse which becomes unacceptable as the number of parameters grows. Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible.^{2}) - What are the memory requirements of the learning algorithm? Something linear in the number of features (or less) is nice. Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled.

One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. A simple trick is to pick a scale, and ask what size problem can be solved given the big-O dependence. For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of *10 ^{10}* is currently appropriate. Computing scales, you get:

O(m) |
O(m log(m)) |
O(m^{2}) |
O(m^{3}) |
O(e^{m}) |

10^{10} |
5*10^{8} |
10^{5} |
2*10^{3} |
25 |

There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems.

There are various depths to which we can care about computation. The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a *O(mn log(mn))* or better dependence is the target.

For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. This is mismatched for machine learning. It’s too crude because *O(m^9)* algorithms are interesting to basically no one. It’s too fine because (a) there are a number of problems of interest with only a small amount of data where algorithms with unquantifiable computation may be of interest (think of Bayesian integration) and (b) some problems simply have no solution yet, so the existence of a solution (which is not necessarily efficient) is of substantial interest.

The right degree of care about computation I’ll call “Turing’s club”. Computation is a primary but not overriding concern. Every algorithm should be accompanied by some statement about it’s computational and space costs. Algorithms in the “no known computational bound” category are of interest if they accomplish something never before done, but are otherwise of little interest. Algorithms with controlled guarantees on computational requirements are strongly preferred. Linear time algorithms are strongly preferred. Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well.

In some cases there is a lack of big-O analysis, in other cases there is confusion with big-Omega. In

> Any algorithm using all of the data is O(m) …

donÃ‚Â´t you mean Omega(m)?

Is it the case in ML research that people simply try to come up with an algorithm to do something, without any sort of care about the computational complexity, and leave it for future researchers to scale it down to a polynomial-time algorithm? I’m not in ML but it seems like if you, as a researcher, have to worry simultaneously about developing an algorithm AND getting it into a reasonable complexity level, it would be hard to get anything done — as opposed to having one genre of research in ML being “theoretical” (= unconcerned about complexity) and another “applied” (= taking other people’s algorithms and refining them).

This is a good point. I tweaked the wording above to say “at least O(m)” for which Omega(m) is what I mean.

The primary focus in algorithm development is on upper bounds (big-O), but lower bounds for classes of problems are of some interest as well.

@author

Was wondering if you can post a blog on the difference between Machine Learning and Pattern Recognition. I never got any good answer either on net or from my professors. It makes me feel that the difference is more political than in definition. Reading both the books made me feel that there isn’t a great difference between these two. Boundaries are very thin.

@Robert Talbert

It isn’t the case that in ML research people simply try to come up with an algorithm to do something. The development of any field is a top down approach rather than bottom up approach. First the philosophy is developed. If it is observed that it can describe good number of observations, a paper is published with a possible solution. And in the next stage a very efficient solution is proposed under the same philosophy. People who have theoretical computer science background find ML an applied field which isn’t true. In TCS almost all the philosophies are fairly defined, the only thing which is left is to solve them. Mind you it take a hell lot of time, energy and thinking to define something. It’s like defining a Turing Machine.

Machine learning is just one approach to pattern recognition. The other common approach is heuristic. For example, you might use hidden Markov models to recognize names of people in text, or you might use regular expressions.

Depending on how broadly you define “pattern”, all of machine learning may be pattern recognition. After all, we’re just talking about English words for broad areas of study here, not technical terms with precisely defined meanings like “non-Abelian group”.

Bringing this back to the topic of the blog post, what differentiates machine learning from statistics is twofold. First, there are many machine learning problems that are data-oriented but not probabilistic (e.g. support vector machines or perceptrons). Second, machine learning is only interesting insofar as it’s connected to computation.

Decent asymptotic worst case complexity is necessary for arbitrary scaling of arbitrary inputs, but not sufficient. The constant factors will get you in trouble (too many cases to even mention).

For applications, what matters is expected complexity and variance for “realistic” inputs.

There are many places where “in the limit” argumentation is used in machine learning.

I always find the following pattern amusing in papers: (1) define a model, (2) prove some theoretical properties about it in the limit (e.g. de Finetti’s theorem is a popular gambit if you’re dealing with mixtures), (3) prove some theoretical properties about it in the frictionless case (e.g. proofs for separable cases in linear classifiers, independent classifiers in co-learning or ensembles, or diagonal covariance in multivariate Gaussians) (4) note that the model as stated in 1 is intractable, (5) develop a bunch of heuristics for which the results in 2 and 3 don’t hold, (6) apply the heuristics to a toy problem, ideally one that’s fixed and widely studied, such as parsing the Penn treebank or recognizing digits in pixel representations, (7) claim both a theoretical and practical advance.

Even decent algorithms may come without any immediate analysis of their complexity class. Of course one should try to document as best as possible the computational complexity required, but sometimes this is just not possible.

To give you an example, after 60 years of intense research we still do not know whether the simplex method used in linear programming is a strongly polynomial algorithm (that is whether there exist a simplex pivoting rule such that the method becomes strongly polynomial). Yet, also since 60 years it is the method of choice and amazingly efficient.

Turing’s club! I love it… hope this catches on. A club is surely the ideal anti-razor.

I think an interesting and slightly poignant contrast can be made with the Kolmogorov complexity version of Occam’s razor applied to universal learning (eg as argued for by Hutter at the the Occam’s razor NIPS workshop in 2001).

Namely that (assuming one-way functions) algorithmic complexity razors turn into clubs (with unwieldy 2^n multiplicative factors) in exactly the polynomial time domain in which we’d be interested in applying them. My favorite recent references for this kind of effect are Hitchcock et al Dimension, Entropy Rates, and Compression and Entropy Rates and Finite-State Dimension.

[…] I would highly recommend that anyone interested in the semantic web and where things are headed read this post. It clarifies a number of issues that are commonly confused or skipped over. Highly recommended. […]

Am I to understand that it is not at all common for researchers in ML to figure out the computational complexity for their algorithms?

I’m no expert in ML, but as I enjoyed the related subjects while studying undergraduate CS, I have recently tried to use ML algorithms on financial data for my master’s thesis (in Economics) – using Weka. I was unpleasantly surprised to find that it was impossible to use most of the algorithms on a dataset with 20 features and 40.000 examples. The learning blows up either with an “out of memory” error or just keeps for hours without showing any kind of progress.

Could you give me any hints from experience what dataset sizes can be processed on an ordinary PC, maybe just for a few algorithms (or algorithm types)? Could you perhaps point me to a web resource or a research paper describing computational complexity for different ML algorithms?

Thank you in advance,

Darko