Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions:
- How does the learning algorithm scale with the number of examples m? Any algorithm using all of the data is at least O(m), but in many cases this is O(m2) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled.
- The above question can also be asked for test cases. In some applications, test-time performance is of great importance.
- How does the algorithm scale with the number of features n per example? Many second order gradient descent algorithms are O(n2) or worse which becomes unacceptable as the number of parameters grows. Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible.
- What are the memory requirements of the learning algorithm? Something linear in the number of features (or less) is nice. Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled.
One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. A simple trick is to pick a scale, and ask what size problem can be solved given the big-O dependence. For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of 1010 is currently appropriate. Computing scales, you get:
O(m) | O(m log(m)) | O(m2) | O(m3) | O(em) |
1010 | 5*108 | 105 | 2*103 | 25 |
There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems.
There are various depths to which we can care about computation. The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a O(mn log(mn)) or better dependence is the target.
For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. This is mismatched for machine learning. It’s too crude because O(m^9) algorithms are interesting to basically no one. It’s too fine because (a) there are a number of problems of interest with only a small amount of data where algorithms with unquantifiable computation may be of interest (think of Bayesian integration) and (b) some problems simply have no solution yet, so the existence of a solution (which is not necessarily efficient) is of substantial interest.
The right degree of care about computation I’ll call “Turing’s club”. Computation is a primary but not overriding concern. Every algorithm should be accompanied by some statement about it’s computational and space costs. Algorithms in the “no known computational bound” category are of interest if they accomplish something never before done, but are otherwise of little interest. Algorithms with controlled guarantees on computational requirements are strongly preferred. Linear time algorithms are strongly preferred. Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well.