The Big O and Constants in Learning – Machine Learning (Theory)

The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n).

In learning theory, there are many statements about learning algorithms of the form “under assumptions x,y, and z, the classifier learned has an error rate of at most O(f(m))“.

There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well.

Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor.
You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems.

We can’t say “never use O()“, because sometimes abstracting away details is the right thing to do. However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms.

3 Replies to “The Big O and Constants in Learning”

Nice post. Is this a call for tight bounds with their constants, or something else entirely?

I have a tendency to leave off conclusions, so I added a bit on—thanks.

I think of this more as “Here’s something where applied learning and learning theory people disagree about what is important, and why.”

There’s an even more disturbing reason not to use bigO notation, which was pointed out very aggressively by Marcus Hutter in the context of computation time, not generalization. In many cases, it is possible to “cheat” in a crazy way, using bigO to “hide” massive constant pre-factors and make completely ridiculous algorithms seem reasonable or even optimal.
For example, take the class of all computational problems whose optimal solution can be encoded as a program of length B bits or less on some Turning machine. Here’s a “universally optimal” program for solving ALL such problems: lexographically enumerate all 2^B programs.
Timeslice between them, running each one on the given input. When any of them terminate,
check if the answer they gave is correct. Since one of them will be the optimal program,
this introduces a “constant” slowdown of “only” 1/(2^B), which of course doesn’t depend on the input size and thus such a strategy is bigO optimal for all problems. Foolish, of course, but it points out a huge Achilles’ heel in the whole bigO framework. Marcus actually shows how to clean up the sketched out argument above to include all problems (without the assumption of size <=B) using Levin search, and to make more precise the bit about "checking the answer". Scaarry stuff.

Comments are closed.