Machine Learning – Page 46 – Machine Learning (Theory)

9/16/2007

Optimizing Machine Learning Programs

Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. Basic tactical optimizations are covered well elsewhere, but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. Here are some of the higher level optimizations I’ve often found useful.

Algorithmic Improvement First. This is Hard, but it is the most important consideration, and typically yields the most benefits. Good optimizations here are publishable. In the context of machine learning, you should be familiar with the arguments for online vs. batch learning.
Choice of Language. There are many arguments about the choice of language. Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher level languages. (Sometimes I prototype in Ocaml.) Choosing the wrong language can result in large slowdowns.
Avoid Pointer-Based Representations. The way your represent information in your program can have a dramatic impact on performance. My rule of thumb is “for fast programs, it’s all arrays in the end”. As an example, consider maps. STL provides map (a tree-based datastructure) and hash_map (an array-of-pointers data structure). Where a hash_map works, it’s common to observe an order-of-magnitude improvement in performance over a map. (Sometimes you must futz with the hash function to observe this). The Google dense_hash_map replaces the array of pointers with a plain old array and (unsurprisingly) is even faster.
What’s fundamentally happening here is locality: dereferencing pointers is a very expensive operation on modern computers because the CPU runs much faster than the latency to RAM. By converting everything into an array, you compute rather than dereference the location of data. Converting things into an array is not always easy, but it is often possible with a little bit of thought and care.
Cached Parsing. Fast algorithms are required for large quantities of data. Unfortunately, the computational process of reading and parsing the data is often intensive. By caching parsed examples in a machine representation format (either in RAM or on disk), a substantial performance boost is achievable. This comes from two sources:
1. You avoid the work of parsing again.
2. The machine representation can be more concise implying improved system caching effects.
Deshuffle. Avoid copying information. It’s easy to end up copying data from one place to another. Commonly, the best implementation avoids this, which has strong implications on representation choice.
Write less Code. There are many reasons to write less code where you can. For the purposes of optimization, having less code in the bottleneck is the surest way to reduce work in the bottleneck. There are lots of techniques for doing this—some of them are purely syntactic transformations while others involve redesigning the algorithm itself.
Don’t trust libraries. In particular, don’t trust library calls inside the bottleneck. It’s often the case that a library function is general purpose while you can get away with a much faster hand-crafted (and inlined) function.
Buffered disk I/O. There is a huge difference in performance between reading and writing directly from the disk and doing this through a buffering layer. In general, it’s not obvious which I/O library functions buffer properly. You can experiment, or implement your own buffering system. C++ I/O libraries seem to handle this better than C libraries in general.
Amortization. Amortization is a very powerful technique for algorithm optimization. The basic idea is to always make sure that one computation (a secondary one) is amortized by another (your primary computation).
Optimize while you wait. There is always a question about how much time should be spent on optimization vs. other aspects of programming. A reasonable rule of thumb is to spend time on optimization when you are waiting for the program to finish running. This is ammortization, applied to yourself.

In all program optimization, it is critical to know where the bottleneck is, and optimize preferentially there. This is fairly tricky on a modern computer because there are many ways that hidden latency can creep in. Tools like gprof and Valgrind can be helpful here, but there is no substitute for having a basic understanding of how a computer works.

8/28/2007

Live ML Class

Davor and Chunnan point out that MLSS 2007 in Tuebingen has live video for the majority of the world that is not there (heh).

8/25/20078/26/2007

The Privacy Problem

Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems.

It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious:

Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data.
… then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings that might be in the data and check to see if they are in the data.
… then encrypt all the search terms and throw away the key! This prevents a dictionary analysis, but it is still entirely possible to do a frequency analysis. If 10 terms appear with known relative frequencies in public data, then finding 10 terms encrypted terms with the same relative frequencies might give you very good evidence for what these terms are.

All of these strategies turn out to be broken. For those not familiar with Machine Learning, other obvious strategies turn out to not work that well.

Just don’t collect the data. We are not too far off from a world where setting the “please don’t collect my information” flag in your browser implies that you volunteer to have your searches return less relevant results, to not find friends, to not filter spam, etc… If everyone effectively has that flag set by legislation the effect would be very substantial. Many internet companies run off of advertising so eliminating the ability to do targeted advertising will eliminate the ability of these companies to exist.
…Then just keep aggregations of the data! Aggregating data is very bad for machine learning in general. When we are figuring out how to do machine learning it’s even worse because we don’t know in advance which aggregations would be most useful.
…Then keep just enough data around and throw out everything else! Unfortunately, there is no such thing as “enough data”. More data is always better.

This is a particularly relevant topic right now, because it’s news and because CMU and NSF are organizing a workshop on the topic next month, which I’m planning to attend. However, this is not simply an interest burst—the long term trend of increasing data collection implies this problem will repeatedly come up over the indefinite future.

The privacy problem breaks into at least two parts.

Cultural Norms. Historically, almost no monetary transactions were recorded and there was a reasonable expectation that people would forget a visitor. This is rapidly changing with the rise of credit cards and cameras. This change in what can be expected is profoundly uncomfortable.
Power Balance. Data is power. The ability to collect and analyze large quantities of data which many large organizations now have or are constructing increases their power relative to ordinary people. This power can be used for good (to improve services) or for bad (to maximize monopoly status or for spying).

The cultural norm privacy problem is sometimes solvable by creating an opt-in or opt-out protocol. This is particularly helpful on the internet because a user could simply request “please don’t record my search” or “please don’t record which news articles I read”. Needing to do this for every search or every news article would be annoying. However, this is easily fixed by having a system wide setting—perhaps a special browser cookie which says “please don’t record me” that any site could check. None of this is helpful for cameras (where no interface exists) or monetary transactions (where the transaction itself determines whether or not some item is shipped).

The power balance privacy problem is much more difficult. Some solutions that people attempt are:

Accept the change in power balance. This is the default action. There are plenty of historical examples where large organizations have abused their power, so providing them more power to abuse may be unwise.
Legislate a halt. Forbid cameras in public places. Forbid the collection or retention of data by any organization. The problem with this method is that technology simply isn’t moving in this direction. At some point, we may end up with cameras and storage devices so small, cheap, and portable that forbidding their use is essentially absurd. The other difficulty with this solution is that it keeps good things from happening. For example, a reasonable argument can be made that the British were effective at tracking bomb planters because the cameras of London helped them source attacks.
Legislate an acceleration. Instead of halting the collection of data, open it up to more general use. One example of this is cameras in police cars in the US. Recordings from these cameras can often settle disputes very definitively. As technology improves, it’s reasonable to expect cameras just about anywhere people are in public. Some legislation and good engineering could make these cameras available to anyone. This would involve a substantial shift in cultural norms—essentially people would always be in potential public view when not at home. This directly collides with the “privacy as a cultural norm” privacy problem.

The hardness of the privacy problem mentioned at the post beginning implies difficult tradeoffs.

If you have cultural norm privacy concerns, then you really don’t appreciate method (3) for power balance privacy concerns.
If you value privacy greatly and the default action is taken, then you prefer monopolistic marketplaces. The advantages of a large amount of private data are prohibitive to new market entrance.
If you want the internet to work better, then there are limits on how little data can be collected.

All of the above is even murkier because what can be done with data is not fully known, nor is what can be done in a privacy sensitive way.

8/19/2007

Choice of Metrics

How do we judge success in Machine Learning? As Aaron notes, the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are:

The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss.
The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity.
The correct loss function is simply unknown (and perhaps unknowable, except by experimentation).

In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for evaluating predictive systems rather than a loss imposed by the real world or a loss which is optimized by a learning algorithm.

A good metric satisfies several properties.

The real problem. This property trumps all other concerns and every effort argument that metric A better reflects the real problem than metric B must be carefully evaluated. Unfortunately, this is application specific, so little more advice can be given.
Boundedness. A good metric is bounded. Joaquin ran the Evaluating Predictive Uncertainty Challenge and presented results at a NIPS workshop. In the presentation, there was a slide on “analysis of a disaster”—one of the good contestant entries mispredicted the correct answer with very high confidence resulting in a terrible log loss, which was used to evaluate the contestants. If that single example was removed from the test set, the entry would have won.
An essential question is: is this fair? My belief is no. It’s not generally fair to exclude a method because it errs once, because losses imposed by the real world are often bounded. This is a failure in the choice of metric of evaluation as much as a failure of the prediction system.

Another highly unintuitive property of unbounded losses is nonconvergence. When an IID assumption holds over examples in the test set, bounded losses converge at known rates. Without a bound, there is never convergence. This means, for example, that we can construct examples of systems where a test set of size m typically produces a loss ordering of system A > system B but a test set of size 2m reverses the typical ordering. This ordering can be made to reverse again and again as larger and larger test sets are used. In other words, there is no size of test set such that you can be sure that the order has stabilized.
Atomicity. Another way to suffer the effects of nonconvergence above is to have a metric which uses some formula on an entire set of examples to produce a prediction. A simple example of this is “area under the ROC curve” which becomes very unstable when the set of test examples is “lopsided”—i.e. almost always 1 or almost always 0. (Sample complexity analysis for AUC gets around this by conditioning on the number of 1’s or 0’s. Most sampling processes do not have this conditioning built in.)
Variation. A good metric can vary substantially in value. Precisely defining “vary substantially” is slightly tricky, because we wan’t it to vary substantially in value with respect to the tested system. A reasonable approach is: compare the metric on the best constant predictor to the minimum of the metric.
Variation can always be “improved” by simply doubling the magnitude of the metric. To remove this “improvement” from consideration, normalizing by a bound on the loss appears reasonable. This implies that we measure varation as (loss of best constant predictor – minimal possible loss) / (maximal loss – minimal loss). As an example, squared loss—(y’ – y)² would have variation 0.25 according to this measure while 0/1 loss—I(y’ != y) would have variation 0.5 for binary classification. For k-class classification, the variation would be (k-1)/k. When possible, using the actual distribution over y to compute the loss of the best constant predictor is even better.
Semantics. It’s useful to have a semantics to the metric, because it makes communication of the metric easy.
Simplicity. It’s useful to have the metric be simple because a good metric will be implemented multiple times by multiple people. Simplicity implies that time is saved in implementation and debugging.

Trading off amongst these criteria is not easy in general, but sometimes a good argument can be made. For example, If you care about probabilistic semantics, squared loss seems superior to log loss (i.e. log (1/p(true y) where p() is the predicted value, because squared loss is bounded.

Another example is AUC ordering vs. predicting which of mixed pairs are larger. Predicing mixed pairs correctly has known rates of deviation convergence and is essentially equivalent to AUC optimization.

8/12/2007

Exponentiated Gradient

The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it.

The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n), (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens:

The world reveals a set of features x in {0,1}ⁿ. In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts).
EG makes a prediction according to y’ = w . x (dot product).
The world reveals the truth y in [0,1].
EG updates the weights according to w_i <- w_ie^{-2 c (y’ – y)x_i}. Here c is a small constant learning rate.
The weights are renormalized to sum to 1.

The 4th line is what gives EG it’s name—exponent of the negative gradient (of squared loss in this case).

The accompanying theoretical statement (in english), is that for all sequences of observations, this algorithm does nearly as well in squared loss as the best convex combination of the features with a regret that is only logarithmic in the number of features. The mathematical theorem statement is: For all T for all T-length sequences of observations,

Sum^T_t=1 (y’_t – y_t)² <= min_{probability distributions q} (2/(2-c)) Sum^T_t=1 (q.x_t – y)² + KL(q||p) / c

Here KL(q||p) = Sum_i q_i ln (q_i / p_i) is the KL-divergence between the distribution q that you compare with and the distribution p that you start with. The KL-divergence is at most log n if p is uniform.

The learning rate c plays a critical role in the theorem, and the best constant setting of c depends on how many total rounds T there are. Tong Zhang likes to think of this algorithm as the stochastic gradient descent with entropy regularization, which makes it clear that when used as an online optimization algorithm, c should be gradually decreased in value.

There are many things right here.

Exponentiated Gradient competes with the best convex predictor with no caveats about how the world produces the data. This is pretty remarkable—it’s much stronger than competing with the best single feature, as many other online adversary algorithms do. The lack of assumptions about the world means this is a pretty universal algorithm.
The notion of competition is logarithmic in the number of features so the algorithm can give good performance even when the number of features is extraordinarily large compared to the number of examples. In contrast gradient descent style algorithms might need a number of examples similar to the number of features. (The same paper also analyzes regret for gradient descent.)
We know from a learning reductions point of view that the ability to optimize squared loss well implies that ability to optimize many other losses fairly well.

A few things aren’t right about EG. A convex combination is not as powerful a representation as a linear combination. There are natural relaxations covered in the EG paper which deal with more general combinations, as well as in some other papers. For more general combinations, the story with respect to gradient descent becomes more mixed. When the best predictor is a linear combination of many features, gradient descent style learning may be superior to EG.

EG-style algorithms are slowly coming out. For example, this paper shows that EG style updates can converge very quickly compared to other methods.