Programming Languages for Machine Learning Implementations

Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier.

  1. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages (java, ocaml, perl and python) often have several issues here.
    1. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long.
    2. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. In some circumstances it results in no slowdown while in others it can cause a 4-order of magnitude slowdown. The basic rule here is that you never want to run out of physical memory and use swap.
    3. Some of these languages are interpreted rather than executed. As a rule of thumb, interpreted languages are an order of magnitude slower than an executed languages.
    4. Even when these languages are compiled, there are often issues with how well they are compiled. Compiling to a modern desktop processor is a tricky business, and only a few compilers do this well.
  2. Programming Ease Ease of use of a language is very subjective because it is always easiest to use the language you are most familiar with. Nevertheless, significant differences exist.
    1. Syntax Syntax is often overlooked, but it can make a huge difference in the ease of both learning to program and using the language. A good syntax allows you study and improve the algorithm rather than the program implementing it. (Algorithmic improvements often yield the greatest speedups while optimizing.) The syntax needs to be concise (so that you can view the entire algorithm) and simple (so that it can become second nature).
    2. Library Support Languages vary dramatically in terms of library support, and having the right linear algebre/graphics/IO library can make a task dramatically easier. Perl has a huge number of associated libraries which greatly ease use.
    3. Built in Functionality Languages differ in terms of the primitives that are available. Which of these primiitives are useful is a subject of significant debate.
      1. Some form of generic programming (templates, polymorphism, etc…) seems important.
      2. Functions as first class objects is often very convenient while programming.
      3. Builting lists and hash tables are often extremely useful primitives. One caveat here is that when you make a speed optimization pass, you often have to avoid these primitives.
      4. Support for modularity is important. Objects (as in object oriented programming) is one example of this, but there are many others. The essential quantity here seems to be an interface creation mechanism.
    4. Scalability Scalability is where otherwise higher level languages often break down. A simple example of this is a language with file I/O built in that fails to perform correctly when the file has size 231 or 232. I am particularly familiar with Ocaml which has the following scalability issues:
      1. List operations often end up consuming the stack and segfaulting. The Unison crew were annoyed enough by this that they created their own “safelist” library with all the same interfaces as the list type.
      2. Arrays on a 32 bit machine can have only 222-1 elements due to dynamic type information set aside for the garbage collector. As a substitute, there is a “big array” library. However, having big arrays, arrays, and strings, often becomes annoying because they have different interfaces for objects which are semantically the same.
    5. Familiarity This isn’t just your familiarity, but also the familiarity of other people who might use the code. At one extreme, you can invent your own language (as Yann LeCun has done with Lush). At the other extreme, you can use a language which many people are familiar with such as C or Java.

The existing significantly used machine learning code bases seem to favor lower level languages.

  1. Weka is one of the best known general purpose machine learning toolkits. It is implemented in Java and has far more algorithmic coverage than any other system I know of.
  2. LibSVM is implemented in C++ and Java.
  3. SVMlight is implemented in C.
  4. C4.5 is implemented in C.
  5. SNNS is written in C.

Both R and Matlab are also useful languages, although I have not used them.

None of the language choices seem anywhere near ideal. The higher level languages often can’t execute fast and the lower level ones which can are often quite clumsy. The approach I’ve taken is to first implement in a higher level language (my choice was ocaml) to test ideas and then reimplement in a lower level language (C or C++) for speed where the ideas work out. This reimplementation step is quite clumsy and I would love to find a way to avoid it.

What is missing for online collaborative research?

The internet has recently made the research process much smoother: papers are easy to obtain, citations are easy to follow, and unpublished “tutorials” are often available. Yet, new research fields can look very complicated to outsiders or newcomers. Every paper is like a small piece of an unfinished jigsaw puzzle: to understand just one publication, a researcher without experience in the field will typically have to follow several layers of citations, and many of the papers he encounters have a great deal of repeated information. Furthermore, from one publication to the next, notation and terminology may not be consistent which can further confuse the reader.

But the internet is now proving to be an extremely useful medium for collaboration and knowledge aggregation. Online forums allow users to ask and answer questions and to share ideas. The recent phenomenon of Wikipedia provides a proof-of-concept for the “anyone can edit” system. Can such models be used to facilitate research and collaboration? This could potentially be extremely useful for newcomers and experts alike. On the other hand, entities of this sort already exist to some extent: Wikipedia::Machine Learning, MLpedia, the discussion boards on kernel-machines.org, Rexa, and the gradual online-ification of paper proceedings to name a few.

None of these have yet achieved takeoff velocity. You’ll know that takeoff velocity has been achieved when these become a necessary part of daily life rather than a frill.

Each of these efforts seems to be missing critical pieces, such as:

  1. A framework for organizing and summarizing information. Wikipedia and MLpedia are good examples, yet this is not as well solved as you might hope as mathematics on the web is still more awkward than it should be.
  2. A framework for discussion. Kernel-machines.org handles this, but is too area-specific. There does exist a discussion framework on Wikipedia/MLpedia, but the presentation format marginalizes discussion, placed on a separate page and generally not viewed by most observers. The discussion, in fact, should be an integral part of the presentation.
  3. Researchers have incentives to contribute. Wikipedia intentionally anonymizes contributors in the presentation, because recognizing them might invite the wrong sort of contributor. Incentives done well, however, are one of the things creating (6). One of the existing constraints within academia is that the basic unit of credit is coauthorship on a peer-reviewed paper. Given this constraint, it would be very handy if a system could automatically translate a subset of an online site into a paper, with authorship automatically summarized. The site itself might also track and display who has contributed how much and who has contributed recently.
  4. Explicit mechanisms for handling disagreements. If you get 3 good researchers on a topic in a room, you might have about 5 distinct opinions. Much of research has to do with thinking carefully about what is important and why, the sorts of topics likely to provoke disagreement. Given that disagreement is a part of the process of research, there needs to be a way to facilitate, and even spotlight, disagreements for a healthy online research mechanism. One crude system for handling disagreements is illustrated by the linux kernel “anyone can download and start their own kernel tree”. A more fine-grained version of this may be effective “anyone can clone a webpage and start their own version of it”. Perhaps this can be coupled with a version voting system, although that is tricky. A fundamental point is: a majority vote does not determine the correctness of a theorem. Integrating a peer review system may work well. None of the existing systems handle this problem effectively.
  5. Low entry costs. Many systems handle this well, but it must be emphasized because small changes in the barrier to entry can have a large effect on (6).
  6. Community buy-in. Wikipedia is the big success story here, but Wikipedia::MachineLearning has more limited success. There are many techniques which might aid community buy in, but they may not be enough.

Can a site be created that simultaneously handles all of the necessary pieces for online research?

How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – emin is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here emin is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.

Objective and subjective interpretations of probability

An amusing tidbit (reproduced without permission) from Herman Chernoff’s delightful monograph, “Sequential analysis and optimal design”:

The use of randomization raises a philosophical question which is articulated by the following probably apocryphal anecdote.

The metallurgist told his friend the statistician how he planned to test the effect of heat on the strength of a metal bar by sawing the bar into six pieces. The first two would go into the hot oven, the next two into the medium oven, and the last two into the cool oven. The statistician, horrified, explained how he should randomize to avoid the effect of a possible gradient of strength in the metal bar. The method of randomization was applied, and it turned out that the randomized experiment called for putting the first two pieces into the hot oven, the next two into the medium oven, and the last two into the cool oven. “Obviously, we can’t do that,” said the metallurgist. “On the contrary, you have to do that,” said the statistician.

What are arguments for and against this design? In a “larger” design or sample, the effect of a reasonable randomization scheme could be such that this obvious difficulty would almost certainly not happen. Assuming that the original strength of the bar and the heat treatment did not “interact” in a complicated nonlinear way, the randomization would virtually cancel out any effect due to a strength gradient or other erratic phenomena, and computing estimates as though these did not exist would lead to no real error. In this small problem, the effect may not be cancelled out, but the statistician still has a right to close his eyes to the design actually selected if he is satisfied with “playing fair”. That is, if he instructs an agent to select the design and he analyzes the results, assuming there are no gradients, his conclusions will be unbiased in the sense that a tendency to overestimate is balanced on the average by a tendency to underestimate the desired quantities. However, this tendency may be substantial as measured by the variability of the estimates which will be affected by substantial gradients. On the other hand, following the natural inclination to reject an obviously unsatisfactory design resulting from randomization puts the statistician in the position of not “playing fair”. What is worse for an objective statistician, he has no way of evaluating in advance how good his procedure is if he can change the rules in the middle of the experiment.

The Bayesian statistician, who uses subjective probability and must consider all information, is unsatisfied to simply play fair. When randomization leads to the original unsatisfactory design, he is aware of this information and unwilling to accept the design. In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.