$1M Netflix prediction contest

Netflix is running a contest to improve recommender prediction systems. A 10% improvement over their current system yields a $1M prize. Failing that, the best smaller improvement yields a smaller $50K prize. This contest looks quite real, and the $50K prize money is almost certainly achievable with a bit of thought. The contest also comes with a dataset which is apparently 2 orders of magnitude larger than any other public recommendation system datasets.

Programming Languages for Machine Learning Implementations

Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier.

  1. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages (java, ocaml, perl and python) often have several issues here.
    1. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long.
    2. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. In some circumstances it results in no slowdown while in others it can cause a 4-order of magnitude slowdown. The basic rule here is that you never want to run out of physical memory and use swap.
    3. Some of these languages are interpreted rather than executed. As a rule of thumb, interpreted languages are an order of magnitude slower than an executed languages.
    4. Even when these languages are compiled, there are often issues with how well they are compiled. Compiling to a modern desktop processor is a tricky business, and only a few compilers do this well.
  2. Programming Ease Ease of use of a language is very subjective because it is always easiest to use the language you are most familiar with. Nevertheless, significant differences exist.
    1. Syntax Syntax is often overlooked, but it can make a huge difference in the ease of both learning to program and using the language. A good syntax allows you study and improve the algorithm rather than the program implementing it. (Algorithmic improvements often yield the greatest speedups while optimizing.) The syntax needs to be concise (so that you can view the entire algorithm) and simple (so that it can become second nature).
    2. Library Support Languages vary dramatically in terms of library support, and having the right linear algebre/graphics/IO library can make a task dramatically easier. Perl has a huge number of associated libraries which greatly ease use.
    3. Built in Functionality Languages differ in terms of the primitives that are available. Which of these primiitives are useful is a subject of significant debate.
      1. Some form of generic programming (templates, polymorphism, etc…) seems important.
      2. Functions as first class objects is often very convenient while programming.
      3. Builting lists and hash tables are often extremely useful primitives. One caveat here is that when you make a speed optimization pass, you often have to avoid these primitives.
      4. Support for modularity is important. Objects (as in object oriented programming) is one example of this, but there are many others. The essential quantity here seems to be an interface creation mechanism.
    4. Scalability Scalability is where otherwise higher level languages often break down. A simple example of this is a language with file I/O built in that fails to perform correctly when the file has size 231 or 232. I am particularly familiar with Ocaml which has the following scalability issues:
      1. List operations often end up consuming the stack and segfaulting. The Unison crew were annoyed enough by this that they created their own “safelist” library with all the same interfaces as the list type.
      2. Arrays on a 32 bit machine can have only 222-1 elements due to dynamic type information set aside for the garbage collector. As a substitute, there is a “big array” library. However, having big arrays, arrays, and strings, often becomes annoying because they have different interfaces for objects which are semantically the same.
    5. Familiarity This isn’t just your familiarity, but also the familiarity of other people who might use the code. At one extreme, you can invent your own language (as Yann LeCun has done with Lush). At the other extreme, you can use a language which many people are familiar with such as C or Java.

The existing significantly used machine learning code bases seem to favor lower level languages.

  1. Weka is one of the best known general purpose machine learning toolkits. It is implemented in Java and has far more algorithmic coverage than any other system I know of.
  2. LibSVM is implemented in C++ and Java.
  3. SVMlight is implemented in C.
  4. C4.5 is implemented in C.
  5. SNNS is written in C.

Both R and Matlab are also useful languages, although I have not used them.

None of the language choices seem anywhere near ideal. The higher level languages often can’t execute fast and the lower level ones which can are often quite clumsy. The approach I’ve taken is to first implement in a higher level language (my choice was ocaml) to test ideas and then reimplement in a lower level language (C or C++) for speed where the ideas work out. This reimplementation step is quite clumsy and I would love to find a way to avoid it.

Incentive Compatible Reviewing

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

  1. We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
    1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
    2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
  2. We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
  3. We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.