Exemplar programming

There are many different abstractions for problem definition and solution. Here are a few examples:

  1. Functional programming: a set of functions are defined. The composed execution of these functions yields the solution.
  2. Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum.
  3. Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower).
  4. Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower).
  5. Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks.
  6. SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner.

These abstractions have different tradeoffs between ease of use, generality, and the effectiveness of existing solutions.

Machine learning can be thought of as exemplar programming. Exemplar programming is creating examples of a (input,output) pairs which are used by an algorithm to predict a function from input to output. Basic questions about this abstraction are:

  1. How easy to use is it? An honest answer here is “not very”. There are several subtle issues related to overfitting, independence, and representativeness of the samples which take significant effort to describe to an unfamiliar person. Making this abstraction easier to use via careful language design is an area where effort may pay off.
  2. How effectve are the exemplar programming solvers (aka learning algorithms)? There is huge variance with respect to the problem under consideration. For some problems, some learning algorithm is very effective while for others you might as well have not tried.
  3. How general is exemplar programming? Very general. A great many problems can be phrased as “given this, I want that”.
  4. How effective has examplar programming been? Increasingly effective.

Exemplar programming is a viewpoint of machine learning which (mostly) ignores statistics, prior information, and the possibility of overfitting. That’s a great deal to ignore, but there are gains as well.

  1. Exemplar programming creates a split between problem solution and problem formation. This is important because the problem solver can heavily optimized (for speed, scalibility, effectiveness on common problems, etc…) making the process of apply machine learning simply a matter of specifying the exemplars.
  2. The formation/solution split helps us focus on problem formation independent of solution. The big gains in machine learning in the last decade seem to be discovering how to apply to new areas. A significant step in any application to a new area is discovering the right way to formulate the problem.

Exemplar programming seems to be a useful viewpoint for machine learning in “big data” problems with many examples where significant prior information is lacking. When either of these conditions are not met, other viewpoints/approaches to machine learning may be more succesful.

Incompatibilities between classical confidence intervals and learning.

Classical confidence intervals satisfy a theorem of the form: For some data sources D,

PrS ~ D(f(D) > g(S)) > 1-d

where f is some function of the distribution (such as the mean) and g is some function of the observed sample S. The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D“. There are even some confidence intervals which do not require IID samples.

Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter.

There are a number of results in theoretical machine learning which use confidence intervals. For example,

  1. The E3 algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability.
  2. Set Covering Machines minimize a confidence interval upper bound on the true error rate of a learned classifier.
  3. The A2 uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning.

Suppose that we want to generalize thse algorithms in a reductive style. The goal would be to train a regressor to predict the output of g(S) for new situations. For example, a good regression prediction of g(S) might allow E3 to be applied to much larger state spaces. Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval.

  1. It’s difficult to imagine a constructive sampling mechanism. In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form “for this state-action, the correct confidence interval is y“.
  2. When we think of succesful learning, we typically think of it in an l1 sense—the expected error rate over the data generating distribution is small. Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds. This mismatch appears unaddressable.

It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems. Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so.

$1M Netflix prediction contest

Netflix is running a contest to improve recommender prediction systems. A 10% improvement over their current system yields a $1M prize. Failing that, the best smaller improvement yields a smaller $50K prize. This contest looks quite real, and the $50K prize money is almost certainly achievable with a bit of thought. The contest also comes with a dataset which is apparently 2 orders of magnitude larger than any other public recommendation system datasets.

Programming Languages for Machine Learning Implementations

Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier.

  1. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages (java, ocaml, perl and python) often have several issues here.
    1. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long.
    2. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. In some circumstances it results in no slowdown while in others it can cause a 4-order of magnitude slowdown. The basic rule here is that you never want to run out of physical memory and use swap.
    3. Some of these languages are interpreted rather than executed. As a rule of thumb, interpreted languages are an order of magnitude slower than an executed languages.
    4. Even when these languages are compiled, there are often issues with how well they are compiled. Compiling to a modern desktop processor is a tricky business, and only a few compilers do this well.
  2. Programming Ease Ease of use of a language is very subjective because it is always easiest to use the language you are most familiar with. Nevertheless, significant differences exist.
    1. Syntax Syntax is often overlooked, but it can make a huge difference in the ease of both learning to program and using the language. A good syntax allows you study and improve the algorithm rather than the program implementing it. (Algorithmic improvements often yield the greatest speedups while optimizing.) The syntax needs to be concise (so that you can view the entire algorithm) and simple (so that it can become second nature).
    2. Library Support Languages vary dramatically in terms of library support, and having the right linear algebre/graphics/IO library can make a task dramatically easier. Perl has a huge number of associated libraries which greatly ease use.
    3. Built in Functionality Languages differ in terms of the primitives that are available. Which of these primiitives are useful is a subject of significant debate.
      1. Some form of generic programming (templates, polymorphism, etc…) seems important.
      2. Functions as first class objects is often very convenient while programming.
      3. Builting lists and hash tables are often extremely useful primitives. One caveat here is that when you make a speed optimization pass, you often have to avoid these primitives.
      4. Support for modularity is important. Objects (as in object oriented programming) is one example of this, but there are many others. The essential quantity here seems to be an interface creation mechanism.
    4. Scalability Scalability is where otherwise higher level languages often break down. A simple example of this is a language with file I/O built in that fails to perform correctly when the file has size 231 or 232. I am particularly familiar with Ocaml which has the following scalability issues:
      1. List operations often end up consuming the stack and segfaulting. The Unison crew were annoyed enough by this that they created their own “safelist” library with all the same interfaces as the list type.
      2. Arrays on a 32 bit machine can have only 222-1 elements due to dynamic type information set aside for the garbage collector. As a substitute, there is a “big array” library. However, having big arrays, arrays, and strings, often becomes annoying because they have different interfaces for objects which are semantically the same.
    5. Familiarity This isn’t just your familiarity, but also the familiarity of other people who might use the code. At one extreme, you can invent your own language (as Yann LeCun has done with Lush). At the other extreme, you can use a language which many people are familiar with such as C or Java.

The existing significantly used machine learning code bases seem to favor lower level languages.

  1. Weka is one of the best known general purpose machine learning toolkits. It is implemented in Java and has far more algorithmic coverage than any other system I know of.
  2. LibSVM is implemented in C++ and Java.
  3. SVMlight is implemented in C.
  4. C4.5 is implemented in C.
  5. SNNS is written in C.

Both R and Matlab are also useful languages, although I have not used them.

None of the language choices seem anywhere near ideal. The higher level languages often can’t execute fast and the lower level ones which can are often quite clumsy. The approach I’ve taken is to first implement in a higher level language (my choice was ocaml) to test ideas and then reimplement in a lower level language (C or C++) for speed where the ideas work out. This reimplementation step is quite clumsy and I would love to find a way to avoid it.