What can Type Theory teach us about Machine Learning?

This post is some combination of belaboring the obvious and speculating wildly about the future. The basic issue to be addressed is how to think about machine learning in terms given to us from Programming Language theory.

Types and Reductions

John’s research programme (I feel this should be in British spelling to reflect the grandiousness of the idea…) of machine learning reductions StateOfReduction is at some essential level type-theoretic in nature. The fundamental elements are the classifier, a function f: alpha -> beta, and the corresponding classifier trainer g: List of (alpha,beta) -> (alpha -> beta). The research goal is to create *combinators* that produce new f’s and g’s given existing ones. John (probably quite rightly) seems unwilling at the moment to commit to any notion stronger than these combinators are correctly typed. One way to see the result of a reduction is something typed like: (For those denied the joy of the Hindly-Milner type system, “simple” is probably wildly wrong.)

g’: List of (gamma,delta) -> (List of (alpha,beta) -> (alpha -> beta)) -> (gamma -> delta)

Perhaps another is to think of the reduction itself type-theoretically as a combinator that takes something typed like g above and spits out a new classifier trainer. (I.e. a reduction “lifts” a lower-level function up to operate on some type it wasn’t originally designed for.)

Many of things John considers reductions have particularly error and sample complexity preserving properties. For instance, a reduction may have the property that small regret of a low level classifier implies small regret for the result of the reduction. There are a number of interesting issues here:

* Can we “type” these properties so a compiler might check them?
* Is this enough? Some reductions (including, I would argue, some that have been discussed seriously) are “trivial” reductions in the sense that the antecedent never holds. That is, the are combinators that setup learning problems impossible to solve. Can we somehow type this kind of thing so we can separate good from bad reductions, where we try to define bad as meaning something like “creates impossible subproblems where it was possible to do otherwise”?

RLBench hints at the power that formalizing interfaces for things like reductions can be. You can chain together powerful comands that leverage existing classifier trainers to learn in a trace-generating model which is adapted (again essentially a combinator library) from a generative model. Unfortunately, RLBench operates at the command line which is poorly designed for this kind of composition.

Types and Probabilistic Models

Arguably the graphical models/Bayesian community has just as grandiose plans as John, but here the reductions are to learning probabilities in the sense “beliefs”. Graphical models form a language that allows us to compose together little subgraphs that express our beliefs about some subsystem. Much research has gone into allowing the modeling language to drive inference as well. It’s interesting to explore what type theory can tell us about this as well.

Existing work

Lloyd Allison has taken thoughts like this quite seriously in his relatively recent work, “Types and Classes in Machine Learning and Data Mining” . Allison uses Haskell to type a number of ideas including
MML and some classifiers, and can check many things statically. Unfortunately, ML is one of those parts of information science where we demand every computational advantage we can get. If we learn to solve some problem with acceptable speed, in no time it is the inner loop of some more sophisticated machine learning algorithm. Despite my apppreciation (well, mostly) of Haskell, it simply isn’t practical to write ML algorithms in Haskell. (Although
presumambly both Allison and …. would disagree with me.)

It may still make a huge amount of sense to think about things in something like the Haskell type system and then translate them to the capable (but gross) type system of, say C++. Understanding polymorphism and type classes and there relation with Machine Learning may be a real fundamental breakthrough to making ML widely useful.

Contributions flowing towards PL theory

Right now, when push comes to shove, all good interfaces between systems basically amount to invoking functions or closures. When
you get over “object oriented” and other such hype this makes sense for structuring work. What’s interesting is that recent machine learning and AI work *can’t* be expressed that way. I think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries. This seems to me where ML people can challenge PL people– help us design a new conception of “interface” that preserves uncertainty between layers. (Say, that
passes probability or “belief messages” back and forth.) Perhaps the probabalistic machinery already exist: we can always define “sampling interfaces” between systems. My guess is that these interfaces are basically multi-directional (unlike functional interfaces). Why? Say I want a program to understand speech and I build a layered system that consists of “signaling processing”, “phoneme recognition”, “word recognition”, “language modeling”, and some form of “semantic understanding”. I can resolve something ambigious about, say a phoneme, by understanding the higher level language model. To get the phoneme parsing right I have to feedback the results from language layer. In this sense, interfaces need to preserve uncertainty and probably pass information both ways.

How to start (small)

I think a serious effort should be made to explain things like reductions in terms of PL theory– even if it means that, like Allison, you have to do some explaining type systems first. I’d love to spend some time hashing some of this out with PL people. (If only there were more hours in the day!)

We should write our libraries to have super-clean functional interfaces and make them as (parametrically) polymorphic as reasonable.

Any other thoughts on ways to proceed on this in the short term?

Academic Mechanism Design

From game theory, there is a notion of “mechanism design”: setting up the structure of the world so that participants have some incentive to do sane things (rather than obviously counterproductive things). Application of this principle to academic research may be fruitful.

What is misdesigned about academic research?

  1. The JMLG guides give many hints.
  2. The common nature of bad reviewing also suggests the system isn’t working optimally.
  3. There are many ways to experimentally “cheat” in machine learning.
  4. Funding Prisoner’s Delimma. Good researchers often write grant proposals for funding rather than doing research. Since the pool of grant money is finite, this means that grant proposals are often rejected, implying that more must be written. This is essentially a “prisoner’s delimma”: anyone not writing grant proposals loses, but the entire process of doing research is slowed by distraction. If everyone wrote 1/2 as many grant proposals, roughly the same distribution of funding would occur, and time would be freed for more research.

Mechanism design is not that easy—many counterintuitive effects can occur. Academic mechanism design is particularly difficult problem because there are many details. Nevertheless, it may be worthwhile because it’s hard to underestimate the value of an improvement in the rate of useful research.

The good news is that not everything needs to be solved at once. For example, on the empirical side, if we setup an easy system allowing anyone to create challenges like KDDCup, we might achieve a better (i.e. less cheat-prone) understanding of what works and what does not.

The Role of Workshops

A good workshop is often far more interesting than the papers at a conference. This happens because a workshop has a much tighter focus than a conference. Since you choose the workshops fitting your interest, the increased relevance can greatly enhance the level of your interest and attention. Roughly speaking, a workshop program consists of elements related to a subject of your interest. The main conference program consists of elements related to someone’s interest (which is rarely your own). Workshops are more about doing research while conferences are more about presenting research.

Several conferences have associated workshop programs, some with deadlines due shortly.

ICML workshops Due April 1
IJCAI workshops Deadlines Vary
KDD workshops Not yet finalized

Anyone going to these conferences should examine the workshops and see if any are of interest. (If none are, then maybe you should organize one next year.)

Active learning

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]‘s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.