What can Type Theory teach us about Machine Learning?

This post is some combination of belaboring the obvious and speculating wildly about the future. The basic issue to be addressed is how to think about machine learning in terms given to us from Programming Language theory.

Types and Reductions

John’s research programme (I feel this should be in British spelling to reflect the grandiousness of the idea…) of machine learning reductions StateOfReduction is at some essential level type-theoretic in nature. The fundamental elements are the classifier, a function f: alpha -> beta, and the corresponding classifier trainer g: List of (alpha,beta) -> (alpha -> beta). The research goal is to create *combinators* that produce new f’s and g’s given existing ones. John (probably quite rightly) seems unwilling at the moment to commit to any notion stronger than these combinators are correctly typed. One way to see the result of a reduction is something typed like: (For those denied the joy of the Hindly-Milner type system, “simple” is probably wildly wrong.)

g’: List of (gamma,delta) -> (List of (alpha,beta) -> (alpha -> beta)) -> (gamma -> delta)

Perhaps another is to think of the reduction itself type-theoretically as a combinator that takes something typed like g above and spits out a new classifier trainer. (I.e. a reduction “lifts” a lower-level function up to operate on some type it wasn’t originally designed for.)

Many of things John considers reductions have particularly error and sample complexity preserving properties. For instance, a reduction may have the property that small regret of a low level classifier implies small regret for the result of the reduction. There are a number of interesting issues here:

* Can we “type” these properties so a compiler might check them?
* Is this enough? Some reductions (including, I would argue, some that have been discussed seriously) are “trivial” reductions in the sense that the antecedent never holds. That is, the are combinators that setup learning problems impossible to solve. Can we somehow type this kind of thing so we can separate good from bad reductions, where we try to define bad as meaning something like “creates impossible subproblems where it was possible to do otherwise”?

RLBench hints at the power that formalizing interfaces for things like reductions can be. You can chain together powerful comands that leverage existing classifier trainers to learn in a trace-generating model which is adapted (again essentially a combinator library) from a generative model. Unfortunately, RLBench operates at the command line which is poorly designed for this kind of composition.

Types and Probabilistic Models

Arguably the graphical models/Bayesian community has just as grandiose plans as John, but here the reductions are to learning probabilities in the sense “beliefs”. Graphical models form a language that allows us to compose together little subgraphs that express our beliefs about some subsystem. Much research has gone into allowing the modeling language to drive inference as well. It’s interesting to explore what type theory can tell us about this as well.

Existing work

Lloyd Allison has taken thoughts like this quite seriously in his relatively recent work, “Types and Classes in Machine Learning and Data Mining” . Allison uses Haskell to type a number of ideas including
MML and some classifiers, and can check many things statically. Unfortunately, ML is one of those parts of information science where we demand every computational advantage we can get. If we learn to solve some problem with acceptable speed, in no time it is the inner loop of some more sophisticated machine learning algorithm. Despite my apppreciation (well, mostly) of Haskell, it simply isn’t practical to write ML algorithms in Haskell. (Although
presumambly both Allison and …. would disagree with me.)

It may still make a huge amount of sense to think about things in something like the Haskell type system and then translate them to the capable (but gross) type system of, say C++. Understanding polymorphism and type classes and there relation with Machine Learning may be a real fundamental breakthrough to making ML widely useful.

Contributions flowing towards PL theory

Right now, when push comes to shove, all good interfaces between systems basically amount to invoking functions or closures. When
you get over “object oriented” and other such hype this makes sense for structuring work. What’s interesting is that recent machine learning and AI work *can’t* be expressed that way. I think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries. This seems to me where ML people can challenge PL people– help us design a new conception of “interface” that preserves uncertainty between layers. (Say, that
passes probability or “belief messages” back and forth.) Perhaps the probabalistic machinery already exist: we can always define “sampling interfaces” between systems. My guess is that these interfaces are basically multi-directional (unlike functional interfaces). Why? Say I want a program to understand speech and I build a layered system that consists of “signaling processing”, “phoneme recognition”, “word recognition”, “language modeling”, and some form of “semantic understanding”. I can resolve something ambigious about, say a phoneme, by understanding the higher level language model. To get the phoneme parsing right I have to feedback the results from language layer. In this sense, interfaces need to preserve uncertainty and probably pass information both ways.

How to start (small)

I think a serious effort should be made to explain things like reductions in terms of PL theory– even if it means that, like Allison, you have to do some explaining type systems first. I’d love to spend some time hashing some of this out with PL people. (If only there were more hours in the day!)

We should write our libraries to have super-clean functional interfaces and make them as (parametrically) polymorphic as reasonable.

Any other thoughts on ways to proceed on this in the short term?

5 Replies to “What can Type Theory teach us about Machine Learning?”

John Langford says:

3/30/2005 at 8:43 pm

I think it’s reasonably easy to type-check reductions at the “alpha” and “beta” levels. I don’t have a real clue about how to type check the error-limiting property or the regret minimizing property. One basic difficulty is that such a type would have to encode a n optimization direction. For example, if we consider a reduction from importance weighted classification (beta’ = {0,1} x R) to binary classification (beta = {0,1}), the type must exclude the possibility that the importance weighted testing procedure negates the prediction of a base classifier. I don’t know how to encode this directionality in a type (but maybe it is possible).

This problem of “reductions might create hard subproblems” is a difficult one. What the mathematics precludes is information-theoretically hard subproblems, but it does not preclude cryptographically hard subproblems, and I don’t see a good way around that.

RLBench still needs work.

I do not understand this: “I think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries.” Can you explain further?

In the short term, I want to create a library of reductions code. This short term plan has been frustrated by distractions for several months.
Andrej Bauer says:

3/31/2005 at 8:42 am

A little bird told me about this discussion. Hi, John.

First I will say that I know next to nothing about machine learning
(perhaps John should tell me what book to read). However, I have spent
considerable time thinking about computable mathematics (“how can we
combine math and programming so that they do not get in each other’s
way?) and it seems to me that you are talking about a special instance
of it. Perhaps you will find my comments useful.

Almost all misconceptions about computing with mathematical entities
(such as real numbers, probability distributions, continuous
functions, etc.) come from the fact that people are trained in
classical logic rather than constructive logic. This however is
another story, so I will just limit myself to one particular instance
of this phenomenon, relevant to the current topic.

Suppose you know how to implement a certain set S of
mathematical objects as a datatype T. Then you narrow your
interest to a specific subset S’ of S and ask yourself
what datatype to use to represent S’. To anyone trained in
classical logic it might seem natural that the elements of S’
should be represented in the same way as the elements of S,
since they are elements of S. However, such a
representation would forget the fact that the elements of S’
have a distinguished property, namely that they are the
elements of S’. This is a source of many misconceptions and
mistakes in design of data structures.

To illuminate what I just said, let us consider an example. Take the
set S of all finite simple directed graphs whose vertices are
labeled by integers. We could represent the elements of S as
pairs (v,e) where v is a list of vertices and e
is a list of edges. In this case the representing datatype for
S is int list * (int * int) list. Now let us
consider the subset S’ of Hamiltonian graphs, i.e., those
graphs that have a cycle passing through each vertex exactly once. As
is well-known, Hamiltonian cycles are not all that easy to compute. It
would make sense to represent Hamiltonian graphs not just as pairs
(v,e) but rather triples (v,e,h) where v and
e are as before, and h is a Hamiltonian cycle. This
would have obvious benefits for performance in any algorithm that
actually uses the fact that a graph is Hamiltonian. Thus the
representing datatype for S’ is int list * (int * int) list * int list. It differs from the original representing
datatype in the fact that it carries more information.

It could be argued in the above example that the extra information
about the Hamiltonian cycle is not strictly necessary since it can be
computed (albeit at a high cost) from v and e. However,
there are (realistic) examples where the missing information cannot be
reconstructed, but I do not want to get into that right now.

Let us now apply this to datatypes in machine learning. Drew pointed
out that a classifier in general is a function a -> b. A
classifier trainer then has the type (a * b) list -> a >
b. If this were really so, it would mean that apart from knowing
that a classifier is a function there is no other information
about it. But without any extra information there is little machine
learning we can do here. Is it not usually the case that both
classifiers and trainers have some notion of confidence, or error
estimation? If so, this should be recorded in the underlying types. I
can well imagine that certain machine learning algorithms can take
advantage of the fact that certain other algorithms produce estimates
about how well they perform.

I guess my point here is that classifiers and trainers do not
necessarily have such simple types as Drew indicated.

There is an automatic way of translating mathematical structures to
their underlying datatypes. It would be interesting to see how
classifiers translate. But first we would have to know what the domain
and codomain of a classifier is. Is it always a metric space, for
example? Is a classifier really just a (special kind of) function, or
does every classifier have a measure of how well it performs? If it
does, they types should reflect this.

P.S. John, why do you despair about most real numbers being
non-computable but the fact that almost all functions are
non-computable does not seem to prevent you from implementing them.
You are right about floating-point being evil, of course, but I am not
sure you are right for the right reason.
Drew Bagnell says:

4/2/2005 at 10:29 am

Hi John,

This problem of “reductions might create hard subproblems” is a difficult one. What the mathematics precludes is information-theoretically hard subproblems, but it does not preclude cryptographically hard subproblems, and I donâ€™t see a good way around that.

The combinatorial difficulty is one thing. I believed at some point in the past that that, for instance, RLTrace could generate information-theoretically hard classification problems. That seems much worse behavior to me. (I haven’t read the draft in a long time and I’m operating from memory though…)

I do not understand this: â€œI think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries.â€ Can you explain further?

Thanks for following up on this one: the comment is rather mysterious. There’s two things I’m trying to get here. The first is that we basically build
systems but making “interfaces” between sub-systems. These interfaces
are essentially functional and aren’t uncertainty preserving. I.e. systems are
built that make hard decisions at one level and pass that result up to another level. Speech recognition is a good example of this behavior, with a distinct hierarchy of processing. This is pretty much the only way we know how to build big systems that do interesting things.

On the other hand, we have a rich (and in an important sense, canonical) language for modeling uncertainty: probability theory. Graphical models present a way to structure this uncertainty– even to allow us to build up layers. A very nice paper that does this is Andrew McCallum and colleagues. (A sort-of discriminative factorial HMM.) Unfortunately, research with such models basically makes one giant structure. That throws away to the sort of abstractions we’d like to build to make big systems work. Graphical models and “message passing” inference have the potential to help us define interfaces themselves. That is, they can serve as the uncertainty-preserving “glue” so that each of these pieces can be implemented separately. I think this requires really thinking about the PL and software engineering implications though, as right now we have to implement it all in one go.

Hi Andrej! Thanks for your thoughts. Any chance you’ll be in the ‘burgh in the near future?

(“how can we
combine math and programming so that they do not get in each otherâ€™s
way?) and it seems to me that you are talking about a special instance
of it.

This might really be it. I’ve been quite intrigued by constructive mathematics in general
and particularly your own work.

On a side note though, I’ve found non-constructive mathematics quite useful in general. Have you looked much at non-standard analysis? I once heard it described as “constructive modulo an ultrafilter”. It gives you the happy feeling of understanding the constructions in analysis (or probability where it most interests me) but without the feeling of giving up useful mathematics. I suppose this conversation needs some kind of side-bar.

Your question about the type of a classifier is interesting. I presented it that way to make it “totally general”. There are multiple frameworks for thinking about machine learning and it’s not clear to me that they would agree as to what the side information should be. (For instance, the Bayesian would rather give you a posterior distribution over functions rather than just the function. Or at least return probabilities instead of raw classifications.) I’m sure your right about my typing being naive, although perhaps it’s force on me by wanting to appeal to multiple frameworks. Does this make any sense?
John Langford says:

4/4/2005 at 10:12 pm

RLTrace could generate information-theoretically hard classification problems.

Not coincidentally, we don’t know a good global performance guarantee for RLTrace.

Is it always a metric space, for example? Is a classifier really just a (special kind of) function, or
does every classifier have a measure of how well it performs?

No, classifiers don’t always operate in metric spaces. Every classifier has a measure of how well it performs, but that measure is typically not evaluatable, except after the fact.

My general impression is that the extra information you want to put into the type is not easily available. For any individual functional form, you can say “the classifier is a decision tree”, “the classifier is a half plane”, etc… But, I don’t know good additional type information for any classifier.
Allan Erskine says:

4/14/2005 at 12:02 am

A nice survey, and an interesting contribution, to PL-theory generalisations of graphical models is Avi Pfeffer’s ‘Stochastic Lambda Calculus’ paper (found here)

The paper itself is light on type theory, but you can get an idea of how such a language might be augmented with a language (O’Caml)’s existing type system by downloading the source to his IBAL language.

Comments are closed.