Universal Learning – Machine Learning (Theory)

5/8/20096/8/2009

Computability in Artificial Intelligence

Normally I do not blog, but John kindly invited me to do so. Since computability issues play a major role in Artificial Intelligence and Machine Learning, I would like to take the opportunity to comment on that and raise some questions.

The general attitude is that AI is about finding efficient smart algorithms. For large parts of machine learning, the same attitude is not too dangerous. If you want to concentrate on conceptual problems, simply become a statistician. There is no analogous escape for modern research on AI (as opposed to GOFAI rooted in logic).

Let me show by analogy why limiting research to computational questions is bad for any field.

Except in computer science, computational aspects play little role in the development of fundamental theories: Consider e.g. set theory with axiom of choice, foundations of logic, exact/full minimax for zero-sum games, quantum (field) theory, string theory, … Indeed, at least in physics, every new fundamental theory seems to be less computable than previous ones. Of course, once a subject has been formalized, further research (a) analyzes the structure of the theory and (b) tries to compute efficient approximations. Only in (b) do computational aspects play a role.

So my question is: Why are computational questions so prevalent in AI research? Here are some unconvincing arguments I’ve heard:

A) Because AI is a subfield of computer science, and the task of computer scientists is to find (efficient) algorithms for well-defined problems?

I think it does not do any (real-world) problem any good to confine it to computer science. Of course, philosophers and cognitive scientists also care about AI, but where are the mathematicians?

B) Because formalizing AI and finding efficient smart programs goes hand-in-hand? Separating these two issues would lead to no, or at best to results which are misleading or useless in the construction of intelligent machines?

I am not aware of any convincing argument that separating the issues of “axiomatizing a field” and “finding efficient solutions” will (likely) fail for AI. The examples above of other fields actually indicate the opposite. Of course, interaction is important to avoid both sides running wild. For instance, von Neumann’s minimax solution for games, albeit infeasible for most games, is the cornerstone of most practical approximations.

C) Because there is some deep connection between intelligence and computation which can not be disentangled?

Sure, you could say that intelligence is by definition about computationally efficient decision making. This is as unconvincing as argument (A). Pointing out that the human brain is a computational device is quite useful in many ways, but doesn’t proves (C) either. Of course, ultimately we want a “fast” smart algorithm. How is AI different from wanting a fast algorithm computing primes, which you derive from a non-algorithmic definition of primes; or drawing fractals?

D) Because AI is trivial if computational issues are ignored? All conceptual problems have already been solved?

Many have expressed ideas that some form of exhaustive search over all possible solutions and picking the “best” one does the job. This works essentially for exactly those problems that are well-defined. For instance, optimal minimax play of a zero-sum game or solving NP complete problems are conceptually trivial, i.e. if computation time is ignored. But in general AI and machine learning, there is not a universally agreed-upon objective function. The Turing test is informal (involves a human judge in the loop), maximizing expected reward (the true distribution is not known, so expectation w.r.t. to what?), etc. The AIXI model, briefly discussed at this blog, is the first complete and formal such criterion, for which, let me phrase it that way, no flaw has yet been identified. Shane Legg’s award-winning thesis gives an informal introduction and contains lots of discussion.

Conceptual and computational problems in AI should be studied jointly as well as separately, but the latter is not (yet) fashionable. When AI was more logic oriented, some good logicians helped develop the foundations of “deductive” AI. Where are the researchers giving modern “inductive” AI its foundation? I am talking about generic learning agents, not classifying i.i.d. data. Reinforcement learners? Well, most of the hard results are from adaptive control theorists, but it’s reassuring to see parts of these communities merging. It’s a pity that so few mathematicians are interested in AI. A field “mathematical AI” with the prestige of “mathematical physics” would be exciting. As a start: 40% of the COLT & ALT papers on generic learning agents, 30% induction, 20% time-series forecasting, 10% i.i.d. Currently it’s reversed.

9/18/20079/18/2007

It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

I have recently completed a 500+ page-book on MDL, the first comprehensive overview of the field (yes, this is a sneak advertisement ).
Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are:

In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron, MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes!
In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by something similar to a Bayesian prior; the interpretation is different though; it is more similar to what has been called a “luckiness function” in the computational learning theory literature).

Both points are explained at length in the book (see esp page 544). Here I’ll merely say a bit more about the first.

MDL is always based on designing a universal code L relative to some given model M. Informally this is a code such that whenever some distribution P in M can be used to compress some data set well, then L will compress this data set well as well (I’ll skip the formal definition here). One method (but by no means the only method) for designing a universal code relative to model M is by taking some prior W on M and using the corresponding Shannon-Fano code, i.e. the code that encodes data z with length

L(z) = – log P_bayes(z),

where P_bayes(.) = \int P(.) d W(P) is the Bayesian marginal distribution for M relative to prior W. If M is parametric, then with just about any ‘smooth’ prior, the Bayesian code with lengths L(z) = – log P_bayes(z) leads to a reasonable universal code. But if M is nonparametric (infinite dimensional, such as in Gaussian process regression, or histogram density estimation with an arbitrary nr of components) then many priors which are perfectly fine according to Bayesian theory are ruled out by MDL theory. The reason is that for some P in M, the Bayesian codes based on such priors do not compress data sampled from P at all, even if the amount of data tends to infinity. One can formally prove that such Bayesian codes are not “universal” according to the standard definition of universality.

Now there exist two theorems by Andrew Barron (from 1991 and 1998, respectively) that directly connect data compression with frequentist statistical consistency. In essence, they imply that estimation based on universal codes must always be statistically consistent (the theorems also directly connect the convergence rates to the amount of compression obtained). For Bayesian inference, there exist various inconsistency results such as those by Diaconis and Freedman (1986) and Barron (1998). These say that, for some nonparametric models M, and with some priors on M, Bayesian inference can be inconsistent, in the sense that for some P in M, if data are i.i.d. sampled from P then even with an infinite amount of data, the posterior puts all its mass on distributions P’ in M that are substantially different from the “true” P. By Barron’s theorems, something like this can never happen for MDL; Diaconis and Freedman use priors which are not allowed according to MDL theory. In fact, MDL-based reasoning can also motivate certain prior choices in nonparametric contexts. For example, if one has little prior knowledge, why would one adopt an RBF kernel in Gaussian process regression? Answer: because the corresponding code has excellent universal coding properties, as shown by Kakade, Seeger and Foster (NIPS 2005): it has only logarithmic coding overhead if the underlying data generating process satisfies some smoothness properties; many other kernels have polynomial overhead. Thus, Gaussian processes combined with RBF kernels lead to substantial compression of the data, and therefore, by Barron’s theorem, predictions based on such Gaussian processes converge fast to the optimal predictions that one could only make make if one had access to the unknown imagined “true” distribution.

In general, it is often thought that different priors on M lead to codes that better compress data for some P in M, and that worse compress data for other P in M. But with nonparametric contexts, it is not like that: then there exist priors with “universally good” and “universally bad” coding properties.

This is not to say that all’s well for MDL in terms of consistency: as John and I showed in a paper that appeared earlier this year (but is really much older), if the true distribution P is not contained in the model class M under consideration but contains a good approximation P’ in M then both MDL and Bayes may become statistically inconsistent in the sense that they don’t necessarily converge to P’ or any other good approximation of P.

Thus: if model M parametric and P in M , then MDL and Bayes consistent. If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. If P not in M, then both Bayes and MDL may be inconsistent.

This leaves one more very important case: what if P is in the closure of M, but not in M itself? For example, M is the set of all Gaussian mixtures with arbitrarily many components, and P is not a Gaussian mixture, but can be arbitrarily well-approximated (in the sense of KL divergence) by a sequence of Gaussian mixtures with ever more components? In this case, Bayes will be consistent but it can be too slow, i.e. it needs more data before the posterior converges than some other methods (like leave-one-out-cross-validation combined with ML estimation). In our forthcoming NIPS 2007 paper, Steven de Rooij, Tim van Erven and I provide a universal-coding based procedure which converges faster than Bayes in those cases, but does not suffer from the disadvantages of leave-one-out-cross validation. Since the method is directly based on universal coding, I’m tempted to call it “MDL”, but the fact that nobody in the MDL community has thought about our idea before, makes me hesitate. When I talked about it to the famous Bayesian Jim Berger, I said “it’s MDL Jim, but not as we know it”.

6/19/2007

How is Compressed Sensing going to change Machine Learning ?

Compressed Sensing (CS) is a new framework developed by Emmanuel Candes, Terry Tao and David Donoho. To summarize, if you acquire a signal in some basis that is incoherent with the basis in which you know the signal to be sparse in, it is very likely you will be able to reconstruct the signal from these incoherent projections.

Terry Tao, the recent Fields medalist, does a very nice job at explaining the framework here. He goes further in the theory description in this post where he mentions the central issue of the Uniform Uncertainty Principle. It so happens that random projections are on average incoherent, within the UUP meaning, with most known basis (sines, polynomials, splines, wavelets, curvelets …) and are therefore an ideal basis for Compressed Sensing. [ For more in-depth information on the subject, the Rice group has done a very good job at providing a central library of papers relevant to the growing subject: http://www.dsp.ece.rice.edu/cs/ ]

The Machine Learning community has looked at Random Projections before, for instance:

Experiments with Random Projections for Machine Learning by Fradkin and Madigan (KDD-03.)

Face Recognition Experiments with Random Projection by Goel, Bebis and Nefian

Dimensionality reduction by random mapping: Fast similarity computation for clustering by S. Kaski (Proceedings of IEEE International Joint Conference on Neural Networks, 1998.)

but while they seem to give somewhat comparable results with regards to PCA, the number of contributions on the subject does not seem overwhelming. Maybe one of the reason is that in most papers cited above, the main theoretical reason for using Random projections lies with the Johnson-Lindenstrauss (JL) lemma. As a consequence, most random matrices used in these publications come from the Database world and not from the newer framework of Compressed Sensing (a list of these matrices and their properties can be found in the middle of this page). The uncanny reliance on Random projections within the JL lemma and in the Compressed Sensing setting was explained by Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin in this paper entitled: A simple proof of the restricted isometry property for random matrices. However, the most interesting fallout from this comparison between JL and CS comes in the form of the contribution by Richard Baraniuk and Michael Wakin in Random projections of smooth manifolds. I’ll let the abstract speak for itself:

We propose a new approach for nonadaptive dimensionality reduction of manifold-modeled data, demonstrating that a small number of random linear projections can preserve key information about a manifold-modeled signal……As our main theoretical contribution, we establish a sufficient number M of random projections to guarantee that, with high probability, all pairwise Euclidean and geodesic distances between points on M are well-preserved under the mapping \phi. Our results bear strong resemblance to the emerging theory of Compressed Sensing (CS), in which sparse signals can be recovered from small numbers of random linear measurements. As in CS, the random measurements we propose can be used to recover the original data in RN. Moreover, like the fundamental bound in CS, our requisite M is linear in the Ã¢â‚¬Å“information levelÃ¢â‚¬Â K and logarithmic in the ambient dimension N; we also identify a logarithmic dependence on the volume and curvature of the manifold. In addition to recovering faithful approximations to manifold-modeled signals, however, the random projections we propose can also be used to discern key properties about the manifold. We discuss connections and contrasts with existing techniques in manifold learning, a setting where dimensionality reducing mappings are typically nonlinear and constructed adaptively from a set of sampled training data.

It looks as though, as a result, Universal Dimensionality Reduction is achieved by some properly chosen random projections. In the case of data living in a low dimensional manifold, the JL lemma states that the number of random projections is proportional to the number of points or samples from that manifold, on the other hand, CS seems to show that the number of random projections is proportional to the characteristic of the manifold only.

The results highlighted by Wakin and Baraniuk are very compelling but there is another appealing reason to Random Projections: Robustness. While trying to mimick nature in the learning process, one cannot but be amazed at the reliability of the biological system. On the other hand, even the researchers that model these processes do not realize or point out that this robustness is in part due to random projections. Case in point, the excellent work of Thomas Serre, Aude Oliva and Tomaso Poggio culminating in a paper describing a biology inspired model of brain that shows its ability to process information in a feedforward fashion. The modeling is new in that, in this area of science, there is a central issue as to whether the brain works in a one-way process or with many loops. In the paper, the feature dimension reduction model (which is what this process is) uses random projections as I pointed out recently.

Because of the intrinsic dimension reduction capability, Mike Wakin has also shown efficient nearest neighbor searches using few random projections (see figure 3 of this paper). I could go on but the point is that since CS is a revolution in the world of signal/feature acquisition and processing (see the analog-to-information A2I site ) one cannot but wonder aloud how this will affect Machine Learning in general.

3/3/2007

All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name	Methodology	What’s right	What’s wrong
Bayesian Learning	You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker\|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.	Handles the small data limit. Very flexible. Interpolates to engineering.	Information theoretically problematic. Explicitly specifying a reasonable prior is often hard. Computationally difficult problems are commonly encountered. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models	Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.	Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.	Often (still) fails to fix problems with the Bayesian approach. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization	Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.	Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.	The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent	Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.	Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.	Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance. Overfitting is a significant issue.
Kernel-based learning	You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.	People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n²) for instance.	Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n²) is not efficient enough when there is much data.
Boosting	You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.	A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners.	The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts	You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.	This is an effective automated method to extract performance from a pool of predictors.	Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions	You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.	The reductions approach can yield highly automated learning algorithms.	The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning	You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner.	The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.	There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory	You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses.	There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis.	The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning	Learning is a process of cutting up the input space and assigning predictions to pieces of the space.	Decision tree algorithms are well automated and can be quite fast.	There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity	Learning is about finding a program which correctly predicts the outputs given the inputs.	Any reasonable problem is learnable with a number of samples related to the description length of the program.	The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning	Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.	We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state.	Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning	Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process	In a sense, we’ve made no assumptions, so algorithms have wide applicability.	All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

10/7/200510/8/2005

On-line learning of regular decision rules

Many decision problems can be represented in the form
FOR n=1,2,…:
— Reality chooses a datum x_n.
— Decision Maker chooses his decision d_n.
— Reality chooses an observation y_n.
— Decision Maker suffers loss L(y_n,d_n).
END FOR.
The observation y_n can be, for example, tomorrow’s stock price and the decision d_n the number of shares Decision Maker chooses to buy. The datum x_n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data.

Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D(x). Can we perform as well, or almost as well, as D, without knowing it? This is essentially a special case of the problem of on-line learning.

This is a simple result of this kind. Suppose the data x_n are taken from [0,1] and L(y,d)=|y–d|. A norm ||h|| of a function h on [0,1] is defined by

||h||² = (Integral₀¹ h(t) dt)² + Integral₀¹ (h'(t))² dt. Decision Maker has a strategy that guarantees, for all N and all D with finite ||D||, Sum_n=1^N L(y_n,d_n) is at most Sum_n=1^N L(y_n,D(x_n)) + (||2D–1|| + 1) N^1/2. Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral₀¹ (D'(t))²dt)^1/2 of D is significantly less than N^1/2. This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions.

It is interesting that the only known way of proving this result uses a non-standard (although very old) notion of probability. The standard definition (“Kolmogorov’s axiomatic“) is that probability is a normalized measure. The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. We call this procedure defensive forecasting. Takemura’s version was about binary observations, but it has been greatly extended since.

Now let us see how we can achieve the goal Sum_n=1 ^N L(y_n,d_n) < Sum_n=1^N L(y_n,D(x_n)) + (something small). This would be easy if we knew the true probabilities for the observations. At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. Alas, we do not know the true probabilities (if they exist at all). But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. The rest of the proof is technical details.