May 16 in Cambridge, is the New England Machine Learning Day, a first regional workshop/symposium on machine learning. To present a poster, submit an abstract by May 5.
Normally I do not blog, but John kindly invited me to do so. Since computability issues play a major role in Artificial Intelligence and Machine Learning, I would like to take the opportunity to comment on that and raise some questions.
The general attitude is that AI is about finding efficient smart algorithms. For large parts of machine learning, the same attitude is not too dangerous. If you want to concentrate on conceptual problems, simply become a statistician. There is no analogous escape for modern research on AI (as opposed to GOFAI rooted in logic).
Let me show by analogy why limiting research to computational questions is bad for any field.
Except in computer science, computational aspects play little role in the development of fundamental theories: Consider e.g. set theory with axiom of choice, foundations of logic, exact/full minimax for zero-sum games, quantum (field) theory, string theory, … Indeed, at least in physics, every new fundamental theory seems to be less computable than previous ones. Of course, once a subject has been formalized, further research (a) analyzes the structure of the theory and (b) tries to compute efficient approximations. Only in (b) do computational aspects play a role.
So my question is: Why are computational questions so prevalent in AI research? Here are some unconvincing arguments I’ve heard:
A) Because AI is a subfield of computer science, and the task of computer scientists is to find (efficient) algorithms for well-defined problems?
I think it does not do any (real-world) problem any good to confine it to computer science. Of course, philosophers and cognitive scientists also care about AI, but where are the mathematicians?
B) Because formalizing AI and finding efficient smart programs goes hand-in-hand? Separating these two issues would lead to no, or at best to results which are misleading or useless in the construction of intelligent machines?
I am not aware of any convincing argument that separating the issues of “axiomatizing a field” and “finding efficient solutions” will (likely) fail for AI. The examples above of other fields actually indicate the opposite. Of course, interaction is important to avoid both sides running wild. For instance, von Neumann’s minimax solution for games, albeit infeasible for most games, is the cornerstone of most practical approximations.
C) Because there is some deep connection between intelligence and computation which can not be disentangled?
Sure, you could say that intelligence is by definition about computationally efficient decision making. This is as unconvincing as argument (A). Pointing out that the human brain is a computational device is quite useful in many ways, but doesn’t proves (C) either. Of course, ultimately we want a “fast” smart algorithm. How is AI different from wanting a fast algorithm computing primes, which you derive from a non-algorithmic definition of primes; or drawing fractals?
D) Because AI is trivial if computational issues are ignored? All conceptual problems have already been solved?
Many have expressed ideas that some form of exhaustive search over all possible solutions and picking the “best” one does the job. This works essentially for exactly those problems that are well-defined. For instance, optimal minimax play of a zero-sum game or solving NP complete problems are conceptually trivial, i.e. if computation time is ignored. But in general AI and machine learning, there is not a universally agreed-upon objective function. The Turing test is informal (involves a human judge in the loop), maximizing expected reward (the true distribution is not known, so expectation w.r.t. to what?), etc. The AIXI model, briefly discussed at this blog, is the first complete and formal such criterion, for which, let me phrase it that way, no flaw has yet been identified. Shane Legg’s award-winning thesis gives an informal introduction and contains lots of discussion.
Conceptual and computational problems in AI should be studied jointly as well as separately, but the latter is not (yet) fashionable. When AI was more logic oriented, some good logicians helped develop the foundations of “deductive” AI. Where are the researchers giving modern “inductive” AI its foundation? I am talking about generic learning agents, not classifying i.i.d. data. Reinforcement learners? Well, most of the hard results are from adaptive control theorists, but it’s reassuring to see parts of these communities merging. It’s a pity that so few mathematicians are interested in AI. A field “mathematical AI” with the prestige of “mathematical physics” would be exciting. As a start: 40% of the COLT & ALT papers on generic learning agents, 30% induction, 20% time-series forecasting, 10% i.i.d. Currently it’s reversed.
Several talks seem potentially interesting to ML folks at this year’s SODA.
- Maria-Florina Balcan, Avrim Blum, and Anupam Gupta, Approximate Clustering without the Approximation. This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style.
- Yury Lifshits and Shengyu Zhang, Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Small-World Design. The basic idea of this paper is that actually creating a metric with a valid triangle inequality inequality is hard for real-world problems, so it’s desirable to have a datastructure which works with a relaxed notion of triangle inequality. The precise relaxation is more extreme than you might imagine, implying the associated algorithms give substantial potential speedups in incomparable applications. Yuri tells me that a cover tree style “true O(n) space” algorithm is possible. If worked out and implemented, I could imagine substantial use.
- Elad Hazan and Satyen Kale Better Algorithms for Benign Bandits. The basic idea of this paper is that in real-world applications, an adversary is less powerful than is commonly supposed, so carefully taking into account the observed variance can yield an algorithm which works much better in practice, without sacrificing the worst case performance.
- Kevin Matulef, Ryan O’Donnell, Ronitt Rubinfeld, Rocco Servedio, Testing Halfspaces. The basic point of this paper is that testing halfspaces is qualitatively easier than finding a good half space with respect to 0/1 loss. Although the analysis is laughably far from practical, the result is striking, and it’s plausible that the algorithm works much better than the analysis. The core algorithm is at least conceptually simple: test that two correlated random points have the same sign, with “yes” being evidence of a halfspace and “no” not.
- I also particularly liked Yuval Peres‘s invited talk The Unreasonable Effectiveness of Martingales. Martingale’s are endemic to learning, especially online learning, and I suspect we can tighten and clarify several arguments using some of the techniques discussed.
I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication.
Naturally, I disagree and believe that learning theory has much more substantial applications.
Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice:
- It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example I have of this is the Isomap, where the algorithm was informed by the analysis yielding substantial improvements in sample complexity over earlier algorithmic ideas.
- An algorithm with learning theory considered in it’s design can be more automatic. I’ve gained more respect for Rifkin’s claim: that the one-against-all reduction, when tuned well, can often perform as well as other approaches. The “when tuned well” caveat is however substantial, because learning algorithms may be applied by nonexperts or by other algorithms which are computationally constrained. A reasonable and worthwhile hope for other methods of addressing multiclass problems is that they are more automatic and computationally faster. The subtle issue here is: How do you measure “more automatic”?
In my experience, learning theory is most useful in it’s crudest forms. A good example comes in the architecting problem: how do you go about solving a learning problem? I mean this in the broadest sense imaginable:
- Is it a learning problem or not? Many problems are most easily solved via other means such as engineering, because that’s easier, because there is a severe data gathering problem, or because there is so much data that memorization works fine. Learning theory such as statistical bounds and online learning with experts helps substantially here because it provides guidelines about what is possible to learn and what not.
- What type of learning problem is it? Is it a problem where exploration is required or not? Is it a structured learning problem? A multitask learning problem? A cost sensitive learning problem? Are you interested in the median or the mean? Is active learning useable or not? Online or not? Answering these questions correctly can easily make a difference between a succesful application and not. Answering these questions is partly definition checking, and since the answer is often “all of the above”, figuring out which aspect of the problem to address first or next is helpful.
- What is the right learning algorithm to use? Here the relative capacity of a learning algorithm and it’s computational efficiency are most important. If you have few features and many examples, a nonlinear algorithm with more representational capacity is a good idea. If you have many features and little data, linear representations or even exponentiated gradient style algorithms are important. If you have very large amounts of data, the most scalable algorithms (so far) use a linear representation. If you have little data and few features, a Bayesian approach may be your only option. Learning theory can help in all of the above by quantifying “many”, “little”, “most”, and “few”. How do you deal with the overfitting problem? One thing I realized recently is that the overfitting problem can be a concern even with very large natural datasets, because some examples are naturally more important than others.
As might be clear, I think of learning theory as somewhat broader than might be traditional. Some of this is simply education. Many people have only been exposed to one piece of learning theory, often VC theory or it’s cousins. After seeing this, they come to think of it as learning theory. VC theory is a good theory, but it is not complete, and other elements of learning theory seem at least as important and useful. Another aspect is publishability. Simply sampling from the learning theory in existing papers does not necessarily give a good distribution of subjects for teaching, because the goal of impressing reviewers does not necessarily coincide with the clean simple analysis that is teachable.
(*) There is significant investigation into improving the tightness of bounds to the point of usefulness, and maybe it will pay off.
Claude Sammut is attempting to put together an Encyclopedia of Machine Learning. I volunteered to write one article on Efficient RL in MDPs, which I would like to invite comment on. Is something critical missing?
This workshop asks for insights how far we may/can push the theoretical boundary of using data in the design of learning machines. Can we express our classification rule in terms of the sample, or do we have to stick to a core assumption of classical statistical learning theory, namely that the hypothesis space is to be defined independent from the sample? This workshop is particularly interested in – but not restricted to – the ‘luckiness framework’ and the recently introduced notion of ‘compatibility functions’ in a semi-supervised learning context (more information can be found at http://www.kuleuven.be/wehys).
Claire asked me to be on the SODA program committee this year, which was quite a bit of work.
I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand.
There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed.
I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For example, if there is a well written paper on an interesting topic which analyzes a flawed abstraction of the topic, should it get in? I would rate this a ‘weak accept’.
Here are some observations/thoughts about the process (Several also appear in Claire’s report).
- Better feedback isn’t too hard. The real time sink in reviewing a theory paper is reading it. Leaving a few comments, even if just “I don’t like the model analyzed because it misses important feature X.” is relatively easy. My impression of the last COLT was that COLT had entirely switched from minimal author feedback to substantial author feedback. This year’s SODA was somewhere inbetween, depending on the PC member involved, which is a definite trend towards stronger comments for SODA.
- Normalization There were very substantial differences amongst the PC members in what fraction of papers they wanted to accept, and this leaked into the final decisions. Normalizing reviewer ratings is standard operating procedure at some machine learning conferences, so I helped with that. Even with that help, further efforts at normalization in the future seem like they could help, for example in getting the decision on the paper above right.
- Ordering There were various areas where we tried to order all the reasonable papers and make a decision based on the ordering. Where the papers are sufficiently related, I think this is very helpful, and the act even changed my opinion on some papers a bit by putting them in better context. Not everyone imposed the same ordering, because there are somewhat different tastes: Do you care about the techniques used? (A traditional theory concern) or about the quality of the result? (I’m more focused here.) Nevertheless, it helped reduce the noise. Incidentally, there is substantial theoretical evidence that decisions by ordering are more robust than decisions by absolute score producing an ordering.
- Writing quality I was surprised by the poor writing quality of some SODA papers—several were basically not readable without a thorough understanding of referenced papers, and a substantial ability to infer what was meant rather than what was said. Some of these papers were accepted, which would have been impossible in a conference with double-blind reviewing.
- PC size The tradition in theory conferences is to have a relatively small program committee. I don’t see much advantage to this for SODA. The program committe is small enough and SODA is broad enough that it seems dubious to claim that every PC member is an expert on the subject of all of their papers. Also, (frankly) the highest quality reviews from my batch of papers weren’t written by me, but rather by reviewers that I picked who had the time to really grind through all the nitty-gritty of the paper. It’s easy to imagine that a larger PC would improve reviewing quality by avoiding overload.
Reviewers and students are sometimes greatly concerned by the distinction between:
- An open set and a closed set.
- A Supremum and a Maximum.
- An event which happens with probability 1 and an event that always happens.
I don’t appreciate this distinction in machine learning & learning theory. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.
The fundamental issue here is that substantial parts of mathematics don’t appear well-matched to computation in the physical world, because the mathematics has concerns which are unphysical. This mismatched mathematics makes irrelevant distinctions. We can ask “what mathematics is appropriate to computation?” Andrej has convinced me that a pretty good answer to this question is constructive mathematics.
So, here’s a basic challenge: Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?