Is the Google way the way for machine learning?

Urs Hoelzle from Google gave an invited presentation at NIPS. In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following:

  1. Make a cluster of machines.
  2. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.)
  3. Interact with data via MapReduce.

Creating a cluster of machines is, by this point, relatively straightforward.

Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters.

MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary.
Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element).

Shifting towards this paradigm has several effects:

  1. It makes “big data” applications more viable.
  2. It makes some learning algorithms more viable than others. One way to think about this is in terms of statistical query learning algorithms. The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. Any such algorithm can be implemented via mapreduce. The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. Gradient descent algorithms can also be phrased as statistical query algorithms. Learning algorithms which work on one example at a time are not generally statistical query algorithms.

    Another way to think about this is in terms of the complexity of the computation. Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. This strongly favors online learning algorithms. Decision trees and naive bayes are (again) relatively reasonable. Support vector machines (or gaussian processes) encounter difficulties related to scaling.

There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently.

I expect Urs is right: we should look in this direction.

The Webscience Future

The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized.

First, let’s acknowledge some known effects.

  1. Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository (Arxiv) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists.
  2. Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research.

Now, let’s look at attempts to go further.

  1. Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic).
  2. Organization of some subfield of research. This includes Satinder Singh’s Reinforcement Learning pages, and, more generally books that have been placed online such as this one.
  3. Discussion Groups The kernel machines discussions provide a good example of some common format allowing discussion.
  4. Class notes have been placed online such as Avrim’s learning theory lecture notes.
  5. Wikipedia has an article on Machine Learning. The article gives a reasonable quick overview and is surely read by a very large number of people.
  6. Online Proceedings are now being used by several conferences such as NIPS.

Now, let’s consider some futures.

  1. Wikifuture Wikipedia becomes better to the point where it is a completely comprehensive listing of current research in machine learning. At some point, we-the-community realize this and begin to emphasize (and credit) information placed in wikipedia. This process reinforces itself to the point where “if it’s not in wikipedia, it doesn’t exist”.

    This future is significantly more probable than most people understand. As evidence compare the machine learning page three years ago (yep, it didn’t exist), two years ago, one year ago, and today. That progression strongly suggests that wikipedia:machine learning will continue to grow into a significant information resource.

    There are fundamental obstacles to the success of the wikipedia future.

    1. credit Wikipedia has only very weak mechanisms for crediting editors. A list of the changes done by one user account is about as much credit as is available. This is not enough to make career-deciding questions on. We could hope for a stronger link between identity and editor along with tools to track the value of particular edits (Think of counting hyperlinks as an analogue for counting citations).
    2. controversy Wikipedia has grown up in a nonmanipulative environment. When it was little known, the incentive to fabricate entries was not great. Now that it is becoming well known that incentive is growing. Character assasination by false article exists. In science, the thing to worry about is misplaced ideas of the importance of your topic of research since it is very difficult to be sufficiently interested in a research topic and simultaneously view it objectively. Research is about creating new ideas, and the location of these ideas in some general organization is in dispute by default.
  2. Evolutionary Progression Consider the following sequence of steps.
    1. Conference Organization We realize that having a list of online papers isn’t nearly as useful as having an organized list of online papers so the conferences which already have online proceedings create an explorable topic hierarchy.
    2. Time Organization We realize that the organization at one particular year’s conference is sketchy—research is a multiyear endeavor. Consequently, we start adding to last years topic hierarchy rather than creating a new one from scratch each year.
    3. Transformation We realize that it is better if papers are done in the language of the web. For example, it’s very handy to be able to hyperlink inside of a paper. A good solution to the math on the web problem would greatly help here.
    4. Consolidation We realize that there is a lot of redundancy in two papers on the same or a similar topic. They share an introduction, motivation, and (often) definitions. By joining the shared pieces, the contents of both papers can be made clearer.

    Each of these individual steps clearly yields something better. At the end of these steps, creating a paper is simply the process of creating a webpage or altering an existing webpage. We can imagine doing all of this while keeping the peer-review mechanisms of science intact, so the resulting process is simply better in all ways. It’s easier to author because for most papers much of the “filler” introduction/motivation/definition can be reused from previous papers. It’s easier to review, because reviewers can consider the result in context. Much of the difficulty of reviewing is simply due to the author and reviewer not being “on the same page” in how they understand things. An organized topic hierarchy greatly aids this.

  3. The unknown It is difficult to anticipate everything. What other futures might exist?

Which future comes about is dependent on many things—the decisions of community leaders, enabling ‘math-on-the-web’ technologies, etc…, so it is difficult to predict which future and when it will come about. Nevertheless, a potential exists and there are several paths leading towards reaching that potential.

The Design of an Optimal Research Environment

How do you create an optimal environment for research? Here are some essential ingredients that I see.

  1. Stability. University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between:
    1. easy small
    2. difficult big

    For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things.

    Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt research was a great example of this before they lost their monopoly and had to fire the researchers.

    An inadequate amount of stability is probably given by fixed duration appointments. The appropriate timescale to consider here is one year because there is a strong annual cycle for application. A one year appointment is probably too short while a 10 year appointment may be effectively the same as arbitrary duration.

    One significant part of stability is financial stability of the parent organization. The nature of research implies that what makes ‘funding sense’ depends on the size of the set of people who can benefit from it. Since ‘governments’ are the largest organizations around, they are good candidates, with the caveat that political will has not been known as greatly stable.

  2. Free time. University-based research is relatively terrible about this while industrial labs vary widely.

    For professors at a university, teaching a subject well requires very significant time and energy. Running the university requires very significant time and energy. Writing grant proposals for funding requires very significant time and energy (success in grant writing is typically a necessity to achieve tenure at US research universities). Advising students well requires significant time and energy. Some of these activities are partially synergistic with research, but it seems unarguable that the multiplicity of time demands can greatly slow research.

    In industrial labs (think of “IBM research” as an example), there is an tendency to shift research towards short term rewards. This is partly becaue the research labs have not proved very capable of using the big successes and partly because even large businesses have ups and downs. During a ‘down’ time, it is very tempting to use the reserve of very capable people in a research lab for short duration projects.

    One thing to understand about research is that it is not the case that a person with half as much free time can produce half as much work. The problems encountered might have sufficient intricacy that they require a day or more of overhead to begin making progress.

  3. Problem exposure The set of research problems which people could work on is much larger than the set of research problems which are plausibly useful to the rest of the world (*). The way narrow down on the relevant problems is to give researchers a chance to encounter them.

    Univeristy-based research can have problems here, because no really effective mechanism for doing this exists by default. In industrial labs, researchers often act as “superconsultants” helping to solve problems. This method is effective but perhaps relatively crude because as often instantiated, it conflicts with “free time”.

    (*) There is always someone who says “but at the time number theory was researched, nobody had any practical use, and now it is very useful for crypto”. This is true, but there are several important caveats:

    1. The “hit rate” (in terms of big impact on everyones lives) for unmotivated research is much lower. I don’t have a good measurement of how much lower, but “significantly” seems both plausibly and anecdotally correct.
    2. Number theory was not entirely unmotivated at the time. Afterall, it was clear that numbers were a very important abstraction and so a better understanding of numbers was plausibly useful.
    3. Would development of factoring algorithms and their hardness understanding have been much delayed without number theory? And would that delay have been worth, in exchange, an earlier development of calculus? These are at least hard questions to answer.

    There is a role for research of the “let’s do it because I am curious nature”, especially because interested people do good research. The size of that role must, necessarily, be relatively small.

  4. Benefit exposure Almost all researchers are not well-exposed to the benefits of research. University-based research is may do this best via giving the researchers the ability to form their own companies with partial ownership. However, the overhead associated with “form your own company” is very substantial creating a formidable barrier (some good news, is that the barrier may be decreasing). This fails for many research advances which are obviously useful yet have no reasonable mechanism of capturing it’s profit. In the context of machine learning, spam filtering is of obvious and significant common use, yet a researcher with a better spam filter can not easily create a company around the method. There are several reasons for this.
    1. Machine learning is a victim of it’s common success. It’s hard to develop a learning algorithm which is substantially better than others. This means that anyone wanting to implement spam filtering can do so. Patents are useless here—you can’t patent an entire field (and even if you could it wouldn’t work).
    2. Spam filtering is not easily modularizable. Most people want their email clients to filter spam, but they don’t want to switch email clients. Combined with the first observation, this makes it clear that a “spam filtering company” is going to run into trouble.

    The benefit structure in industrial research labs is (perhaps) often even worse. The primary mechanism is via bonuses, where 10% of base salary might be large. Obviously, keeping the job is of much greater importance. A system where the base salary is bumped by a small (variable) amount each year is common. The implication of this is that rewards are delayed (reducing incentives) compared to a bonus and age discrimination becomes a problem.

    A relative bright spot here is google. Much of what they work on is not research-as-we-know-it, but some of it is, and there seem to be significant mechanisms for rewarding success, even after the IPO.

  5. Devolution of power There is a traditional notion of how things are done which says “everyone must have a manager to tell them what to do”. This works badly in research for several reasons.
    1. People are simply not very good at doing research on a topic which does not interest them. The problem is hard enough that doing it in a disinterested or even not-fully-interested way greatly reduces the chance of success.
    2. Another difficulty here is judgement. People make mistakes, including even managers. If a manager has a good idea about what is an interesting topic of research, then the power to say “work on this” is mildly beneficial over simple persuasion. On the other hand, a mistake is greatly harmful.

    The alternative to a command-and-control notion of research is a persuasion-based system. In a persuasion-based system, people are free to either work with someone or not, as they judge. In a persuasion based system, no compelling idea means that researchers try different problems while, which seems appropriate. And, when something promising comes along, more people can be added as demand and persuasiveness allows. University style research partially achieves this (a notable complaint of students is that they do not have enough freedom). In industrial labs, this varies widely.

    It is worth noting that devolution of power has occurred in all of the following surprisingly effective organizations/endeavors:

    1. Linux kernel development (and more generally open source software). Only a decade ago, most economists would have considered the present success of open source software development as implausible. Anyone who can code can work on the linux kernel if they want.
    2. Wikipedia Traditionaly encyclopedias complain that wikipedia articles can be false since anyone can submit, but this argument vastly underestimates the value of having a comprehensive and uptodate information resource.
    3. Google made it so anyone could advertise in a simple and easy way. This simultaneously lowered the barrier to advertising and sanitized the process. We sometimes forget that google came long after other internet search engines so their success was far from assured.
  6. Student exposure By “student” here, I mean “research student”. Research students are of course much more common on university campuses, although there are sometimes government funded research labs such as MPI in Germany and NICTA in Australia.

    Any reasonable system involving people working together must take into account how new people are brought into the system. The appropriate point seems to be where training in research starts (typically, ‘phd students’). Aside from issues of a self-sustaining system, students can add something significant to the process of research. The effort of explaining a new topic of research often aids research via simplification and consideration in new contexts. Students of course can also contribute substantially to the basic mission of research because they inherently bring in new ideas.

  7. Concentration There are several kinds of concentration, and each of them is important. The first kind is the ‘quality’ kind: there are better and worse researchers. Unfortunately, this is extraordinarily difficult to judge, except after the fact. Another kind is concentration of interests. Having 5 people each interested in the subject working on their own is often substantially less useful than having the 5 working together. The internet has greatly aided concentration because some significant benefit can be derived even from people who are not physically near to each other. Universities have traditionally had difficulty with this because the demands of teaching undergraduates imply a necessity for breadth of subjects. This implies that it is common for universities to not have two people working on a similar subject.

It is interesting that no place manages to capture them all. If all of these elements were to come together, the result might be very impressive.

Prediction Competitions

There are two prediction competitions currently in the air.

  1. The Performance Prediction Challenge by Isabelle Guyon. Good entries minimize a weighted 0/1 loss + the difference between a prediction of this loss and the observed truth on 5 datasets. Isabelle tells me all of the problems are “real world” and the test datasets are large enough (17K minimum) that the winner should be well determined by ability rather than luck. This is due March 1.
  2. The Predictive Uncertainty Challenge by Gavin Cawley. Good entries minimize log loss on real valued output variables for one synthetic and 3 “real” datasets related to atmospheric prediction. The use of log loss (which can be infinite and hence is never convergent) and smaller test sets of size 1K to 7K examples makes the winner of this contest more luck dependent. Nevertheless, the contest may be of some interest particularly to the branch of learning (typically Bayes learning) which prefers to optimize log loss.

May the best predictor win.

The design of a computing cluster

This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.

  1. Architecture There are several architectural choices.
    1. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
    2. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
    3. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
    4. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
    5. In the end, we decided to go with option (2). Roughly speaking, the AMD system seemed like a better value than Intel. The opteron systems were desirable over the Athlon64 systems because halving the number of nodes aides system setup and maintenance while preserving about the same (or slightly more) cost/cpu. Another concern driving us towards the Opteron system was the ability to expand the RAM at a later date. In the last year or so, CPU speeds have not increased signficantly, instead dual core chips have come out. This trend seems likely to continue which implies the time to obselescence of a machine is driven by the maximum RAM capacity.

  2. Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
  3. Operating System The options are
    1. Windows
    2. Linux

    We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:

    1. 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
    2. 64bit linux
    3. We ended up choosing 32bit linux simply for stability and ease-of-setup reasons. The ease-of-setup was particularly compelling with respect to the openMosix choice (below). The exact variant of linux we used was a matter of some initial exploration determined by Don Coleman (TTIs master of machines). It is very plausible that we will want to switch to 64bit linux at some point in the future.

  4. Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
    1. Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
    2. Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
    3. Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.

    At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.

There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.

In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.)