The perfect candidate

The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.

Despite this, hiring in good research positions can be challenging. A good research position is one where you can:

  • Spend the majority of your time working on research questions that interest.
  • Work with other like-minded people.
  • For several years.

I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.

The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.

But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.

  1. Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
  2. Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
  3. A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
  4. A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
  5. I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.

Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way 🙂

Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.

Microsoft Research, New York City

Yahoo! laid off people. Unlike every previous time there have been layoffs, this is serious for Yahoo! Research.

We had advanced warning from Prabhakar through the simple act of leaving. Yahoo! Research was a world class organization that Prabhakar recruited much of personally, so it is deeply implausible that he would spontaneously decide to leave. My first thought when I saw the news was “Uhoh, Rob said that he knew it was serious when the head of ATnT Research left.” In this case it was even more significant, because Prabhakar recruited me on the premise that Y!R was an experiment in how research should be done: via a combination of high quality people and high engagement with the company. Prabhakar’s departure is a clear end to that experiment.

The result is ambiguous from a business perspective. Y!R clearly was not capable of saving the company from its illnesses. I’m not privy to the internal accounting of impact and this is the kind of subject where there can easily be great disagreement. Even so, there were several strong direct impacts coming from the machine learning, economics, and algorithms groups.

Y!R clearly was excellent from an academic research perspective. On a per person basis in relevant subjects, it was outstanding. One way to measure this is by noticing that both ICML and KDD had (co)program chairs from Y!R. It turns out that talking to the rest of the organization doing consulting, architecting, and prototyping on a minority basis helps research by sharpening the questions you ask more than it hinders by taking up time. The decision to participate in this experiment was a good one for me personally.

It has been clear in silicon valley, academia, and pretty much everywhere else that people at Yahoo! including Yahoo! Research have been looking around for new positions. Maintaining the excellence of Y!R in a company that has been under prolonged stress was challenging leadership-wise. Consequently, the abrupt departure of Prabhakar and an apparent lack of appreciation by the new CEO created a crisis of confidence. Many people who were sitting on strong offers quickly left, and everyone else started looking around.

In this situation, my first concern was for colleagues, both in Machine Learning across the company and the Yahoo! Research New York office.

Machine Learning turns out to be a very hot technology. Every company and government in the world is drowning in data, and Machine Learning is the prime tool for actually using it to do interesting things. More generally, the demand for high quality seasoned machine learning researchers across startups, mature companies, government labs, and academia has been astonishing, and I expect the outcome to reflect that. This is remarkably different from the cuts that hit ATnT research in late 2001 and early 2002 where the famous machine learning group there took many months to disperse to new positions.

In the New York office, we investigated many possibilities hard enough that it became a news story. While that article is wrong in specifics (we ended up not fired for example, although it is difficult to discern cause and effect), we certainly shook the job tree very hard to see what would fall out. To my surprise, amongst all the companies we investigated, Microsoft had a uniquely sufficient agility, breadth of interest, and technical culture, enabling them to make offers that I and a significant fraction of the Y!R-NY lab could not resist. My belief is that the new Microsoft Research New York City lab will become an even greater techhouse than Y!R-NY. At a personal level, it is deeply flattering that they have chosen to create a lab for us on short notice. I will certainly do my part chasing the greatest learning algorithms not yet invented.

In light of this, I would encourage people in academia to consider Yahoo! in as fair a light as possible in the current circumstances. There are and will be some serious hard feelings about the outcome as various top researchers elsewhere in the organization feel compelled to look for jobs and leave. However, Yahoo! took a real gamble supporting a research organization about 7 years ago, and many positive things have come of this gamble from all perspectives. I expect almost all of the people leaving to eventually do quite well, and often even better.

What about ICML? My second thought on hearing about Prabhakar’s departure was “I really need to finish up initial paper/reviewer assignments today before dealing with this”. During the reviewing period where the program chair load is relatively light, Joelle handled nearly everything. My great distraction ended neatly in time to help with decisions at ICML. I considered all possibilities in accepting the job and was prepared to simply put aside a job search for some time if necessary, but the timing was surreally perfect. All signs so far point towards this ICML being an exceptional ICML, and I plan to do everything that I can to make that happen. The early registration deadline is May 13.

What about KDD? Deepak was sitting on an offer at Linkedin and simply took it, so the disruption there was even more minimal. Linkedin is a significant surprise winner in this affair.

What about Vowpal Wabbit? Amongst other things, VW is the ultrascale learning algorithm, not the kind of thing that you would want to put aside lightly. I negotiated to continue the project and succeeded. This surprised me greatly—Microsoft has made serious commitments to supporting open source in various ways and that commitment is what sealed the deal for me. In return, I would like to see Microsoft always at or beyond the cutting edge in machine learning technology.

added: crosspost on CACM.
added: Lance, Jennifer, NYTimes, Vader

Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

  1. MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
  2. MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
    MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

  1. MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
  2. MapReduce: Conceptual simplicity. One easy to understand function is enough.
  3. MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
  4. MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
  5. MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
  6. MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
  7. MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
  8. MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

  1. For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  2. For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  3. For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
  4. The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.

Giving Thanks

Thanksgiving is perhaps my favorite holiday, because pausing your life and giving thanks provides a needed moment of perspective.

As a researcher, I am most thankful for my education, without which I could not function. I want to share this, because it provides some sense of how a researcher starts.

  1. My long term memory seems to function particularly well, which makes any education I get is particularly useful.
  2. I am naturally obsessive, which makes me chase down details until I fully understand things. Natural obsessiveness can go wrong, of course, but it’s a great ally when you absolutely must get things right.
  3. My childhood was all in one hometown, which was a conscious sacrifice on the part of my father, implying disruptions from moving around were eliminated. I’m not sure how important this was since travel has it’s own benefits, but it bears thought.
  4. I had several great teachers in grade school, and naturally gravitated towards teachers over classmates, as they seemed more interesting. I particularly remember Mr. Cox, who read Watership Down 10 minutes a day. The frustration of not getting to the ending drove me into reading books on my own, including just about every science fiction book in Lebanon Oregon.
  5. I spent a few summers picking strawberries and blueberries. It’s great motivation to not do that sort of thing for a living.
  6. Lebanon school district was willing to bend the rules for me, so I could skip unnecessary math classes. I ended up a year advanced, taking math from our local community college during senior year in high school.
  7. College applications was a very nervous time, because high quality colleges cost much more than we could reasonably expect to pay. I was very lucky to get into Caltech here. Caltech should not be thought of as a university—instead, it’s a research lab which happens to have a few undergraduate students running around. I understand from Preston that the operating budget is about 4% tuition these days. This showed in the financial aid package, where they basically let me attend for the cost of room&board. Between a few scholarships and plentiful summer research opportunities, I managed to graduate debt free. Caltech was also an exceptional place to study, because rules like “no taking two classes at the same time” were never enforced them. The only limits on what you could learn were your own.
  8. Graduate school was another big step. Here, I think Avrim must have picked out my application to Carnegie Mellon, which was a good fit for me. At the time, I knew I wanted to do research in some sort of ML/AI subject area, but not really what, so the breadth of possibilities at CMU was excellent. In graduate school, your advisor is much more important, and between Avrim and Sebastian, I learned quite a bit. The funding which made this all work out was mostly hidden from me at CMU, but there was surely a strong dependence on NSF and DARPA. Tom Siebel also directly covered my final year as as a Siebel Scholar.
  9. Figuring out what to do next was again a nervous time, but it did work out, first in a summer postdoc with Michael Kearns, then at IBM research as a Herman Goldstine Fellow, then at TTI-Chicago, and now at Yahoo! Research for the last 5 years.

My life is just one anecdote, from which it’s easy to be misled. But trying to abstract the details, it seems like the critical elements for success are a good memory, an interest in getting the details right, motivation, and huge amounts of time to learn and then to do research. Given that many of the steps in this process winnow out large fractions of people, some amount of determination and sheer luck is involved. Does the right person manage to see you as a good possibility?

But mostly I’d like to give thanks for the “huge amounts of time” which in practical terms translates into access to other smart people and funding. In education and research funding is something like oxygen—you really miss it when it’s not there, so Thanksgiving is a good time to remember it.

Research Directions for Machine Learning and Algorithms

Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

  1. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.
  2. Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

  1. How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do efficient offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
  2. How can we learn from lots of data? We’ll be presenting a KDD survey tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
  3. How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it.
  4. What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
  5. The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted at CACM.