Machine Learning (Theory)


Fact over Fiction

Tags: politics,Research jl@ 1:23 pm

Politics is a distracting affair which I generally believe it’s best to stay out of if you want to be able to concentrate on research. Nevertheless, the US presidential election looks like something that directly politicizes the idea and process of research by damaging the association of scientists & students, funding for basic research, and creating political censorship.

A core question here is: What to do? Today’s March for Science is a good step, but I’m not sure it will change many minds. Unlike most scientists, I grew up in a a county (Linn) which voted overwhelmingly for Trump. As a consequence, I feel like I must translate the mindset a bit. For the median household left behind over my lifetime a march by relatively affluent people protesting the government cutting expenses will not elicit much sympathy. Discussion about the overwhelming value of science may also fall on deaf ears simply because they have not seen the economic value personally. On the contrary, they have seen their economic situation flat or worsening for 4 decades with little prospect for things getting better. Similarly, I don’t expect history lessons on anti-intellectualism to make much of a dent. Fundamentally, scientists and science fans are a small fraction of the population.

What’s needed is a campaign that achieves broad agreement across the population and which will help. One of the roots of the March for Science is a belief in facts over fiction which may have the requisite scope. In particular, there seems to be a good case that the right to engage in mass disinformation has been enormously costly to the United States and is now a significant threat to civil stability. Internally, disinformation is a preferred tool for starting wars or for wealthy companies to push a deadly business model. Externally, disinformation is now being actively used to sway elections and is self-funding.

The election outcome is actually less important than the endemic disagreement that disinformation creates. When people simply believe in different facts about the world how can you expect them to agree? There probably are some good uses of mass disinformation somewhere, but I’m extremely skeptical the value exceeds the cost.

Is opposition to mass disinformation broad enough that it makes a good organizing principle? If mass disinformation was eliminated or greatly reduced it would be an enormous value to society, particularly to the disinformed. It would not address the fundamental economic stagnation of the median household in the United States, but it would remove a significant threat to civil society which may be necessary for such progress. Given a choice between the right to mass disinform and democracy, I choose democracy.

A real question is “how”? We are discussing an abridgment of freedom of speech so from a legal perspective the basis must rest on the balance between freedom of speech and other constitutional rights. Many abridgements exist like censuring a yell of “fire” in a crowded theater unnecessarily.

Voluntary efforts (as Facebook and Twitter have undertaken) are a start, but it seems unlikely to go far enough as many other “news” organizations have made no such commitments. A system where companies commit to informing over disinforming and in return become both more trusted and simultaneously liable for disinformation damages (due to the disinformed) as assessed by civil law may make sense. Right now organizations are mostly free to engage in disinformation as long as it is not directed at an individual where libel laws apply. Penalizing an organization for individual mistakes seems absurd, but a pattern of errors backed by scientific surveys verifying an anomalously misinformed status of viewers/readers/listeners is cause for action. Getting this right is obviously a tricky thing—we want a solution that a real news organization with an existing mimetic immune system prefers to the status quo because it curbs competitors that disinform. At the same time, there must be enough teeth to make disinformation uneconomical or the problem only grows.

Should disinformation have criminal penalties? One existing approach here uses RICO laws to counter disinformation from Tobacco companies. Reading the history, this took an amazing amount of time—enough that it was ineffective for a generation. It seems plausible that an act directly addressing disinformation may be helpful.

What about technical solutions? Technical solutions seem necessary for success, perhaps with changes to law incentivizing this. It’s important to understand that something going before the courts is inherently slow, particularly because courts tend to be deeply overloaded. A significant increase in the number of cases going before courts makes an approach nonviable in practice.

Would we regret this? There is a long history of governments abusing laws to censor inconvenient news sources so caution is warranted. Structuring new laws in a manner such that they cannot be abused is an important consideration. It is obviously important to leave satire fully intact which seems entirely possibly by making the fact that it is satire unmistakable. This entire discussion is also not relevant to individuals speaking to other individuals—that is not what creates a problem.

Is this possible? It might seem obvious that mass disinformation should be curbed but there should be no doubt that powerful forces will work to preserve mass disinformation by subtle and unethical means.

Overall, I fundamentally believe that people in a position to inform or disinform have a responsibility to inform. If they don’t want that responsibility, then they should abdicate the position to someone who does, similar in effect to the proposed fiduciary rule for investments. I’m open to any approach towards achieving this.

Edit: also at CACM.


No more MSR Silicon Valley

Tags: Research jl@ 6:44 pm

This news report is correct, the Microsoft Research Silicon Valley center has been cut. The New York lab has not been directly affected although obviously cross-lab collaborations are impacted, and we sympathize deeply with those involved. Most of the rest of MSR is not directly affected.

I’m not privy to the rationale behind the decision, but in my opinion there are some very strong people in the various groups (Algorithms, Architecture, Distributed Systems, Privacy, Software tools, Web Search), and I expect offers have started raining on them. In my experience, this is harrowing in the short term, yet I know that most of my previous colleagues ended up happier after the troubles hit Yahoo! Research 2 1/2 years ago.


The perfect candidate

The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.

Despite this, hiring in good research positions can be challenging. A good research position is one where you can:

  • Spend the majority of your time working on research questions that interest.
  • Work with other like-minded people.
  • For several years.

I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.

The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.

But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.

  1. Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
  2. Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
  3. A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
  4. A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
  5. I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.

Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way :-)

Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.


Microsoft Research, New York City

Tags: Machine Learning,Research jl@ 5:34 am

Yahoo! laid off people. Unlike every previous time there have been layoffs, this is serious for Yahoo! Research.

We had advanced warning from Prabhakar through the simple act of leaving. Yahoo! Research was a world class organization that Prabhakar recruited much of personally, so it is deeply implausible that he would spontaneously decide to leave. My first thought when I saw the news was “Uhoh, Rob said that he knew it was serious when the head of ATnT Research left.” In this case it was even more significant, because Prabhakar recruited me on the premise that Y!R was an experiment in how research should be done: via a combination of high quality people and high engagement with the company. Prabhakar’s departure is a clear end to that experiment.

The result is ambiguous from a business perspective. Y!R clearly was not capable of saving the company from its illnesses. I’m not privy to the internal accounting of impact and this is the kind of subject where there can easily be great disagreement. Even so, there were several strong direct impacts coming from the machine learning, economics, and algorithms groups.

Y!R clearly was excellent from an academic research perspective. On a per person basis in relevant subjects, it was outstanding. One way to measure this is by noticing that both ICML and KDD had (co)program chairs from Y!R. It turns out that talking to the rest of the organization doing consulting, architecting, and prototyping on a minority basis helps research by sharpening the questions you ask more than it hinders by taking up time. The decision to participate in this experiment was a good one for me personally.

It has been clear in silicon valley, academia, and pretty much everywhere else that people at Yahoo! including Yahoo! Research have been looking around for new positions. Maintaining the excellence of Y!R in a company that has been under prolonged stress was challenging leadership-wise. Consequently, the abrupt departure of Prabhakar and an apparent lack of appreciation by the new CEO created a crisis of confidence. Many people who were sitting on strong offers quickly left, and everyone else started looking around.

In this situation, my first concern was for colleagues, both in Machine Learning across the company and the Yahoo! Research New York office.

Machine Learning turns out to be a very hot technology. Every company and government in the world is drowning in data, and Machine Learning is the prime tool for actually using it to do interesting things. More generally, the demand for high quality seasoned machine learning researchers across startups, mature companies, government labs, and academia has been astonishing, and I expect the outcome to reflect that. This is remarkably different from the cuts that hit ATnT research in late 2001 and early 2002 where the famous machine learning group there took many months to disperse to new positions.

In the New York office, we investigated many possibilities hard enough that it became a news story. While that article is wrong in specifics (we ended up not fired for example, although it is difficult to discern cause and effect), we certainly shook the job tree very hard to see what would fall out. To my surprise, amongst all the companies we investigated, Microsoft had a uniquely sufficient agility, breadth of interest, and technical culture, enabling them to make offers that I and a significant fraction of the Y!R-NY lab could not resist. My belief is that the new Microsoft Research New York City lab will become an even greater techhouse than Y!R-NY. At a personal level, it is deeply flattering that they have chosen to create a lab for us on short notice. I will certainly do my part chasing the greatest learning algorithms not yet invented.

In light of this, I would encourage people in academia to consider Yahoo! in as fair a light as possible in the current circumstances. There are and will be some serious hard feelings about the outcome as various top researchers elsewhere in the organization feel compelled to look for jobs and leave. However, Yahoo! took a real gamble supporting a research organization about 7 years ago, and many positive things have come of this gamble from all perspectives. I expect almost all of the people leaving to eventually do quite well, and often even better.

What about ICML? My second thought on hearing about Prabhakar’s departure was “I really need to finish up initial paper/reviewer assignments today before dealing with this”. During the reviewing period where the program chair load is relatively light, Joelle handled nearly everything. My great distraction ended neatly in time to help with decisions at ICML. I considered all possibilities in accepting the job and was prepared to simply put aside a job search for some time if necessary, but the timing was surreally perfect. All signs so far point towards this ICML being an exceptional ICML, and I plan to do everything that I can to make that happen. The early registration deadline is May 13.

What about KDD? Deepak was sitting on an offer at Linkedin and simply took it, so the disruption there was even more minimal. Linkedin is a significant surprise winner in this affair.

What about Vowpal Wabbit? Amongst other things, VW is the ultrascale learning algorithm, not the kind of thing that you would want to put aside lightly. I negotiated to continue the project and succeeded. This surprised me greatly—Microsoft has made serious commitments to supporting open source in various ways and that commitment is what sealed the deal for me. In return, I would like to see Microsoft always at or beyond the cutting edge in machine learning technology.

added: crosspost on CACM.
added: Lance, Jennifer, NYTimes, Vader


Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

  1. MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
  2. MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
    MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

  1. MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
  2. MapReduce: Conceptual simplicity. One easy to understand function is enough.
  3. MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
  4. MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
  5. MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
  6. MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
  7. MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
  8. MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

  1. For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  2. For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  3. For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
  4. The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.

Older Posts »

Powered by WordPress