If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.
Despite this, hiring in good research positions can be challenging. A good research position is one where you can:
- Spend the majority of your time working on research questions that interest.
- Work with other like-minded people.
- For several years.
I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.
The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.
But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.
- Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
- Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
- A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
- A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
- I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.
Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way
Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.
This year’s ICML had several papers which I want to read through more carefully and understand better.
- Chun-Liang Li, Hsuan-Tien Lin, Condensed Filter Tree for Cost-Sensitive Multi-Label Classification. Several tricks accumulate to give a new approach for addressing cost sensitive multilabel classification.
- Nikos Karampatziakis and Paul Mineiro, Discriminative Features via Generalized Eigenvectors. An efficient, effective eigenvalue solution for supervised learning yields compelling nonlinear performance on several datasets.
- Nir Ailon, Zohar Karnin, Thorsten Joachims, Reducing Dueling Bandits to Cardinal Bandits. An effective method for reducing dueling bandits to normal bandits that extends to contextual situations.
- Pedro Pinheiro, Ronan Collobert, Recurrent Convolutional Neural Networks for Scene Labeling. Image parsing remains a challenge, and this is plausibly a step forward.
- Cicero Dos Santos, Bianca Zadrozny, Learning Character-level Representations for Part-of-Speech Tagging. Word morphology is clearly useful information, and yet almost all ML-for-NLP applications ignore it or hard-code it (by stemming).
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Statistically efficient interactive learning is now computationally feasible. I wish this one had been done in time for the NIPS tutorial
- David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, Deterministic Policy Gradient Algorithms. A reduction in variance from working out the deterministic limit of policy gradient make policy gradient approaches look much more attractive.
Edit: added one that I forgot.
- Rayid Ghani (Chief Scientist at Obama 2012)
- Brian Kingsbury (Speech Recognition @ IBM)
- Jorge Nocedal (who did LBFGS)
We’ve been somewhat disorganized in advertising this. As a consequence, anyone who has not submitted an abstract but would like to do so may send one directly to me (email@example.com title NYASMLS) by Friday March 14. I will forward them to the rest of the committee for consideration.
In recent years, there’s been an explosion of free educational resources that make high-level knowledge and skills accessible to an ever-wider group of people. In your own field, you probably have a good idea of where to look for the answer to any particular question. But outside your areas of expertise, sifting through textbooks, Wikipedia articles, research papers, and online lectures can be bewildering (unless you’re fortunate enough to have a knowledgeable colleague to consult). What are the key concepts in the field, how do they relate to each other, which ones should you learn, and where should you learn them?
Courses are a major vehicle for packaging educational materials for a broad audience. The trouble is that they’re typically meant to be consumed linearly, regardless of your specific background or goals. Also, unless thousands of other people have had the same background and learning goals, there may not even be a course that fits your needs. Recently, we (Roger Grosse and Colorado Reed) have been working on Metacademy, an open-source project to make the structure of a field more explicit and help students formulate personal learning plans.
Metacademy is built around an interconnected web of concepts, each one annotated with a short description, a set of learning goals, a (very rough) time estimate, and pointers to learning resources. The concepts are arranged in a prerequisite graph, which is used to generate a learning plan for a concept. In this way, Metacademy serves as a sort of “package manager for knowledge.”
Metacademy also has wiki-like documents called roadmaps, which briefly overview key concepts in a field and explain why you might want to learn about them; here’s one we wrote for Bayesian machine learning.
Many ingredients of Metacademy are drawn from pre-existing systems, including Khan Academy, saylor.org, Connexions, and many intelligent tutoring systems. We’re not trying to be the first to do any particular thing; rather, we’re trying to build a tool that we personally wanted to exist, and we hope others will find it useful as well.
Granted, if you’re reading this blog, you probably have a decent grasp of most of the concepts we’ve annotated. So how can Metacademy help you? If you’re teaching an applied course and don’t want to re-explain Gibbs sampling, you can simply point your students to the concept on Metacademy. Or, if you’re writing a textbook or teaching a MOOC, Metacademy can help potential students find their way there. Don’t worry about self-promotion: if you’ve written something you think people will find useful, feel free to add a pointer!
We are hoping to expand the content beyond machine learning, and we welcome contributions. You can create a roadmap to help people find their way around a field. We are currently working on a GUI for editing the concepts and the graph connecting them (our current system is based on Github pull requests), and we’ll send an email to our registered users once this system is online. If you find Metacademy useful or want to contribute, let us know at feedback _at_ metacademy _dot_ org.
At NIPS I’m giving a tutorial on Learning to Interact. In essence this is about dealing with causality in a contextual bandit framework. Relative to previous tutorials, I’ll be covering several new results that changed my understanding of the nature of the problem. Note that Judea Pearl and Elias Bareinboim have a tutorial on causality. This might appear similar, but is quite different in practice. Pearl and Bareinboim’s tutorial will be about the general concepts while mine will be about total mastery of the simplest nontrivial case, including code. Luckily, they have the right order. I recommend going to both
I also just released version 7.4 of Vowpal Wabbit. When I was a frustrated learning theorist, I did not understand why people were not using learning reductions to solve problems. I’ve been slowly discovering why with VW, and addressing the issues. One of the issues is that machine learning itself was not automatic enough, while another is that creating a very low overhead process for doing learning reductions is vitally important. These have been addressed well enough that we are starting to see compelling results. Various changes:
- The internal learning reduction interface has been substantially improved. It’s now pretty easy to write new learning reduction. binary.cc provides a good example. This is a very simple reduction which just binarizes the prediction. More improvements are coming, but this is good enough that other people have started contributing reductions.
- Zhen Qin had a very productive internship with Vaclav Petricek at eharmony resulting in several systemic modifications and some new reductions, including:
- A direct hash inversion implementation for use in debugging.
- A holdout system which takes over for progressive validation when multiple passes over data are used. This keeps the printouts ‘honest’.
- An online bootstrap mechanism system which efficiently provides some understanding of prediction variations and which can sometimes effectively trade computational time for increased accuracy via ensembling. This will be discussed at the biglearn workshop at NIPS.
- A top-k reduction which chooses the top-k of any set of base instances.
- Hal Daume has a new implementation of Searn (and Dagger, the codes are unified) which makes structured prediction solutions far more natural. He has optimized this quite thoroughly (exercising the reduction stack in the process), resulting in this pretty graph.
Here, CRF++ is commonly used conditional random field code, SVMstruct is an SVM-style approach to classification, and CRF SGD is an online learning CRF approach. All of these methods use the same features. Fully optimized code is typically rough, but this one is less than 100 lines.
I’m trying to put together a tutorial on these things at NIPS during the workshop break on the 9th and will add details as that resolves for those interested enough to skip out on skiing
Edit: The VW tutorial will take place during the break at the big learning workshop from 1:30pm – 3pm at Harveys Emerald Bay B.
Various news stories have coverage, but the synopsis is that he had a heart attack on Sunday and is survived by his wife Anat and daughter Aviv. There is discussion of creating a memorial fund for them, which I hope comes to fruition, and plan to contribute to.
I will remember Ben as someone who thought carefully and comprehensively about new ways to do things, then fought hard and successfully for what he believed in. It is an ideal we strive for, that Ben accomplished.
Several strong graduates are on the job market this year.
- Alekh Agarwal made the most scalable public learning algorithm as an intern two years ago. He has a deep and broad understanding of optimization and learning as well as the ability and will to make things happen programming-wise. I’ve been privileged to have Alekh visiting me in NY where he will be sorely missed.
- John Duchi created Adagrad which is a commonly helpful improvement over online gradient descent that is seeing wide adoption, including in Vowpal Wabbit. He has a similarly deep and broad understanding of optimization and learning with significant industry experience at Google. Alekh and John have often coauthored together.
- Stephane Ross visited me a year ago over the summer, implementing many new algorithms and working out the first scale free online update rule which is now the default in Vowpal Wabbit. Stephane is not on the market—Google robbed the cradle successfully I’m sure that he will do great things.
- Anna Choromanska visited me this summer, where we worked on extreme multiclass classification. She is very good at focusing on a problem and grinding it into submission both in theory and in practice—I can see why she wins awards for her work. Anna’s future in research is quite promising.
I also wanted to mention some postdoc openings in machine learning.
- In New York Leon Bottou, Miro Dudik, and I are looking for someone. The deadline is December 13.
- In New England, Sham Kakade and Adam Kalai are looking for someone. The deadline is December 13.
- Also in the New York area, Daniel Hsu and Tong Zhang may both be considering a postdoc with no particular deadline.
- In England, Peter Flach is looking for two postdocs on a health & machine learning project with a deadline of December 2. I consider machine learning for healthcare of critical importance in the future.
There will be no New York ML Symposium this year. The core issue is that NYAS is disorganized by people leaving, pushing back the date, with the current candidate a spring symposium on March 28. Gunnar and I were outvoted here—we were gung ho on organizing a fall symposium, but the rest of the committee wants to wait.
In some good news, most of the ICML 2012 videos have been restored from a deep backup.
Manik and I are organizing the extreme classification workshop at NIPS this year. We have a number of good speakers lined up, but I would further encourage anyone working in the area to submit an abstract by October 9. I believe this is an idea whose time has now come.
The NIPS website doesn’t have other workshops listed yet, but I expect several others to be of significant interest.
A big ouch—all the videos for ICML 2012 were lost in a shuffle. Rajnish sends the below, but if anyone can help that would be greatly appreciated.
Sincere apologies to ICML community for loosing 2012 archived videos
What happened: In order to publish 2013 videos, we decided to move 2012 videos to another server. We have a weekly backup service from the provider but after removing the videos from the current server, when we tried to retrieve the 2012 videos from backup service, the backup did not work because of provider-specific requirements that we had ignored while removing the data from previous server.
What are we doing about this: At this point, we are still looking into raw footage to find if we can retrieve some of the videos, but following are the steps we are taking to make sure this does not happen again in future:
(1) We are going to create a channel on Vimeo (and potentially on YouTube) and we will publish there the p-in-p- or slide-versions of the videos. This will be available by the beginning of Oct 2013.
(2) We are going to provide download links from TechTalks so that the slide-version (of p-in-p- version if availbale) of the videos can be directly downloaded by viewers.This feature will be available by Aug 4th 2013.
(3) Of course we are now creating regular backups that do not depend on our service provider.
How can you help: If you have downloaded from TechTalks the ICML 2012 videos using external tools, we will really appreciate if you can provide us the videos, please email at firstname.lastname@example.org .
The large scale machine learning class I taught with Yann LeCun has finished. As I expected, it took quite a bit of time :-). We had about 25 people attending in person on average and 400 regularly watching the recorded lectures which is substantially more sustained interest than I expected for an advanced ML class. We also had some fun with class projects—I’m hopeful that several will eventually turn into papers.
I expect there are a number of professors interested in lecturing on this and related topics. Everyone will have their personal taste in subjects of course, but hopefully there will be some convergence to common course materials as well. To help with this, I am making the sources to my presentations available. Feel free to use/improve/embelish/ridicule/etc… in the pursuit of the perfect course.
ICML registration is also available, at about an x3 higher cost. My understanding is that this is partly due to the costs of a larger conference being harder to contain, partly due to ICML lasting twice as long with tutorials and workshops, and partly because the conference organizers were a bit over-conservative in various ways.
Adam Kalai points out the New England Machine Learning Day May 1 at MSR New England. There is a poster session with abstracts due April 19. I understand last year’s NEML went well and it’s great to meet your neighbors at regional workshops like this.
Sebastien Bubeck has a new ML blog focused on optimization and partial feedback which may interest people.
Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods.
- Videos will be posted with about a 1 day delay on techtalks. This is a side-by-side capture of video+slides from Weyond.
- We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things. update2: Sign up here.
The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.
Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.
This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers.
For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through that lens. More generally, I expect there is quite a bit of interest: figuring out how to use data to make predictions well is a topic of growing interest to many fields. In 2007, this was true, and demand is much stronger now. Yann and I also come from quite different viewpoints, so I’m looking forward to learning from him as well.
We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible.
Suggestions or thoughts on the class are welcome
2012 was a tumultuous year for me, but it was undeniably a great year for deep learning efforts. Signs of this include:
- Winning a Kaggle competition.
- Wide adoption of deep learning for speech recognition.
- Significant industry support.
- Gains in image recognition.
This is a rare event in research: a significant capability breakout. Congratulations are definitely in order for those who managed to achieve it. At this point, deep learning algorithms seem like a choice undeniably worth investigating for real applications with significant data.
The main program will feature invited talks from Peter Bartlett, William Freeman, and Vladimir Vapnik, along with numerous spotlight talks and a poster session. Following the main program, hackNY and Microsoft Research are sponsoring a networking hour with talks from machine learning practitioners at NYC startups (specifically bit.ly, Buzzfeed, Chartbeat, and Sense Networks, Visual Revenue). This should be of great interest to everyone considering working in machine learning.