May 16 in Cambridge, is the New England Machine Learning Day, a first regional workshop/symposium on machine learning. To present a poster, submit an abstract by May 5.
has died. He lived a full life. I know him personally as a founder of the Center for Computational Learning Systems and the New York Machine Learning Symposium, both of which have sheltered and promoted the advancement of machine learning. I expect much of the New York area machine learning community will miss him, as well as many others around the world.
Sasha is the open problems chair for both COLT and ICML. Open problems will be presented in a joint session in the evening of the COLT/ICML overlap day. COLT has a history of open sessions, but this is new for ICML. If you have a difficult theoretically definable problem in machine learning, consider submitting it for review, due March 16. You’ll benefit three ways:
- The effort of writing down a precise formulation of what you want often helps you understand the nature of the problem.
- Your problem will be officially published and citable.
- You might have it solved by some very intelligent bored people.
The general idea could easily be applied to any problem which can be crisply stated with an easily verifiable solution, and we may consider expanding this in later years, but for this year all problems need to be of a theoretical variety.
Joelle and I (and Mahdi, and Laurent) finished an initial assignment of Program Committee and Area Chairs to papers. We’ll be updating instructions for the PC and ACs as we field questions. Feel free to comment here on things of plausible general interest, but email us directly with specific concerns.
For graduate students, the Yahoo! Key Scientific Challenges program including in machine learning is on again, due March 9. The application is easy and the $5K award is high quality “no strings attached” funding. Consider submitting.
The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
- The cluster parallel learning code better supports multiple simultaneous runs, and other forms of parallelism have been mostly removed. This incidentally significantly simplifies the learning core.
- The online learning algorithms are more general, with support for l1 (via a truncated gradient variant) and l2 regularization, and a generalized form of variable metric learning.
- There is a solid persistent server mode which can train online, as well as serve answers to many simultaneous queries, either in text or binary.
This should be a very good release if you are just getting started, as we’ve made it compile more automatically out of the box, have several new examples and updated documentation.
- Miro will cover the L-BFGS implementation, which he created from scratch. We have found this works quite well amongst batch learning algorithms.
- Alekh will cover how to do cluster parallel learning. If you have access to a large cluster, VW is orders of magnitude faster than any other public learning system accomplishing linear prediction. And if you are as impatient as I am, it is a real pleasure when the computers can keep up with you.
This will be recorded, so it will hopefully be available for viewing online before too long.
I hope to see you soon
Everyone should have received notice for NY ML Symposium abstracts. Check carefully, as one was lost by our system.
The event itself is October 21, next week. Leon Bottou, Stephen Boyd, and Yoav Freund are giving the invited talks this year, and there are many spotlights on local work spread throughout the day. Chris Wiggins has setup 6(!) ML-interested startups to follow the symposium, which should be of substantial interest to the employment interested.
I also wanted to give an update on ICML 2012. Unlike last year, our deadline is coordinated with AIStat (which is due this Friday). The paper deadline for ICML has been pushed back to February 24 which should allow significant time for finishing up papers after the winter break. Other details may interest people as well:
- We settled on using CMT after checking out the possibilities. I wasn’t looking for this, because I’ve often found CMT clunky in terms of easy access to the right information. Nevertheless, the breadth of features and willingness to support new/better approaches to reviewing was unrivaled. We are also coordinating with Laurent, Rich, and CMT to enable their paper/reviewer recommendation system. The outcome should be a standardized interface in CMT for any recommendation system, which others can then code to if interested.
- Area chairs have been picked. The list isn’t sacred, so if we discover significant holes in expertise we’ll deal with it. We expect to start inviting PC members in a little while. Right now, we’re looking into invited talks. If you have any really good suggestions, they could be considered.
- CCC is interested in sponsoring travel costs for any climate/environment related ML papers, which seems great to us. In general, this seems like an area of growing interest.
- We now have a permanent server and the beginnings of the permanent website setup. Much more work needs to be done here.
- We haven’t settled yet on how videos will work. Last year, ICML experimented with Weyond with results here. Previously, ICML had used videolectures, which is significantly more expensive. If you have an opinion about cost/quality tradeoffs or other options, speak up.
- Plans for COLT have shifted slightly—COLT will start a day early, overlap with tutorials, then overlap with a coordinated first day of ICML conference papers.
Various people want to use hunch.net to announce things. I’ve generally resisted this because I feared hunch becoming a pure announcement zone while I am much more interested contentful posts and discussion personally. Nevertheless there is clearly some value and announcements are easy, so I’m planning to summarize announcements on Mondays.
- D. Sculley points out an interesting Semisupervised feature learning competition, with a deadline of October 17.
- Lihong Li points out the webscope user interaction dataset which is the first high quality exploration dataset I’m aware of that is publicly available.
- Seth Rogers points out CrossValidated which looks similar in conception to metaoptimize, but directly using the stackoverflow interface and with a bit more of a statistics twist.
Many Machine Learning related events are coming up this fall.
- September 9, abstracts for the New York Machine Learning Symposium are due. Send a 2 page pdf, if interested, and note that we:
- widened submissions to be from anybody rather than students.
- set aside a larger fraction of time for contributed submissions.
- September 15, there is a machine learning meetup, where I’ll be discussing terascale learning at AOL.
- September 16, there is a CS&Econ day at New York Academy of Sciences. This is not ML focused, but it’s easy to imagine interest.
- September 23 and later NIPS workshop submissions start coming due. As usual, there are too many good ones, so I won’t be able to attend all those that interest me. I do hope some workshop makers consider ICML this coming summer, as we are increasing to a 2 day format for you. Here are a few that interest me:
- Big Learning is about dealing with lots of data. Abstracts are due September 30.
- The Bayes Bandits workshop. Abstracts are due September 23.
- The Personalized Medicine workshop
- The Learning Semantics workshop. Abstracts are due September 26.
- The ML Relations workshop. Abstracts are due September 30.
- The Hierarchical Learning workshop. Challenge submissions are due October 17, and abstracts are due October 21.
- The Computational Tradeoffs workshop. Abstracts are due October 17.
- The Model Selection workshop. Abstracts are due September 24.
- October 16-17 is the Singularity Summit in New York. This is for the AIists and only peripherally about ML.
- October 16-21 is a Predictive Analytics World in New York. As machine learning goes industrial, we see industrial-style conferences rapidly developing.
- October 21, there is the New York ML Symposium. In addition to what’s there, Chris Wiggins is looking into setting up a session for startups and those interested in them to get to know each other, as last year.
- Decembr 16-17 NIPS workshops in Granada, Spain.
Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.
This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.
This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth
(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)
I just released Vowpal Wabbit 6.0. Since the last version:
- VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
- The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
- The new matrix factorization code from Jake adds a core algorithm.
- We finally have basic persistent daemon support, again with Jake’s help.
- Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
- The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.
All of the above was done without adding significant new dependencies, so the code should compile easily.
The VW mailing list has been slowly growing, and is a good place to ask questions.
Unfortunately, I ended up sick for much of this ICML. I did manage to catch one interesting paper:
Joelle and I are program chairs for ICML 2012 in Edinburgh, which I previously enjoyed visiting in 2005. This is a huge responsibility, that we hope to accomplish well. A part of this (perhaps the most fun part), is imagining how we can make ICML better. A key and critical constraint is choosing things that can be accomplished. So far we have:
- Colocation. The first thing we looked into was potential colocations. We quickly discovered that many other conferences precomitted their location. For the future, getting a colocation with ACL or SIGIR, seems to require more advanced planning. If that can be done, I believe there is substantial interest—I understand there was substantial interest in the joint symposium this year. What we did manage was achieving a colocation with COLT and there is an outside chance that a machine learning summer school will precede the main conference. The colocation with COLT is in both time and space, with COLT organized as (essentially) a separate track in a nearby building. We look forward to organizing a joint invited session or two with the COLT program chairs.
- Tutorials. We don’t have anything imaginative here, except for pushing for quality tutorials, probably through a mixture of invitations and a call. There is a small chance we’ll be able to organize a machine learning summer school as a prequel, which would be quite cool, but several things have to break right for this to occur.
- Conference. We are considering a few tinkerings with the conference format.
- Shifting a conference banquet to be during the workshops, more tightly integrating the workshops.
- Having 3 nights of posters (1 per day) rather than 2 nights. This provides more time/poster, and avoids halving talks and posters appear on different days.
- Having impromptu sessions in the evening. Two possibilities here are impromptu talks and perhaps a joint open problems session with COLT. I’ve made sure we have rooms available so others can organize other things.
- We may go for short presentations (+ a poster) for some papers, depending on how things work out schedulewise. My opinions on this are complex. ICML is traditionally multitrack with all papers having a 25 minute-ish presentation. As a mechanism for research, I believe this is superior to a single track conference of a similar size because:
- Typically some talk of potential interest can always be found by participants avoiding the boredom problem which comes up at a single track conference
- My experience is that program organizers have a limited ability to foresee which talks are of most interest, commonly creating a misallocation of attention.
On the other hand, there are clearly limits to the number of tracks that are reasonable, and I feel like ICML (especially with COLT cotimed) is near the upper limit. There are also some papers which have a limited scope of interest, for which a shorter presentation is reasonable.
- Workshops. A big change here—we want to experiment with 2 days of workshops rather than 1. There seems to be demand for it, as the number of workshops historically is about 10, enough that it’s easy to imagine people commonly interested in 2 workshops. It’s also the case that NIPS has had to start rejecting a substantial fraction of workshop submissions for space reasons. I am personally a big believer in workshops as a mechanism for further research, so I hope this works out well.
Journal integration. I tend to believe that we should be shifting to a journal format for ICML papers, as per many past discussions. After thinking about this the easiest way seems to be simply piggybacking on existing journals such as JMLR and MLJ by essentially declaring that people could submit there first, and if accepted, and not otherwise presented at a conference, present at ICML. This was considered too large a change, so it is not happening. Nevertheless, it is a possible tweak that I believe should be considered for the future. My best guess is that this would never displace the baseline conference review process, but it would help some papers that don’t naturally fit into a conference format while keeping quality high.
- Reviewing. Drawing on plentiful experience with what goes wrong, I think we can create the best reviewing system for conferences. We are still debating exact details here while working through what is possible in different conference systems. Nevertheless, some basic goals are:
- Double Blind [routine now] Two identical papers with different authors should have the same chance of success. In terms of reviewing quality, I think double blind makes little difference in the short term, but the public commitment to fair reviewing makes a real difference in the long term.
- Author Feedback [routine now] Author feedback makes a difference in only a small minority of decisions, but I believe its effect is larger as (a) reviewer quality improves and (b) reviewer understanding improves. Both of these are silent improvers of quality. Somewhat less routine, we are seeking a mechanism for authors to be able to provide feedback if additional reviews are requested, as I’ve become cautious of the late-breaking highly negative review.
- Paper Editing. Geoff Gordon tweaked AIStats this year to allow authors to revise papers during feedback. I think this is helpful, because it encourages authors to fix clarity issues immediately, rather than waiting longer. This helps with some things, but it is not a panacea—authors still have to convince reviewers their paper is worthwhile, and given the way people are first impressions are lasting impressions.
- Multisource reviewing. We want all of the initial reviews to be assigned by good yet different mechanisms. In the past, I’ve observed that the source of reviewer assignments can greatly bias the decision outcome, all the way from “accept with minor revisions” to “reject” in the case of a JMLR submission that I had. Our plan at the moment is that one review will be assigned by bidding, one by a primary area chair, and one by a secondary area chair.
- No single points of failure. When Bob Williamson and I were PC members for learning theory at NIPS, we each came to a decisions given reviews and then reconciled differences. This made a difference on about 5-10% of decisions, and (I believe) improved overall quality a bit. More generally, I’ve seen instances where an area chair has an unjustifiable dislike for a paper and kills it off, which this mechanism avoids.
- Speed. In general, I believe speed and good decision making are antagonistic. Nevertheless, we believe it is important to try to do the reviewing both quickly and well. Doing things quickly implies that we can push the submission deadline back later, providing authors more time to make quality papers. Key elements of doing things well fast are: good organization (that’s all on us), light loads for everyone involved (i.e. not too many papers), crowd sourcing (i.e. most decisions made by area chairs), and some amount of asynchrony. Altogether, we believe at the moment that two weeks can be shaved from our reviewing process.
- Website. Traditionally at ICML, every new local organizer was responsible for creating a website. This doesn’t make sense anymore, because substantial work is required there, which can and should be amortized across the years so that the website can evolve to do more for the community. We plant to create a permanent website, based around some combination of icml.cc and machinelearning.org. I think this just makes sense.
- Publishing. We are thinking about strongly encouraging authors to use arxiv for final submissions. This provides a lasting backing store for ICML papers, as well as a mechanism for revisions. The reality here is that some mistakes get into even final drafts, so a way to revise for the long term is helpful. We are also planning to videotape and make available all talks, although a decision between videolectures and Weyond has not yet been made.
Implementing all the changes above is ambitious, but I believe feasible and that each is individually beneficial and to some extent individually evaluatable. I’d like to hear any thoughts you have on this. It’s also not too late if you have further suggestions of your own.
Shravan and Alex‘s LDA code is released. On a single machine, I’m not sure how it currently compares to the online LDA in VW, but the ability to effectively scale across very many machines is surely interesting.
Alina and Jake point out the COLT Call for Open Questions due May 11. In general, this is cool, and worth doing if you can come up with a crisp question. In my case, I particularly enjoyed crafting an open question with precisely a form such that a critic targeting my papers would be forced to confront their fallacy or make a case for the reward. But less esoterically, this is a way to get the attention of some very smart people focused on a problem that really matters, which is the real value.
- There is now a mailing list, which I and several other developers are subscribed to.
- The main website has shifted to the wiki on github. This means that anyone with a github account can now edit it.
- I’m planning to give a tutorial tomorrow on it at eHarmony/the LA machine learning meetup at 10am. Drop by if you’re interested.
The status of VW amongst other open source projects has changed. When VW first came out, it was relatively unique amongst existing projects in terms of features. At this point, many other projects have started to appreciate the value of the design choices here. This includes:
- Mahout, which now has an SGD implementation.
- Shogun, where Soeren is keen on incorporating features.
- LibLinear, where they won the KDD best paper award for out-of-core learning.
This is expected—any open source approach which works well should be widely adopted. None of these other projects yet have the full combination of features, so VW still offers something unique. There are also more tricks that I haven’t yet had time to implement, and I look forward to discovering even more.
Yahoo!’s Key Scientific Challenges for Machine Learning grant applications are due March 11. If you are a student working on relevant research, please consider applying. It’s for $5K of unrestricted funding.