Alekh, John, Ofer, and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested.
And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.
Joseph Turian creates MetaOptimize for discussion of NLP and ML on big datasets. This includes a blog, but perhaps more importantly a question and answer section. I’m hopeful it will take off.
The second Netflix prize is canceled due to privacy problems. I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release.
I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.
Yahoo! is sponsoring two machine learning events that might interest people.
- The Key Scientific Challenges program (due March 5) for Machine Learning and Statistics offers $5K (plus bonuses) for graduate students working on a core problem of interest to Y! If you are already working on one of these problems, there is no reason not to submit, and if you aren’t you might want to think about it for next year, as I am confident they all press the boundary of the possible in Machine Learning. There are 7 days left.
- The Learning to Rank challenge (due May 31) offers an $8K first prize for the best ranking algorithm on a real (and really used) dataset for search ranking, with presentations at an ICML workshop. Unlike the Netflix competition, there are prizes for 2nd, 3rd, and 4th place, perhaps avoiding the heartbreak the ensemble encountered. If you think you know how to rank, you should give it a try, and we might all learn something. There are 3 months left.
and I can’t help but remember him.
I first met Sam as an undergraduate at Caltech where he was TA for Hopfield’s class, and again when I visited Gatsby, when he invited me to visit Toronto, and at too many conferences to recount. His personality was a combination of enthusiastic and thoughtful, with a great ability to phrase a problem so it’s solution must be understood. With respect to my own work, Sam was the one who advised me to make my first tutorial, leading to others, and to other things, all of which I’m grateful to him for. In fact, my every interaction with Sam was positive, and that was his way.
His death is being called a suicide which is so incompatible with my understanding of Sam that it strains my credibility. But we know that his many responsibilities were great, and it is well understood that basically all sane researchers have legions of inner doubts. Having been depressed now and then myself, it’s helpful to understand at least intellectually that the true darkness of the now is overestimated, and that you have more friends than you think. Sam was one of mine, and I’ll miss him.
My last interaction with Sam, last week, was discussing a new research direction that interested him, optimizing the cost of acquiring feature information in the learning algorithm. This problem is endemic to real-world applications, and has been studied to some extent elsewhere, but I expect that in our unwritten future history, we’ll discover that further study of this problem is more helpful than almost anyone realizes. The reply that I owed him feels heavy, and an incompleteness is hanging. For his wife and children it is surely so incomparably greater that I lack words.
(Added) Others: Fernando, Kevin McCurley, Danny Tarlow, David Hogg, Yisong Yue, Lance Fortnow on Sam, a Memorial site, and a Memorial Fund
I’d like to point out Inherent Uncertainty, which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.
Several events are happening in the NY area.
- Barriers in Computational Learning Theory Workshop, Aug 28. That’s tomorrow near Princeton. I’m looking forward to speaking at this one on “Getting around Barriers in Learning Theory”, but several other talks are of interest, particularly to the CS theory inclined.
- Claudia Perlich is running the INFORMS Data Mining Contest with a deadline of Sept. 25. This is a contest using real health record data (they partnered with HealthCare Intelligence) to predict transfers and mortality. In the current US health care reform debate, the case studies of high costs we hear strongly suggest machine learning & statistics can save many billions.
- The Singularity Summit October 3&4. This is for the AIists out there. Several of the talks look interesting, although unfortunately I’ll miss it for ALT.
- Predictive Analytics World, Oct 20-21. This is stretching the definition of “New York Area” a bit, but the train to DC is reasonable. This is a conference of case studies of applications of ML to real-world problems.
- Machine Learning Symposium, Friday Nov. 6. I’m on the committee again this year. The abstract deadline is Sept. 30, and we already have several speakers lined up.
There are at least 3 summer schools related to machine learning this summer.
- The first is at University of Chicago June 1-11 organized by Misha Belkin, Partha Niyogi, and Steve Smale. Registration is closed for this one, meaning they met their capacity limit. The format is essentially an extended Tutorial/Workshop. I was particularly interested to see Valiant amongst the speakers. I’m also presenting Saturday June 6, on logarithmic time prediction.
- Praveen Srinivasan points out the second at Peking University in Beijing, China, July 20-27. This one differs substantially, as it is about vision, machine learning, and their intersection. The deadline for applications is June 10 or 15. This is also another example of the growth of research in China, with active support from NSF.
- The third one is at Cambridge, England, August 29-September 10. It’s in the MLSS series. Compared to the Chicago one, this one is more about the Bayesian side of ML, although effort has been made to create a good cross section of topics. It’s also more focused on tutorials over workshop-style talks.
Mark Reid has setup a discussion site for ICML papers again this year and Monica Dinculescu has linked it in from the ICML site. Last year’s attempt appears to have been an acceptable but not wild success as a little bit of fruitful discussion occurred. I’m hoping this year will be a bit more of a success—please don’t be shy
I’d like to also point out that ICML’s early registration deadline has a few hours left, while UAI’s and COLT’s are in a week.
This post is partly meant as an advertisement for the reductions tutorial Alina, Bianca, and I are planning to do at ICML. Please come, if you are interested.
Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems.
In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult.
- It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, you often don’t know the right way to state things.
Here’s my current attempt: The process of abstraction for learning reductions can start with sample complexity bounds (or online learning against an adversary analysis). A very simple sample complexity bound is that for all sets of hypotheses H, for all distributions D on examples (x,y), and for all confidence parameters dPr(x,y)m~Dm(for all h in H: |e(h,D)-e(h,(x,y)m)| < (ln( |H|/ d )/m)0.5 ) > 1 – d
Here (x,y)m is a sequence of m IID samples, e(h,D) is the error rate of h on D and e(h,(x,y)m) is the empirical error rate of h on the set of IID samples.
The previous bound is a very simple example, and yet remarkably complex both to state and to interpret—many people have been lost by the meaning of d. The impact of this complexity is that it is difficult to effectively use these bounds in practical learning algorithm design, particularly in solving more complex learning problems where much more than one bit of prediction is required. This was a central frustration that I ran into in my thesis work. Some progress has been made since then, but it is still quite difficult. The abstraction in the learning reduction setting is:
- You throw away d, because it only has a logarithmic dependence anyways.
- You eliminate H and m on the theory that intelligent choices for H and m are made in practice.
- You eliminate the IID assumption, because it is no longer needed to define things
The statement then ise(A((x,y)m),D)-e(h*,D) < eps
where A() is the hypothesis output by the learning algorithm, h* is the best possible predictor, and eps is used to parameterize the theorems. This abstraction is radical in some sense, but something radical was needed to yield tractable and useful analysis on the complex problems people need to solve in practice.
- A consequence of lack of familiarity, is that people often misread. In reading a paper, there is a temptation to not read carefully and fill in your understanding of things. Most of the time this works out well, but not here. For example, we saw many instances where people inserted IID sample assumptions or other things that simply weren’t there.
- Once you get past the lack of familiarity and misunderstandings, there is a feeling that the new abstraction is cheating. To some extent I understand, as I remember learning about abstractions in class, and I remember feeling that they were in some real sense cheating by dropping important details. For example:
- Big-O notation provides an upper bound specified up to constants. For example O(log n) computational complexity means there exists a constant c such that the number of operations requires is less than c log n. Big-O can be abused by hiding “constants” larger than the plausible values for the parameters. In machine learning, a particularly egregious case occurs in Bandit analysis where the punchline of some papers is “logarithmic regret”, hiding an arbitrarily large problem dependent constant.
- TCP provides a mechanism for reliable transport over an unreliable network. It is a very commonly used mechanism for sending information over the internet—you used TCP in reading this. TCP is both a programming construct and a mechanism for abstracting communicating over a network. The TCP abstraction is broken when the network is too unreliable for it to recover, such as on sketchy wireless networks where the programmer built for the TCP abstraction which wasn’t delivered.
- Dimensional analysis is a technique for quick analysis in physics. The basic idea is to just look at the units when estimating some quantity and combine them to get the right unit answer. For example, to compute the distance d traveled after time t with acceleration a, you simply use at2, since that formula is the only way to combine a with units of distance/time2 and t with units of time to get units of distance. This answer is off by a factor of 2 from what a more detailed analysis using integration yields, which is typical. Dimensional analysis can be misleading when the constants are very large. One example is in Gravitation where there is a table with time and distance equated since they are related by a constant—the speed of light 3*108 m/s. For example, E=mc2 becomes E=m.
Although the above breakages are real, the usefulness of these abstractions, in terms of allowing us to quickly think about and make decisions more than offsets the drawbacks. Indeed, even the breakages stated above are thought provoking or useful enough that I can’t even say it is wrong to consider them. This property that abstractions can be abused is generically essential to the process of abstraction itself. Abstraction is about neglecting details, and when these details are not neglectable, the abstraction is abused or ineffective. Because of this, any abstraction is insufficient for analyzing and solving real problems where the neglected details matter.
Just as for these abstractions, the learning reduction abstraction can be abused—the chosen learning algorithm can be pathetic yielding vacuous bounds, or the reduction can scramble the feature information with an encryption algorithm making it so no reasonable learning algorithm could yield other than pathetic performance. Similarly, there are situations in which I don’t know how to effectively use a learning reduction to build a learning algorithm, and it seems implausible that observation changes as more is learned in the future.
For a good abstraction, the drawbacks are matched by the advantages. The principle advantage is that there is a new way to examine and solve problems. This has several interesting effects.
- A good abstraction can capture a more complete specification of the problem. As an example, the sample complexity view of learning is broken in practice, because when insufficient performance is achieved people choose a different set of hypotheses H by throwing in additional features or choosing a different learning algorithm. Capturing this process in the sample complexity view requires an additional level of complexity. In the reduction view, this is entirely natural, because any means for achieving a better generalization—more/better features, a better learning algorithm, a better prior, sticking a human in the learning process, etc… are legitimate. This is particularly powerful when architecting solutions, providing a partial answer to the “What?” question Yehuda pointed out.
- A higher level abstraction can let you accidentally solve problems in other areas as well. A good example of this is error correcting tournaments which are useful for tournament design to select the best player/team/paper in real tournaments. Recently, I was amused to learn that a standard betting procedure for basketball tournaments exactly mirrors the importance weights suggested for the final elimination of ECTs. The first phase of ECTs provides a sound and practical method to seed a final elimination tournament, eliminating the need for (and biases of) a committee.
- Perhaps the most interesting effect is that the new abstraction can aid you in finding effective solutions to new problems. For learning reductions, there are about 3 compelling instances I’ve seen so far.
- Given training-time access to a good policy oracle, Searn provides a method for decomposing any complex prediction problem into simple problems, such that low regret solutions to the simple problems imply a low regret solution to the original problem. While Searn competes well (computationally and prediction-wise) with existing methods for linear chain style structured prediction, it really shines on more complex problems. Hal used Searn for automatic document summarization (see section 6.2) which previously wasn’t really solved via ML. More generally, when I learn about the details of other complex prediction systems for machine translation or vision, the base algorithms are tweaked, typically in ways that Searn would suggest. This suggests that Searn formalizes and automates the intuitions of practical people.
- The “one step RL” reduction in Bianca’s thesis (page 119) provided tractable and effective approaches to learning in partial feedback problems where only the loss of a chosen label is learned. An even simpler reduction exists as a matter of folklore—estimate the the value of each label and then take an argmax. However, we have found classification approaches generally work better, where applicable, and as the theory suggests.
- Many commonly used algorithms for prediction have a running time linear (or worse) in the number of labels with decision trees a good exception. While simply predicting faster isn’t normally solving a “new problem”, an exponential improvement in computational time seems to merit this description because it allows entirely new kinds of applications. It turns out that it is both very easy to do logarithmic time prediction wrong, and that this problem is often fixable. Furthermore, it appears logarthmic time prediction can really work in practice over very many labels.
When we started working on learning reductions, I had no idea what either the difficulties or rewards were going to be—it simply seemed like a natural and compelling direction of investigation. Given the substantial difficulties encountered, it’s not at all clear that this pursuit was personally worthwhile. It has cost much time which could have been put to good use in other ways.
On the other hand, the advantages are also substantial. I’ve learned something about architecting solutions to problems, both expanding the domain of application for the field and providing a personal edge that I can bring to many conversations about ML. It’s also progress towards the AI goal, which interests me. When I think of what I could have worked on instead to achieve these goals, I don’t have any more compelling answer yet. Learning reductions seem to have accomplished more per unit thought than any other theoretical approach I can identify over the last 5 or 6 years. Furthermore, they are composable by design, so they should stay relevant (and perhaps even become more so), when people use an online active deep semisupervised probabilistic convolutional algorithm to solve a problem, particularly for complex problems.
As I said at the beginning, please join us for the tutorial, if you are interested.
Jonathan Chang has a research blog on aspects of machine learning.
Adam Klivans, points out the COLT call for papers. The important points are:
- Due Feb 13.
- Montreal, June 18-21.
- This year, there is author feedback.
We’d like to invite hunch.net readers to participate in the NIPS 2008 workshop on kernel learning. While the main focus is on automatically learning kernels from data, we are also also looking at the broader questions of feature selection, multi-task learning and multi-view learning. There are no restrictions on the learning problem being addressed (regression, classification, etc), and both theoretical and applied work will be considered. The deadline for submissions is October 24.
More detail can be found here.
Corinna Cortes, Arthur Gretton, Gert Lanckriet, Mehryar Mohri, Afshin Rostamizadeh
If you are in the New York area and interested in machine learning, consider submitting a 2 page abstract to the ML symposium by tomorrow (Sept 5th) midnight. It’s a fun one day affair on October 10 in an awesome location overlooking the world trade center site.
A bit further off (but a real conference) is the AI and Stats deadline on November 5, to be held in Florida April 16-19.
Here are some papers from ICML 2008 that I found interesting.
- Risi Kondor and Karsten Borgwardt, The Skew Spectrum of Graphs. This paper is about a new family of functions on graphs which is invariant under node label permutation. They show that these quantities appear to yield good features for learning.
- Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. This is the first published practical consistent active learning algorithm. The abstract is also pretty impressive.
- Lihong Li, Michael Littman, and Thomas Walsh Knows What It Knows: A Framework For Self-Aware Learning. This is an attempt to create learning algorithms that know when they err, (other work includes Vovk). It’s not yet clear to me what the right model for feature-dependent confidence intervals is.
- Novi Quadrianto, Alex Smola, TIberio Caetano, and Quoc Viet Le Estimating Labels from Label Proportions. This is an example of learning in a specialization of the offline contextual bandit setting.
- Filip Radlinski, Robert Kleinberg and Thorsten Joachims
Learning Diverse Rankings with Multi-Armed Bandits. Learning should be used to solve the diversity problem, and doing it in an online bandit-like setting is quite natural. I believe the setting can be generalized to a setting with features without too much work.
- Rich Caruana, Nikos Karampatziakis, Ainur Yessenalina An Empirical Evaluation of Supervised Learning in High Dimensions. This paper doesn’t need an abstract given the title . I hadn’t previously appreciated how well a random forest works in high dimensions.
- Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient Bandit Algorithms for Online Multiclass Prediction. A paper about an online contextual bandit setting specialized to a multiclass realizable yet otherwise adversarial setting that yields a practical algorithm.
I’d like to add that I thought the conference organization (and the colocation with COLT and UAI) are particularly well done, the best I’ve seen. The key seems to be tight integration of the colocating conference programs and hordes of local volunteers making sure everything is working. I was also happy to see a 10 years award for best paper 10 years ago.
COLT has a call for open problems due March 21. I encourage anyone with a specifiable open problem to write it down and send it in. Just the effort of specifying an open problem precisely and concisely has been very helpful for my own solutions, and there is a substantial chance others will solve it. To increase the chance someone will take it up, you can even put a bounty on the solution. (Perhaps I should raise the $500 bounty on the K-fold cross-validation problem as it hasn’t yet been solved).
Helsinki is a fun place to visit.
IMLS (which is the nonprofit running ICML) has setup a new mailing list for Machine Learning News. The list address is MLemail@example.com, and signup requires a google account (which you can create). Only members can send messages.