Machine Learning (Theory)

8/20/2011

The Large Scale Learning Survey Tutorial

Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.

This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.

This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth :-)

(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)

8/15/2011

Vowpal Wabbit 6.0

Tags: Announcements, Code, Machine Learning jl@ 9:46 pm

I just released Vowpal Wabbit 6.0. Since the last version:

  1. VW is now 2-3 orders of magnitude faster at linear learning, primarily thanks to Alekh. Given the baseline, this is loads of fun, allowing us to easily deal with terafeature datasets, and dwarfing the scale of any other open source projects. The core improvement here comes from effective parallelization over kilonode clusters (either Hadoop or not). This code is highly scalable, so it even helps with clusters of size 2 (and doesn’t hurt for clusters of size 1). The core allreduce technique appears widely and easily reused—we’ve already used it to parallelize Conjugate Gradient, LBFGS, and two variants of online learning. We’ll be documenting how to do this more thoroughly, but for now “README_cluster” and associated scripts should provide a good starting point.
  2. The new LBFGS code from Miro seems to commonly dominate the existing conjugate gradient code in time/quality tradeoffs.
  3. The new matrix factorization code from Jake adds a core algorithm.
  4. We finally have basic persistent daemon support, again with Jake’s help.
  5. Adaptive gradient calculations can now be made dimensionally correct, following up on Paul’s post, yielding a better algorithm. And Nikos sped it up further with SSE native inverse square root.
  6. The LDA core is perhaps twice as fast after Paul educated us about SSE and representational gymnastics.

All of the above was done without adding significant new dependencies, so the code should compile easily.

The VW mailing list has been slowly growing, and is a good place to ask questions.

Enjoy.

8/6/2011

Interesting thing at UAI 2011

Tags: Conferences, Papers, Reinforcement jl@ 3:44 pm

I had a chance to attend UAI this year, where several papers interested me, including:

  1. Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture. We’ve already discussed this one, but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it.
  2. Yao-Liang Yu and Dale Schuurmans, Rank/norm regularization with closed-form solutions: Application to subspace clustering. This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally.
  3. Laurent Charlin, Richard Zemel and Craig Boutilier, A Framework for Optimizing Paper Matching. This is about what works in matching papers to reviewers, as has been tested at several previous NIPS. We are looking into using this system for ICML 2012.

In addition I wanted to comment on Karl Friston’s invited talk. At the outset, he made a claim that seems outlandish to me: The way the brain works is to minimize surprise as measured by a probabilistic model. The majority of the talk was not actually about this—instead it was about how probabilistic models can plausibly do things that you might not have thought possible, such as birdsong. Nevertheless, I think several of us in the room ended up stuck on the claim in questions afterward.

My personal belief is that world modeling (probabilistic or not) is a useful subroutine for intelligence, but it could not possibly be the entirety of intelligence. A key reason for this is the bandwidth of our senses—we simply take in too much information to model everything with equal attention. It seems critical for the efficient functioning of intelligence that only things which might plausibly matter are modeled, and only to the degree that matters. In other words, I do not model the precise placement of items on my desk, or even the precise content of my desk, because these details simply do not matter.

This argument can be made in another way. Suppose for the moment that all the brain does is probabilistic modeling. Then, the primary notion of failure to model is “surprise”, which is low probability events occurring. Surprises (stumbles, car wrecks, and other accidents) certainly can be unpleasant, but this could be correct if modeling is a subroutine as well. The clincher is that there are many unpleasant things which are not surprises, including keeping your head under water, fasting, and self-inflicted wounds.

Accounting for the unpleasantness of these events requires more than probabilistic modeling. In other words, it requires rewards, which is why reinforcement learning is important. As a byproduct, rewards also naturally create a focus of attention, addressing the computational efficiency issue. Believing that intelligence is just probabilistic modeling is another example of simple wrong answer.

8/1/2011

Interesting papers at COLT 2011

Since John did not attend COLT this year, I have been volunteered to report back on the hot stuff at this year’s meeting. The conference seemed to have pretty high quality stuff this year, and I found plenty of interesting papers on all the three days. I’m gonna pick some of my favorites going through the program in a chronological order.

The first session on matrices seemed interesting for two reasons. First, the papers were quite nice. But more interestingly, this is a topic that has had a lot of presence in Statistics and Compressed sensing literature recently. So it was good to see high-dimensional matrices finally make their entry at COLT. The paper of Ohad and Shai on Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing provides non-trivial guarantees on trace norm regularization in an agnostic setup, while Rina and Nati show how Rademacher averages can be used to get sharper results for matrix completion problems in their paper Concentration-Based Guarantees for Low-Rank Matrix Reconstruction. Both the papers seemed to share the flavor of a learning theorists’ take at compressed sensing that I enjoyed seeing.

The best student paper by Amit, Sivan and Shai2 on Multiclass Learnability and the ERM principle showed a crucial distinction between binary and multiclass classification. Every ERM procedure is not equally good for multiclass classification. In particular, there are multiclass problems where some ERM learners succeed while others are inconsistent, in sharp contrast to the binary case. They also present some intuition on what characterizes a good ERM procedure for the multiclass setting.

I enjoyed all the three papers in the online learning session quite a bit. Jake, Elad and Peter show the equivalence of Blackwell approachability and low regret in their paper Blackwell Approachability and No-Regret Learning are Equivalent, with applications to efficient algorithms for calibration. Sasha, Karthik and Ambuj won the best paper award for their paper Online Learning: Beyond Regret which shows how the tools like sequential Rademacher averages and sequential covering numbers can be used to capture the minimax value of a large class of games, beyond just external regret settings. Their paper with Dean on Complexity-Based Approach to Calibration with Checking Rules showed a nice application of these techniques to the calibration problem.

The impromptu session was quite a hit this year with ~15 talks. I was quite disappointed to see none of them turn up on my NIPS review stack :)

Rob managed to save some money by solving his open problem from last COLT, together with Indraneel and Cynthia. Their paper The Rate of Convergence of AdaBoost was interesting as it made me realize how much difference the boundedness of rates can make to the theoretical properties of an algorithm. Adaboost is greedy coordinate descent, for which convergence over a compact domain is well-studied, but what makes this challenging here is that Adaboost doesn’t impose any bound on the weights. The way this paper gets around and pays a penalty for these issues seemed quite interesting.

I also liked the paper by Gabor, David and Csaba on Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments. This paper shows a tight characterization of partial regret games that have an optimal regret of 0, T1/2, T2/3 or T. While some sufficient conditions one way or the other were known before, theirs is a first complete characterization in my knowledge.

Overall, this was a thoroughly enjoyable COLT, both in the technical content and in the choice of venue.

7/11/2011

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

7/10/2011

ICML 2011 and the future

Unfortunately, I ended up sick for much of this ICML. I did manage to catch one interesting paper:

Richard Socher, Cliff Lin, Andrew Y. Ng, and Christopher D. Manning Parsing Natural Scenes and Natural Language with Recursive Neural Networks.

I invited Richard to share his list of interesting papers, so hopefully we’ll hear from him soon. In the meantime, Paul and Hal have posted some lists.

the future

Joelle and I are program chairs for ICML 2012 in Edinburgh, which I previously enjoyed visiting in 2005. This is a huge responsibility, that we hope to accomplish well. A part of this (perhaps the most fun part), is imagining how we can make ICML better. A key and critical constraint is choosing things that can be accomplished. So far we have:

  1. Colocation. The first thing we looked into was potential colocations. We quickly discovered that many other conferences precomitted their location. For the future, getting a colocation with ACL or SIGIR, seems to require more advanced planning. If that can be done, I believe there is substantial interest—I understand there was substantial interest in the joint symposium this year. What we did manage was achieving a colocation with COLT and there is an outside chance that a machine learning summer school will precede the main conference. The colocation with COLT is in both time and space, with COLT organized as (essentially) a separate track in a nearby building. We look forward to organizing a joint invited session or two with the COLT program chairs.
  2. Tutorials. We don’t have anything imaginative here, except for pushing for quality tutorials, probably through a mixture of invitations and a call. There is a small chance we’ll be able to organize a machine learning summer school as a prequel, which would be quite cool, but several things have to break right for this to occur.
  3. Conference. We are considering a few tinkerings with the conference format.
    1. Shifting a conference banquet to be during the workshops, more tightly integrating the workshops.
    2. Having 3 nights of posters (1 per day) rather than 2 nights. This provides more time/poster, and avoids halving talks and posters appear on different days.
    3. Having impromptu sessions in the evening. Two possibilities here are impromptu talks and perhaps a joint open problems session with COLT. I’ve made sure we have rooms available so others can organize other things.
    4. We may go for short presentations (+ a poster) for some papers, depending on how things work out schedulewise. My opinions on this are complex. ICML is traditionally multitrack with all papers having a 25 minute-ish presentation. As a mechanism for research, I believe this is superior to a single track conference of a similar size because:
      1. Typically some talk of potential interest can always be found by participants avoiding the boredom problem which comes up at a single track conference
      2. My experience is that program organizers have a limited ability to foresee which talks are of most interest, commonly creating a misallocation of attention.

      On the other hand, there are clearly limits to the number of tracks that are reasonable, and I feel like ICML (especially with COLT cotimed) is near the upper limit. There are also some papers which have a limited scope of interest, for which a shorter presentation is reasonable.

  4. Workshops. A big change here—we want to experiment with 2 days of workshops rather than 1. There seems to be demand for it, as the number of workshops historically is about 10, enough that it’s easy to imagine people commonly interested in 2 workshops. It’s also the case that NIPS has had to start rejecting a substantial fraction of workshop submissions for space reasons. I am personally a big believer in workshops as a mechanism for further research, so I hope this works out well.
  5. Journal integration. I tend to believe that we should be shifting to a journal format for ICML papers, as per many past discussions. After thinking about this the easiest way seems to be simply piggybacking on existing journals such as JMLR and MLJ by essentially declaring that people could submit there first, and if accepted, and not otherwise presented at a conference, present at ICML. This was considered too large a change, so it is not happening. Nevertheless, it is a possible tweak that I believe should be considered for the future. My best guess is that this would never displace the baseline conference review process, but it would help some papers that don’t naturally fit into a conference format while keeping quality high.
  6. Reviewing. Drawing on plentiful experience with what goes wrong, I think we can create the best reviewing system for conferences. We are still debating exact details here while working through what is possible in different conference systems. Nevertheless, some basic goals are:
    1. Double Blind [routine now] Two identical papers with different authors should have the same chance of success. In terms of reviewing quality, I think double blind makes little difference in the short term, but the public commitment to fair reviewing makes a real difference in the long term.
    2. Author Feedback [routine now] Author feedback makes a difference in only a small minority of decisions, but I believe its effect is larger as (a) reviewer quality improves and (b) reviewer understanding improves. Both of these are silent improvers of quality. Somewhat less routine, we are seeking a mechanism for authors to be able to provide feedback if additional reviews are requested, as I’ve become cautious of the late-breaking highly negative review.
    3. Paper Editing. Geoff Gordon tweaked AIStats this year to allow authors to revise papers during feedback. I think this is helpful, because it encourages authors to fix clarity issues immediately, rather than waiting longer. This helps with some things, but it is not a panacea—authors still have to convince reviewers their paper is worthwhile, and given the way people are first impressions are lasting impressions.
    4. Multisource reviewing. We want all of the initial reviews to be assigned by good yet different mechanisms. In the past, I’ve observed that the source of reviewer assignments can greatly bias the decision outcome, all the way from “accept with minor revisions” to “reject” in the case of a JMLR submission that I had. Our plan at the moment is that one review will be assigned by bidding, one by a primary area chair, and one by a secondary area chair.
    5. No single points of failure. When Bob Williamson and I were PC members for learning theory at NIPS, we each came to a decisions given reviews and then reconciled differences. This made a difference on about 5-10% of decisions, and (I believe) improved overall quality a bit. More generally, I’ve seen instances where an area chair has an unjustifiable dislike for a paper and kills it off, which this mechanism avoids.
    6. Speed. In general, I believe speed and good decision making are antagonistic. Nevertheless, we believe it is important to try to do the reviewing both quickly and well. Doing things quickly implies that we can push the submission deadline back later, providing authors more time to make quality papers. Key elements of doing things well fast are: good organization (that’s all on us), light loads for everyone involved (i.e. not too many papers), crowd sourcing (i.e. most decisions made by area chairs), and some amount of asynchrony. Altogether, we believe at the moment that two weeks can be shaved from our reviewing process.
  7. Website. Traditionally at ICML, every new local organizer was responsible for creating a website. This doesn’t make sense anymore, because substantial work is required there, which can and should be amortized across the years so that the website can evolve to do more for the community. We plant to create a permanent website, based around some combination of icml.cc and machinelearning.org. I think this just makes sense.
  8. Publishing. We are thinking about strongly encouraging authors to use arxiv for final submissions. This provides a lasting backing store for ICML papers, as well as a mechanism for revisions. The reality here is that some mistakes get into even final drafts, so a way to revise for the long term is helpful. We are also planning to videotape and make available all talks, although a decision between videolectures and Weyond has not yet been made.

Implementing all the changes above is ambitious, but I believe feasible and that each is individually beneficial and to some extent individually evaluatable. I’d like to hear any thoughts you have on this. It’s also not too late if you have further suggestions of your own.

6/22/2011

Ultra LDA

Shravan and Alex’s LDA code is released. On a single machine, I’m not sure how it currently compares to the online LDA in VW, but the ability to effectively scale across very many machines is surely interesting.

5/16/2011

Research Directions for Machine Learning and Algorithms

Tags: Machine Learning, Questions, Research jl@ 9:15 pm

Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

  1. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.
  2. Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

  1. How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do efficient offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
  2. How can we learn from lots of data? We’ll be presenting a KDD survey tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
  3. How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it.
  4. What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
  5. The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted at CACM.

5/9/2011

CI Fellows, again

Tags: Announcements jl@ 2:05 pm

Lev and Hal point out the CI Fellows program is on again for this year. Lev visited me for a year under this program, and I quite enjoyed it. Due May 31.

4/23/2011

ICML workshops due

Tags: Announcements, Machine Learning jl@ 10:53 am

Lihong points out that ICML workshop submissions are due April 29.

4/20/2011

The End of the Beginning of Active Learning

Tags: Active, Machine Learning Daniel@ 12:18 pm

This post is by Daniel Hsu and John Langford.

In selective sampling style active learning, a learning algorithm chooses which examples to label. We now have an active learning algorithm that is:

  1. Efficient in label complexity, unlabeled complexity, and computational complexity.
  2. Competitive with supervised learning anywhere that supervised learning works.
  3. Compatible with online learning, with any optimization-based learning algorithm, with any loss function, with offline testing, and even with changing learning algorithms.
  4. Empirically effective.

The basic idea is to combine disagreement region-based sampling with importance weighting: an example is selected to be labeled with probability proportional to how useful it is for distinguishing among near-optimal classifiers, and labeled examples are importance-weighted by the inverse of these probabilities. The combination of these simple ideas removes the sampling bias problem that has plagued many previous heuristics for active learning, and yet leads to a general and flexible method that enjoys the desirable traits described above. None of these desirable criteria are individually sufficient, but the simultaneous satisfaction of all criteria by one algorithm is compelling.

This combination of traits implies that active learning is a viable choice in most places where supervised learning is a viable choice. 6 years ago, we didn’t know how to deal with adversarial label noise, and didn’t know how to characterize where active learning helped over supervised learning. 5.5 years ago we had the first breakthroughs in characterization and learning with adversarial label noise. Several more substantial improvements occurred, leading to a tutorial 2 years ago and discussion about what’s next. Since then, we cracked question (2) here and applied it to get an effective absurdly efficient active learning algorithm. Directions for experimenting with it are here, page 47.

As research programs go, we’d like to declare victory, but can’t. Victory in research is when ideas get used, becoming part of the standard repertoire of tools and ways people think. Instead, it seems fair to declare “theory victory” which is more like a milestone in the grand scheme. We’ve hit a point where anyone versed in these results can comfortably and effectively apply active learning instead of supervised learning (caveats below). Whether or not this leads to a real victory depends a great deal on how this gets used. In achieving “theory victory”, the key people were:

  1. Nina Balcan*
  2. Alina Beygelzimer
  3. Sanjoy Dasgupta
  4. Steve Hanneke*
  5. Daniel Hsu*
  6. Matti Kääriäinen
  7. Nikos Karampatziakis
  8. [Added from Daniel] Vladimir Koltchinskii
  9. John Langford
  10. Claire Monteleoni*
  11. Tong Zhang

(*)=thesis work.

Naturally, there are many caveats to the above. We moved as fast as we could towards an effective sound useful algorithm according to the standard IID-only assumption criteria of supervised learning. This means that our understanding is not particularly broad, and a number of questions remain including 1,3,4 here.

  1. The existing solution is a simple algorithm with a complex analysis. Simplifying and tightening this analysis would be quite helpful—it’s not even clear we have the best-possible functional form yet. In addition, we rely on the abstraction of a learning algorithm as an effective ERM algorithm when applying the algorithm in practice. That’s plausibly reasonable, but it may also breakdown, implying that experiments beyond linear and decision tree architectures could be helpful.
  2. The extreme efficiency of active learning achieved opens up the possibility of using it for efficient parallel learning and other information-distillation purposes.
  3. Our understanding of the limits of active learning is not complete: various lower bounds have been identified, but a gap remains relative to the known upper bounds. Are the label complexity upper bounds achieved by the general scheme tight for a general class of active learning algorithms, or can the algorithms be improved using new techniques without sacrificing consistency?
  4. Can active learning succeed under different generative / statistical assumptions (e.g., fully-adversarial data, alternative labeling oracles)? Some recent progress on fully-adversarial active learning has been made by Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona and Ofer Dekel, Claudio Gentile, and Karthik Sridharan, with the latter also providing a solution for learning with multiple heterogeneous labeling oracles (e.g. Mechanical Turk). At the other end of the spectrum, the work of Andrew Guillory and Jeff Bilmes and Daniel Golovin and Andreas Krause (building on some earlier findings of Sanjoy Dasgupta) looks at average-case / Bayesian analyses of greedy algorithms. The resulting approximation factor-type guarantees can be reassuring to the practitioner when justifying a particular choice of sampling strategy; the trade-off here is that the guarantee holds in a somewhat limited sense.
  5. Can active learning be effectively combined with semi-supervised and unsupervised learning? It is known from the rich area of semi-supervised learning that unlabeled data can suggest learning biases (e.g., large margin separators, low dimensional structure) that may improve performance over supervised learning, especially when labeled data are few. When these biases are not aligned with reality, however, performance can be significantly degraded; this is a common but serious criticism of semi-supervised learning. A basic observation is that active learning provides the opportunity to validate or refute these biases using label queries, and also to subsequently revise them. Thus, it seems that active learners ought to be able to pursue learning biases much more aggressively than passive learners. A few works on cluster-based sampling and multi-view active learning have appeared, but much remains to be discovered.

4/18/2011

A paper not at Snowbird

Tags: Conferences, Machine Learning, Papers jl@ 11:20 am

Unfortunately, a scheduling failure meant I missed all of AIStat and most of the learning workshop, otherwise known as Snowbird, when it’s at Snowbird.

At snowbird, the talk on Sum-Product networks by Hoifung Poon stood out to me (Pedro Domingos is a coauthor.). The basic point was that by appropriately constructing networks based on sums and products, the normalization problem in probabilistic models is eliminated, yielding a highly tractable yet flexible representation+learning algorithm. As an algorithm, this is noticeably cleaner than deep belief networks with a claim to being an order of magnitude faster and working better on an image completion task.

Snowbird doesn’t have real papers—just the abstract above. I look forward to seeing the paper. (added: Rodrigo points out the deep learning workshop draft.)

4/11/2011

The Heritage Health Prize

Tags: Competitions, Machine Learning jl@ 9:39 am

The Heritage Health Prize is potentially the largest prediction prize yet at $3M, which is sure to get many people interested. Several elements of the competition may be worth discussing.

  1. The most straightforward way for HPN to deploy this predictor is in determining who to cover with insurance. This might easily cover the costs of running the contest itself, but the value to the health system of a whole is minimal, as people not covered still exist. While HPN itself is a provider network, they have active relationships with a number of insurance companies, and the right to resell any entrant. It’s worth keeping in mind that the research and development may nevertheless end up being useful in the longer term, especially as entrants also keep the right to their code.
  2. The judging metric is something I haven’t seen previously. If a patient has probability 0.5 of being in the hospital 0 days and probability 0.5 of being in the hospital ~53.6 days, the optimal prediction in expectation is ~6.4 days. This is evidence against point (1) above, since cost is probably closer to linear in the number of hospital days. As a starting point, I suspect many people will simply optimize conditional squared loss and then back out an inferred prediction according to p=ex-1, with clipping. The standard approach of ensembling should be effective.
  3. The team structure seems a bit strange to me. I’m not sure there is a good reason for it from a prediction point of view and 8 may be too hard a limit on team size, imposing bin packing problems on the entrants.
  4. Privacy is clearly a huge concern. They anonymized the data, require entrants to protect the data, and admonish people to not try to break privacy. Despite that, the data will be released to large numbers of people, so I wouldn’t be surprised if someone attempts a join attack of some sort. Whether or not a join attack succeeds could make a huge difference in how this contest is viewed in the long term.
  5. The Accuracy Threshold is a big deal. If they set it at an out-of-reach point (which they could easily do), the size of the prize becomes 0.5M. This part of the contest is supposed to be determined next month.

This contest is not a slam-dunk, but is has the potential to become one, and I’ll be interested to see how it turns out.

4/6/2011

COLT open questions

Tags: Announcements, Machine Learning jl@ 6:53 am

Alina and Jake point out the COLT Call for Open Questions due May 11. In general, this is cool, and worth doing if you can come up with a crisp question. In my case, I particularly enjoyed crafting an open question with precisely a form such that a critic targeting my papers would be forced to confront their fallacy or make a case for the reward. But less esoterically, this is a way to get the attention of some very smart people focused on a problem that really matters, which is the real value.

3/27/2011

Vowpal Wabbit, v5.1

Tags: Announcements, Code jl@ 8:30 pm

I just created version 5.1 of vowpal wabbit. This almost entirely a bugfix release, so it’s an easy upgrade from v5.0.

In addition:

  1. There is now a mailing list, which I and several other developers are subscribed to.
  2. The main website has shifted to the wiki on github. This means that anyone with a github account can now edit it.
  3. I’m planning to give a tutorial tomorrow on it at eHarmony/the LA machine learning meetup at 10am. Drop by if you’re interested.

The status of VW amongst other open source projects has changed. When VW first came out, it was relatively unique amongst existing projects in terms of features. At this point, many other projects have started to appreciate the value of the design choices here. This includes:

  1. Mahout, which now has an SGD implementation.
  2. Shogun, where Soeren is keen on incorporating features.
  3. LibLinear, where they won the KDD best paper award for out-of-core learning.

This is expected—any open source approach which works well should be widely adopted. None of these other projects yet have the full combination of features, so VW still offers something unique. There are also more tricks that I haven’t yet had time to implement, and I look forward to discovering even more.

I’m indebted to many people at this point who have helped with this project. I particularly point out Daniel and Nikos, who have spent quite a bit of time over the last few months working on things.

3/20/2011

KDD Cup 2011

Yehuda points out KDD-Cup 2011 which Markus and Gideon helped setup. This is a prediction and recommendation contest for music. In addition to being a fun chance to show your expertise, there are cash prizes of $5K/$2K/$1K.

3/19/2011

The Ideal Large Scale Learning Class

Tags: Machine Learning, Teaching jl@ 4:24 pm

At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.

There are many different kinds of scaling.

  1. Scaling in examples This is the most basic kind of scaling.
    1. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
    2. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck’s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
    3. Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
  2. Feature Scaling—what do you do when you have very many features?
    1. Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
    2. Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
    3. Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
  3. Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
    1. Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
    2. GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
  4. Labels
    1. Sourcing Just acquiring sufficient label information can be problematic.

      1. Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
      2. When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
      3. The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
    2. Many Labels It’s common to need to make a complex prediction.
      1. Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
      2. Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell’s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
      3. Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.

What else is essential and missing?

2/25/2011

Yahoo! Machine Learning grant due March 11

Tags: Announcements, Machine Learning jl@ 9:56 am

Yahoo!’s Key Scientific Challenges for Machine Learning grant applications are due March 11. If you are a student working on relevant research, please consider applying. It’s for $5K of unrestricted funding.

2/17/2011

What does Watson mean?

Tags: AI, Competitions jl@ 10:37 pm

Watson convincingly beat the best champion Jeopardy! players. The apparent significance of this varies hugely, depending on your background knowledge about the related machine learning, NLP, and search technology. For a random person, this might seem evidence of serious machine intelligence, while for people working on the system itself, it probably seems like a reasonably good assemblage of existing technologies with several twists to make the entire system work.

Above all, I think we should congratulate the people who managed to put together and execute this project—many years of effort by a diverse set of highly skilled people were needed to make this happen. In academia, it’s pretty difficult for one professor to assemble that quantity of talent, and in industry it’s rarely the case that such a capable group has both a worthwhile project and the support needed to pursue something like this for several years before success.

Alina invited me to the Jeopardy watching party at IBM, which was pretty fun, and it gave me a chance to talk to several people, principally Gerry Tesauro (2nd from the right). It’s cool to see people asking for autographs :)

I wasn’t surprised to see Watson win. Partly, this is simply because when a big company does a publicity stunt like this, it’s with a pretty solid expectation of victory. Partly, this is because I already knew that computers could answer trivia questions moderately well(*), so the question was just how far this could be improved. Gerry tells me that although Watson’s error rate is still significant, one key element is the ability to estimate with high accuracy when they can answer with high accuracy. Gerry also tells me the Watson papers will be coming out later this summer, with many more details.

What happens next? I don’t expect the project to be shelved like deep blue was, for two reasons. The first is that there is clearly very substantial room for improvement, and the second is that having a natural language question/answering device that can quickly search and respond from large sets of text is obviously valuable. The first means that researchers are interested, and the second that the money to support them can probably be found. The history of textual entailment challenges is another less centralized effort in about the same direction.

In the immediate future (next few years), applications in semi-open domains may become viable, particularly when a question/answer device knows when to answer “I don’t know”. Fully conversational speech recognition working in an open domain should take somewhat longer, because speech recognition software has additional error points, conversational systems aren’t so easy to come by, and in a fully open domain the error rates will be higher. Getting the error rate on questions down to the level that a human with access to the internet has difficulty beating is the tricky challenge which has not yet been addressed. It’s a worthy goal to work towards.

Many people believe in human exceptionalism, so when seeing a computer beat Jeopardy, they are surprised that humans aren’t exceptional there. We should understand that this has happened many times before, with chess and mathematical calculation being two areas where computers now dominate, but which were once thought to be the essence of intelligence by some. Similarly, it is not difficult to imagine automated driving (after all, animals can do it), gross object recognition, etc…

To avert surprise in the future, human exceptionalists should understand what the really hard things for an AI to do are. It’s important to understand that there are various levels of I in AI. A few I think about are:

  1. Animal Intelligence. The ability to understand your place in the world, navigate the world, and accomplish something. Some of these tasks are solved, but many others are not yet. This level implies that routine tasks can be automated. Automated driving, farming, factories, etc…
  2. Turing Test Intelligence. The ability to mimic a typical human well-enough to fool a typical human in open conversation. Watson doesn’t achieve this, but the thrust of the research is in this direction as open domain question answering is probably necessary for this. Nonroutine noncreative tasks might be accomplished by the computer. Think of an automated secretary.
  3. Pandora’s box Intelligence. The ability to efficiently self-program in an open domain so as to continuously improve. At this level human exceptionalism fails, and it is difficult to predict what happens next.

So, serious evidence of (2) or (3) is what I watch for.

(*) About 10 years ago, I had a friend2 on WWTBAM who called the friend for help on a question, who typed the question and multiple choice answers into CMU’s Zephyr system, where a bot I made queried (question,answer) pairs on Google to discover which had the most web pages. It worked.

2/2/2011

User preferences for search engines

I want to comment on the “Bing copies Google” discussion here, here, and here, because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me.

As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo!, which has started using Bing. This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand, I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning. Also note, this post is not related to official Yahoo! policy, it’s just my personal view.

The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later noticed some synthetic responses from Bing with the synthetic queries.

There are two kinds of disagreement which people might have with this.

One is the privacy disagreementBig Brother Microsoft is looking at what I search and using it”. I’m sympathetic on this count, but also sympathetic to the counter argument, that the data collected has value and can enhance the results for all users. In the end, I think companies should simply do their best to accept a user’s wishes, so those who want privacy can have it, and those who want to contribute their data towards improving a search engine can do so. The precise manner for achieving this by opt-in, opt-out, differential privacy, anonymization or other techniques is not entirely clear to me.

Let’s assume the privacy issue is dealt with. This is at least partly and possibly grossly untrue, but I want to focus on the other issue, and this assumption simplifies it’s discussion because a user and their internet browser are synonymous when the privacy issue is dealt with, as the agent’s actions are a true reflection of the user’s preferences.

The other issue is an originality disagreement, which much of the discussion focuses on. What I believe happened was a user feedback process, where users queried Google, clicked on a result, informed Microsoft/Bing of the query and clicked result, and their preference was used to promote the search result within Bing. Now, there is a slippery-slope of questions. Should a user be allowed to:

  1. Reveal to their chosen search engine their preferred result?
  2. Reveal to a competitor’s search engine their preferred result?

If you answer ‘no’ to the first, you are deeply against user freedom in a manner I can’t sympathize with. If you answer ‘yes’ to the first, and ‘no’ to the second, then you are still somewhat against user freedom. This isn’t too crazy a stance, as various people sell information and require of their users that it not be retransmitted. One of the more famous examples of this is the Bloomberg Terminal. However, in all instances I’m aware of, users knowingly agree to a contract providing access to the information with limitations. Google never entered into such a contract with it’s users, and I don’t know a sound basis for even an implicit contract. So, my answer are “yes, and yes” here.

But this doesn’t entirely deal with the issue of originality. You could argue that it’s ok for Microsoft to take advantage of revealed user interaction, but it’s still a matter of following rather than leading. This argument is simplistic and wrong, as I expect all informed parties already understand. A basic truth seen in many ways, is that the proper incorporation of new sources of information always improves results. This is true in machine learning where sample complexity results and cotraining formalize mechanisms and values of incorporating additional information, and it was heavily used by all competitive teams in the Netflix Competition. More generally, it’s true in basic knowledge engineering, where people fuse sources of information to create a better system, and I’m virtually certain it’s true of the ranking algorithms behind Google and Bing, which are surely complex beasts taking into account many sources of information. I know no details about the algorithm which Microsoft is using, but it’s quite plausible that they incorporated this information well enough to improve the quality of their results, perhaps in some instances so they are better than Google’s or the earlier version of Bing’s. If that’s the case, Google will either follow Microsoft’s lead taking into account user feedback as Microsoft does, or risk becoming obsolete.

We can also think about things in terms of the future. A basic truth, is that building a successful search engine is extraordinarily difficult. This is revealed by search market share, but also by simply thinking about the logistics involved. You need to crawl the web, have server farms all over the world (because the speed of light just isn’t fast enough), and incorporate many sources of information in just the right way in order to succeed, all while adversaries try to corrupt your results. If we prefer a future where there is a healthy competition amongst search engines, then it’s important to lower these barriers to entry so new people with new ideas can more easily test them out. One way to lower the barrier to entry is to accept that users can share their interaction, even with a competitor’s search engine.

Perhaps it’s inevitable that Amit Singhal has a viewpoint driving towards a monopoly on internet search. However, Google has generally been relatively good about supporting a rich ecosystem of innovation for information technology development, so I am still somewhat surprised. I would be more sympathetic to a position for allowing users of Internet Explorer a built-in means to choose to share their search behavior with Google or other search engines on an equal footing.

« Newer PostsOlder Posts »

Powered by WordPress