Machine Learning (Theory)


Vowpal Wabbit, version 5.0, and the second heresy

I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm.

The biggest changes are new algorithms:

  1. Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes.
  2. Nikos also implemented the per-feature learning rates as per these two papers. Often, this works better than the default algorithm. It isn’t the default because it isn’t (yet) as adaptable in terms of learning rate decay. This is enabled with –adaptive and learned regressors are compatible with the default. Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos noticed that the phenomenal quake inverse square root hack applies making this substantially faster than a naive implementation.
  3. Nikos and Daniel also implemented active learning derived from this paper, usable via –active_simulation (to test parameters on an existing supervised dataset) or –active_learning (to do the real thing). This runs at full speed which is much faster than is reasonable in any active learning scenario. We see this approach dominating supervised learning on all classification datasets so far, often with far fewer labeled examples required, as the theory predicts. The learned predictor is compatible with the default.
  4. Olivier helped me implement preconditioned conjugate gradient based on Jonathan Shewchuk‘s tutorial. This is a batch algorithm and hence requires multiple passes over any dataset to do something useful. Each step of conjugate gradient requires 2 passes. The advantage of cg is that it converges relatively quickly via the use of second derivative information. This can be particularly helpful if your features are of widely differing scales. The use of –regularization 0.001 (or smaller) is almost required with –conjugate_gradient as it will otherwise overfit hard. This implementation has two advantages over the basic approach: it implicitly computes a Hessian in O(n) time where n is the number of features and it operates out of core, hence making it applicable to datasets that don’t conveniently fit in RAM. The learned predictor is compatible with the default, although you’ll notice that a factor of 8 more RAM is required when learning.
  5. Matt Hoffman and I implemented Online Latent Dirichlet Allocation. This code is still experimental and likely to change over the next week. It really does a minibatch update under the hood. The code appears to be substantially faster than Matt’s earlier python implementation making this probably the most efficient LDA anywhere. LDA is still much slower than online linear learning as it is quite computationally heavy in comparison—perhaps a good candidate for GPU optimization.
  6. Nikos, Daniel, and I have been experimenting with more online cluster parallel learning algorithms (–corrective, –backprop, –delayed_global). We aren’t yet satisfied with these although they are improving. Details are at the LCCC workshop.

In addition, Ariel added a test suite, Shravan helped with ngrams, and there are several other minor new features and bug fixes including a very subtle one caught by Vaclav.

The documentation on the website hasn’t kept up with the code. I’m planning to rectify that over the next week, and have a new tutorial starting at 2pm in the LCCC room for those interested. Yes, I’ll not be skiing :)


Traffic Prediction Problem

Slashdot points out the Traffic Prediction Challenge which looks pretty fun. The temporal aspect seems to be very common in many real-world problems and somewhat understudied.


Partha Niyogi has died

from brain cancer. I asked Misha who worked with him to write about it.

Partha Niyogi, Louis Block Professor in Computer Science and Statistics at the University of Chicago passed away on October 1, 2010, aged 43.

I first met Partha Niyogi almost exactly ten years ago when I was a graduate student in math and he had just started as a faculty in Computer Science and Statistics at the University of Chicago. Strangely, we first talked at length due to a somewhat convoluted mathematical argument in a paper on pattern recognition. I asked him some questions about the paper, and, even though the topic was new to him, he had put serious thought into it and we started regular meetings. We made significant progress and developed a line of research stemming initially just from trying to understand that one paper and to simplify one derivation. I think this was typical of Partha, showing both his intellectual curiosity and his intuition for the serendipitous; having a sense and focus for inquiries worth pursuing, no matter how remote or challenging, and bringing his unique vision to new areas. We had been working together continuously from that first meeting until he became too sick to continue. Partha had been a great adviser and a close friend for me; I am very much thankful to him for his guidance, intellectual inspiration and friendship.

Partha had a broad range of interests in research centered around the problem of learning, which had been his interest since he was an undergraduate at the Indian Institute of Technology. His research had three general themes: geometric methods in machine learning, particularly manifold methods; language evolution and language learning (he recently published a 500-page monograph on it) and speech analysis and recognition. I will not talk about his individual works, a more in-depth summary of his research is in the University of Chicago Computer Science department obituary. It is enough to say that his work has been quite influential and widely followed up. In every one of these areas he had his own approach, distinct, clear, and not afraid to challenge unexamined conventional wisdom. To lose this intellectually rigorous but open-minded vision is not just a blow to those of us who knew him and worked with him, but to the field of machine learning itself.

I owe a lot to Partha; to his insight and thoughtful attitude to research and every aspect of life. It had been a great privilege to be Partha’s student, collaborator and friend; his passing away leaves deep sadness and emptiness. It is hard to believe Partha is no longer with us, but his friendship and what I learned from him will stay with me for the rest of my life.

More from Jake, Suresh, and Lance.


Machined Learnings

Paul Mineiro has started Machined Learnings where he’s seriously attempting to do ML research in public. I personally need to read through in greater detail, as much of it is learning reduction related, trying to deal with the sorts of complex source problems that come up in practice.


New York Area Machine Learning Events

On Sept 21, there is another machine learning meetup where I’ll be speaking. Although the topic is contextual bandits, I think of it as “the future of machine learning”. In particular, it’s all about how to learn in an interactive environment, such as for ad display, trading, news recommendation, etc…

On Sept 24, abstracts for the New York Machine Learning Symposium are due. This is the largest Machine Learning event in the area, so it’s a great way to have a conversation with other people.

On Oct 22, the NY ML Symposium actually happens. This year, we are expanding the spotlights, and trying to have more time for posters. In addition, we have a strong set of invited speakers: David Blei, Sanjoy Dasgupta, Tommi Jaakkola, and Yann LeCun. After the meeting, a late hackNY related event is planned where students and startups can meet.

I’d also like to point out the related CS/Econ symposium as I have interests there as well.



Tags: Announcements,Conferences jl@ 5:35 pm

Geoff Gordon points out AIStats 2011 in Ft. Lauderdale, Florida. The call for papers is now out, due Nov. 1. The plan is to experiment with the review process to encourage quality in several ways. I expect to submit a paper and would encourage others with good research to do likewise.


Alex Smola starts a blog

Adventures in Data Land.


Rob Schapire at NYC ML Meetup

I’ve been wanting to attend the NYC ML Meetup for some time and hope to make it next week on the 25th. Rob Schapire is talking about “Playing Repeated Games”, which in my experience is far more relevant to machine learning than the title might indicate.


The Workshop on Cores, Clusters, and Clouds

Tags: Announcements,Workshop jl@ 8:47 am

Alekh, John, Ofer, and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested.

And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.



Tags: Announcements,Machine Learning jl@ 12:39 am

Joseph Turian creates MetaOptimize for discussion of NLP and ML on big datasets. This includes a blog, but perhaps more importantly a question and answer section. I’m hopeful it will take off.


Netflix Challenge 2 Canceled

Tags: Announcements,Competitions jl@ 6:33 pm

The second Netflix prize is canceled due to privacy problems. I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release.

I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.


Yahoo! ML events

Yahoo! is sponsoring two machine learning events that might interest people.

  1. The Key Scientific Challenges program (due March 5) for Machine Learning and Statistics offers $5K (plus bonuses) for graduate students working on a core problem of interest to Y! If you are already working on one of these problems, there is no reason not to submit, and if you aren’t you might want to think about it for next year, as I am confident they all press the boundary of the possible in Machine Learning. There are 7 days left.
  2. The Learning to Rank challenge (due May 31) offers an $8K first prize for the best ranking algorithm on a real (and really used) dataset for search ranking, with presentations at an ICML workshop. Unlike the Netflix competition, there are prizes for 2nd, 3rd, and 4th place, perhaps avoiding the heartbreak the ensemble encountered. If you think you know how to rank, you should give it a try, and we might all learn something. There are 3 months left.


Sam Roweis died

and I can’t help but remember him.

I first met Sam as an undergraduate at Caltech where he was TA for Hopfield‘s class, and again when I visited Gatsby, when he invited me to visit Toronto, and at too many conferences to recount. His personality was a combination of enthusiastic and thoughtful, with a great ability to phrase a problem so it’s solution must be understood. With respect to my own work, Sam was the one who advised me to make my first tutorial, leading to others, and to other things, all of which I’m grateful to him for. In fact, my every interaction with Sam was positive, and that was his way.

His death is being called a suicide which is so incompatible with my understanding of Sam that it strains my credibility. But we know that his many responsibilities were great, and it is well understood that basically all sane researchers have legions of inner doubts. Having been depressed now and then myself, it’s helpful to understand at least intellectually that the true darkness of the now is overestimated, and that you have more friends than you think. Sam was one of mine, and I’ll miss him.

My last interaction with Sam, last week, was discussing a new research direction that interested him, optimizing the cost of acquiring feature information in the learning algorithm. This problem is endemic to real-world applications, and has been studied to some extent elsewhere, but I expect that in our unwritten future history, we’ll discover that further study of this problem is more helpful than almost anyone realizes. The reply that I owed him feels heavy, and an incompleteness is hanging. For his wife and children it is surely so incomparably greater that I lack words.

(Added) Others: Fernando, Kevin McCurley, Danny Tarlow, David Hogg, Yisong Yue, Lance Fortnow on Sam, a Memorial site, and a Memorial Fund


Inherent Uncertainty

Tags: Announcements,Machine Learning jl@ 12:01 pm

I’d like to point out Inherent Uncertainty, which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.


NIPS workshops

Many of the NIPS workshops have a deadline about now, and the NIPS early registration deadline is Nov. 6. Several interest me:

  1. Adaptive Sensing, Active Learning, and Experimental Design due 10/27.
  2. Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra, due Nov. 6.
  3. Large-Scale Machine Learning: Parallelism and Massive Datasets, due 10/23 (i.e. past)
  4. Analysis and Design of Algorithms for Interactive Machine Learning, due 10/30.

And I’m sure many of the others interest others. Workshops are great as a mechanism for research, so take a look if there is any chance you might be interested.


New York Area Machine Learning Events

Several events are happening in the NY area.

  1. Barriers in Computational Learning Theory Workshop, Aug 28. That’s tomorrow near Princeton. I’m looking forward to speaking at this one on “Getting around Barriers in Learning Theory”, but several other talks are of interest, particularly to the CS theory inclined.
  2. Claudia Perlich is running the INFORMS Data Mining Contest with a deadline of Sept. 25. This is a contest using real health record data (they partnered with HealthCare Intelligence) to predict transfers and mortality. In the current US health care reform debate, the case studies of high costs we hear strongly suggest machine learning & statistics can save many billions.
  3. The Singularity Summit October 3&4. This is for the AIists out there. Several of the talks look interesting, although unfortunately I’ll miss it for ALT.
  4. Predictive Analytics World, Oct 20-21. This is stretching the definition of “New York Area” a bit, but the train to DC is reasonable. This is a conference of case studies of applications of ML to real-world problems.
  5. Machine Learning Symposium, Friday Nov. 6. I’m on the committee again this year. The abstract deadline is Sept. 30, and we already have several speakers lined up.


The Machine Learning Forum

Dear Fellow Machine Learners,

For the past year or so I have become increasingly frustrated with the peer review system in our field. I constantly get asked to review papers in which I have no interest. At the same time, as an action editor in JMLR, I constantly have to harass people to review papers. When I send papers to conferences and to journals I often get rejected with reviews that, at least in my mind, make no sense. Finally, I have a very hard time keeping up with the best new work, because I don’t know where to look for it…

I decided to try an do something to improve the situation. I started a new web site, which I decided to call “The machine learning forum” the URL is

The main idea behind this web site is to remove anonymity from the review process. In this site, all opinions are attributed to the actual person that expressed them. I expect that this will improve the quality of the reviews. An obvious other effect is that there will be fewer negative reviews, weak papers will tend not to get reviewed at all, but then again, is that such a bad thing?

If you have any interest in this endeavor, please register to the web site and please submit a photo of yourself. Based on the information on your web site I will decide whether to grant you “author” privileges that would allow you to write reviews and overviews. Anybody can submit pointers to publications that they would like somebody to review. Anybody can participate in the discussion forum that is a fancy message board with threads etc.

Right now the main contribution I am looking for are “overviews”.

Overviews are pages written by somebody who is an authority in some area (for example, Kamalika Chaudhuri is an authority on mixture models) in which they list the main papers in the area and five a high level description for how the papers relate. These overviews are intended to serve as an entry point for somebody that wants to learn about that subfield. Overviews *can* reference the work of the author of the overview. This is unlike reviews, in which the reviewer cannot be the author of the reviewed paper.

I hope you are interested enough to give this a try!

Comments are very welcome.


Yoav Freund (


Many ways to Learn this summer

There are at least 3 summer schools related to machine learning this summer.

  1. The first is at University of Chicago June 1-11 organized by Misha Belkin, Partha Niyogi, and Steve Smale. Registration is closed for this one, meaning they met their capacity limit. The format is essentially an extended Tutorial/Workshop. I was particularly interested to see Valiant amongst the speakers. I’m also presenting Saturday June 6, on logarithmic time prediction.
  2. Praveen Srinivasan points out the second at Peking University in Beijing, China, July 20-27. This one differs substantially, as it is about vision, machine learning, and their intersection. The deadline for applications is June 10 or 15. This is also another example of the growth of research in China, with active support from NSF.
  3. The third one is at Cambridge, England, August 29-September 10. It’s in the MLSS series. Compared to the Chicago one, this one is more about the Bayesian side of ML, although effort has been made to create a good cross section of topics. It’s also more focused on tutorials over workshop-style talks.


2009 ICML discussion site

Mark Reid has setup a discussion site for ICML papers again this year and Monica Dinculescu has linked it in from the ICML site. Last year’s attempt appears to have been an acceptable but not wild success as a little bit of fruitful discussion occurred. I’m hoping this year will be a bit more of a success—please don’t be shy :-)

I’d like to also point out that ICML‘s early registration deadline has a few hours left, while UAI‘s and COLT‘s are in a week.


Wielding a New Abstraction

This post is partly meant as an advertisement for the reductions tutorial Alina, Bianca, and I are planning to do at ICML. Please come, if you are interested.

Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems.

In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult.

  1. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, you often don’t know the right way to state things.

    Here’s my current attempt: The process of abstraction for learning reductions can start with sample complexity bounds (or online learning against an adversary analysis). A very simple sample complexity bound is that for all sets of hypotheses H, for all distributions D on examples (x,y), and for all confidence parameters d

    Pr(x,y)m~Dm(for all h in H: |e(h,D)-e(h,(x,y)m)| < (ln( |H|/ d )/m)0.5 ) > 1 – d

    Here (x,y)m is a sequence of m IID samples, e(h,D) is the error rate of h on D and e(h,(x,y)m) is the empirical error rate of h on the set of IID samples.

    The previous bound is a very simple example, and yet remarkably complex both to state and to interpret—many people have been lost by the meaning of d. The impact of this complexity is that it is difficult to effectively use these bounds in practical learning algorithm design, particularly in solving more complex learning problems where much more than one bit of prediction is required. This was a central frustration that I ran into in my thesis work. Some progress has been made since then, but it is still quite difficult. The abstraction in the learning reduction setting is:

    1. You throw away d, because it only has a logarithmic dependence anyways.
    2. You eliminate H and m on the theory that intelligent choices for H and m are made in practice.
    3. You eliminate the IID assumption, because it is no longer needed to define things

    The statement then is

    e(A((x,y)m),D)-e(h*,D) < eps

    where A() is the hypothesis output by the learning algorithm, h* is the best possible predictor, and eps is used to parameterize the theorems. This abstraction is radical in some sense, but something radical was needed to yield tractable and useful analysis on the complex problems people need to solve in practice.

  2. A consequence of lack of familiarity, is that people often misread. In reading a paper, there is a temptation to not read carefully and fill in your understanding of things. Most of the time this works out well, but not here. For example, we saw many instances where people inserted IID sample assumptions or other things that simply weren’t there.
  3. Once you get past the lack of familiarity and misunderstandings, there is a feeling that the new abstraction is cheating. To some extent I understand, as I remember learning about abstractions in class, and I remember feeling that they were in some real sense cheating by dropping important details. For example:
    1. Big-O notation provides an upper bound specified up to constants. For example O(log n) computational complexity means there exists a constant c such that the number of operations requires is less than c log n. Big-O can be abused by hiding “constants” larger than the plausible values for the parameters. In machine learning, a particularly egregious case occurs in Bandit analysis where the punchline of some papers is “logarithmic regret”, hiding an arbitrarily large problem dependent constant.
    2. TCP provides a mechanism for reliable transport over an unreliable network. It is a very commonly used mechanism for sending information over the internet—you used TCP in reading this. TCP is both a programming construct and a mechanism for abstracting communicating over a network. The TCP abstraction is broken when the network is too unreliable for it to recover, such as on sketchy wireless networks where the programmer built for the TCP abstraction which wasn’t delivered.
    3. Dimensional analysis is a technique for quick analysis in physics. The basic idea is to just look at the units when estimating some quantity and combine them to get the right unit answer. For example, to compute the distance d traveled after time t with acceleration a, you simply use at2, since that formula is the only way to combine a with units of distance/time2 and t with units of time to get units of distance. This answer is off by a factor of 2 from what a more detailed analysis using integration yields, which is typical. Dimensional analysis can be misleading when the constants are very large. One example is in Gravitation where there is a table with time and distance equated since they are related by a constant—the speed of light 3*108 m/s. For example, E=mc2 becomes E=m.

    Although the above breakages are real, the usefulness of these abstractions, in terms of allowing us to quickly think about and make decisions more than offsets the drawbacks. Indeed, even the breakages stated above are thought provoking or useful enough that I can’t even say it is wrong to consider them. This property that abstractions can be abused is generically essential to the process of abstraction itself. Abstraction is about neglecting details, and when these details are not neglectable, the abstraction is abused or ineffective. Because of this, any abstraction is insufficient for analyzing and solving real problems where the neglected details matter.

    Just as for these abstractions, the learning reduction abstraction can be abused—the chosen learning algorithm can be pathetic yielding vacuous bounds, or the reduction can scramble the feature information with an encryption algorithm making it so no reasonable learning algorithm could yield other than pathetic performance. Similarly, there are situations in which I don’t know how to effectively use a learning reduction to build a learning algorithm, and it seems implausible that observation changes as more is learned in the future.

For a good abstraction, the drawbacks are matched by the advantages. The principle advantage is that there is a new way to examine and solve problems. This has several interesting effects.

  1. A good abstraction can capture a more complete specification of the problem. As an example, the sample complexity view of learning is broken in practice, because when insufficient performance is achieved people choose a different set of hypotheses H by throwing in additional features or choosing a different learning algorithm. Capturing this process in the sample complexity view requires an additional level of complexity. In the reduction view, this is entirely natural, because any means for achieving a better generalization—more/better features, a better learning algorithm, a better prior, sticking a human in the learning process, etc… are legitimate. This is particularly powerful when architecting solutions, providing a partial answer to the “What?” question Yehuda pointed out.
  2. A higher level abstraction can let you accidentally solve problems in other areas as well. A good example of this is error correcting tournaments which are useful for tournament design to select the best player/team/paper in real tournaments. Recently, I was amused to learn that a standard betting procedure for basketball tournaments exactly mirrors the importance weights suggested for the final elimination of ECTs. The first phase of ECTs provides a sound and practical method to seed a final elimination tournament, eliminating the need for (and biases of) a committee.
  3. Perhaps the most interesting effect is that the new abstraction can aid you in finding effective solutions to new problems. For learning reductions, there are about 3 compelling instances I’ve seen so far.
    1. Given training-time access to a good policy oracle, Searn provides a method for decomposing any complex prediction problem into simple problems, such that low regret solutions to the simple problems imply a low regret solution to the original problem. While Searn competes well (computationally and prediction-wise) with existing methods for linear chain style structured prediction, it really shines on more complex problems. Hal used Searn for automatic document summarization (see section 6.2) which previously wasn’t really solved via ML. More generally, when I learn about the details of other complex prediction systems for machine translation or vision, the base algorithms are tweaked, typically in ways that Searn would suggest. This suggests that Searn formalizes and automates the intuitions of practical people.
    2. The “one step RL” reduction in Bianca‘s thesis (page 119) provided tractable and effective approaches to learning in partial feedback problems where only the loss of a chosen label is learned. An even simpler reduction exists as a matter of folklore—estimate the the value of each label and then take an argmax. However, we have found classification approaches generally work better, where applicable, and as the theory suggests.
    3. Many commonly used algorithms for prediction have a running time linear (or worse) in the number of labels with decision trees a good exception. While simply predicting faster isn’t normally solving a “new problem”, an exponential improvement in computational time seems to merit this description because it allows entirely new kinds of applications. It turns out that it is both very easy to do logarithmic time prediction wrong, and that this problem is often fixable. Furthermore, it appears logarthmic time prediction can really work in practice over very many labels.

When we started working on learning reductions, I had no idea what either the difficulties or rewards were going to be—it simply seemed like a natural and compelling direction of investigation. Given the substantial difficulties encountered, it’s not at all clear that this pursuit was personally worthwhile. It has cost much time which could have been put to good use in other ways.

On the other hand, the advantages are also substantial. I’ve learned something about architecting solutions to problems, both expanding the domain of application for the field and providing a personal edge that I can bring to many conversations about ML. It’s also progress towards the AI goal, which interests me. When I think of what I could have worked on instead to achieve these goals, I don’t have any more compelling answer yet. Learning reductions seem to have accomplished more per unit thought than any other theoretical approach I can identify over the last 5 or 6 years. Furthermore, they are composable by design, so they should stay relevant (and perhaps even become more so), when people use an online active deep semisupervised probabilistic convolutional algorithm to solve a problem, particularly for complex problems.

As I said at the beginning, please join us for the tutorial, if you are interested.

« Newer PostsOlder Posts »

Powered by WordPress