Machine Learning (Theory)


Interesting papers at ICML 2014

This year’s ICML had several papers which I want to read through more carefully and understand better.

  1. Chun-Liang Li, Hsuan-Tien Lin, Condensed Filter Tree for Cost-Sensitive Multi-Label Classification. Several tricks accumulate to give a new approach for addressing cost sensitive multilabel classification.
  2. Nikos Karampatziakis and Paul Mineiro, Discriminative Features via Generalized Eigenvectors. An efficient, effective eigenvalue solution for supervised learning yields compelling nonlinear performance on several datasets.
  3. Nir Ailon, Zohar Karnin, Thorsten Joachims, Reducing Dueling Bandits to Cardinal Bandits. An effective method for reducing dueling bandits to normal bandits that extends to contextual situations.
  4. Pedro Pinheiro, Ronan Collobert, Recurrent Convolutional Neural Networks for Scene Labeling. Image parsing remains a challenge, and this is plausibly a step forward.
  5. Cicero Dos Santos, Bianca Zadrozny, Learning Character-level Representations for Part-of-Speech Tagging. Word morphology is clearly useful information, and yet almost all ML-for-NLP applications ignore it or hard-code it (by stemming).
  6. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Statistically efficient interactive learning is now computationally feasible. I wish this one had been done in time for the NIPS tutorial :-)
  7. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, Deterministic Policy Gradient Algorithms. A reduction in variance from working out the deterministic limit of policy gradient make policy gradient approaches look much more attractive.

Edit: added one that I forgot.


2011 ML symposium and the bears

The New York ML symposium was last Friday. Attendance was 268, significantly larger than last year. My impression was that the event mostly still fit the space, although it was crowded. If anyone has suggestions for next year, speak up.

The best student paper award went to Sergiu Goschin for a cool video of how his system learned to play video games (I can’t find the paper online yet). Choosing amongst the submitted talks was pretty difficult this year, as there were many similarly good ones.

By coincidence all the invited talks were (at least potentially) about faster learning algorithms. Stephen Boyd talked about ADMM. Leon Bottou spoke on single pass online learning via averaged SGD. Yoav Freund talked about parameter-free hedging. In Yoav’s case the talk was mostly about a better theoretical learning algorithm, but it has the potential to unlock an exponential computational complexity improvement via oraclization of experts algorithms… but some serious thought needs to go in this direction.

Unrelated, I found quite a bit of truth in Paul’s talking bears and Xtranormal always adds a dash of funny. My impression is that the ML job market has only become hotter since 4 years ago. Anyone who is well trained can find work, with the key limiting factor being “well trained”. In this environment, efforts to make ML more automatic and more easily applied are greatly appreciated. And yes, Yahoo! is still hiring too :)


KDD and MUCMD 2011

At KDD I enjoyed Stephen Boyd‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Some of the papers which interested me are:

  1. Kai-Wei Chang and Dan Roth, Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models, which is about effectively using a hard-example cache to speedup learning.
  2. Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan, CHIRP: A New Classifier Based on Composite Hypercubes on Iterated Random Projections. The bar on creating new classifiers is pretty high. The approach here uses a combination of random projection and partition which appears to be compelling for some nonlinear and relatively high computation settings. They do a more thorough empirical evaluation than most papers.
  3. Zhuang Wang, Nemanja Djuric, Koby Crammer, and Slobodan Vucetic Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification. The paper explores an interesting idea: having lots of weight vectors (effectively infinity) associated with a particular label, showing that algorithms on this representation can deal with lots of data as per linear predictors, but with superior-to-linear performance. The authors don’t use the hashing trick, but their representation is begging for it.
  4. Michael Bruckner and Tobias Scheffer, Stackelberg Games for Adversarial Prediction Problem. This is about email spam filtering, where the authors use a theory of adversarial equilibria to construct a more robust filter, at least in some cases. Demonstrating this on noninteractive data is inherently difficult.

There were also three papers that were about creating (or perhaps composing) learning systems to do something cool.

  1. Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor, I Want to Answer, Who Has a Question? Yahoo! Answers Recommender System. This is about how to learn to route a question to the appropriate answerer automatically.
  2. Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler, Automatically Tagging Email by Leveraging Other Users’ Folders. This is about helping people organize their email with machine learning.
  3. D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou, Detecting Adversarial Advertisements in the Wild. The title is an excellent abstract here, and there are quite a few details about the implementation.

I also attended MUCMD, a workshop on the Meaningful Use of Complex Medical Data shortly afterwards. This workshop is about the emergent area of using data to improve medicine. The combination of electronic health records, the economic importance of getting medicine right, and the relatively weak use of existing data implies there is much good work to do.

This finally gave us a chance to discuss radically superior medical trial designs based on work in exploration and learning :)

Jeff Hammerbacher‘s talk was a hilarilously blunt and well stated monologue about the need and how to gather data in a usable way.

Amongst the talks on using medical data, Suchi Saria‘s seemed the most mature. They’ve constructed a noninvasive test for problem infants which is radically superior to the existing Apgar score according to leave-one-out cross validation.

From the doctor’s side, there was discussion of the deep balkanization of data systems within hospitals, efforts to overcome that, and the (un)trustworthiness of data. Many issues clearly remain here, but it also looks like serious progress is being made.

Overall, the workshop went well, with the broad cross-section of talks providing quite a bit of extra context you don’t normally see. It left me believing that a community centered on MUCMD is rising now, with attendant workshops, conferences, etc… to be expected.


Interesting thing at UAI 2011

Tags: Conferences,Papers,Reinforcement jl@ 3:44 pm

I had a chance to attend UAI this year, where several papers interested me, including:

  1. Hoifung Poon and Pedro Domingos Sum-Product Networks: A New Deep Architecture. We’ve already discussed this one, but in a nutshell, they identify a large class of efficiently normalizable distributions and do learning with it.
  2. Yao-Liang Yu and Dale Schuurmans, Rank/norm regularization with closed-form solutions: Application to subspace clustering. This paper is about matrices, and in particular they prove that certain matrices are the solution of matrix optimizations. I’m not matrix inclined enough to fully appreciate this one, but I believe many others may be, and anytime closed form solutions come into play, you get 2 order of magnitude speedups, as they show experimentally.
  3. Laurent Charlin, Richard Zemel and Craig Boutilier, A Framework for Optimizing Paper Matching. This is about what works in matching papers to reviewers, as has been tested at several previous NIPS. We are looking into using this system for ICML 2012.

In addition I wanted to comment on Karl Friston‘s invited talk. At the outset, he made a claim that seems outlandish to me: The way the brain works is to minimize surprise as measured by a probabilistic model. The majority of the talk was not actually about this—instead it was about how probabilistic models can plausibly do things that you might not have thought possible, such as birdsong. Nevertheless, I think several of us in the room ended up stuck on the claim in questions afterward.

My personal belief is that world modeling (probabilistic or not) is a useful subroutine for intelligence, but it could not possibly be the entirety of intelligence. A key reason for this is the bandwidth of our senses—we simply take in too much information to model everything with equal attention. It seems critical for the efficient functioning of intelligence that only things which might plausibly matter are modeled, and only to the degree that matters. In other words, I do not model the precise placement of items on my desk, or even the precise content of my desk, because these details simply do not matter.

This argument can be made in another way. Suppose for the moment that all the brain does is probabilistic modeling. Then, the primary notion of failure to model is “surprise”, which is low probability events occurring. Surprises (stumbles, car wrecks, and other accidents) certainly can be unpleasant, but this could be correct if modeling is a subroutine as well. The clincher is that there are many unpleasant things which are not surprises, including keeping your head under water, fasting, and self-inflicted wounds.

Accounting for the unpleasantness of these events requires more than probabilistic modeling. In other words, it requires rewards, which is why reinforcement learning is important. As a byproduct, rewards also naturally create a focus of attention, addressing the computational efficiency issue. Believing that intelligence is just probabilistic modeling is another example of simple wrong answer.


ICML 2011 and the future

Unfortunately, I ended up sick for much of this ICML. I did manage to catch one interesting paper:

Richard Socher, Cliff Lin, Andrew Y. Ng, and Christopher D. Manning Parsing Natural Scenes and Natural Language with Recursive Neural Networks.

I invited Richard to share his list of interesting papers, so hopefully we’ll hear from him soon. In the meantime, Paul and Hal have posted some lists.

the future

Joelle and I are program chairs for ICML 2012 in Edinburgh, which I previously enjoyed visiting in 2005. This is a huge responsibility, that we hope to accomplish well. A part of this (perhaps the most fun part), is imagining how we can make ICML better. A key and critical constraint is choosing things that can be accomplished. So far we have:

  1. Colocation. The first thing we looked into was potential colocations. We quickly discovered that many other conferences precomitted their location. For the future, getting a colocation with ACL or SIGIR, seems to require more advanced planning. If that can be done, I believe there is substantial interest—I understand there was substantial interest in the joint symposium this year. What we did manage was achieving a colocation with COLT and there is an outside chance that a machine learning summer school will precede the main conference. The colocation with COLT is in both time and space, with COLT organized as (essentially) a separate track in a nearby building. We look forward to organizing a joint invited session or two with the COLT program chairs.
  2. Tutorials. We don’t have anything imaginative here, except for pushing for quality tutorials, probably through a mixture of invitations and a call. There is a small chance we’ll be able to organize a machine learning summer school as a prequel, which would be quite cool, but several things have to break right for this to occur.
  3. Conference. We are considering a few tinkerings with the conference format.
    1. Shifting a conference banquet to be during the workshops, more tightly integrating the workshops.
    2. Having 3 nights of posters (1 per day) rather than 2 nights. This provides more time/poster, and avoids halving talks and posters appear on different days.
    3. Having impromptu sessions in the evening. Two possibilities here are impromptu talks and perhaps a joint open problems session with COLT. I’ve made sure we have rooms available so others can organize other things.
    4. We may go for short presentations (+ a poster) for some papers, depending on how things work out schedulewise. My opinions on this are complex. ICML is traditionally multitrack with all papers having a 25 minute-ish presentation. As a mechanism for research, I believe this is superior to a single track conference of a similar size because:
      1. Typically some talk of potential interest can always be found by participants avoiding the boredom problem which comes up at a single track conference
      2. My experience is that program organizers have a limited ability to foresee which talks are of most interest, commonly creating a misallocation of attention.

      On the other hand, there are clearly limits to the number of tracks that are reasonable, and I feel like ICML (especially with COLT cotimed) is near the upper limit. There are also some papers which have a limited scope of interest, for which a shorter presentation is reasonable.

  4. Workshops. A big change here—we want to experiment with 2 days of workshops rather than 1. There seems to be demand for it, as the number of workshops historically is about 10, enough that it’s easy to imagine people commonly interested in 2 workshops. It’s also the case that NIPS has had to start rejecting a substantial fraction of workshop submissions for space reasons. I am personally a big believer in workshops as a mechanism for further research, so I hope this works out well.
  5. Journal integration. I tend to believe that we should be shifting to a journal format for ICML papers, as per many past discussions. After thinking about this the easiest way seems to be simply piggybacking on existing journals such as JMLR and MLJ by essentially declaring that people could submit there first, and if accepted, and not otherwise presented at a conference, present at ICML. This was considered too large a change, so it is not happening. Nevertheless, it is a possible tweak that I believe should be considered for the future. My best guess is that this would never displace the baseline conference review process, but it would help some papers that don’t naturally fit into a conference format while keeping quality high.
  6. Reviewing. Drawing on plentiful experience with what goes wrong, I think we can create the best reviewing system for conferences. We are still debating exact details here while working through what is possible in different conference systems. Nevertheless, some basic goals are:
    1. Double Blind [routine now] Two identical papers with different authors should have the same chance of success. In terms of reviewing quality, I think double blind makes little difference in the short term, but the public commitment to fair reviewing makes a real difference in the long term.
    2. Author Feedback [routine now] Author feedback makes a difference in only a small minority of decisions, but I believe its effect is larger as (a) reviewer quality improves and (b) reviewer understanding improves. Both of these are silent improvers of quality. Somewhat less routine, we are seeking a mechanism for authors to be able to provide feedback if additional reviews are requested, as I’ve become cautious of the late-breaking highly negative review.
    3. Paper Editing. Geoff Gordon tweaked AIStats this year to allow authors to revise papers during feedback. I think this is helpful, because it encourages authors to fix clarity issues immediately, rather than waiting longer. This helps with some things, but it is not a panacea—authors still have to convince reviewers their paper is worthwhile, and given the way people are first impressions are lasting impressions.
    4. Multisource reviewing. We want all of the initial reviews to be assigned by good yet different mechanisms. In the past, I’ve observed that the source of reviewer assignments can greatly bias the decision outcome, all the way from “accept with minor revisions” to “reject” in the case of a JMLR submission that I had. Our plan at the moment is that one review will be assigned by bidding, one by a primary area chair, and one by a secondary area chair.
    5. No single points of failure. When Bob Williamson and I were PC members for learning theory at NIPS, we each came to a decisions given reviews and then reconciled differences. This made a difference on about 5-10% of decisions, and (I believe) improved overall quality a bit. More generally, I’ve seen instances where an area chair has an unjustifiable dislike for a paper and kills it off, which this mechanism avoids.
    6. Speed. In general, I believe speed and good decision making are antagonistic. Nevertheless, we believe it is important to try to do the reviewing both quickly and well. Doing things quickly implies that we can push the submission deadline back later, providing authors more time to make quality papers. Key elements of doing things well fast are: good organization (that’s all on us), light loads for everyone involved (i.e. not too many papers), crowd sourcing (i.e. most decisions made by area chairs), and some amount of asynchrony. Altogether, we believe at the moment that two weeks can be shaved from our reviewing process.
  7. Website. Traditionally at ICML, every new local organizer was responsible for creating a website. This doesn’t make sense anymore, because substantial work is required there, which can and should be amortized across the years so that the website can evolve to do more for the community. We plant to create a permanent website, based around some combination of and I think this just makes sense.
  8. Publishing. We are thinking about strongly encouraging authors to use arxiv for final submissions. This provides a lasting backing store for ICML papers, as well as a mechanism for revisions. The reality here is that some mistakes get into even final drafts, so a way to revise for the long term is helpful. We are also planning to videotape and make available all talks, although a decision between videolectures and Weyond has not yet been made.

Implementing all the changes above is ambitious, but I believe feasible and that each is individually beneficial and to some extent individually evaluatable. I’d like to hear any thoughts you have on this. It’s also not too late if you have further suggestions of your own.


A paper not at Snowbird

Unfortunately, a scheduling failure meant I missed all of AIStat and most of the learning workshop, otherwise known as Snowbird, when it’s at Snowbird.

At snowbird, the talk on Sum-Product networks by Hoifung Poon stood out to me (Pedro Domingos is a coauthor.). The basic point was that by appropriately constructing networks based on sums and products, the normalization problem in probabilistic models is eliminated, yielding a highly tractable yet flexible representation+learning algorithm. As an algorithm, this is noticeably cleaner than deep belief networks with a claim to being an order of magnitude faster and working better on an image completion task.

Snowbird doesn’t have real papers—just the abstract above. I look forward to seeing the paper. (added: Rodrigo points out the deep learning workshop draft.)


A Variance only Deviation Bound

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.


ALT 2009

Tags: Conferences,Online,Papers jl@ 2:58 pm

I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength.

One paper that might interest people generally is:

Alexey Chernov and Vladimir Vovk, Prediction with Expert Evaluators’ Advice. The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.


Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?


Interesting papers at UAICMOLT 2009

Here’s a list of papers that I found interesting at ICML/COLT/UAI in 2009.

  1. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation.
  2. Hal Daume, Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective.
  3. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thomas Walsh, Istvan Szita, Carlos Diuk, Michael Littman Exploring compact reinforcement-learning representations with linear regression at UAI. The UAI paper covers application to RL as well.
  4. Ping Li, Improving Compressed Counting at UAI. This paper talks about how to keep track of the moments in a datastream with very little space and computation. I’m not sure I have a use for it yet, but it seems like a cool piece of basic technology.
  5. Mark Reid and Bob Williamson Surrogate Regret Bounds for Proper Losses at ICML. This paper points out that via the integral characterization of proper losses, proper scoring rules can be reduced to binary classification. The results unify and generalize the Probing and Quanting reductions we worked on previously. This paper is also related to Nicolas Lambert‘s work, which is quite thought provoking in terms of specifying what is learnable.
  6. Daniel Hsu, Sham M. Kakade and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models COLT. This paper shows that a subset of HMMs can be learned using an SVD-based algorithm.
  7. Samory Kpotufe, Escaping the curse of dimensionality with a tree-based regressor at COLT. This paper shows how to directly applying regression in high dimensional vector spaces and have it succeed anyways because the data is naturally low-dimensional.
  8. Shai Ben-David, David Pal and Shai Shalev-Shwartz. Agnostic Online Learning at COLT. This paper characterizes the ability to learn when an adversary is choosing features in the online setting as the “Littlestone dimension”.


Interesting Presentations at Snowbird

Here are a few of presentations interesting me at the snowbird learning workshop (which, amusingly, was in Florida with AIStat).

  1. Thomas Breuel described machine learning problems within OCR and an open source OCR software/research platform with modular learning components as well has a 60Million size dataset derived from Google‘s scanned books.
  2. Kristen Grauman and Fei-Fei Li discussed using active learning with different cost labels and large datasets for image ontology. Both of them used Mechanical Turk as a labeling system, which looks to become routine, at least for vision problems.
  3. Russ Tedrake discussed using machine learning for control, with a basic claim that it was the way to go for problems involving a medium Reynold’s number such as in bird flight, where simulation is extremely intense.
  4. Yann LeCun presented a poster on an FPGA for convolutional neural networks yielding a factor of 100 speedup in processing. In addition to the graphics processor approach Rajat has worked on, this seems like an effective approach to deal with the need to compute many dot products.


Interesting Papers at SODA 2009

Tags: Conferences,Papers,Theory jl@ 10:35 am

Several talks seem potentially interesting to ML folks at this year’s SODA.

  1. Maria-Florina Balcan, Avrim Blum, and Anupam Gupta, Approximate Clustering without the Approximation. This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style.
  2. Yury Lifshits and Shengyu Zhang, Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Small-World Design. The basic idea of this paper is that actually creating a metric with a valid triangle inequality inequality is hard for real-world problems, so it’s desirable to have a datastructure which works with a relaxed notion of triangle inequality. The precise relaxation is more extreme than you might imagine, implying the associated algorithms give substantial potential speedups in incomparable applications. Yuri tells me that a cover tree style “true O(n) space” algorithm is possible. If worked out and implemented, I could imagine substantial use.
  3. Elad Hazan and Satyen Kale Better Algorithms for Benign Bandits. The basic idea of this paper is that in real-world applications, an adversary is less powerful than is commonly supposed, so carefully taking into account the observed variance can yield an algorithm which works much better in practice, without sacrificing the worst case performance.
  4. Kevin Matulef, Ryan O’Donnell, Ronitt Rubinfeld, Rocco Servedio, Testing Halfspaces. The basic point of this paper is that testing halfspaces is qualitatively easier than finding a good half space with respect to 0/1 loss. Although the analysis is laughably far from practical, the result is striking, and it’s plausible that the algorithm works much better than the analysis. The core algorithm is at least conceptually simple: test that two correlated random points have the same sign, with “yes” being evidence of a halfspace and “no” not.
  5. I also particularly liked Yuval Peres‘s invited talk The Unreasonable Effectiveness of Martingales. Martingale’s are endemic to learning, especially online learning, and I suspect we can tighten and clarify several arguments using some of the techniques discussed.


A NIPS paper

I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.

I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.

There are a couple workarounds for this computational bug:

  1. Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
  2. Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.

This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.


How do we get weak action dependence for learning with partial observations?

Tags: Machine Learning,Papers,Problems jl@ 9:53 am

This post is about contextual bandit problems where, repeatedly:

  1. The world chooses features x and rewards for each action r1,…,rk then announces the features x (but not the rewards).
  2. A policy chooses an action a.
  3. The world announces the reward ra

The goal in these situations is to learn a policy which maximizes ra in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction.

One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k0.5, epoch-greedy (and the simpler epsilon greedy) have a dependence on k1/3, and the offset tree has a dependence on k-1. These results aren’t directly comparable because different things are being analyzed. The fact that all analyses have poor dependence on k is troublesome. The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem.

In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. For example, if we do empirical risk minimization over a finite hypothesis space H, the dependence is at most ln |H| using an Occam’s Razor bound. Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems.

Why is there such a large contrast between these settings? At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one.

One of the intuitions you develop after working with supervised learning is that holistic information is often better. As an example, given a choice between labeling the same point multiple times (perhaps revealing and correcting noise) or labeling other points once, an algorithm with labels other points typically exists and typically yields as good or better performance in theory and in practice. This appears untrue when we have only partial observations.

For example, consider the following problem(*): “Find an action with average reward greater than 0.5 with probability at least 0.99″ and consider two algorithms:

  1. Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward.
  2. Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat.

When there are 1010 actions and 109 of them have average reward 0.6, it’s easy to prove that algorithm 2 is much better than algorithm 1.

Lower bounds for the partial observation settings imply that more tractable algorithms only exist under additional assumptions. Two papers which do this without context features are:

  1. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandit problems in metric spaces, STOC 2008. Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other.
  2. Deepak Agarwal, , and Deepayan Chakrabati, Multi-armed Bandit Problems with Dependent Arms, ICML 2007. Here the idea is that the values of actions are generated recursively, preserving structure through the recursion.

Basic questions: Are there other kinds of natural structure which allows a good dependence on the total number of actions? Can these kinds of structures be extended to the setting with features? (Which seems essential for real applications.)

(*) Developed in discussion with Yisong Yue and Bobby Kleinberg.


Interesting papers at COLT (and a bit of UAI & workshops)

Here are a few papers from COLT 2008 that I found interesting.

  1. Maria-Florina Balcan, Steve Hanneke, and Jenn Wortman, The True Sample Complexity of Active Learning. This paper shows that in an asymptotic setting, active learning is always better than supervised learning (although the gap may be small). This is evidence that the only thing in the way of universal active learning is us knowing how to do it properly.
  2. Nir Ailon and Mehryar Mohri, An Efficient Reduction of Ranking to Classification. This paper shows how to robustly rank n objects with n log(n) classifications using a quicksort based algorithm. The result is applicable to many ranking loss functions and has implications for others.
  3. Michael Kearns and Jennifer Wortman. Learning from Collective Behavior. This is about learning in a new model, where the goal is to predict how a collection of interacting agents behave. One claim is that learning in this setting can be reduced to IID learning.

Due to the relation with Metric-E3, I was particularly interested in a couple other papers on reinforcement learning in navigation-like spaces.
I also particularly enjoyed Dan Klein‘s talk, which was the most impressive application of graphical model technology I’ve seen.

I also attended the large scale learning challenge workshop and enjoyed Antoine Bordes talk about a fast primal space algorithm that won by a hair over other methods in the wild track. Ronan Collobert‘s talk was also notable in that they are doing relatively featuritis-free NLP.


Interesting Papers at ICML 2007

Here are a few of the papers I enjoyed at ICML.

  1. Steffen Bickel, Michael Brüeckner, Tobias Scheffer, Discriminative Learning for Differing Training and Test Distributions There is a nice trick in this paper: they predict the probability that an unlabeled sample is in the training set vs. the test set, and then use this prediction to importance weight labeled samples in the training set. This paper uses a specific parametric model, but the approach is easily generalized.
  2. Steve Hanneke A Bound on the Label Complexity of Agnostic Active Learning This paper bounds the number of labels required by the A2 algorithm for active learning in the agnostic case. Last year we figured out agnostic active learning was possible. This year, it’s quantified. Hopefull soon, it will be practical.
  3. Sylvian Gelly, David Silver Combining Online and Offline Knowledge in UCT. This paper is about techniques for improving MoGo with various sorts of learning. MoGo has a fair claim at being the world’s best Go algorithm.

There were also a large number of online learning papers this year, especially if you count papers which use online learning techniques for optimization on batch datasets (as I do). This is expected, because larger datasets are becoming more common, and online learning makes more sense the larger the dataset. Many of these papers are of interest if your goal is learning fast while others are about extending online learning into new domains.

(Feel free to add any other papers of interest in the comments.)


Interesting Papers at COLT 2007

Here are two papers that seem particularly interesting at this year’s COLT.

  1. Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
  2. Adam Tauman Kalai. Learning Nested Halfspaces and Uphill Decision Trees. Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. Under the assumption that:
    1. The level sets of the correct regressed value are halfspaces.
    2. The level sets obey a Lipschitz condition.

    this paper proves that a good regressor can be PAC-learned using a boosting algorithm. (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition.)


Conditional Tournaments for Multiclass to Binary

This problem has been cracked (but not quite completely solved) by Alina, Pradeep, and I. The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here:

For the single elimination tournament, we can prove that:
For all multiclass problems D, for all learned binary classifiers c, the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log2 k. Restated:

regmulticlass(D,Filter_tree_test(c)) <= regbinary (Filter_tree_train(D),c)


  1. Filter_tree_train(D) is the induced binary classification problem
  2. Filter_tree_test(c) is the induced multiclass classifier.
  3. regmulticlass is the multiclass regret (= difference between error rate and minimum possible error rate)
  4. regbinary is the binary regret

This result has a slight dependence on k which we suspect is removable. The current conjecture is that this dependence can be removed by using higher order tournaments such as double elimination, triple elimination, up to log2 k-elimination.

The key insight which makes the result possible is conditionally defining the prediction problems at interior nodes. In essence, we use the learned classifiers from the first level of the tree to filter the distribution over examples reaching the second level of the tree. This process repeats, until the root node is reached. Further details, including a more precise description and some experimental results are in the draft paper.


What to do with an unreasonable conditional accept

Last year about this time, we received a conditional accept for the searn paper, which asked us to reference a paper that was not reasonable to cite because there was strictly more relevant work by the same authors that we already cited. We wrote a response explaining this, and didn’t cite it in the final draft, giving the SPC an excuse to reject the paper, leading to unhappiness for all.

Later, Sanjoy Dasgupta suggested that an alternative was to talk to the PC chair instead, as soon as you see that a conditional accept is unreasonable. William Cohen and I spoke about this by email, the relevant bit of which is:

If an SPC asks for a revision that is inappropriate, the correct
action is to contact the chairs as soon as the decision is made,
clearly explaining what the problem is, so we can decide whether or
not to over-rule the SPC. As you say, this is extra work for us
chairs, but that’s part of the job, and we’re willing to do that sort
of work to improve the overall quality of the reviewing process and
the conference. In short, Sanjoy was right.

At the time, I operated under the belief that the PC chair’s job was simply too heavy to bother with something like this, but that was wrong. William invited me to post this, and I hope we all learn a little bit from it. Obviously, this should only be used if there is a real flaw in the conditions for a conditional accept paper.


The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?
Older Posts »

Powered by WordPress