Sebastien Bubeck has a new ML blog focused on optimization and partial feedback which may interest people.

## 3/22/2013

## 1/31/2013

### Remote large scale learning class participation

Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods.

- Videos will be posted with about a 1 day delay on techtalks. This is a side-by-side capture of video+slides from Weyond.
- We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things.
**update2**: Sign up here.

The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.

## 2/20/2012

### Berkeley Streaming Data Workshop

The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.

## 12/2/2011

### Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than *any future* single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

- MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
- MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.

MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

- MPI: The Allreduce function. The starting state for AllReduce is
*n*nodes each with a number, and the end state is all nodes having the sum of all numbers. - MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

- For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
- For smaller datasets with computationally intense models, GPU approaches seem very compelling.
- For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
- The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.

## 9/28/2011

### Somebody’s Eating Your Lunch

Since we last discussed the other online learning, Stanford has very visibly started pushing mass teaching in AI, Machine Learning, and Databases. In retrospect, it’s not too surprising that the next step up in serious online teaching experiments are occurring at the computer science department of a university embedded in the land of startups. Numbers on the order of 100000 are quite significant—similar in scale to the number of computer science undergraduate students/year in the US. Although these populations surely differ, the fact that they *could* overlap is worth considering for the future.

It’s too soon to say how successful these classes will be and there are many easy criticisms to make:

**Registration != Learning**… but if only 1/10th complete these classes, the scale of teaching still surpasses the scale of any traditional process.**1st year excitement != nth year routine**… but if only 1/10th take future classes, the scale of teaching still surpasses the scale of any traditional process.**Hello, cheating**… but teaching is much harder than testing in general, and we already have recognized systems for mass testing.**Online misses out**… sure, but for students not enrolled in a high quality university program, this is simply not a relevant comparison. There are also benefits to being online as well, as your time might be better focused. Anecdotally, at Caltech, they let us take two classes at the same time, which I did a few times. Typically, I had a better grade in the class that I skipped as the instructor had to go through things rather slowly.**Where’s the beef?**The hard nosed will want to know how to make money, which is always a concern. But, a decent expectation is that if you first figure out how to create value, you’ll find some way to make money. And, if you first wait until it’s clear how to make money, you won’t make any.

My belief is that this project will pan out, with allowances for the expected inevitable adjustments that you learn to make from experience. I think the critics miss an understanding of what’s possible and what motivates people.

The prospect of teaching 1 student means you might review some notes. The prospect of teaching ~10 students means you prepare some slides. The prospect of teaching ~100 students means you polish your slides well, trying to anticipate questions, and hopefully drawing on experience from previous presentations. I’ve never directly taught ~1000 students, but at that scale you must try very hard to make the presentation perfect, including serious testing with dry runs. 10^{5} students must make getting out of bed in the morning quite easy.

Stanford has a significant first-mover advantage amongst top research universities, but it’s easy to imagine a few other (but not many) universities operating at a similar scale. Those that have the foresight to start a serious online teaching program soon will have a chance of being among the few. For other research universities, we can expect boutique traditional classes to continue for some time. These boutique classes may have some significant social value, because it’s easy to imagine that the few megaclasses miss important things in developing research areas. And for everyone working at teaching universities, someone is eating your lunch.

## 9/21/2010

### Regretting the dead

Nikos pointed out this new york times article about poor clinical design killing people. For those of us who study learning from exploration information this is a reminder that low regret algorithms are particularly important, as regret in clinical trials is measured by patient deaths.

Two obvious improvements on the experimental design are:

- With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial.
- An optimal experimental design will smoothly phase between exploration and exploitation as evidence for a new treatment shows that it can be effective. This is old tech, for example in the EXP3.P algorithm (page 12 aka 59) although I prefer the generalized and somewhat clearer analysis of EXP4.P.

Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to “phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).

Getting from here to there seems difficult. It’s been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it’s also a failure of imagination within our own field. When we design algorithms, we often don’t think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways. Getting this right here has a substantial moral aspect, potentially saving millions of lives over time through more precise and fast deployments of new treatments.

## 6/13/2010

### The Good News on Exploration and Learning

Consider the contextual bandit setting where, repeatedly:

- A context
*x*is observed. - An action
*a*is taken given the context*x*. - A reward
*r*is observed, dependent on*x*and*a*.

Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward.

This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions.

A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and butter supervised learning problems. Just as for supervised learning, we know how to create and reuse datasets, how to benchmark algorithms, how to reuse existing supervised learning algorithms in this setting, and how to achieve optimal rates of learning quantitatively similar to supervised learning.

This is also one of the times when understanding the basic theory can make a huge difference in your success. There are many wrong ways to attack contextual bandit problems or prepare datasets, and taking a wrong turn can easily mean the difference between failure and success. Understanding how contextual bandit problems differ from basic supervised learning problems is critical to routine success here.

All of the above is not meant to claim that everything is done research-wise here so we’ll try to outline where the current boundary of research lies as best we can. However, we are surely at a point both in terms of application demand (especially for internet applications of search, advertising, page optimization, but also medical applications and surely others) and methodology supply (with basic reliable techniques now easily available or created) where these techniques are shifting from theory esoterica to required education.

Given the above, Alina and I decided to prepare a tutorial to be given at Yahoo! Labs summer school (my first India trip!), ICML, KDD, and hopefully videolectures.net. Please join us. The subjects we plan to cover are essentially the keys to the kingdom of solving shallow interactive learning problems.

## 1/24/2010

### Specializations of the Master Problem

One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as:

- The world announces an observation
*x*. - The agent makes a choice
*a*. - The world announces a reward
*r*.

The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating *x* to *a* or *a* to *r* is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of the master problem general enough to provide an effective solution in real problems?

The process of specializing is a tricky business, as you want to simultaneously achieve tractable analysis, sufficient generality to be useful, and yet capture a new aspect of the master problem not otherwise addressed. Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? What follows is my mental map of different specializations.

### Online Learning

The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings.

Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc…

The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. For this adversarial setting, the algorithm alters into a form of follow-the-perturbed leader, where the learning algorithm randomizes it’s action amongst the set of plausible alternatives in order to defeat an adversary.

The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies *h:X – > A*, with a dependence log(#policies) or weaker in regret, a measure of failure to compete.

A good basic paper to read here is:

Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm, which shows the basic information-theoretic claim clearly. Vovk‘s page on aggregating algorithms is also relevant, although somewhat harder to read.

Provably computationally tractable special cases all have linear structure, either on rewards or policies. Good results are often observed empirically by applying backpropagation for nonlinear architectures, with the danger of local minima understood.

### Bandit Analysis

In the bandit setting, step 1 is omitted, and the difficulty of the problem is weakened by assuming that action in step (2) don’t alter future rewards. The goal is generally to compete with all constant arm strategies.

Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. If there are *k* strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret *k*. The most impressive theoretical discovery in this setting is that the dependence on *T*, the number of timesteps, is not substantially worse than supervised learning despite the need to explore.

Given the dependence on *k* all of these algorithms are computationally tractable.

However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. Adding back the observation in step 1 allows competing with a large set of policies, while the regret grows only as log(#policies) or weaker. Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification.

### MDP analysis

A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation *x* is a state *s*, which is a sufficient statistic for predicting all future observations. Compared to the previous settings, dealing with time dependence is explicitly required, but learning typically exists in only primitive forms.

The first work here was in the 1950′s where the actual MDP was assumed known and the problem was simply computing a good policy, typically via dynamic programming style solutions. More recently, principally in the 1990′s, the setting where the MDP was not assumed known was analyzed. A very substantial theoretical advancement was the *E ^{3}* algorithm which requires only

*O(S*experience to learn a near-optimal policy where the world is an MDP with

^{2}A)*S*state and

*A*actions per state. A further improvement on this is Delayed Q-Learning, where only

*O(SA)*experience is required. There are many variants on the model-based approach and not much for the model-free approach. Lihong Li‘s thesis probably has the best detailed discussion at present.

There are some unsatisfactory elements of the analysis here. First, I’ve suppressed the dependence on the definition of “approximate” and the typical time horizon, for which the dependence is often bad and the optimality is unclear. The second is the dependence on *S*, which is intuitively unremovable, with this observation formalized in the lower bound Sham and I worked on (section 8.6 of Sham’s thesis). Empirically, these and related algorithms are often finicky, because in practice the observation isn’t a sufficient statistic and the number of states isn’t small, so approximating things as such is often troublesome.

A very different variant of this setting is given by Control theory, which I know less about than I should. The canonical setting for control theory is with a known MDP having linear transition dynamics. More exciting are the system identification problems where the system must be first identified. I don’t know any good relatively assumption free results for this setting.

### Oracle Advice Shortcuts

Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. Using this oracle, conservative policy iteration is guaranteed to perform well, so long as a base learning algorithm can predict well. This algorithm was refined and improved a bit by PSDP, which works via dynamic programming, improving guarantees to work with regret rather than errors.

An alternative form of oracle is provide by access to a good policy at training time. In this setting, Searn has similar provable guarantees with a similar analysis.

The oracle based algorithms appear to work well anywhere these oracles are available.

### Uncontrolled Delay

In the uncontrolled delay setting, step (2) is removed, and typically steps (1) and (3) are collapsed into one observation, where the goal becomes state tracking. Most of the algorithms for state tracking are heavily model dependent, implying good success within particular domains. Examples include Kalman filters, hidden markov models, and particle filters which typical operate according to an explicit probabilistic model of world dynamics.

Relatively little is known for a nonparametric version of this problem. One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here.

A basic question is: What’s missing from the above? A good answer is worth a career.

## 12/7/2009

### Vowpal Wabbit version 4.0, and a NIPS heresy

I’m releasing version 4.0(tarball) of Vowpal Wabbit. The biggest change (by far) in this release is experimental support for cluster parallelism, with notable help from Daniel Hsu.

I also took advantage of the major version number to introduce some incompatible changes, including switching to murmurhash 2, and other alterations to cachefiles. You’ll need to delete and regenerate them. In addition, the precise specification for a “tag” (i.e. string that can be used to identify an example) changed—you can’t have a space between the tag and the ‘|’ at the beginning of the feature namespace.

And, of course, we made it faster.

For the future, I put up my todo list outlining the major future improvements I want to see in the code. I’m planning to discuss the current mechanism and results of the cluster parallel implementation at the large scale machine learning workshop at NIPS later this week. Several people have asked me to do a tutorial/walkthrough of VW, which is arranged for friday 2pm in the workshop room—no skiing for me Friday. Come join us if this heresy interests you as well

## 11/15/2009

### The Other Online Learning

If you search for “online learning” with any major search engine, it’s interesting to note that zero of the results are for online machine learning. This may not be a mistake if you are committed to a global ordering. In other words, the number of people specifically interested in the least interesting top-10 online human learning result might exceed the number of people interested in online machine learning, even given the presence of the other 9 results. The essential observation here is that the process of human learning is a big business (around 5% of GDP) effecting virtually everyone.

The internet is changing this dramatically, by altering the economics of teaching. Consider two possibilities:

- The classroom-style teaching environment continues as is, with many teachers for the same subject.
- All the teachers for one subject get together, along with perhaps a factor of 2 more people who are experts in online delivery. They spend a factor of 4 more time designing the perfect lecture & learning environment as verified by extensive study.

These two approaches have a similar economic cost, with the additional effort in the second approach being offset by the fact that it is a one-time effort rather than an annual effort.

I’m sure many people prefer the classroom approach, because it’s traditional, because a teacher can adjust dynamically and intelligently to the student, and because a teacher provides ancillary benefits such as day care and child abuse detection. Nevertheless, the second approach represents a compelling alternative addressing education. For classes commonly taught through high school, it’s difficult to imagine how good a learning experience could be after millions of hours spent refining to create the perfect approach. Imagine repeating a lecture over-and-over, testing the resulting student understanding a {day, week, month, year, decade} later to such an extent that every slide, every sentence, and every exercise is optimized for excellent learning. We could even imagine adapting the lecture to the learning style of each student.

The process of converting to the second approach has been slow, but it seems to be picking up. This suggests we can expect several things:

**Shakeout**Like all new approaches, there is room for early adopters to win while the established old order suffers. We can expect the most severe impact on pure teaching institutions which do not adopt the newer approaches. Research universities will be insulated in two ways: much of their revenue comes from research grants anyways while the new approach creates a flight to excellence, which the research universities can lay some claim to. At one extreme, I understand that only 4-5% of the operating budget for Caltech comes from student tuition.**Centralized Testing**. Although class lessons can be taught at a distance, and exercises worked out by students, there is great room for cheating. The remedy for this is a strong centralized testing service. This already exists in the form of SAT, GRE, and AP tests, because grade inflation and nonuniform standards are common across schools. If a student can ace these tests after taking online learning classes, then there is a real sense in which colleges accepting students are satisfied by their qualifications. We can expect this to become more true, and perhaps to see more employer-oriented tests. We can also expect that testable subjects have an inherent advantage in online learning. As centralized testing is a difficult market to break into, the existing systems have a substantial advantage here.**Digitization**. Doing online learning brings all the advantages and disadvantages of any other digital media. These include perfect replicability, essentially free distribution, and difficult economics—on one hand the approach could be vastly valuable while on the other it’s difficult to charge someone for something they can get free. The economics imply that there is room for a major charity or state government to accomplish a great deal which might be difficult to accomplish in a business model.**Gaps**. There are areas of teaching which are not amenable to online instruction. For example, teaching people to do research remains in the apprentice model. Similarly, letters of recommendation remain an aspect of the apprentice model. Subjects of relatively small interest such as individual research directions may not merit the effort of a highly polished online instruction system. Similarly, many elements of our current education system are not related to formal education, but rather are about students meeting students, teachers acting as daycare for students, or simply structuring the day for learning. Mechanisms achieving the same ends with online human learning systems are necessary, and the conflation of goals represented by the traditional education approach will retard (but not stop) the adoption of online learning approaches. This process has already taken a decade, and we can expect more decades to come.

For those of us interested in online machine learning, it’s natural to question the relationship with online human learning. The practices differ entirely, but the theory still applies, as there are no clauses in the theorem statements of the form “if the learning agent is not a human then…” When you examine the theorem statements for applicability to online human learning, there are a few ideas which may transfer well. One of these is the necessity and techniques for handling exploration problems. If there are two ways to teach a subject, then you could simply try both and take the best. But if your resources are limited then a UCB approach provides a more efficient mechanism for doing this testing. Similarly, if a student has a set of known attributes, contextual-bandit approaches suggest a sound mechanism for personalization of lessons.

Much of our other theory about the process of online learning may be helpful in a heuristic-motivating manner, but it appears typically too pessimistic to accurately capture what is possible. For example, a common technique to explain an idea when teaching is to simply cover a few extreme cases from which all others are some interpolation. The closest common machine learning analogue to this is some active learning algorithms, where a learning algorithm chooses which examples to label. But, of course, this is not an accurate model, because it’s not the student, but rather the teacher which is choosing the examples. A setting more suitable for student and teacher has been studied in learning theory (see the bibliography here for a link into the citation tree). However, these results are typically rather brittle, so it’s not clear yet that we have understood the right way to formalize this process.

## 10/10/2009

### ALT 2009

I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength.

One paper that might interest people generally is:

Alexey Chernov and Vladimir Vovk, Prediction with Expert Evaluators’ Advice. The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

## 7/31/2009

### Vowpal Wabbit Open Source Project

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

- The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10
^{12}sparse feature) size datasets. The limiting factor is typically i/o. - I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
- After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 10
^{5}examples

I would also like to emphasize that this is intended as an open source *project* rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

- We dropped SEG for now, because of code complexity reasons.
- Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the
**-q**flag, because otherwise you are i/o bound. - The code is more deeply threaded, with a separate thread for parsing.
- There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

## 4/2/2009

### Asymmophobia

One striking feature of many machine learning algorithms is the gymnastics that designers go through to avoid symmetry breaking. In the most basic form of machine learning, there are labeled examples composed of features. Each of these can be treated symmetrically or asymmetrically by algorithms.

**feature symmetry**Every feature is treated the same. In gradient update rules, the same update is applied whether the feature is first or last. In metric-based predictions, every feature is just as important in computing the distance.**example symmetry**Every example is treated the same. Batch learning algorithms are great exemplars of this approach.**label symmetry**Every label is treated the same. This is particularly noticeable in multiclass classification systems which predict according to*arg max*but it occurs in many other places as well._{l}w_{l}x

Empirically, breaking symmetry well seems to yield great algorithms.

**feature asymmetry**For those who like the “boosting is stepwise additive regression on exponential loss” viewpoint (I don’t entirely), boosting is an example of symmetry breaking on features.**example asymmetry**Online learning introduces an example asymmetry. Aside from providing a mechanism for large scale learning, it also enables learning in entirely new (online) settings.**label asymmetry**Tree structured algorithms are good instances of example asymmetry. This includes both the older decision tree approaches like C4.5 and some newer ones we’ve worked on. These approaches are exponentially faster in the number of labels than more symmetric approaches.

The examples above are notably important, with good symmetry breaking approaches yielding substantially improved prediction or computational performance. Given such strong evidence that symmetry breaking is a desirable property, a basic question is: Why isn’t it more prevalent, and more thoroughly studied? One reasonable answer is that doing symmetry breaking well requires more serious thought about learning algorithm design, so researchers simply haven’t gotten to it. This answer appears incomplete.

A more complete answer is that many researchers seem to reflexively avoid symmetry breaking. A simple reason for this is the now pervasive use of Matlab in machine learning. Matlab is a handy tool for fast prototyping of learning algorithms, but it has an intrinsic language-level bias towards symmetric approaches since there are builtin primitives for matrix operations. A more complex reason is a pervasive reflex belief in fairness. While this is admirable when reviewing papers, it seems less so when designing learning algorithms. A third related reason seems to be a fear of doing unmotivated things. Anytime symmetry breaking is undertaken, the method for symmetry breaking is in question, and many people feel uncomfortable without a theorem suggesting the method is the right one. Since there are few theorems motivating symmetry breaking methods, it is often avoided.

What methods for symmetry breaking exist?

- Randomization. Neural Network learning algorithms which initialize the weights randomly exemplify this. I consider the randomization approach particularly weak. It makes experiments non-repeatable, and it seems like the sort of solution that someone with asymmophobia would come up with if they were forced to do something asymmetric.
- Arbitrary. Arbitrary symmetry breaking is something like random, except there is no randomness—you simply declare this feature/label/example comes first and that one second. This seems mildly better than the randomized approach, but still not inspiring.
- Data-driven. Boosting is a good example where a data-driven approach drives symmetry breaking (over features). Data-driven approaches for symmetry breaking seem the most sound, as they can result in improved performance.

While there are examples of learning algorithms doing symmetry breaking for features, labels, and examples individually, there aren’t any I know which do all three, well. What would such an algorithm look like?

## 2/18/2009

### Decision by Vetocracy

Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison:

Paper | Banditron | Offset Tree | Notes |

Problem Scope | Multiclass problems where only the loss of one choice can be probed. | Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. | Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1. |

What’s new | Analysis and Experiments | Algorithm, Analysis, and Experiments | As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identical to One-Step RL with Traces (page 122) (2003) in Bianca‘s thesis with the epsilon greedy strategy and a multiclass perceptron with update scaled by the importance weight. |

Computational Time | O(k) per example where k is the number of choices |
O(log k) per example |
Lower bounds on the sample complexity of learning in this setting are a factor of k worse than for supervised learning, implying that many more examples may be needed in practice. Consequently, learning algorithm speed is more important than in standard supervised learning. |

Analysis | Incomparable. An online regret analysis showing that if a small hinge loss predictor exists, a bounded number of mistakes occur. Also, an algorithm independent analysis of the fully realizable case. | Incomparable. A learning reduction analysis showing how the regret of any base classifier bounds policy regret. Also contains a lower bound and comparable analysis of all plausible alternative reductions. | |

Experiments | 1 dataset, comparing with no other approaches to solving the problem. | 13 datasets, comparing with 2 other approaches to solve the problem. | |

Outcome | Accepted at ICML | Rejected at ICML, NIPS, UAI, and NIPS. |

The reviewers of the Banditron paper made the right call. The subject is interesting, and analysis of a new learning domain is of substantial interest. Real advances in machine learning often come as new domains of application. The talk was well attended and generated substantial interest. It’s also important to remember the reviewers of the two papers probably did not overlap, so there was no explicit preference for A over B.

Why was the Offset Tree rejected? One of these rejections is easily explained as a fluke—we ran into a reviewer at UAI who believes that learning by memorization is the way to go. I, and virtually all machine learning people, disagree but some reviewers at UAI aren’t interested or expert in machine learning.

The striking thing about the other 3 rejects is that they all contain a reviewer who doesn’t read the paper. Instead, the reviewer asserts that learning reductions are bogus because for an alternative notion of learning reduction, made up by the reviewer, an obviously useless approach yields a factor of 2 regret bound. I believe this is the same reviewer each time, because the alternative theorem statement drifted over the reviews fixing bugs we pointed out in the author response.

The first time we encountered this review, we assumed the reviewer was just cranky that day—maybe we weren’t quite clear enough in explaining everything as it’s always difficult to get every detail clear in new subject matter. I have sometimes had a very strong negative impression of a paper which later turned out to be unjustified upon further consideration. Sometimes when a reviewer is cranky, they change their mind after the authors respond, or perhaps later, or perhaps never but you get a new set of reviewers the next time.

The second time the review came up, we knew there was a problem. If we are generous to the reviewer, and taking into account the fact that learning reduction analysis is a relatively new form of analysis, the fear that because an alternative notion of reduction is vacuous our notion of reduction might also be vacuous isn’t too outlandish. Fortunately, there is a way to completely address that—we added an algorithm independent lower bound to the draft (which was the only significant change in content over the submissions). This lower bound conclusively proves that our notion of learning reduction is not vacuous as is the reviewer’s notion of learning reduction.

The review came up a third time. Despite pointing out the lower bound quite explicitly, the reviewer simply ignored it. This more-or-less confirms our worst fears. Some reviewer is bidding for the paper with the intent to torpedo review it. They are uninterested in and unwiling to read the content itself.

**Shouldn’t author feedback address this?** Not if the reviewer ignores it.

**Shouldn’t Double Blind reviewing help?** Not if the paper only has one plausible source. The general problem area and method of analysis were freely discussed on hunch.net. We withheld public discussion of the algorithm itself for much of the time (except for a talk at CMU) out of respect for the review process.

**Why doesn’t the area chair/program chair catch it?** It took us 3 interactions to get it, so it seems unrealistic to expect someone else to get it in one interaction. In general, these people are strongly overloaded and the reviewer wasn’t kind enough to boil down the essence of the stated objection as I’ve done above. Instead, they phrase it as an example and do not clearly state the theorem they have in mind or distinguish the fact that the quantification of that theorem differs from the quantification of our theorems. More generally, my observation is that area chairs rarely override negative reviews because:

- It risks their reputation since defending a criticized work requires the kind of confidence that can only be inspired by a thorough personal review they don’t have time for.
- They may offend the reviewer they invited to review and personally know.
- They figure that the average review is similar to the average perception/popularity by the community anyways.
- Even if they don’t agree with the reviewer, it’s hard to fully discount the review in their consideration.

I’ve seen these effects create substantial mental gymnastics elsewhere.

**Maybe you just ran into a cranky reviewer 3 times randomly** Maybe so. However, the odds seem low enough and the 1/2 year cost of getting another sample high enough, that going with the working hypothesis seems indicated.

**Maybe the writing needs improving**. Often that’s a reasonable answer for a rejection, but in this case I believe not. We’ve run the paper by several people, who did not have substantial difficulties understanding it. They even understand the draft well enough to make a suggestion or two. More generally, no paper is harder to read than the one you picked because you want to reject it.

**What happens next?** With respect to the Offset Tree, I’m hopeful that we eventually find reviewers who appreciate an exponentially faster algorithm, good empirical results, or the very tight and elegant analysis, or even all three. For the record, I consider the Offset Tree a great paper. It remains a substantial advance on the state of the art, even 2 years later, and as far as I know the Offset Tree (or the Realizable Offset Tree) consistently beat all reasonable contenders both in prediction and computational performance. This is rare and precious, as many papers tradeoff one for the other. It yields a practical algorithm applicable to real problems. It substantially addresses the RL to classification reduction problem. It also has the first nonconstant algorithm independent lower bound for learning reductions.

With respect to the reviewer, I expect remarkably little. The system is designed to protect reviewers, so they have virtually no responsibility for their decisions. This reviewer has a demonstrated capability to sabotage the review process at ICML and NIPS and a demonstrated willingness to continue doing so indefinitely. The process of bidding for papers and making up reasons to reject them seems tedious, but there is no fundamental reason why they can’t continue doing so for several decades if they remain active in academia.

This experience has substantially altered my understanding and appreciation of the review process at conferences. The bidding mechanism commonly used, coupled with responsibility-free reviewing is an invitation to abuse. A clever abusive reviewer can sabotage perhaps 5 papers per conference (out of 8 reviewed), while maintaining a typical average score. While I don’t believe most people choose papers with intent to sabotage, the capability is there and used by at least one person and possibly others. If, for example, 5% of reviewers are willing to abuse the process this way and there are 100 reviewers, every paper must survive 5 vetoes. If there are 200 reviewers, every paper must survive 10 vetoes. And if there are 400 reviewers, every paper must survive 20 vetoes. This makes publishing any paper that offends someone difficult. The surviving papers are typically inoffensive or part of a fad strong enough that vetoes are held back. Neither category is representative of high quality decision making. These observations suggest that the conference with the most reviewers tend strongly toward faddy and inoffensive papers, both of which often lack impact in the long term. Perhaps this partly explains why NIPS is so weak when people start citation counting. Conversely, this would suggest that smaller conferences and workshops have a natural advantage. Similarly, the reviewing style in theory conferences seems better—the set of bidders for any paper is substantially smaller, implying papers must survive fewer vetos.

This decision making process can be modeled as a group of *n* decision makers, each of which has the opportunity to veto any action. When *n* is relatively small, this decision making process might work ok, depending on the decision makers, but as *n* grows larger, it’s difficult to imagine a worse decision making process. The closest representatives outside of academia I know are deeply bureacratic governments and other large organizations where many people must sign off on something before it takes place. These vetocracies are universally frustrating to interact with. A reasonable conjecture is that any decision making process with a large veto number has poor characteristics.

A basic question is: Is a vetocracy inevitable for large organizations? I believe the answer is no. The basic observation is that the value of *n* can be logarithmic in the number of participants in an organization rather than linear, as per reviewing under a bidding process. An essential force driving vetocracy creation is a desire to offload responsibility for decisions, so there is no clear decision maker. A large organization not deciding by vetocracy must have a very different structure, with clearly dilineated responsibility.

NIPS provides an almost perfect natural experiment in it’s workshop organization, which involves the very same community of people and subject matter, yet works in a very different manner. There are one or two workshop chairs who are responsible for selecting amongst workshop proposals, after which the content of the workshop is entirely up to the workshop organizers. If a workshop is rejected, it’s clear who is at fault, and if a workshop presentation is rejected, it is often clear by who. Some workshop chairs use a small set of reviewers, but even then the effective veto number remains small. Similarly, if a workshop ends up a flop, it’s relatively easy to see who to blame—either the workshop chair for not predicting it, or the organizers for failing to organize. I can’t think of a single time when I attended both the workshops and the conference that the workshops were less interesting than the conference. My understanding is that this observation is common. Given this discussion, it will be particularly interesting to see how the review process Michael and Leon setup for ICML this year pans out, as it is a system with notably more responsibility assignment than in previous years.

Journals end up looking relatively good with respect to vetocracy avoidance. The ones I’m familiar with have a chief editor who bears responsibility for routing papers to an action editor, who bears responsibility for choosing good reviewers. Every agent except the reviewers is often known by the authors, and the reviewers don’t act as additional vetoers in nearly as strong a manner as reviewers with the opportunity to bid.

This experience has also altered my view of blogging and research. On one hand, I’m very enthusiastic about research in general, and my research in particular, where we are regularly cracking conventionally impossible problems. On the other hand, it seems that some small number of people viewing a discussion silently decide they don’t like it, and veto it given the opportunity. It only takes one to turn strong paper into a years-long odyssey, so public discussion of research directions and topics in a vetocracy is akin to voluntarily wearing a “kick me” sign. While this a problem for me, I expect it to be even worse for the members of a vetocracy in the long term.

It’s hard to imagine any research community surviving without a serious online presence. When a prospective new researcher looks around at existing research, if they don’t find serious online discussion, they’ll assume it doesn’t exist under the “not on the internet so it doesn’t exist” principle. This will starve a field of new people. More generally, there is an opportunity to get feedback about research directions and problems much more rapidly than is otherwise possible, allowing us to avoid research on dead end topics which are pervasive. At some point, it may even seem that people not willing to discuss their research simply avoid doing so because it is critically lacking in one way or another. Since a vetocracy creates a substantial disincentive to discuss research directions online, we can expect that communities sticking with decision by vetocracy to be at a substantial disadvantage.

## 7/26/2008

### Compositional Machine Learning Algorithm Design

There were two papers at ICML presenting learning algorithms for a contextual bandit-style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly:

- The world reveals feature values (aka context information).
- A policy chooses an action.
- The world provides a reward.

Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications.

I’m going to pick on the Banditron paper (second one), which attacks the special case of the contextual bandit setting where exactly one of the rewards is *1* and all other actions result in reward *0*, and show that (a) similar performance is achievable via a simple combination of existing modular technologies and (b) superior performance is achievable by optimizing some existing modular technologies for the realizable case. This algorithm is the hardest of the two to compete with, because it explicitly deals with the explore/exploit tradeoff. Note that I’m *definitely* not trying to minimize the paper—there is analysis in that paper which remains interesting to me and isn’t covered by what follows. I am happy that it was published. The point of this post is showing that a modular approach to building learning algorithms is a strong contender when we encounter new learning problems.

Given the problem statement, my approach to solving the problem would be to compose some modular technologies I know.

**Perceptron learning algorithm**We chose a Binary Perceptron as a classification algorithm. This choice is intrinsically motivated by the great computational performance of a Perceptron. It’s also the closest binary supervised learning algorithm to the Banditron, which eliminates a source of variation in comparison. We could have easily chosen a different base learning algorithm, and in many applications this is highly desirable.**Offset Tree Reduction**The offset tree is a newer machine learning reduction from the contextual bandit setting to the standard supervised learning setting. It more robustly transforms a supervised learner’s performance into good policy performance than any other reduction. The offset tree also has good computational properties, since it produces at most*log*binary examples per train or test event, where_{2}k*k*is the number of actions. In some sense it’s unfair to include the offset tree because it hasn’t yet been formally published. In another sense, that’s what this post is about.**Epoch-Greedy exploration**. The Epoch-greedy approach shows how to handle the explore/exploit tradeoff for learning in a contextual bandit setting as a function of a sample complexity bound. For common sample complexity bounds, we get an*O(T*online regret where^{2/3})*T*is the total number of timesteps.**Occam’s Razor Bound**The Occam’s Razor bound limits the regret of an empirical error minimizing perceptron as function of the number of examples. The bound (and it’s many cousins) are often loose, so the only thing we’ll really use is the denominator which says that regret scales as 1/sqrt(number of training examples) in the worst case. Applying Epoch-Greedy to the Occam’s Razor bound gives you an exploration probability of about C/t^{1/3}where*t*is the round number.

Each component above has been analyzed *in isolation* and is at least a reasonable approach (some are the best possible). Each of these components is also composable. Fitting these pieces together, we get an online learning algorithm (agnostic offset-tree perceptron) that chooses to explore uniformly at random amongst the actions with probability about *1/t ^{1/3}*. How well does it perform? On the 4 class reuters based dataset used in the Banditron paper, we get the following accumulated average error rates with some code.:

The right plot is from the Banditron paper. The Perceptron line in both plots is for an algorithm which learns knowing the full loss function of each example, so it represents an ideal we don’t expect to achieve here. There are three results in the left plot:

- The blue line is a version of the component set where the Occam’s Razor bound and the Offset Tree reduction have been optimized for the realizable case. This was the first thing we tested (and it’s the result I mentioned at Shai‘s ICML talk). It turns out this approach works substantially better than the Banditron, achieving an error rate about halfway between the Banditron and the Multiclass Perceptron. The two components that we tweaked are:
**Realizable case Bound**It’s well known that in the realizable case the regret of a chosen classifier should scale as*1/t*rather than*1/t*. Plugging this into epoch-greedy, we get that the probability of exploration should be about^{0.5}*1/t*.^{0.5}**Realizable case Offset Tree**A basic observation is that in the realizable setting, every observation should create an example to tune the learning algorithm. In the context of the perceptron, this implies every error creates an example. The offset tree reduction can be altered to take this into account by eliminating the importance weight from all updates, and updating even for exploitation examples which are not drawn uniform randomly.

- The red line is what you get with exactly the component set stated above. We were curious about the degree to which a general purpose algorithm can perform well on this application as the realizable case algorithm is definitely broken when the problem is inherently noisy. The performance is somewhat worse than Banditron. I believe this is because it explores only about 1% of the time while the Banditron plot comes from exploring about 5% of the time.
- The green line is from a component set where epoch greedy and the offset tree have been tweaked to keep track of and use the distribution over actions at every timestep. This tweaks allows the amount of exploration as measured by the sum of importance weights of training examples to almost double. As we see, this approach improves performance, almost as if we doubled the number of examples, giving it similar performance to the Banditron. The tweaks used for the component set are:
**Stochastic Epoch-Greedy**Instead of deterministically exploring every 1/(bound_gap) times, choose to explore with probability 1/(bound_gap), and pass this probability to the offset tree reduction.**Nonuniform Offset Tree**Tweak the Offset Tree in the obvious way to take into account nonuniform exploration. In particular, 1/2 is replaced with K/p(a) where p(a) is the probability of the action taken conditioned on one of two actions being taken. We set K so that this value is 1 when a nonexploit action is taken, which implies the importance weight is*p(a)/(2-p(a))*when the exploit action is taken.**Importance Weighted Perceptron**We dealt with importance weights generated by the offset tree reduction by scaling any update by the importance weight.

People may be dissatisfied with the component assembly approach to learning algorithm design, because all of the pieces are not analyzed together. For example, the Banditron paper essentially proves that certain conditions are sufficient for the Banditron algorithm to perform well. This is a more complete guarantee than “all the pieces we stuck together are known to work well when analyzed in isolation”, but this style of guarantee has limitations which are both obvious and often overlooked. These guarantees do not show *necessity* of these conditions, and hence characterize only a subset of settings where the Banditron works. Unless you are lucky enough to know that your application satisfies the sufficient conditions, you’ll basically have to try the Banditron and see if it works for any particular application. Another drawback is that the complexity of analyzing all the different pieces simultaneously means that it’s difficult to use the best elements together. This last point is what leaves me dissatisfied with the complete analysis approach—it produces algorithms which are simple to analyze rather than optimized for performance.

If you are thinking “I have a better algorithm for solving one problem”, then you’ve missed the point of this post. The point of this post is that compositional design is good for solving many learning problems. This post is about one example of that approach in action. We start by assembling a set of components which we know work well from individual component analysis. Then, we try to optimize the performance of the assembly by swapping components or improving individual components to address known properties of the problem or observed deficiencies. In this particular case, we end up with a better performing algorithm *and* better components (such as stochastic epoch greedy) which are directly reusable in other settings. The essence of this approach is understanding that there is a real vocabulary of interchangeable components and actively using it in designing learning algorithms.

I would like to thank Alina who helped substantially with this post. I would also like to thank Shai for providing data and helping setup a clean comparison and Sham for helpful discussion.

## 7/6/2008

### To Dual or Not

Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”

The basic setting is learning a weight vector *w _{i}* so that the function

*f(x)= sum*optimizes some convex loss function.

_{i}w_{i}x_{i}The functional view of the dual is that instead of (or in addition to) keeping track of *w _{i}* over the feature space, you keep track of a vector

*a*over the examples and define

_{j}*w*.

_{i}= sum_{j}a_{j}x_{ji}The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.

**Noise**When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.**Rates**One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.**Kernelization**Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of*a*, which is handy when the weight vector is only implicitly defined._{i}

A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.

## 4/26/2008

### Eliminating the Birthday Paradox for Universal Features

I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text.

The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).

A central concern for this approach is collisions, which create a loss of information. If you use *m* features in an index space of size *n* the birthday paradox suggests a collision if *m > n ^{0.5}*, essentially because there are

*m*pairs. This is pretty bad, because it says that with a vocabulary of

^{2}*10*features, you might need to have

^{5}*10*entries in your table.

^{10}It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about *4m ^{3}/n^{2}*. Plugging in numbers, we see that this implies perhaps only

*n=10*entries are required to avoid a collision. This number can be further reduced to

^{8}*10*by increasing the degree of duplication to 4 or more.

^{7}The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.

This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.

(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.

## 3/23/2008

### Interactive Machine Learning

A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

**What is Interactive Machine Learning?** The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

- Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
**Active Learning**. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.**Contextual Bandits**. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

**What isn’t Interactive Machine Learning?**

There are several learning settings which fail either the interaction or the learning test.

- Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
- Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
- Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
- MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
- Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

**Why now?** Because it’s time, of course.

- We know from other fields and various examples that interaction is very powerful.
- From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
- From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
- From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
- From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.

- We know that this analysis is often tractable. For example, since Sanjoy‘s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
- Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
- It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
- It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

**Basic Questions**

- For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
- How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
- Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

## 12/21/2007

### Vowpal Wabbit Code Release

We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li, Alex Strehl, and I have been working on.

To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are:

- IBM’s Parallel Machine Learning Toolbox isn’t quite open source. The approach used by this toolbox is essentially map-reduce style computation, which doesn’t seem amenable to online learning approaches. This is significant, because the fastest learning algorithms without parallelization tend to be online learning algorithms.
- Leon Bottou‘s sgd implementation first loads data into RAM, then learns. Leon’s code is a great demonstrator of how fast and effective online learning approaches (specifically stochastic gradient descent) can be. VW is about a factor of 3 faster on my desktop, and yields a lower error rate solution.

There are several other features such as feature pairing, sparse features, and namespacing that are often handy in practice.

At present, VW optimizes squared loss via gradient descent or exponentiated gradient descent over a linear representation.

This code is free to use, incorporate, and modify as per the BSD (revised) license. The project is ongoing inside of Yahoo. We will gladly incorporate significant improvements from other people, and I believe any significant improvements are of substantial research interest.

## 10/17/2007

### Online as the new adjective

Online learning is in vogue, which means we should expect to see in the near future:

- Online boosting.
- Online decision trees.
- Online SVMs. (actually, we’ve already seen)
- Online deep learning.
- Online parallel learning.
- etc…

There are three fundamental drivers of this trend.

- Increasing size of datasets makes online algorithms attractive.
- Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning:

The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note that this is the simplest possible setting—making the constraints nonlinear, increasing the dimensionality, introducing noise, etc… all make online learning preferred. - Online learning is conceptually harder, and hence more researchable. The essential difficulty is that there is no well defined static global optima, as exists for many other optimization-based learning algorithms. Grappling with this lack is of critical importance.