Machine Learning (Theory)


Metacademy: a package manager for knowledge

In recent years, there’s been an explosion of free educational resources that make high-level knowledge and skills accessible to an ever-wider group of people. In your own field, you probably have a good idea of where to look for the answer to any particular question. But outside your areas of expertise, sifting through textbooks, Wikipedia articles, research papers, and online lectures can be bewildering (unless you’re fortunate enough to have a knowledgeable colleague to consult). What are the key concepts in the field, how do they relate to each other, which ones should you learn, and where should you learn them?

Courses are a major vehicle for packaging educational materials for a broad audience. The trouble is that they’re typically meant to be consumed linearly, regardless of your specific background or goals. Also, unless thousands of other people have had the same background and learning goals, there may not even be a course that fits your needs. Recently, we (Roger Grosse and Colorado Reed) have been working on Metacademy, an open-source project to make the structure of a field more explicit and help students formulate personal learning plans.

Metacademy is built around an interconnected web of concepts, each one annotated with a short description, a set of learning goals, a (very rough) time estimate, and pointers to learning resources. The concepts are arranged in a prerequisite graph, which is used to generate a learning plan for a concept. In this way, Metacademy serves as a sort of “package manager for knowledge.”

Currently, most of our content is related to machine learning and probabilistic AI; for instance, here are the learning plan and graph for deep belief nets.

the learning plan for deep belief nets part of the learning graph for deep belief nets

Metacademy also has wiki-like documents called roadmaps, which briefly overview key concepts in a field and explain why you might want to learn about them; here’s one we wrote for Bayesian machine learning.

Many ingredients of Metacademy are drawn from pre-existing systems, including Khan Academy,, Connexions, and many intelligent tutoring systems. We’re not trying to be the first to do any particular thing; rather, we’re trying to build a tool that we personally wanted to exist, and we hope others will find it useful as well.

Granted, if you’re reading this blog, you probably have a decent grasp of most of the concepts we’ve annotated. So how can Metacademy help you? If you’re teaching an applied course and don’t want to re-explain Gibbs sampling, you can simply point your students to the concept on Metacademy. Or, if you’re writing a textbook or teaching a MOOC, Metacademy can help potential students find their way there. Don’t worry about self-promotion: if you’ve written something you think people will find useful, feel free to add a pointer!

We are hoping to expand the content beyond machine learning, and we welcome contributions. You can create a roadmap to help people find their way around a field. We are currently working on a GUI for editing the concepts and the graph connecting them (our current system is based on Github pull requests), and we’ll send an email to our registered users once this system is online. If you find Metacademy useful or want to contribute, let us know at feedback _at_ metacademy _dot_ org.


The Large Scale Learning class notes

The large scale machine learning class I taught with Yann LeCun has finished. As I expected, it took quite a bit of time :-). We had about 25 people attending in person on average and 400 regularly watching the recorded lectures which is substantially more sustained interest than I expected for an advanced ML class. We also had some fun with class projects—I’m hopeful that several will eventually turn into papers.

I expect there are a number of professors interested in lecturing on this and related topics. Everyone will have their personal taste in subjects of course, but hopefully there will be some convergence to common course materials as well. To help with this, I am making the sources to my presentations available. Feel free to use/improve/embelish/ridicule/etc… in the pursuit of the perfect course.


Remote large scale learning class participation

Tags: Machine Learning,Online,Teaching jl@ 9:40 pm

Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods.

  1. Videos will be posted with about a 1 day delay on techtalks. This is a side-by-side capture of video+slides from Weyond.
  2. We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things. update2: Sign up here.

The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.


NYU Large Scale Machine Learning Class

Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.

This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers.

For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through that lens. More generally, I expect there is quite a bit of interest: figuring out how to use data to make predictions well is a topic of growing interest to many fields. In 2007, this was true, and demand is much stronger now. Yann and I also come from quite different viewpoints, so I’m looking forward to learning from him as well.

We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible.

Suggestions or thoughts on the class are welcome :-)


Patterns for research in machine learning

There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:

  1. Separate code from data.
  2. Separate input data, working data and output data.
  3. Save everything to disk frequently.
  4. Separate options from parameters.
  5. Do not use global variables.
  6. Record the options used to generate each run of the algorithm.
  7. Make it easy to sweep options.
  8. Make it easy to execute only portions of the code.
  9. Use checkpointing.
  10. Write demos and tests.

Click here for discussion and examples for each item. Also see Charles Sutton’s and HackerNews’ thoughts on the same topic.

My guess is that these patterns will not only be useful for machine learning, but also any other computational work that involves either a) processing large amounts of data, or b) algorithms that take a significant amount of time to execute. Share this list with your students and colleagues. Trust me, they’ll appreciate it.



Yaser points out some nicely videotaped machine learning lectures at Caltech. Yaser taught me machine learning, and I always found the lectures clear and interesting, so I expect many people can benefit from watching. Relative to Andrew Ng‘s ML class there are somewhat different areas of emphasis but the topic is the same, so picking and choosing the union may be helpful.


Somebody’s Eating Your Lunch

Tags: CS,Online,Teaching jl@ 1:36 pm

Since we last discussed the other online learning, Stanford has very visibly started pushing mass teaching in AI, Machine Learning, and Databases. In retrospect, it’s not too surprising that the next step up in serious online teaching experiments are occurring at the computer science department of a university embedded in the land of startups. Numbers on the order of 100000 are quite significant—similar in scale to the number of computer science undergraduate students/year in the US. Although these populations surely differ, the fact that they could overlap is worth considering for the future.

It’s too soon to say how successful these classes will be and there are many easy criticisms to make:

  1. Registration != Learning … but if only 1/10th complete these classes, the scale of teaching still surpasses the scale of any traditional process.
  2. 1st year excitement != nth year routine … but if only 1/10th take future classes, the scale of teaching still surpasses the scale of any traditional process.
  3. Hello, cheating … but teaching is much harder than testing in general, and we already have recognized systems for mass testing.
  4. Online misses out … sure, but for students not enrolled in a high quality university program, this is simply not a relevant comparison. There are also benefits to being online as well, as your time might be better focused. Anecdotally, at Caltech, they let us take two classes at the same time, which I did a few times. Typically, I had a better grade in the class that I skipped as the instructor had to go through things rather slowly.
  5. Where’s the beef? The hard nosed will want to know how to make money, which is always a concern. But, a decent expectation is that if you first figure out how to create value, you’ll find some way to make money. And, if you first wait until it’s clear how to make money, you won’t make any.

My belief is that this project will pan out, with allowances for the expected inevitable adjustments that you learn to make from experience. I think the critics miss an understanding of what’s possible and what motivates people.

The prospect of teaching 1 student means you might review some notes. The prospect of teaching ~10 students means you prepare some slides. The prospect of teaching ~100 students means you polish your slides well, trying to anticipate questions, and hopefully drawing on experience from previous presentations. I’ve never directly taught ~1000 students, but at that scale you must try very hard to make the presentation perfect, including serious testing with dry runs. 105 students must make getting out of bed in the morning quite easy.

Stanford has a significant first-mover advantage amongst top research universities, but it’s easy to imagine a few other (but not many) universities operating at a similar scale. Those that have the foresight to start a serious online teaching program soon will have a chance of being among the few. For other research universities, we can expect boutique traditional classes to continue for some time. These boutique classes may have some significant social value, because it’s easy to imagine that the few megaclasses miss important things in developing research areas. And for everyone working at teaching universities, someone is eating your lunch.

(Cross posted at CACM.)


The Large Scale Learning Survey Tutorial

Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.

This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.

This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth :-)

(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)


The Ideal Large Scale Learning Class

Tags: Machine Learning,Teaching jl@ 4:24 pm

At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.

There are many different kinds of scaling.

  1. Scaling in examples This is the most basic kind of scaling.
    1. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
    2. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
    3. Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
  2. Feature Scaling—what do you do when you have very many features?
    1. Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
    2. Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
    3. Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
  3. Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
    1. Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
    2. GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
  4. Labels
    1. Sourcing Just acquiring sufficient label information can be problematic.

      1. Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
      2. When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
      3. The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
    2. Many Labels It’s common to need to make a complex prediction.
      1. Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
      2. Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
      3. Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.

What else is essential and missing?


The Other Online Learning

Tags: Machine Learning,Online,Teaching jl@ 3:07 pm

If you search for “online learning” with any major search engine, it’s interesting to note that zero of the results are for online machine learning. This may not be a mistake if you are committed to a global ordering. In other words, the number of people specifically interested in the least interesting top-10 online human learning result might exceed the number of people interested in online machine learning, even given the presence of the other 9 results. The essential observation here is that the process of human learning is a big business (around 5% of GDP) effecting virtually everyone.

The internet is changing this dramatically, by altering the economics of teaching. Consider two possibilities:

  1. The classroom-style teaching environment continues as is, with many teachers for the same subject.
  2. All the teachers for one subject get together, along with perhaps a factor of 2 more people who are experts in online delivery. They spend a factor of 4 more time designing the perfect lecture & learning environment as verified by extensive study.

These two approaches have a similar economic cost, with the additional effort in the second approach being offset by the fact that it is a one-time effort rather than an annual effort.

I’m sure many people prefer the classroom approach, because it’s traditional, because a teacher can adjust dynamically and intelligently to the student, and because a teacher provides ancillary benefits such as day care and child abuse detection. Nevertheless, the second approach represents a compelling alternative addressing education. For classes commonly taught through high school, it’s difficult to imagine how good a learning experience could be after millions of hours spent refining to create the perfect approach. Imagine repeating a lecture over-and-over, testing the resulting student understanding a {day, week, month, year, decade} later to such an extent that every slide, every sentence, and every exercise is optimized for excellent learning. We could even imagine adapting the lecture to the learning style of each student.

The process of converting to the second approach has been slow, but it seems to be picking up. This suggests we can expect several things:

  1. Shakeout Like all new approaches, there is room for early adopters to win while the established old order suffers. We can expect the most severe impact on pure teaching institutions which do not adopt the newer approaches. Research universities will be insulated in two ways: much of their revenue comes from research grants anyways while the new approach creates a flight to excellence, which the research universities can lay some claim to. At one extreme, I understand that only 4-5% of the operating budget for Caltech comes from student tuition.
  2. Centralized Testing. Although class lessons can be taught at a distance, and exercises worked out by students, there is great room for cheating. The remedy for this is a strong centralized testing service. This already exists in the form of SAT, GRE, and AP tests, because grade inflation and nonuniform standards are common across schools. If a student can ace these tests after taking online learning classes, then there is a real sense in which colleges accepting students are satisfied by their qualifications. We can expect this to become more true, and perhaps to see more employer-oriented tests. We can also expect that testable subjects have an inherent advantage in online learning. As centralized testing is a difficult market to break into, the existing systems have a substantial advantage here.
  3. Digitization. Doing online learning brings all the advantages and disadvantages of any other digital media. These include perfect replicability, essentially free distribution, and difficult economics—on one hand the approach could be vastly valuable while on the other it’s difficult to charge someone for something they can get free. The economics imply that there is room for a major charity or state government to accomplish a great deal which might be difficult to accomplish in a business model.
  4. Gaps. There are areas of teaching which are not amenable to online instruction. For example, teaching people to do research remains in the apprentice model. Similarly, letters of recommendation remain an aspect of the apprentice model. Subjects of relatively small interest such as individual research directions may not merit the effort of a highly polished online instruction system. Similarly, many elements of our current education system are not related to formal education, but rather are about students meeting students, teachers acting as daycare for students, or simply structuring the day for learning. Mechanisms achieving the same ends with online human learning systems are necessary, and the conflation of goals represented by the traditional education approach will retard (but not stop) the adoption of online learning approaches. This process has already taken a decade, and we can expect more decades to come.

For those of us interested in online machine learning, it’s natural to question the relationship with online human learning. The practices differ entirely, but the theory still applies, as there are no clauses in the theorem statements of the form “if the learning agent is not a human then…” When you examine the theorem statements for applicability to online human learning, there are a few ideas which may transfer well. One of these is the necessity and techniques for handling exploration problems. If there are two ways to teach a subject, then you could simply try both and take the best. But if your resources are limited then a UCB approach provides a more efficient mechanism for doing this testing. Similarly, if a student has a set of known attributes, contextual-bandit approaches suggest a sound mechanism for personalization of lessons.

Much of our other theory about the process of online learning may be helpful in a heuristic-motivating manner, but it appears typically too pessimistic to accurately capture what is possible. For example, a common technique to explain an idea when teaching is to simply cover a few extreme cases from which all others are some interpolation. The closest common machine learning analogue to this is some active learning algorithms, where a learning algorithm chooses which examples to label. But, of course, this is not an accurate model, because it’s not the student, but rather the teacher which is choosing the examples. A setting more suitable for student and teacher has been studied in learning theory (see the bibliography here for a link into the citation tree). However, these results are typically rather brittle, so it’s not clear yet that we have understood the right way to formalize this process.


Many ways to Learn this summer

There are at least 3 summer schools related to machine learning this summer.

  1. The first is at University of Chicago June 1-11 organized by Misha Belkin, Partha Niyogi, and Steve Smale. Registration is closed for this one, meaning they met their capacity limit. The format is essentially an extended Tutorial/Workshop. I was particularly interested to see Valiant amongst the speakers. I’m also presenting Saturday June 6, on logarithmic time prediction.
  2. Praveen Srinivasan points out the second at Peking University in Beijing, China, July 20-27. This one differs substantially, as it is about vision, machine learning, and their intersection. The deadline for applications is June 10 or 15. This is also another example of the growth of research in China, with active support from NSF.
  3. The third one is at Cambridge, England, August 29-September 10. It’s in the MLSS series. Compared to the Chicago one, this one is more about the Bayesian side of ML, although effort has been made to create a good cross section of topics. It’s also more focused on tutorials over workshop-style talks.


Wielding a New Abstraction

This post is partly meant as an advertisement for the reductions tutorial Alina, Bianca, and I are planning to do at ICML. Please come, if you are interested.

Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems.

In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult.

  1. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, you often don’t know the right way to state things.

    Here’s my current attempt: The process of abstraction for learning reductions can start with sample complexity bounds (or online learning against an adversary analysis). A very simple sample complexity bound is that for all sets of hypotheses H, for all distributions D on examples (x,y), and for all confidence parameters d

    Pr(x,y)m~Dm(for all h in H: |e(h,D)-e(h,(x,y)m)| < (ln( |H|/ d )/m)0.5 ) > 1 – d

    Here (x,y)m is a sequence of m IID samples, e(h,D) is the error rate of h on D and e(h,(x,y)m) is the empirical error rate of h on the set of IID samples.

    The previous bound is a very simple example, and yet remarkably complex both to state and to interpret—many people have been lost by the meaning of d. The impact of this complexity is that it is difficult to effectively use these bounds in practical learning algorithm design, particularly in solving more complex learning problems where much more than one bit of prediction is required. This was a central frustration that I ran into in my thesis work. Some progress has been made since then, but it is still quite difficult. The abstraction in the learning reduction setting is:

    1. You throw away d, because it only has a logarithmic dependence anyways.
    2. You eliminate H and m on the theory that intelligent choices for H and m are made in practice.
    3. You eliminate the IID assumption, because it is no longer needed to define things

    The statement then is

    e(A((x,y)m),D)-e(h*,D) < eps

    where A() is the hypothesis output by the learning algorithm, h* is the best possible predictor, and eps is used to parameterize the theorems. This abstraction is radical in some sense, but something radical was needed to yield tractable and useful analysis on the complex problems people need to solve in practice.

  2. A consequence of lack of familiarity, is that people often misread. In reading a paper, there is a temptation to not read carefully and fill in your understanding of things. Most of the time this works out well, but not here. For example, we saw many instances where people inserted IID sample assumptions or other things that simply weren’t there.
  3. Once you get past the lack of familiarity and misunderstandings, there is a feeling that the new abstraction is cheating. To some extent I understand, as I remember learning about abstractions in class, and I remember feeling that they were in some real sense cheating by dropping important details. For example:
    1. Big-O notation provides an upper bound specified up to constants. For example O(log n) computational complexity means there exists a constant c such that the number of operations requires is less than c log n. Big-O can be abused by hiding “constants” larger than the plausible values for the parameters. In machine learning, a particularly egregious case occurs in Bandit analysis where the punchline of some papers is “logarithmic regret”, hiding an arbitrarily large problem dependent constant.
    2. TCP provides a mechanism for reliable transport over an unreliable network. It is a very commonly used mechanism for sending information over the internet—you used TCP in reading this. TCP is both a programming construct and a mechanism for abstracting communicating over a network. The TCP abstraction is broken when the network is too unreliable for it to recover, such as on sketchy wireless networks where the programmer built for the TCP abstraction which wasn’t delivered.
    3. Dimensional analysis is a technique for quick analysis in physics. The basic idea is to just look at the units when estimating some quantity and combine them to get the right unit answer. For example, to compute the distance d traveled after time t with acceleration a, you simply use at2, since that formula is the only way to combine a with units of distance/time2 and t with units of time to get units of distance. This answer is off by a factor of 2 from what a more detailed analysis using integration yields, which is typical. Dimensional analysis can be misleading when the constants are very large. One example is in Gravitation where there is a table with time and distance equated since they are related by a constant—the speed of light 3*108 m/s. For example, E=mc2 becomes E=m.

    Although the above breakages are real, the usefulness of these abstractions, in terms of allowing us to quickly think about and make decisions more than offsets the drawbacks. Indeed, even the breakages stated above are thought provoking or useful enough that I can’t even say it is wrong to consider them. This property that abstractions can be abused is generically essential to the process of abstraction itself. Abstraction is about neglecting details, and when these details are not neglectable, the abstraction is abused or ineffective. Because of this, any abstraction is insufficient for analyzing and solving real problems where the neglected details matter.

    Just as for these abstractions, the learning reduction abstraction can be abused—the chosen learning algorithm can be pathetic yielding vacuous bounds, or the reduction can scramble the feature information with an encryption algorithm making it so no reasonable learning algorithm could yield other than pathetic performance. Similarly, there are situations in which I don’t know how to effectively use a learning reduction to build a learning algorithm, and it seems implausible that observation changes as more is learned in the future.

For a good abstraction, the drawbacks are matched by the advantages. The principle advantage is that there is a new way to examine and solve problems. This has several interesting effects.

  1. A good abstraction can capture a more complete specification of the problem. As an example, the sample complexity view of learning is broken in practice, because when insufficient performance is achieved people choose a different set of hypotheses H by throwing in additional features or choosing a different learning algorithm. Capturing this process in the sample complexity view requires an additional level of complexity. In the reduction view, this is entirely natural, because any means for achieving a better generalization—more/better features, a better learning algorithm, a better prior, sticking a human in the learning process, etc… are legitimate. This is particularly powerful when architecting solutions, providing a partial answer to the “What?” question Yehuda pointed out.
  2. A higher level abstraction can let you accidentally solve problems in other areas as well. A good example of this is error correcting tournaments which are useful for tournament design to select the best player/team/paper in real tournaments. Recently, I was amused to learn that a standard betting procedure for basketball tournaments exactly mirrors the importance weights suggested for the final elimination of ECTs. The first phase of ECTs provides a sound and practical method to seed a final elimination tournament, eliminating the need for (and biases of) a committee.
  3. Perhaps the most interesting effect is that the new abstraction can aid you in finding effective solutions to new problems. For learning reductions, there are about 3 compelling instances I’ve seen so far.
    1. Given training-time access to a good policy oracle, Searn provides a method for decomposing any complex prediction problem into simple problems, such that low regret solutions to the simple problems imply a low regret solution to the original problem. While Searn competes well (computationally and prediction-wise) with existing methods for linear chain style structured prediction, it really shines on more complex problems. Hal used Searn for automatic document summarization (see section 6.2) which previously wasn’t really solved via ML. More generally, when I learn about the details of other complex prediction systems for machine translation or vision, the base algorithms are tweaked, typically in ways that Searn would suggest. This suggests that Searn formalizes and automates the intuitions of practical people.
    2. The “one step RL” reduction in Bianca‘s thesis (page 119) provided tractable and effective approaches to learning in partial feedback problems where only the loss of a chosen label is learned. An even simpler reduction exists as a matter of folklore—estimate the the value of each label and then take an argmax. However, we have found classification approaches generally work better, where applicable, and as the theory suggests.
    3. Many commonly used algorithms for prediction have a running time linear (or worse) in the number of labels with decision trees a good exception. While simply predicting faster isn’t normally solving a “new problem”, an exponential improvement in computational time seems to merit this description because it allows entirely new kinds of applications. It turns out that it is both very easy to do logarithmic time prediction wrong, and that this problem is often fixable. Furthermore, it appears logarthmic time prediction can really work in practice over very many labels.

When we started working on learning reductions, I had no idea what either the difficulties or rewards were going to be—it simply seemed like a natural and compelling direction of investigation. Given the substantial difficulties encountered, it’s not at all clear that this pursuit was personally worthwhile. It has cost much time which could have been put to good use in other ways.

On the other hand, the advantages are also substantial. I’ve learned something about architecting solutions to problems, both expanding the domain of application for the field and providing a personal edge that I can bring to many conversations about ML. It’s also progress towards the AI goal, which interests me. When I think of what I could have worked on instead to achieve these goals, I don’t have any more compelling answer yet. Learning reductions seem to have accomplished more per unit thought than any other theoretical approach I can identify over the last 5 or 6 years. Furthermore, they are composable by design, so they should stay relevant (and perhaps even become more so), when people use an online active deep semisupervised probabilistic convolutional algorithm to solve a problem, particularly for complex problems.

As I said at the beginning, please join us for the tutorial, if you are interested.


MLSS 2008

… is in Kioloa, Australia from March 3 to March 14. It’s a great chance to learn something about Machine Learning and I’ve enjoyed several previous Machine Learning Summer Schools.

The website has many more details, but registration is open now for the first 80 to sign up.


The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?


Presentation of Proofs is Hard.

Tags: Research,Teaching jl@ 7:38 pm

When presenting part of the Reinforcement Learning theory tutorial at ICML 2006, I was forcibly reminded of this.

There are several difficulties.

  1. When creating the presentation, the correct level of detail is tricky. With too much detail, the proof takes too much time and people may be lost to boredom. With too little detail, the steps of the proof involve too-great a jump. This is very difficult to judge.

    1. What may be an easy step in the careful thought of a quiet room is not so easy when you are occupied by the process of presentation.
    2. What may be easy after having gone over this (and other) proofs is not so easy to follow in the first pass by a viewer.

    These problems seem only correctable by process of repeated test-and-revise.

  2. When presenting the proof, simply speaking with sufficient precision is substantially harder than in normal conversation (where precision is not so critical). Practice can help here.
  3. When presenting the proof, going at the right pace for understanding is difficult. When we use a blackboard/whiteboard, a natural reasonable pace is imposed by the process of writing. Unfortunately, writing doesn’t scale well to large audiences for vision reasons, losing this natural pacing mechanism.
  4. It is difficult to entertain with a proof—there is nothing particularly funny about it. This particularly matters for a large audience which tends to naturally develop an expectation of being entertained.

Given all these difficulties, it is very tempting to avoid presenting proofs. Avoiding the proof in any serious detail is fairly reasonable in a conference presentation—the time is too short and the people viewing are too heavily overloaded to follow the logic well. The “right” level of detail is often the theorem statement.

Nevertheless, avoidance is not always possible because the proof is one of the more powerful mechanisms we have for doing research.

Powered by WordPress