Machine Learning (Theory)

1/24/2010

Specializations of the Master Problem

One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as:

  1. The world announces an observation x.
  2. The agent makes a choice a.
  3. The world announces a reward r.

The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of the master problem general enough to provide an effective solution in real problems?

The process of specializing is a tricky business, as you want to simultaneously achieve tractable analysis, sufficient generality to be useful, and yet capture a new aspect of the master problem not otherwise addressed. Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? What follows is my mental map of different specializations.

Online Learning

The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings.

Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc…

The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. For this adversarial setting, the algorithm alters into a form of follow-the-perturbed leader, where the learning algorithm randomizes it’s action amongst the set of plausible alternatives in order to defeat an adversary.

The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies h:X – > A, with a dependence log(#policies) or weaker in regret, a measure of failure to compete.

A good basic paper to read here is:
Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm, which shows the basic information-theoretic claim clearly. Vovk’s page on aggregating algorithms is also relevant, although somewhat harder to read.

Provably computationally tractable special cases all have linear structure, either on rewards or policies. Good results are often observed empirically by applying backpropagation for nonlinear architectures, with the danger of local minima understood.

Bandit Analysis

In the bandit setting, step 1 is omitted, and the difficulty of the problem is weakened by assuming that action in step (2) don’t alter future rewards. The goal is generally to compete with all constant arm strategies.

Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. If there are k strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret k. The most impressive theoretical discovery in this setting is that the dependence on T, the number of timesteps, is not substantially worse than supervised learning despite the need to explore.

Given the dependence on k all of these algorithms are computationally tractable.

However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. Adding back the observation in step 1 allows competing with a large set of policies, while the regret grows only as log(#policies) or weaker. Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification.

MDP analysis

A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation x is a state s, which is a sufficient statistic for predicting all future observations. Compared to the previous settings, dealing with time dependence is explicitly required, but learning typically exists in only primitive forms.

The first work here was in the 1950’s where the actual MDP was assumed known and the problem was simply computing a good policy, typically via dynamic programming style solutions. More recently, principally in the 1990’s, the setting where the MDP was not assumed known was analyzed. A very substantial theoretical advancement was the E3 algorithm which requires only O(S2A) experience to learn a near-optimal policy where the world is an MDP with S state and A actions per state. A further improvement on this is Delayed Q-Learning, where only O(SA) experience is required. There are many variants on the model-based approach and not much for the model-free approach. Lihong Li’s thesis probably has the best detailed discussion at present.

There are some unsatisfactory elements of the analysis here. First, I’ve suppressed the dependence on the definition of “approximate” and the typical time horizon, for which the dependence is often bad and the optimality is unclear. The second is the dependence on S, which is intuitively unremovable, with this observation formalized in the lower bound Sham and I worked on (section 8.6 of Sham’s thesis). Empirically, these and related algorithms are often finicky, because in practice the observation isn’t a sufficient statistic and the number of states isn’t small, so approximating things as such is often troublesome.

A very different variant of this setting is given by Control theory, which I know less about than I should. The canonical setting for control theory is with a known MDP having linear transition dynamics. More exciting are the system identification problems where the system must be first identified. I don’t know any good relatively assumption free results for this setting.

Oracle Advice Shortcuts

Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. Using this oracle, conservative policy iteration is guaranteed to perform well, so long as a base learning algorithm can predict well. This algorithm was refined and improved a bit by PSDP, which works via dynamic programming, improving guarantees to work with regret rather than errors.

An alternative form of oracle is provide by access to a good policy at training time. In this setting, Searn has similar provable guarantees with a similar analysis.

The oracle based algorithms appear to work well anywhere these oracles are available.

Uncontrolled Delay

In the uncontrolled delay setting, step (2) is removed, and typically steps (1) and (3) are collapsed into one observation, where the goal becomes state tracking. Most of the algorithms for state tracking are heavily model dependent, implying good success within particular domains. Examples include Kalman filters, hidden markov models, and particle filters which typical operate according to an explicit probabilistic model of world dynamics.

Relatively little is known for a nonparametric version of this problem. One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here.

A basic question is: What’s missing from the above? A good answer is worth a career.

1/19/2010

Deadline Season, 2010

Tags: Machine Learning jl@ 5:37 pm

Many conference deadlines are coming soon.

Deadline Double Blind / Author Feedback Time/Place
ICML January 18((workshops) / February 1 (Papers) / February 13 (Tutorials) Y/Y Haifa, Israel, June 21-25
KDD February 1(Workshops) / February 2&5 (Papers) / February 26 (Tutorials & Panels)) / April 17 (Demos) N/S Washington DC, July 25-28
COLT January 18 (Workshops) / February 19 (Papers) N/S Haifa, Israel, June 25-29
UAI March 11 (Papers) N?/Y Catalina Island, California, July 8-11

ICML continues to experiment with the reviewing process, although perhaps less so than last year.

The S “sort-of” for COLT is because author feedback occurs only after decisions are made.

KDD is notable for being the most comprehensive in terms of {Tutorials, Workshops, Challenges, Panels, Papers (two tracks), Demos}. The S for KDD is because there is sometimes author feedback at the decision of the SPC.

The (past) January 18 deadline for workshops at ICML is nominal, as I (as workshop chair) almost missed it myself and we have space for a few more workshops. If anyone is thinking “oops, I missed the deadline”, send in your proposal by Friday the 22nd.

This year, I’m an area chair for ICML and on the SPC for KDD. I hope to see interesting papers on plausibly useful learning theory (broadly interpreted) at each conference, as I did last year.

1/13/2010

Sam Roweis died

Tags: Announcements, Machine Learning jl@ 7:02 pm

and I can’t help but remember him.

I first met Sam as an undergraduate at Caltech where he was TA for Hopfield’s class, and again when I visited Gatsby, when he invited me to visit Toronto, and at too many conferences to recount. His personality was a combination of enthusiastic and thoughtful, with a great ability to phrase a problem so it’s solution must be understood. With respect to my own work, Sam was the one who advised me to make my first tutorial, leading to others, and to other things, all of which I’m grateful to him for. In fact, my every interaction with Sam was positive, and that was his way.

His death is being called a suicide which is so incompatible with my understanding of Sam that it strains my credibility. But we know that his many responsibilities were great, and it is well understood that basically all sane researchers have legions of inner doubts. Having been depressed now and then myself, it’s helpful to understand at least intellectually that the true darkness of the now is overestimated, and that you have more friends than you think. Sam was one of mine, and I’ll miss him.

My last interaction with Sam, last week, was discussing a new research direction that interested him, optimizing the cost of acquiring feature information in the learning algorithm. This problem is endemic to real-world applications, and has been studied to some extent elsewhere, but I expect that in our unwritten future history, we’ll discover that further study of this problem is more helpful than almost anyone realizes. The reply that I owed him feels heavy, and an incompleteness is hanging. For his wife and children it is surely so incomparably greater that I lack words.

(Added) Others: Fernando, Kevin McCurley, Danny Tarlow, David Hogg, Yisong Yue, Lance Fortnow on Sam, a Memorial site, and a Memorial Fund

12/27/2009

Interesting things at NIPS 2010

Tags: Machine Learning jl@ 3:40 pm

Several papers at NIPS caught my attention.

  1. Elad Hazan and Satyen Kale, Online Submodular Optimization They define an algorithm for online optimization of submodular functions with regret guarantees. This places submodular optimization roughly on par with online convex optimization as tractable settings for online learning.
  2. Elad Hazan and Satyen Kale On Stochastic and Worst-Case Models of Investing. At it’s core, this is yet another example of modifying worst-case online learning to deal with variance, but the application to financial models is particularly cool and it seems plausibly superior other common approaches for financial modeling.
  3. Mark Palatucci, Dean Pomerlau, Tom Mitchell, and Geoff Hinton Zero Shot Learning with Semantic Output Codes The goal here is predicting a label in a multiclass supervised setting where the label never occurs in the training data. They have some basic analysis and also a nice application to FMRI brain reading.
  4. Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, and Oliver Spatscheck, Tracking Dynamic Sources of Malicious Activity at Internet Scales. This is a plausible combination of worst-case learning algorithms in a tree-like structure over IP space to track and predict bad IPs. Their empirical results look quite good to me and there are many applications where this prediction problem needs to be solved.
  5. Kamalika Chaudhuri, Daniel Hsu, and Yoav Freund, A Parameter Free Hedging Algorithm This paper is about eliminating the learning rate parameter from online learning algorithms. While that’s certainly useful, the approach taken involves a double-exponential rather than a single exponential potential, which is strange and potentially useful in many other places.
  6. Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Polynomial Semantic Indexing This is about an empirically improved algorithm for learning ranking functions based on (query,document) content. The sexy Semantic name is justified because it is not based on syntactic matching of query to document.

I also found the future publication models discussion interesting. The follow-up post here has details and further discussion.

At the workshops, I was deeply confronted with the problem of too many interesting workshops to attend in the given amount of time. Two talks stood out for me:

  1. Carlos Guestrin gave a talk in the interactive machine learning workshop on Turning Down the Noise in the Blogosphere by Khalid El-Arini, Gaurav Veda, Dafna Shahaf, and Carlos Guestrin which I missed at KDD this year. The paper discusses the use exponential weight online learning algorithms to rerank blog posts based on user-specific interests. It comes with a demonstration website where you can test it out.
  2. Leslie Valiant gave a talk on representations and operations on concepts in a brain-like fashion. The style of representation and algorithm involves distributed representations on sparse graphs, an approach which is relatively unfamiliar. Bloom filters and in machine learning experience with learning through hashing functions has sharpened my intuition a bit. The talk seemed to cover Memorization and Association on a Realistic Neural Model at Neural Computation as well as A First Experimental Demonstration of Massive Knowledge Infusion at KR.

12/24/2009

Top graduates this season

Tags: Graduates, Machine Learning jl@ 4:55 pm

I would like to point out 3 graduates this season as having my confidence they are capable of doing great things.

  1. Daniel Hsu has diverse papers with diverse coauthors on {active learning, mulitlabeling, temporal learning, …} each covering new algorithms and methods of analysis. He is also a capable programmer, having helped me with some nitty-gritty details of cluster parallel Vowpal Wabbit this summer. He has an excellent tendency to just get things done.
  2. Nicolas Lambert doesn’t nominally work in machine learning, but I’ve found his work in elicitation relevant nevertheless. In essence, elicitable properties are closely related to learnable properties, and the elicitation complexity is related to a notion of learning complexity. See the Surrogate regret bounds paper for some related discussion. Few people successfully work at such a general level that it crosses fields, but he’s one of them.
  3. Yisong Yue is deeply focused on interactive learning, which he has attacked at all levels: theory, algorithm adaptation, programming, and popular description. I’ve seen a relentless multidimensional focus on a new real-world problem be an excellent strategy for research and expect he’ll succeed.

The obvious caveat applies—I don’t know or haven’t fully appreciated everyone’s work so I’m sure I missed people. I’d like to particularly point out Percy Liang and David Sontag as plausibly such whom I’m sure others appreciate a great deal.

12/9/2009

Inherent Uncertainty

Tags: Announcements, Machine Learning jl@ 12:01 pm

I’d like to point out Inherent Uncertainty, which I’ve added to the ML blog post scanner on the right. My understanding from Jake is that the intention is to have a multiauthor blog which is more specialized towards learning theory/game theory than this one. Nevertheless, several of the posts seem to be of wider interest.

Future Publication Models @ NIPS

Yesterday, there was a discussion about future publication models at NIPS. Yann and Zoubin have specific detailed proposals which I’ll add links to when I get them (Yann’s proposal and Zoubin’s proposal).

What struck me about the discussion is that there are many simultaneous concerns as well as many simultaneous proposals, which makes it difficult to keep all the distinctions straight in a verbal conversation. It also seemed like people were serious enough about this that we may see some real movement. Certainly, my personal experience motivates that as I’ve posted many times about the substantial flaws in our review process, including some very poor personal experiences.

Concerns include the following:

  1. (Several) Reviewers are overloaded, boosting the noise in decision making.
  2. (Yann) A new system should run with as little built-in delay and friction to the process of research as possible.
  3. (Hanna Wallach(updated)) Double-blind review is particularly important for people who are unknown or from an unknown institution.
  4. (Several) But, it’s bad to take double blind so seriously as to disallow publishing on arxiv or personal webpages.
  5. (Yann) And double-blind is bad when it prevents publishing for substantial periods of time. Apparently, this comes up in CVPR.
  6. (Zoubin) Any new system should appear to outsiders as if it’s the old system, or a journal, because it’s already hard enough to justify CS tenure cases to other disciplines.
  7. (Fernando) There shouldn’t be a big change with a complex bureaucracy, but rather a smaller changes which are obviously useful or at least worth experimenting with.

There were other concerns as well, but these are the ones that I remember.

Elements of proposals include:

  1. (Yann) Everything should go to Arxiv or an arxiv-like system first, as per physics or mathematics. This addresses (1), because it delinks dissemination from review, relieving some of the burden of reviewing. It also addresses (2) since with good authors they can immediately begin building on each other’s work. It conflicts with (3), because Arxiv does not support double-blind submission. It does not conflict if we build our own system.
  2. (Fernando) Create a conference coincident journal in which people can publish at any time. VLDB has apparently done this. It can be done smoothly by allowing submission in either conference deadline mode or journal mode. This proposal addresses (1) by reducing peak demand on reviewing. It also addresses (6) above.
  3. (Daphne) Perhaps we should have a system which only reviews papers for correctness, which is not nearly as subjective as for novelty or interestingness. This addresses (1), by eliminating some concerns for the reviewer. It is orthogonal to the double blind debate. In biology, such a journal exists (pointer updated), because delays were becoming absurd and intolerable.
  4. (Yann) There should be multiple publishing entities (people or groups of people) that can bless a paper as interesting. This addresses (1).

There are many other proposal elements (too many for my memory), which hopefully we’ll see in particular proposals. If other people have concrete proposals, now is probably the right time to formalize them.

12/7/2009

Vowpal Wabbit version 4.0, and a NIPS heresy

Tags: Code, Machine Learning, Online jl@ 12:42 pm

I’m releasing version 4.0(tarball) of Vowpal Wabbit. The biggest change (by far) in this release is experimental support for cluster parallelism, with notable help from Daniel Hsu.

I also took advantage of the major version number to introduce some incompatible changes, including switching to murmurhash 2, and other alterations to cachefiles. You’ll need to delete and regenerate them. In addition, the precise specification for a “tag” (i.e. string that can be used to identify an example) changed—you can’t have a space between the tag and the ‘|’ at the beginning of the feature namespace.

And, of course, we made it faster.

For the future, I put up my todo list outlining the major future improvements I want to see in the code. I’m planning to discuss the current mechanism and results of the cluster parallel implementation at the large scale machine learning workshop at NIPS later this week. Several people have asked me to do a tutorial/walkthrough of VW, which is arranged for friday 2pm in the workshop room—no skiing for me Friday. Come join us if this heresy interests you as well :)

11/15/2009

The Other Online Learning

Tags: Machine Learning, Online, Teaching jl@ 3:07 pm

If you search for “online learning” with any major search engine, it’s interesting to note that zero of the results are for online machine learning. This may not be a mistake if you are committed to a global ordering. In other words, the number of people specifically interested in the least interesting top-10 online human learning result might exceed the number of people interested in online machine learning, even given the presence of the other 9 results. The essential observation here is that the process of human learning is a big business (around 5% of GDP) effecting virtually everyone.

The internet is changing this dramatically, by altering the economics of teaching. Consider two possibilities:

  1. The classroom-style teaching environment continues as is, with many teachers for the same subject.
  2. All the teachers for one subject get together, along with perhaps a factor of 2 more people who are experts in online delivery. They spend a factor of 4 more time designing the perfect lecture & learning environment as verified by extensive study.

These two approaches have a similar economic cost, with the additional effort in the second approach being offset by the fact that it is a one-time effort rather than an annual effort.

I’m sure many people prefer the classroom approach, because it’s traditional, because a teacher can adjust dynamically and intelligently to the student, and because a teacher provides ancillary benefits such as day care and child abuse detection. Nevertheless, the second approach represents a compelling alternative addressing education. For classes commonly taught through high school, it’s difficult to imagine how good a learning experience could be after millions of hours spent refining to create the perfect approach. Imagine repeating a lecture over-and-over, testing the resulting student understanding a {day, week, month, year, decade} later to such an extent that every slide, every sentence, and every exercise is optimized for excellent learning. We could even imagine adapting the lecture to the learning style of each student.

The process of converting to the second approach has been slow, but it seems to be picking up. This suggests we can expect several things:

  1. Shakeout Like all new approaches, there is room for early adopters to win while the established old order suffers. We can expect the most severe impact on pure teaching institutions which do not adopt the newer approaches. Research universities will be insulated in two ways: much of their revenue comes from research grants anyways while the new approach creates a flight to excellence, which the research universities can lay some claim to. At one extreme, I understand that only 4-5% of the operating budget for Caltech comes from student tuition.
  2. Centralized Testing. Although class lessons can be taught at a distance, and exercises worked out by students, there is great room for cheating. The remedy for this is a strong centralized testing service. This already exists in the form of SAT, GRE, and AP tests, because grade inflation and nonuniform standards are common across schools. If a student can ace these tests after taking online learning classes, then there is a real sense in which colleges accepting students are satisfied by their qualifications. We can expect this to become more true, and perhaps to see more employer-oriented tests. We can also expect that testable subjects have an inherent advantage in online learning. As centralized testing is a difficult market to break into, the existing systems have a substantial advantage here.
  3. Digitization. Doing online learning brings all the advantages and disadvantages of any other digital media. These include perfect replicability, essentially free distribution, and difficult economics—on one hand the approach could be vastly valuable while on the other it’s difficult to charge someone for something they can get free. The economics imply that there is room for a major charity or state government to accomplish a great deal which might be difficult to accomplish in a business model.
  4. Gaps. There are areas of teaching which are not amenable to online instruction. For example, teaching people to do research remains in the apprentice model. Similarly, letters of recommendation remain an aspect of the apprentice model. Subjects of relatively small interest such as individual research directions may not merit the effort of a highly polished online instruction system. Similarly, many elements of our current education system are not related to formal education, but rather are about students meeting students, teachers acting as daycare for students, or simply structuring the day for learning. Mechanisms achieving the same ends with online human learning systems are necessary, and the conflation of goals represented by the traditional education approach will retard (but not stop) the adoption of online learning approaches. This process has already taken a decade, and we can expect more decades to come.

For those of us interested in online machine learning, it’s natural to question the relationship with online human learning. The practices differ entirely, but the theory still applies, as there are no clauses in the theorem statements of the form “if the learning agent is not a human then…” When you examine the theorem statements for applicability to online human learning, there are a few ideas which may transfer well. One of these is the necessity and techniques for handling exploration problems. If there are two ways to teach a subject, then you could simply try both and take the best. But if your resources are limited then a UCB approach provides a more efficient mechanism for doing this testing. Similarly, if a student has a set of known attributes, contextual-bandit approaches suggest a sound mechanism for personalization of lessons.

Much of our other theory about the process of online learning may be helpful in a heuristic-motivating manner, but it appears typically too pessimistic to accurately capture what is possible. For example, a common technique to explain an idea when teaching is to simply cover a few extreme cases from which all others are some interpolation. The closest common machine learning analogue to this is some active learning algorithms, where a learning algorithm chooses which examples to label. But, of course, this is not an accurate model, because it’s not the student, but rather the teacher which is choosing the examples. A setting more suitable for student and teacher has been studied in learning theory (see the bibliography here for a link into the citation tree). However, these results are typically rather brittle, so it’s not clear yet that we have understood the right way to formalize this process.

11/9/2009

NYAS ML Symposium this year.

Tags: Machine Learning, Workshop jl@ 9:24 am

The NYAS ML symposium grew again this year to 170 participants, despite the need to outsmart or otherwise tunnel through a crowd.

Perhaps the most distinct talk was by Bob Bell on various aspects of the Netflix prize competition. I also enjoyed several student posters including Matt Hoffman’s cool examples of blind source separation for music.

I’m somewhat surprised how much the workshop has grown, as it is now comparable in size to a small conference, although in style more similar to a workshop. At some point as an event grows, it becomes owned by the community rather than the organizers, so if anyone has suggestions on improving it, speak up and be heard.

11/6/2009

Yisong Yue on Self-improving Systems

Tags: Machine Learning jl@ 7:48 pm

I’d like to point out Yisong Yue’s post on Self-improving systems, which is a nicely readable description of the necessity and potential of interactive learning to deal with the information overload problem that is endemic to the modern internet.

10/3/2009

Static vs. Dynamic multiclass prediction

Tags: Machine Learning, Problems jl@ 7:01 am

I have had interesting discussions about distinction between static vs. dynamic classes with Kishore and Hal.

The distinction arises in multiclass prediction settings. A static set of classes is given by a set of labels {1,…,k} and the goal is generally to choose the most likely label given features. The static approach is the one that we typically analyze and think about in machine learning.

The dynamic setting is one that is often used in practice. The basic idea is that the number of classes is not fixed, varying on a per example basis. These different classes are generally defined by a choice of features.

The distinction between these two settings as far as theory goes, appears to be very substantial. For example, in the static setting, in learning reductions land, we have techniques now for robust O(log(k)) time prediction in many multiclass setting variants. In the dynamic setting, the best techniques known are O(k), and furthermore this exponential gap may be essential, at least without further assumptions.

Are there techniques for converting from dynamic multiclass to static multiclass? For example, we could embed a dynamic set of classes within a much larger static set ranging over all possible dynamic classes while eliminating all class-dependent features. In some cases, this approach may work well, but I’ve also seen it fail, with the basic problem being that a learning algorithm might easily choose an invalid class. We could of course force a learning algorithm to choose amongst the dynamically valid set, but I don’t know a general way to do that without making the running time at least scale with the number of valid classes.

So, a basic question that’s bothering me is: When and how can we effectively predict amongst a set of dynamically defined classes in sublinear time? A quick answer is “it’s not possible because simply reading off the set of dynamically defined classes require O(class count) time”. This answer isn’t satisfying, because there are many ways to implicitly specify a set in sublinear time. So the modified question is “Are there natural ways to dynamically define classes in sublinear time? And can these be used for sublinear prediction?”

9/29/2009

Machine Learning Protests at the G20

Tags: Machine Learning jl@ 12:59 pm

The machine learning department at CMU turned out en masse to protest the G20 summit in Pittsburgh. Arthur Gretton uploaded some great photos covering the event :-)

9/21/2009

Netflix finishes (and starts)

Tags: Competitions, Machine Learning jl@ 6:11 pm

I attended the Netflix prize ceremony this morning. The press conference part is covered fine elsewhere, with the basic outcome being that BellKor’s Pragmatic Chaos won over The Ensemble by 15-20 minutes, because they were tied in performance on the ultimate holdout set. I’m sure the individual participants will have many chances to speak about the solution. One of these is Bell at the NYAS ML symposium on Nov. 6.

Several additional details may interest ML people.

  1. The degree of overfitting exhibited by the difference in performance on the leaderboard test set and the ultimate hold out set was small, but determining at .02 to .03%.
  2. A tie was possible, because the rules cut off measurements below the fourth digit based on significance concerns. In actuality, of course, the scores do differ before rounding, but everyone I spoke to claimed not to know how. The complete dataset has been released on UCI, so each team could compute their own score to whatever accuracy desired.
  3. I was impressed by the slick systematic uses of SVD mentioned in the technical presentation, as implied by the first comment here.
  4. The amount of programming and time which went into this contest was pretty shocking. I was particularly impressed with the amount of effort that went into various techniques for blending results from different systems. In this respect, the lack of release of the source code is a little bit disappointing.
  5. I forgot to ask explicitly, but no one mentioned using any joins of the data to external databases. That’s somewhat surprising if you think about it given how much other information is available about movies.
  6. I hadn’t previously convexity functioning as a tool for social engineering so explicitly. Because squared loss is convex, any two different solutions of similar performance can be linearly blended to yield a mixed solution of superior performance. The implications of this observation were on display.

Netflix also announced a plan for a new contest, which will focus on using features of users, and predicting well for the (presumably large number of) users who rate very few movies. I hope they get the anonymization on this data right, as it’s obviously important.

This brings up a basic issue: How should a contest be designed? In the main, the finished Netflix contest seems to have been well designed. For example, the double holdout set approach nicely prevents overfitting, which has been a bugaboo of some previous contests. One improvement they are already implementing is asymptopia removal—the contest will award $0.5M in 6 months, and $0.5M more in 18 months. Nevertheless, we might imagine better contests, and perhaps it’s worthwhile to do so given the amount of attention devoted.

  1. Metric One criticism is that squared loss does not very directly reflect the actual value to Netflix of a particular set of recommendations. This seems like a fair criticism, although if you believe ranking according to the optimal expected conditional ratings is the best possible, it is at least consistent. The degree to which suboptimal squared loss prediction controls suboptimality of a recommendation loss is weak, but it should kick in when squared loss is deeply optimized as happened in this contest.

    What you really want is something like “Did the user pick the recommended movie?” This would provide a qualitative leap in the fidelity of the metric to the true underlying problem. Unfortunately, doing this properly is difficult, as you need to cope with exploration issues, which must be done at the time of data collection. So my basic take is that the squared loss metric seems “ok”, with the proviso that it could be done better if you start the data collection with some amount of random exploration.

  2. Prize distribution In a race as tight as this one, it must feel pretty rough for the members of The Ensemble to put so much effort in and then win nothing. A good case can be made that this isn’t optimal design for a contest where we are trying to learn new things. For example, it seems quite plausible that there was some interesting technique used in The Ensemble yet not used by the winner. A case can also be made based on online learning with experts theory, which generally says that the right way to reward a stable of experts is via an exponential weighting scheme. This essentially corresponds to having a “softmax” prize distribution where the distribution to a participant p is according to e-C(winner – p) where C is a problem dependent constant. This introduces the possibility of a sybil attack, but that appears acceptably controllable, especially if the prize distribution is limited to the top few participants.
  3. Source Code After the Netflix prize was going for some time, the programming-time complexity of entering the contest became very formidable. The use of convex loss function and requiring participants to publish helped some with this, but it remained quite difficult. If the contest required the release of source code as well, I could imagine both lowering the barrier to late entry, and helping advance the field a bit more. Of course, it’s hard to go halfway with this—if you really want to guarantee that the source code works, you need to make the information exchange interface be the source code itself (which is then compiled and run in a sandbox), rather than labels.

One last question to consider is: Is it good for the research community to have contests? My general belief on this is a definite “yes”, as it gives people who know how to do things a chance to distinguish themselves. For the Netflix contest in particular, the contest has educated me a bit about ensemble and SVD-style techniques, and I’m sure it’s generally helped crystallize out a set of applicable ML technologies for many people, which I expect to see widely used elsewhere in the future.

9/18/2009

Necessary and Sufficient Research

Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways.

  1. Decomposition failure. The solution to the decomposition does not in fact yield a solution to the overall problem.
  2. Artificial hardness. The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary.

As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary.

In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively little idea about how to do research, so they naturally abdicate the problem choice to an advisor. As an inexperienced graduate student, it’s difficult to avoid this, because an advisor often really does know much better about what is easy, what is hard, and how to decompose a complex problem into solvable subproblems. Nevertheless, if your plan in life is to do substantial research, it’s essential even then to question research directions carefully.

In contrast to sufficient subgoals of a greater goal, there are also necessary subgoals. A necessary subgoal is one which must be solved to solve the greater goal. One of the reasons why the artificial hardness problem is so common is that the sufficient subgoals are commonly confused with necessary subgoals. The essential test for a necessary subgoal is whether or not a solution to the global problem can be used as a solution to the subgoal.

My personal greater goal is creating a master machine learning algorithm that can solve any reasonable learning problem where “reasonable” includes at least the set that humans can solve. Relative to this greater goal, many existing research programs do not appear necessary.

  1. The short form of my second comment on Marcus’s post is that I see the sufficiency but not the necessity of competing with all Turing machines.
  2. The necessity of several statistical modeling approaches appears unclear to me, because they often encounter severe computational problems. Perhaps this is an example of creating an artificially hard problem, as empirical experiences with Searn suggest.

What is necessary?

  1. Large data. It is clear that humans solving learning problems have access to large amounts of information which they employ to solve these problems. While we don’t stick children into a sensory deprivation tank to see how much it retards their ability to solve problems when grown, some experiments along these lines have been done with animals yielding obvious ability deficiency.
  2. Online learning. The ability to learn in an online environment with relatively little processing per bit of input is clearly a sufficient approach to solve many problems. We can also argue that it is a computational necessity, as retraining based upon all past information appears computationally infeasible, or at least deeply wasteful.
  3. Interactive learning. It’s clear that many animals use an interactive process to learn about the world. This form of learning is also necessary, because it provides the ability to answer unanticipated questions. We can further argue the necessity by pointing out that interactive proofs appear much more powerful in computational complexity theory than noninteractive proofs. For example, viewed from a learning perspective, much of science is about interactive learning.
  4. Compositional Design of a learning system. The necessity of compositional design in machine learning is not entirely clear. For example, we could imagine that it’s possible to design good learning systems using an evolutionary approach. Nevertheless, since our basic goal in research is a much more efficient and faster design, it seems that the decision to take a research-based approach implies that compositional design is necessary. Restated: we should be able to design the system to learn in components which compose to form an overall solution.
  5. Large contexts. It’s necessary that a learning algorithm be able to use a relatively large number of bits when making a decision. For example, people working on vision have cool examples where people manage to use many different cues to predict what an object is.
  6. Nonlinearity. People can clearly solve learning problems for which no linear representation (of input information) is capable of achieving good performance.

Some of these are criticizable as perhaps unnecessary, and I can easily imagine missing others. If you have other arguments for what is or is not necessary for this greater goal, please speak up.

There are two other categories of subgoal research we could consider. There are subgoals which are necessary and sufficient (in combination) to solve the greater goal. I don’t know of any such arguments for my greater goal.

The fourth category is subgoals which are neither necessary nor sufficient for a greater goal. In my experience such papers are quite common at conferences with types that include:

  1. Work on speeding up a slow algorithm leaving it slower than the state of the art,
  2. Otherwise improving an algorithm which is suboptimal while leaving it suboptimal.

The nitty-gritty of these questions come at review time. Which papers should be accepted? In general, decision making is pretty terrible because greater goals are rarely stated, perhaps as a form of strategic ambiguity. After all, the set of people attending a conference have multiple greater goals. Nevertheless, my personal ordering is:

  1. Necessary and sufficient research directions. An emptyset in my experience.
  2. Necessary research directions. This is typically a small fraction.
  3. Sufficient research directions. This is a greater fraction.
  4. Neither. This is often the majority.
  5. Wrong. Must be rejected.

So the question to periodically ask yourself as a researcher is: What is the greater goal? Is this subgoal necessary? Sufficient? Something else?

8/27/2009

New York Area Machine Learning Events

Tags: Announcements, Machine Learning jl@ 6:46 am

Several events are happening in the NY area.

  1. Barriers in Computational Learning Theory Workshop, Aug 28. That’s tomorrow near Princeton. I’m looking forward to speaking at this one on “Getting around Barriers in Learning Theory”, but several other talks are of interest, particularly to the CS theory inclined.
  2. Claudia Perlich is running the INFORMS Data Mining Contest with a deadline of Sept. 25. This is a contest using real health record data (they partnered with HealthCare Intelligence) to predict transfers and mortality. In the current US health care reform debate, the case studies of high costs we hear strongly suggest machine learning & statistics can save many billions.
  3. The Singularity Summit October 3&4. This is for the AIists out there. Several of the talks look interesting, although unfortunately I’ll miss it for ALT.
  4. Predictive Analytics World, Oct 20-21. This is stretching the definition of “New York Area” a bit, but the train to DC is reasonable. This is a conference of case studies of applications of ML to real-world problems.
  5. Machine Learning Symposium, Friday Nov. 6. I’m on the committee again this year. The abstract deadline is Sept. 30, and we already have several speakers lined up.

8/26/2009

Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?

7/31/2009

Vowpal Wabbit Open Source Project

Tags: Code, Machine Learning, Online jl@ 12:51 pm

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

  1. The program can effectively scale up to batch-style training on sparse terafeature (i.e. 1012 sparse feature) size datasets. The limiting factor is typically i/o.
  2. I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
  3. After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 105 examples :)

I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

  1. We dropped SEG for now, because of code complexity reasons.
  2. Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
  3. The code is more deeply threaded, with a separate thread for parsing.
  4. There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

7/11/2009

Interesting papers at KDD

Tags: Machine Learning jl@ 1:35 am

I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup, but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences.

There were several papers that other people might like to take a look at.

  1. Yehuda Koren Collaborative Filtering with Temporal Dynamics. This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award.
  2. D. Sculley, Robert Malkin, Sugato Basu, Roberto J. Bayardo, Predicting Bounce Rates in Sponsored Search Advertisements. The basic claim of this paper is that the probability people immediately leave (”bounce”) after clicking on an advertisement is predictable.
  3. Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contenders. The basic claim here is that it is possible to beat the baseline system in Netflix and preserve a nontrivial amount of user privacy. It’s the first demonstration I’ve seen of this sort, and it’s particularly impressive they used a strong algorithm-independent definition of privacy which Cynthia Dwork first stated.

KDD also experimented this year with crowdvine which was interesting. Compared to Mark Reid’s efforts with ICML, they managed to get substantially more participation. There seemed to be two reasons: the conference organizers more deeply integrated and encouraged the use of crowdvine, and crowdvine has certain handy additional uses—you can create your own personal schedule for instance, which incidentally provides some vague global notion of the popularity of various papers. The biggest drawback I found was that the papers themselves were not integrated into the website.

7/9/2009

The Machine Learning Forum

Dear Fellow Machine Learners,

For the past year or so I have become increasingly frustrated with the peer review system in our field. I constantly get asked to review papers in which I have no interest. At the same time, as an action editor in JMLR, I constantly have to harass people to review papers. When I send papers to conferences and to journals I often get rejected with reviews that, at least in my mind, make no sense. Finally, I have a very hard time keeping up with the best new work, because I don’t know where to look for it…

I decided to try an do something to improve the situation. I started a new web site, which I decided to call “The machine learning forum” the URL is http://themachinelearningforum.org

The main idea behind this web site is to remove anonymity from the review process. In this site, all opinions are attributed to the actual person that expressed them. I expect that this will improve the quality of the reviews. An obvious other effect is that there will be fewer negative reviews, weak papers will tend not to get reviewed at all, but then again, is that such a bad thing?

If you have any interest in this endeavor, please register to the web site and please submit a photo of yourself. Based on the information on your web site I will decide whether to grant you “author” privileges that would allow you to write reviews and overviews. Anybody can submit pointers to publications that they would like somebody to review. Anybody can participate in the discussion forum that is a fancy message board with threads etc.

Right now the main contribution I am looking for are “overviews”.

Overviews are pages written by somebody who is an authority in some area (for example, Kamalika Chaudhuri is an authority on mixture models) in which they list the main papers in the area and five a high level description for how the papers relate. These overviews are intended to serve as an entry point for somebody that wants to learn about that subfield. Overviews *can* reference the work of the author of the overview. This is unlike reviews, in which the reviewer cannot be the author of the reviewed paper.

I hope you are interested enough to give this a try!

Comments are very welcome.

Cheers

Yoav Freund (yfreund@ucsd.edu)

Older Posts »

Powered by WordPress