Machine Learning (Theory)

6/29/2005

Not EM for clustering at COLT

Tags: Papers, Unsupervised jl@ 10:06 am

One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.

One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.

A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:

  1. Project to low dimensional space.
  2. Pick out gross structure.
  3. Project gross structure into the high dimensional space.
  4. Run EM (or some other local improvement algorithm) to find a final fit.

The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.

6/28/2005

A COLT paper

Tags: Papers, Prediction Theory jl@ 2:22 pm

I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.

The cross validation problem: cash reward

Tags: Prediction Theory, Problems jl@ 2:00 pm

I just presented the cross validation problem at COLT.

The problem now has a cash prize (up to $500) associated with it—see the presentation for details.

The write-up for colt.

6/22/2005

Languages of Learning

Tags: Organization jl@ 1:08 pm

A language is a set of primitives which can be combined to succesfully create complex objects. Languages arise in all sorts of situations: mechanical construction, martial arts, communication, etc… Languages appear to be the key to succesfully creating complex objects—it is difficult to come up with any convincing example of a complex object which is not built using some language. Since languages are so crucial to success, it is interesting to organize various machine learning research programs by language.

The most common language in machine learning are languages for representing the solution to machine learning. This includes:

  1. Bayes Nets and Graphical Models A language for representing probability distributions. The key concept supporting modularity is conditional independence. Michael Kearns has been working on extending this to game theory.
  2. Kernelized Linear Classifiers A language for representing linear separators, possibly in a large space. The key form of modularity here is kernelization.
  3. Neural Networks A language for representing and learning functions. The key concept supporting modularity is backpropagation. (Yann LeCun gave some very impressive demos at the Chicago MLSS.)
  4. Decision Trees Another language for representing and learning functions. The key concept supporting modularity is partitioning the input space.

Many other learning algorithms can be seen as falling into one of the above families.

In addition there are languages related to various aspects of learning.

  1. Reductions A language for translating between varying real-world losses and core learning algorithm optimizations.
  2. Feature Languages Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.
  3. Data interaction languages The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.

These lists surely miss some languages—feel free to point them out below.

With respect to research “interesting” language-related questions include:

  1. For what aspects of learning is a language missing? Anytime adhocery is encountered, this suggests that there is room for a language. Finding what is not there is both hard and valuable.
  2. Are any of these languages fundamentally flawed or fundamentally advantageous with respect to another language?
  3. What are the most easy to use and effective primitives for these languages?

6/18/2005

Lower Bounds for Learning Reductions

Tags: Problems, Reductions jl@ 10:40 pm

Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B“.

A lower bound for a learning reduction would have the form “for all reductions R, there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B“.

The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here.

At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understanding of how feasible they are or which techniques may be useful in proving them. Here is a rough summary of what I know:

  1. For structured prediction, we have a partially worked out lower bound for all reductions using the structure to only carry single bits. A proof for reductions using the structure in others ways seems tricky at the moment.
  2. For Reinforcement learning it may (this is unclear) be possible to prove a lower bound showing that prediction ability alone can not solve RL well.
  3. There are various results which can be thought of as lower bounds for more limited families of reductions. One example is analyzing exactly how badly margin optimization can underperform for 0-1 loss when there is noise.

Overall, this is a moderately interesting direction of research which has not been much investigated.

6/17/2005

Reopening RL->Classification

Tags: Problems, Reductions, Reinforcement jl@ 9:00 am

In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.

The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).

So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.

Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.

6/13/2005

Wikis for Summer Schools and Workshops

Tags: Organization dinoj@ 4:52 pm

Chicago ‘05 ended a couple of weeks ago. This was the sixth Machine Learning Summer School, and the second one that used a wiki. (The first was Berder ‘04, thanks to Gunnar Raetsch.) Wikis are relatively easy to set up, greatly aid social interaction, and should be used a lot more at summer schools and workshops. They can even be used as the meeting’s webpage, as a permanent record of its participants’ collaborations — see for example the wiki/website for last year’s NVO Summer School.

A basic wiki is a collection of editable webpages, maintained by software called a wiki engine. The engine used at both Berder and Chicago was TikiWiki — it is well documented and gets you something running fast. It uses PHP and MySQL, but doesn’t require you to know either. Tikiwiki has far more features than most wikis, as it is really a full Content Management System. (My thanks to Sebastian Stark for pointing this out.) Here are the features we found most useful:

  • Bulletin boards, or forums. The most-used one was the one for social events, which allowed participants to find company for doing stuff without requiring organizer assistance. While conferences, by their inherently less interactive nature, don’t usually benefit from all aspects of wikis, this is one feature worth adding to every one. [Example]

    Other useful forums to set up are “Lost and Found”, and discussion lists for lectures — although the latter only work if the lecturer is willing to actively answer questions arising on the forum. You can set forums up so that all posts to them are immediately emailed to someone.

  • Editable pages. For example, we set up pages for each lecture that we were able to edit easily later as more information (e.g. slides) became available. Lecturers who wanted to modify their pages could do so without requiring organizer help or permission. (Not that most of them actually took advantage of this in practice… but this will happen in time, as the wiki meme infects academia.) [Example]

  • Sign-up sheets. Some tutorials or events were only open to a limited number of people. Having editable pages means that people can sign up themselves. [Example]

  • FAQs. You can set up general categories, and add questions, and place the same question in different categories. We set most of this up before the summer school, with directions of how to get there from the airport, what to bring, etc. We also had volunteers post answers to anticipated FAQs like the location of local restaurants and blues clubs. [Example]

  • Menus. You can set up the overall layout of the webpage, by specifying the locations and contents menus on the left and right of a central `front page’. This is done via the use of `modules’, and makes it possible for your wiki pages to completely replace the webpages — if you are willing to make some aesthetic sacrifices.

  • Different levels of users: The utopian wiki model of having ‘all pages editable by everyone’ is … well, utopian. You can set up different groups of users with different permissions.

  • Calendars. Useful for scheduling, and for changes to schedules. (With the number of changes we had, we really needed this.) You can have multiple calendars e.g. one for lectures, another for practical sessions, and another for social events — and users can overlay them on each other. [Example]

A couple of other TikiWiki features that we didn’t get working at Chicago, but would have been nice to have, are these:

  • Image Galleries. Gunnar got this working at Berder, where it was a huge success. Photographs are great icebreakers, even the ones that don’t involve dancing on tables.

  • Surveys. These are easy to set up, and have option for participants to see, or not to see, the results of surveys — useful when asking people to rate lectures.

TikiWiki also has several features that we didn’t use, such as blogs and RSS feeds. It also has a couple of bugs (and features that are bad enough to be called bugs), such as permission issues and the inability to print calendars neatly. These will doubtless get cleaned up in due course.

Finally, owing to much prodding from John and some other MLSS participants, I’ve written up my experiences in using TikiWiki @ Chicago ‘05 on my website, including installation instructions and a list of “Good Things to Do”. This documentation is meant to be a survival guide complementary to the existing TikiWiki documentation, which can sometimes be overwhelming.

6/10/2005

Workshops are not Conferences

Tags: General jl@ 9:09 am

… and you should use that fact.

A workshop differs from a conference in that it is about a focused group of people worrying about a focused topic. It also differs in that a workshop is typically a “one-time affair” rather than a series. (The Snowbird learning workshop counts as a conference in this respect.)

A common failure mode of both organizers and speakers at a workshop is to treat it as a conference. This is “ok”, but it is not really taking advantage of the situation. Here are some things I’ve learned:

  1. For speakers: A smaller audience means it can be more interactive. Interactive means a better chance to avoid losing your audience and a more interesting presentation (because you can adapt to your audience). Greater focus amongst the participants means you can get to the heart of the matter more easily, and discuss tradeoffs more carefully. Unlike conferences, relevance is more valued than newness.
  2. For organizers: Not everything needs to be in a conference style presentation format (i.e. regularly spaced talks of 20-30 minute duration). Significant (and variable) question time, different talk durations, flexible rescheduling, and panel discussions can all work well.

6/8/2005

Question: “When is the right time to insert the loss function?”

Tags: Bayesian, General jl@ 9:15 am

Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time?

When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set.

The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun(applied) and Vladimir Vapnik(theoretical) advocate: “solve the simplest prediction problem that solves the problem”. (One difficulty with this principle is that ’simplest’ is difficult to define in a satisfying way.)

One reason why it’s unclear is that optimizing an arbitrary loss is not an easy thing for a learning algorithm to cope with. Learning reductions (which I am a big fan of) give a mechanism for doing this, but they are new and relatively untried.

Drew Bagnell adds: Another approach to integrating loss functions into learning is to try to re-derive ideas about probability theory appropriate for other loss functions. For instance, Peter Grunwald and A.P. Dawid present a variant on maximum entropy learning. Unfortunately, it’s even less clear how often these approaches lead to efficient algorithms.

6/6/2005

Exact Online Learning for Classification

Tags: Solutions jl@ 9:43 am

Jacob Abernethy and I have found a computationally tractable method for computing an optimal (or near optimal depending on setting) master algorithm combining expert predictions addressing this open problem. A draft is here.

The effect of this improvement seems to be about a factor of 2 decrease in the regret (= error rate minus best possible error rate) for the low error rate situation. (At large error rates, there may be no significant difference.)

There are some unfinished details still to consider:

  1. When we remove all of the approximation slack from online learning, is the result a satisfying learning algorithm, in practice? I consider online learning is one of the more compelling methods of analyzing and deriving algorithms, but that expectation must be either met or not by this algorithm
  2. Some extra details: The algorithm is optimal given a small amount of side information (k in the draft). What is the best way to remove this side information? The removal is necessary for a practical algorithm. One mechanism may be the k->infinity limit.

Powered by WordPress