Workshop Schedule

The general plan is to alternate between 30 minute talks and 15 minutes for discussion, bathroom, email, etc...

Breakfast at the Quad club, Lunch at TTI, Dinner in groups.

Thursday

9:00amBreakfast@Quad Club
overview talks
9:30 am Nick Hopper Machine Learning and Reductions in CryptographyModern cryptography can be viewed as the compliment of machine learning: given apparently strong primitives (concept classes which appear hard to learn (even weakly)), the goal of a cryptographer is to find a way to construct a cryptographic component from these primitives so that finding a weakness in the new component ((weakly) learning some supposedly hidden information) is provably equivalent to finding weaknesses in the primitives used to construct it. Thus, in a sense, every cryptographic security proof is a reduction between weakly learning two related concept classes. Over the past twenty-five years, the cryptographic community has developed a set of tools for thinking about these reductions, and its own set of answers to the central questions of this workshop. This talk will highlight some of the famous cryptographic reductions, and some cryptographic tools and approaches to thinking about reductions.
10:15 am John Langford Classification as an Atom Several fields such as digital circuit design and 3-d graphics have greatly benefited from a reductionist approach: find the simplest element capable of constructing desired behaviors and use it. This approach is typically fruitful because it becomes unnecessary to design a new element for a new problem. Instead, you reuse old components to construct new solutions to new problems.

Is classification an atom for machine learning? We look at the set of problems reducible to classification and the advantages of using classification with respect to other primitives.

11:00 am Imre Risi Kondor To the RKHS it's all the same It is now generally recognized that one of the principal advantages of kernel based learning is that the same algorithms can support a wide variety input and output spaces. I will be talking about how a wide variety of learning problems involving learning on manifolds, groups, etc. can be reduced to the beloved SVM or Gaussian Processes. In particular, I am going to talk about solving ranking problems with kernels.
11:45amLunch@TTI-C
Reinforcement Learning Talks
1:30pm Alan Fern and Robert Givan Reinforcement Learning Viewed as Classification: Empirical Results for Large Relational Markov Decision Processes We present and empirically evaluate a heuristic reduction from reinforcement learning to cost-sensitive classification. Existing reductions that are based on approximate policy iteration typically represent policies via cost functions and produce a sequence of cost-function regression problems. Instead, our approach represents policies directly as state-action mappings and produces a sequence of cost-sensitive classification problems. We argue that reducing to classification rather than regression can have practical benefits as it is often easier to specify an effective policy space via a direct policy-language bias rather than a cost-function bias. Empirically our technique can select good stationary policies for very large relational Markov decision processes (MDPs) where previous techniques fail. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. paper
2:15pm Michail Lagoudakis(joint with Ron Parr) Reinforcement Learning as Classification: Leveraging Modern Classifiers The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques. To a large extent, however, current reinforcement learning algorithms draw upon machine learning techniques that are at least ten years old and, with a few exceptions, very little has been done to exploit recent advances in classification learning for the purposes of reinforcement learning. We use a variant of approximate policy iteration based on rollouts that allows us to use a pure classification learner, such as a support vector machine (SVM), in the inner loop of the algorithm. We argue that the use of SVMs, particularly in combination with the kernel trick, can make it easier to apply reinforcement learning as an ``out-of-the-box'' technique, without extensive feature engineering. Our approach opens the door to modern classification methods, but does not preclude the use of classical methods. We present experimental results in the pendulum balancing and bicycle riding domains using both SVMs and neural networks for classifiers. ICML 2003 paper
3:00pmDrew Bagnell Policy Search by Dynamic Programming Policy search approaches to reinforcement learning represent a promising method for solving POMDPs and large MDPs. Unfortunately, many methods in common use suffer from poor scaling in required computational and sample resources. We show that if a ``baseline distribution'' is given (indicating roughly how often we expect a good policy to visit each state), then we can derive an efficient policy search algorithm for which we can provide non-trivial performance guarantees. Our work leverages the central insight of dynamic programming, the principle of optimality, without recourse to finding fixed points of the Bellman functional equation. In particular, our algorithm, which we call Policy Search by Dynamic Programming (PSDP), is in the spirit of the traditional dynamic programming approach to solving MDPs where values are ``backed up.'' In PSDP, it is the policy which is backed up. Each backup is in essence a weighted classification problem; in this way we are able to reduce reinforcement learning with a baseline distribution to a finite sequence of supervised learning problems. We will conclude with applications of the algorithm to robotic planning and control in partially observed domains.

This is joint work with Sham Kakade, Andrew Ng and Jeff Schneider.

paper
3:45pm John Langford Families of Reductions of Reinforcement Learning to Classification There are several ways to reduce Reinforcement learning to classification in the presence of some form of side information about the world. Some of these methods have advantages in precision, some have advantages in computational complexity, and some have advantages in terms of computational requirements. I will review and compare them all at a high level. paper paper
4:30pmAdam Klivans In Search of New Algorithms for PAC Learning A concept class has polynomial threshold function degree d if for every element of the concept class there exists a real valued, multivariate polynomial of degree d such the polynomial is positive on exactly the set of inputs for which the concept is TRUE (and negative otherwise).

Recent new results for classical problems such as learning DNF or learning intesections of halfspaces suggests that giving a reduction from a concept to a polynomial threshold function is a powerful approach for designing algorithms.

Unfortunately it also seems like it is the only approach for designing PAC learning algorithms. We will show that with the notable exception of the concept class of parity functions, essentially all known PAC algorithms (in the distribution free model) can be rephrased as reductions to polynomial threshold functions. We will describe recent polynomial threshold function constructions for concept classes such as DNF formulae, intersections of halfspaces, and decision lists, and show that even minor variations of this approach can yield interesting new algorithms.

(Joint work with Ryan O'Donnell and Rocco Servedio)

5:15pmDinner

Friday

8:30amBreakfast@Quad Club
Theory
9:30 am Shai Ben-David An Informational Complexity Notion of Reduction for Concept Classes Several parameters defined for a concept class may be viewed as reflecting the informational complexity of the class. The VC-dimension is probably the prime example of such a parameter. Some other examples are:
  • The sample complexity of learning the concept class from random labeled examples.
  • The size of the minimal compression scheme for the class
  • The optimal mistake bound for learning the class online (or the query complexity of learning this class using membership and equivalence queries)
We introduce an embedding-based partial order over classes that refines each of (the partial orders over classes induced by) the above parameters. Our notion is similar to the Pitt-Warmuth 'prediction preserving reductions' stripped of its computational complexity aspects. We introduce the resulting notion of universal concept classes that play a role analogous to that played in computational complexity by languages that are complete for a given complexity class. We show the existence of such complete classes for the family of 'geometric concept classes' and use this completeness to prove the existence of strong compression schemes.

This talk is based on joint work with Ami Litman.

10:15 am Phil Long Weighing evidence in the absence of a gold standard Many propositions in Biology can be supported by a variety of different sorts of evidence. It is often useful to collect together large numbers of such propositions, together with the evidence supporting them, into databases to be used in other analyses. In this case, it becomes necessary to automatically decide which propositions are supported well enough to be included. This can involve weighing evidence of varying strength. In some important cases, there is effectively no definitive source of evidence, no "gold standard," that can be used for evaluating the others. Examples of problems with this property are (1) pairing up equivalent genes between species, and (2) predicting which pairs of proteins interact.

Previously, Mitina and Vereshchagin proposed and analyzed a theoretical model of a problem with some key properties in common with these applications. To an extent, a more detailed theoretical formulation of the above problems can be reduced to a problem similar to theirs, to which their techniques can be applied. This leads naturally to an algorithm, though some further work was needed to make it practical.

I will describe simulation results for this algorithm on artificial data, evaluation of the method through its application to mapping orthologs, and some open problems.

11:00 amGuy Lebanon The Ranking Poset Framework for Classification and Ranking We introduce a new framework for combining rankers that generalizes several known models such as discrete Adaboost and logistic regression, error correcting output codes and the Mallows model. Using the new framework we study properties of the above models and derive a previously unknown Bayesian interpretation. In addition, the framework suggests new models for ranking and classification. One such model, cranking, is demonstrated in the context of meta search and classification. Other proposed models are an extension of the above framework to incomplete and categorical ranking and applying the framework to the collaborative filtering problem. NIPS 2002 paper
11:45Lunch@TTI-C
Instantiations
1:30pm Adam Kalai Boosting with Noise One of the most common complaints about boosting is that it seems to perform poorly on noisy data sets. We show theoretically exactly how far one can boost in the presence of random classification noise. Under a slightly different model of weak learning, we show that one can boost to arbitrary accuracy in the presence of noise.

In newer results, we give a noise-tolerant boosting algorithm that has the same theoretical efficiency of mainstream boosting algorithms such as AdaBoost.

Joint work with Rocco Servedio

paper
2:15pm Bianca Zadrozny Reductions as changes in the distribution of examples Recent reductions from cost-sensitive learning and reinforcement learning to zero/one-loss classifier learning involve as a key component a change in the distribution of examples. The boosting reduction from strong classifier learning to weak classifier llearning is also carried out as a series of distribution changes. Besides being interesting theoretically as a possible way of unifying different reductions into a common framework, the observation that many reductions in machine learning involve distribution changes has implications for the use of reductions in practice. In order to achieve good results through the use of reductions, our "core learning algorithm" (the one we reduce to) must be able to learn effectively with respect to a distribution of examples that is different than the one from which we can directly draw examples. In this talk, I will compare different ways of implementing this with existing classifier learners and present a general ensemble learning scheme based on rejection sampling that yields very good results in practice for a variety of base classifier learners.

This talk is based on joint work with John Langford and Naoki Abe.

3:00 pmDavid McAllesterCollective Classification The collective classification problem arises when a set of items are to be simultaneously classified taking into account correlations between them. For example, labeling the words in a sentence into parts of speech or classifying the web pages in a department web site into professors and students. We review some existing approaches to this problem and consider a particular reduction to classical binary classification.
3:45pmNaoki Abe Reduction of unsupervised learning to classification with application to outlier detection We present a formulation of unsupervised learning which can be reduced to classification , namely, a classification algorithm with performance guarantee can be transformed into the same for unsupervised learning. While the reduction is valid for various forms of unsupervised learning, including clustering and density estimation, it is most natural to apply it to the problem of outlier detection. We demonstrate its pratical significance by applying it to a proto-typical application of outlier detection: intrusion detection. Our experiments show that our method based on the proposed reduction attains excellent predictive performance, as compared to the best performing methods for this problem in the literature.

Joint work with John Langford.

4:30pm Valentina Bayer Cost-sensitive Learning: Reduction from Supervised Learning to an Acyclic MDP We define cost-sensitive learning with attribute costs and misclassification costs as the problem of learning diagnostic policies minimizing expected total costs.

In this formulation, cost-sensitive learning is harder to solve than standard supervised learning. Special cases of cost-sensitive learning problems include classifiers minimizing 0/1 loss, classifiers sensitive only to attribute costs, and classifiers sensitive only to misclassification costs.

This CSL problem can be reduced to an acyclic MDP with a single start state (the state where no attributes were measured). Model-free Reinforcement Learning algorithms take too long to converge, due to the stochastic nature of classification actions. Dynamic programming methods can be used to solve this MDP, but it is more efficient to perform heuristic search.

We are interested in learning cost-sensitive diagnostic policies from datasets of labeled examples. We show on realistic problems that AO*-based systematic search learns good diagnostic policies when additional safeguards against overfitting are provided.

paper
5:15pmDinner