Atomic Learning Workshop

Location: TTI-Chicago on the second floor of the Press building

Friday, March 24 : Structured Learning Day

Time	speaker	title	abstract
8:30 am - 9 am	Breakfast @ TTI
9 am - 9:50am	Charles Sutton	Sparse Probabilistic Communication in Atomic Learning	Probabilistic methods for atomic learning provide a principled framework for communication among atoms during training and prediction, and (at least theoretically) not requiring a calibration step. But in many applications, especially in language, the outcome space is very large. For example, complex tasks in information extraction, speech processing, and machine translation are solved by a pipeline of subtasks, each of which is a structured prediction task in itself. In such cases, when each atom has many outcomes, communicating a full probability distribution can be more information than is necessary, reducing efficiency and preventing training from scaling to large data sets. I will discuss recent work in communicating less often during training (that is, training atoms separately) and communicating more sparsely (that is, beam search-like methods). For example, recently we have studied a sparse forward-backward algorithm that interacts well with CRF training. In linear chains, we achieve a fourfold reduction in training time with no loss in accuracy.	CRF tutorial, paper: Sparse Forward-Backward, paper: Piecewise Training
10 am - 10:50am	Yasemin Altun	Supervised and Semi-Supervised Learning for Structured Variables	Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. I will talk about these approached and then investigate structured classification in a semi-supervised setting. I'll present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and present a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, this formulation naturally extends to new test points.
11 am - 11:50am	Ben Taskar	Structured prediction using tractable and intractable models	I will talk about the range of models (Markov random fields, context free grammars, matchings and graph-cuts) that can be estimated efficiently and exactly using the large-margin criterion and simple gradient-based methods. I will also discuss ongoing experimental and and theoretical work on learning intractable models using relaxations.
noon	lunch at TTI
1pm - 1:50pm	Drew Bagnell	Subgradient Structured Learning with Applications in Robotics	Imitation learning of sequential, goal-directed behavior by standard supervised techniques is often difficult. We frame learning such behaviors as a maximum margin structured prediction problem over a space of policies. Iâ€™ll demonstrate our approach applied to route planning for outdoor mobile robots, where the behavior a designer wishes a planner to execute is often clear, while specifying cost functions that engender such behavior is a much more difficult task. Further, we demonstrate a simple, provably efficient approach to structured maximum margin learning, based on the subgradient method, that leverages existing fast algorithms for inference. Although the technique is general, it is particularly relevant in problems where A* and dynamic programming approaches make learning policies tractable in problems beyond the limitations of QP formulations. I will show a few additional applications of the subgradient approach, and new regret and generalization bounds that take advantage of an online version of the algorithm. Finally, I will discuss where approximate inference procedures can be applied in the subgradient method for structured learning.
2pm-2:50pm	Vasin Punyakanok	Global Inference in Learning for Natural Language Processing	Natural language processing (NLP) problems conceptually often involve assigning values to multiple variables over some complex structure. Learning classifiers to predict the values of these variables may be done independently or all together, but in both cases, combining their outcomes requires a global inference to ensure the final global coherency. For a specific example, I will show an approach to use global inference for semantic role labeling problem in NLP. Then, I will discuss some related questions to this framework of utilizing global inference in learning.	slides
3pm - 3:50pm	Hal Daume III	Learning in Search	Modern machine learning methods are great for producing simple output, but extensions to structured outputs are typically limited to models with nice comptuational properties (sequences and trees under strict locality assumptions). Many problems (especially from natural language) do not fit this mold, and complex search techniques must be employed (eg., machine translation). I am interested in learning techniques that account for and exploit this search process for producing better outputs with better guarantees. In this framework, the "atoms" are the individual pieces of substructure that are predicted in each step, and search is the "glue" that combines them together.	slides Open Office
3:50pm	Discussion

Saturday, March 25 : The Broader Atomic Learning Day

Time	speaker	title	abstract	material
8:30 am - 9 am	Breakfast @ TTI
9:00am - 9:50am	John Langford	The Method of Reduction in Learning	Are all learning problems fundamentally "the same", or are they different? This basic question has dramatic impact. An answer of "different" implies that the way we should approach problem solving on a case-by-case basis creating customized algorithms for individual applications. An answer of "the same" implies that we might be able to reuse core learning algorithms repeatedly to solve many (or perhaps all) learning problems. I will discuss evidence that the answer is "the same". This includes new kinds of theorem statements for the relationship between learning problems and empirical evidence that this approach works in practice.	Slides, website
10am - 10:50am	Matti Kaariainen	Lower bounds for reductions	We present some results on lower bounds for reductions. These include a lower bound for reducing class probability estimation to binary classification that matches the upper bound provided by the probing reduction, and a lower bound for reducing a structural sequence prediction problem to binary classification. In order to be able to prove the lower bounds, we make some restricting assumptions on the reductions considered. While the assumptions we rely on at this point may be more restrictive than necessary, we show that some restrictions are needed, since otherwise virtually any learning problem can be solved by a (very unnatural and useless) reduction to binary classification. How to exclude such unnatural reductions in a natural way seems to be a difficult but important question both for lower bounds and in general, so we will devote some time to discussing it.	slides
11am - 11:50am	Rich Caruana	Inductive Transfer and Model Compression	Inductive Transfer (a.k.a. learning-to-learn, multitask learning, representation learning, ...) is the process of transfering what is learned for one problem to other problems so that they can be learned better. Through inductive transfer the attention of the learner can be focused towards important aspects of the problem it otherwise might miss. This "shaping" often makes the learned models much more complete and complex. In this talk I'll briefly review inductive transfer in neural nets, k-nearest neighbor, and bayes nets. I'll also discuss recent work in model compression which allows large, complex models to be "compressed" into smaller, faster models. One interesting use of model compression is to separate the tasks learned by multitask learning into smaller, task specific models. Model compression is important when the models learned by inductive transfer or ensemble methods become so large and complex that they are impractical to use in applications where memory is limited or predictions must be made in real-time. An useful side-effect of learning compressed models is that it gives some information about how complex the function learned for different problems really is.	slides
Noon	Lunch at TTI
1pm-1:50pm	Nic Schraudolph	Rapid Stochastic Gradient Descent for Atomic Learning	Learning in large systems composed of simple "atoms" has been hampered by a lack of efficient, scalable optimization algorithms for this pur- pose: second-order methods are too expensive for large, nonlinear models, conjugate gradient does not tolerate the noise inherent in online learning, and simple gradient descent, evolutionary algorithms, etc., are unacceptably slow to converge. I am addressing this problem by developing new ways to accelerate stochastic gradient descent, using second-order gradient informationbtained cheaply and locally through fast curvature matrix-vector products. In the stochastic meta-descent (SMD) algorithm, this cheap curvature information is built up iteratively into a stochastic approximation of Levenberg-Marquardt second-order gradient steps, which are then used to adapt individual gradient step sizes. SMD handles noisy, correlated, non-stationary signals well, and approaches the rapid convergence of second-order methods at only linear cost per iteration, thus scaling up to extremely large nonlinear systems. To date it has greatly accelerated adaptive applications in computational fluid dynamics, computer vision, and reinforcement learning. Our most recent development is a version of SMD operating in reproducing kernel Hilbert space.	slides
2pm-2:50pm	Simon Osindero	Learning in Deep Belief Networks	We show how to use ``complementary priors'' to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The solutions returned by this greedy algorithm can be used as initilization points for other learning procedures that refine the weights. For example, we have used the fast, greedy algorithm tonitialize a slow fine-tuning procedure using a contrastive version of the wake-sleep algorithm. After such refinement, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model givesetter digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind. We have recently applied the same idea to describe the joint distribution of a set of face images, and we are able to learn a generative model that captures much of the structure in the dataset.
3pm-3:50pm	David McAllester	A* Instead of Dynamic Programming, or, Inference Rules as Atoms of Learning	We generalize Dijkstra's shortest path algorithm, A* search, and heuristics derived from abstractions to the problem of computing lightest weight derivations for general weighted logic programs. This generalization gives an essentially mechanical way of improving the efficiency of a wide variety of dynamic programing algorithms. We discuss examples in natural language processing and computer vision. As a demonstration we develop a novel algorithm for finding contours in images. The algorithms described here also allow for a clean approach to the pipeline problem --- the problem of passing information back and forth between various levels of processing in perceptual inference.
3:50pm	Final discussion