Machine Learning (Theory)

5/16/2011

Research Directions for Machine Learning and Algorithms

Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

  1. Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the n2 dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.
  2. Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

  1. How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do efficient offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
  2. How can we learn from lots of data? We’ll be presenting a KDD survey tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
  3. How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently O(n) operation, which is frustrating when it’s plausible that an exponentially faster O(log(n)) solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it.
  4. What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
  5. The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted at CACM.

9/28/2010

Machined Learnings

Paul Mineiro has started Machined Learnings where he’s seriously attempting to do ML research in public. I personally need to read through in greater detail, as much of it is learning reduction related, trying to deal with the sorts of complex source problems that come up in practice.

9/18/2009

Necessary and Sufficient Research

Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways.

  1. Decomposition failure. The solution to the decomposition does not in fact yield a solution to the overall problem.
  2. Artificial hardness. The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary.

As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary.

In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively little idea about how to do research, so they naturally abdicate the problem choice to an advisor. As an inexperienced graduate student, it’s difficult to avoid this, because an advisor often really does know much better about what is easy, what is hard, and how to decompose a complex problem into solvable subproblems. Nevertheless, if your plan in life is to do substantial research, it’s essential even then to question research directions carefully.

In contrast to sufficient subgoals of a greater goal, there are also necessary subgoals. A necessary subgoal is one which must be solved to solve the greater goal. One of the reasons why the artificial hardness problem is so common is that the sufficient subgoals are commonly confused with necessary subgoals. The essential test for a necessary subgoal is whether or not a solution to the global problem can be used as a solution to the subgoal.

My personal greater goal is creating a master machine learning algorithm that can solve any reasonable learning problem where “reasonable” includes at least the set that humans can solve. Relative to this greater goal, many existing research programs do not appear necessary.

  1. The short form of my second comment on Marcus’s post is that I see the sufficiency but not the necessity of competing with all Turing machines.
  2. The necessity of several statistical modeling approaches appears unclear to me, because they often encounter severe computational problems. Perhaps this is an example of creating an artificially hard problem, as empirical experiences with Searn suggest.

What is necessary?

  1. Large data. It is clear that humans solving learning problems have access to large amounts of information which they employ to solve these problems. While we don’t stick children into a sensory deprivation tank to see how much it retards their ability to solve problems when grown, some experiments along these lines have been done with animals yielding obvious ability deficiency.
  2. Online learning. The ability to learn in an online environment with relatively little processing per bit of input is clearly a sufficient approach to solve many problems. We can also argue that it is a computational necessity, as retraining based upon all past information appears computationally infeasible, or at least deeply wasteful.
  3. Interactive learning. It’s clear that many animals use an interactive process to learn about the world. This form of learning is also necessary, because it provides the ability to answer unanticipated questions. We can further argue the necessity by pointing out that interactive proofs appear much more powerful in computational complexity theory than noninteractive proofs. For example, viewed from a learning perspective, much of science is about interactive learning.
  4. Compositional Design of a learning system. The necessity of compositional design in machine learning is not entirely clear. For example, we could imagine that it’s possible to design good learning systems using an evolutionary approach. Nevertheless, since our basic goal in research is a much more efficient and faster design, it seems that the decision to take a research-based approach implies that compositional design is necessary. Restated: we should be able to design the system to learn in components which compose to form an overall solution.
  5. Large contexts. It’s necessary that a learning algorithm be able to use a relatively large number of bits when making a decision. For example, people working on vision have cool examples where people manage to use many different cues to predict what an object is.
  6. Nonlinearity. People can clearly solve learning problems for which no linear representation (of input information) is capable of achieving good performance.

Some of these are criticizable as perhaps unnecessary, and I can easily imagine missing others. If you have other arguments for what is or is not necessary for this greater goal, please speak up.

There are two other categories of subgoal research we could consider. There are subgoals which are necessary and sufficient (in combination) to solve the greater goal. I don’t know of any such arguments for my greater goal.

The fourth category is subgoals which are neither necessary nor sufficient for a greater goal. In my experience such papers are quite common at conferences with types that include:

  1. Work on speeding up a slow algorithm leaving it slower than the state of the art,
  2. Otherwise improving an algorithm which is suboptimal while leaving it suboptimal.

The nitty-gritty of these questions come at review time. Which papers should be accepted? In general, decision making is pretty terrible because greater goals are rarely stated, perhaps as a form of strategic ambiguity. After all, the set of people attending a conference have multiple greater goals. Nevertheless, my personal ordering is:

  1. Necessary and sufficient research directions. An emptyset in my experience.
  2. Necessary research directions. This is typically a small fraction.
  3. Sufficient research directions. This is a greater fraction.
  4. Neither. This is often the majority.
  5. Wrong. Must be rejected.

So the question to periodically ask yourself as a researcher is: What is the greater goal? Is this subgoal necessary? Sufficient? Something else?

8/3/2009

Carbon in Computer Science Research

Tags: CS,Research jl@ 11:10 am

Al Gore‘s film and gradually more assertive and thorough science has managed to mostly shift the debate on climate change from “Is it happening?” to “What should be done?” In that context, it’s worthwhile to think a bit about what can be done within computer science research.

There are two things we can think about:

  1. Doing Research At a cartoon level, computer science research consists of some combination of commuting to&from work, writing programs, running them on computers, writing papers, and presenting them at conferences. A typical computer has a power usage on the order of 100 Watts, which works out to 2.4 kiloWatt-hours/day. Looking up David MacKay‘s reference on power usage per person, it becomes clear that this is a relatively minor part of the lifestyle, although it could become substantial if many more computers are required. Much larger costs are associated with commuting (which is in common with many people) and attending conferences. Since local commuting is common across many people, and there are known approaches (typically public transportation) for more efficient commuting, I expect researchers can piggyback on improvements in public transportation to reduce commuting costs. In fact, the situation for researchers may be better in general, as the nature of the job may make commuting avoidable, at least on some days.

    Presenting at conferences is the remaining problem area, essentially due to travel by airplane to and from a conference. Travel by airplane has an energy cost similar to travel by car over the same distance, but we typically take airplanes for very long distances. Unlike cars, typical airplane usage requires stored energy in a dense form. For example, there are no serious proposals I’m aware of for battery-powered airplanes, because all existing rechargeable batteries have a power density around 1/10th that of hydrocarbon fuel (which makes sense given that about 3/4 of the mass for a hydrocarbon fire is oxygen in the air). This suggests airplane transport may be particularly difficult to adapt towards low or zero carbon usage. The plausible approaches I know involve either using electricity (from where?) to inefficiently crack water for hydrogen, or the biofuel approach where hydrocarbons are made by plants, with neither of these approaches particularly far along in development. If these aren’t developed, it seems we should expect fewer conferences, more regional conferences, Europe with it’s extensive fast train network to be less impacted, and more serious effort towards distributed conferences. For the last, it’s easy to imagine with existing technology having simultaneous regional conferences which are mutually videoconferenced, and we aren’t far from being able to handle a fully interactive videobroadcast amongst an indefinitely large number of participants. As a corollary of fewer conferences, other interactive mechanisms (for example research blogs) seems likely to grow.

  2. Research Topics They keyword for research topics is efficiency, and it is not a trivial concern on a global scale. In computer science, there have been a few algorithms (such as quicksort and hashing) developed which substantially and broadly improved real-world efficiency, but the real driver of efficiency so far is the hardware development, which has phenomenally improved efficiency for several decades.

    Many of the efficiency improvements are sure to remain hardware based, but software is becoming an essential component. One basic observation about efficient algorithms is that for problems admitting an efficient parallel solution (counting is a great example), the parallel algorithm is generally more efficient, because energy use is typically superlinear in clock speed. As an extreme example, the human brain which is deeply optimized by evolution for energy efficiency typically runs at at 100Hz or 100KHz.

    Although efficiency suggests parallel algorithms, this should not be done blindly. For example, in machine learning the evidence I’ve seen so far suggests that online learning (which is admittedly harder to parallelize) is substantially more efficient than batch style learning, implying that I expect online approaches to be more efficient than map-reduce based machine learning as is typically seen in the Mahout project.

    A substantial difficulty with parallel algorithms is the programming itself. In this regard, there is plenty of room for programming language work as well.

6/1/2009

Multitask Poisoning

Tags: Mathematics,Research jl@ 2:25 am

There are many ways that interesting research gets done. For example it’s common at a conference for someone to discuss a problem with a partial solution, and for someone else to know how to solve a piece of it, resulting in a paper. In some sense, these are the easiest results we can achieve, so we should ask: Can all research be this easy?

The answer is certainly no for fields where research inherently requires experimentation to discover how the real world works. However, mathematics, including parts of physics, computer science, statistics, etc… which are effectively mathematics don’t require experimentation. In effect, a paper can be simply a pure expression of thinking. Can all mathematical-style research be this easy?

What’s going on here is research-by-communication. Someone knows something, someone knows something else, and as soon as someone knows both things, a problem is solved. The interesting thing about research-by-communication is that it is becoming radically easier with improved technology. It’s already possible to communicate with nearly anyone anywhere in the world via {voice, video, text, shared program/website/blog}. As these mechanisms become cheaper in cost and ease of use, we’ll surely see their adoption into many activities. If all mathematical-style research can be sped up by communication, that’s great news for research.

However, I don’t believe it. The essential difficulty is that doing good research often requires the simultaneous understanding of several different things—the problem, all the broken approaches to solving some problem, why they break, and some hint about where to look for a solution. Absorbing and simultaneously keeping in mind this information requires a substantial amount of effort.

The extent of effort required is perhaps most easily understood in collaboration with yourself. Often, a problem is not immediately solved the first time it is thought of, instead a researcher must attack it again and again, until either giving up or finding a solution. A basic parameter in attacking a problem is: How hard is it to resume where you left off?

For many of the problems I’ve worked on, the answer is pretty hard. After weeks of not working on a problem, it might take several hours to refresh my memory with all the simultaneous concerns that go into a solution. Even if I worked on such a problem yesterday, it might take a half hour or an hour to reach a state where I’m prepared to make progress.

Given the difficulty of research, I (and many other people) often struggle in dealing with interruptions. Modern technology has made communication very easy, implying a stream of potential interruptions throughout the day, some of which are undeniably fruitful. And yet, an interruption means the overhead of getting back to thinking must be paid yet again.

Trading off properly between the value of avoiding the overhead and the value of dealing with interruptions is a constant struggle which typically did not exist before before modern communication technology made it prevalent. I think it is common to give in to the interrupts, and effectively cease to be able to do research other than research-by-communication. I’m not ready to do that, so I’ve found it essential to arrange large unused and uninterrupted periods of time for the purpose of doing research.

Although the solution is simple, I’ve often found implementing it challenging in several ways. There is the standard problem that people you deal with don’t understand the overhead of switching tasks in research. However, there is also a more insidious problem. Coping with the modern world requires that at least some portion of our time be devoted to interrupts, which are almost always easier to deal with than research. Dealing with these interrupts therefore can create a bad habit where you seek interrupts to achieve the (short term, easy) gratification of dealing with them. Thus, multitasking creates an internal expectation of multitasking, which makes multitasking preferred, eliminating the ability to do careful research. This is what I think of as multitask poisoning—it’s not just the time lost to swapping some context back in, but also the bad habits of self-interruption that it creates.

I don’t have a good answer to this problem, other than continuing the struggle to preserve substantial contiguous chunks of time for thinking. An alternative sometimes-applicable solution is to reduce the overhead of starting to solve a problem by decomposing the problem into subproblems. Where possible, this is of course valuable, but mathematical research is often almost uniquely undecomposable, because the nature of the problem is that it’s solution is unknown and hence undecomposable. Restated, the real problem is finding a valid decomposition.

I self-interrupted at least 3 times in making this post.

« Newer PostsOlder Posts »

Powered by WordPress