Machine Learning (Theory)

1/30/2006

Should the Input Representation be a Vector?

Tags: Machine Learning jl@ 12:10 am

Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds.

To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values).

On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix. In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted.

A vector window and a matrix both imply a notion of locality which is being actively and effectively used by these algorithms. In contrast, common algorithms like support vector machines, neural networks, and decision trees do not use or take advantage of locality in the input representation. For example, many of these algorithms are (nearly) invariant under permutation of input vector features.

A basic question we should ask is: “Does it really matter whether or not a learning algorithm knows the locality structure of the input?” Consider a simplistic example where we have n input bits as features and the function f to learn uses k of hte n bits. Suppose we also know that the function is one of two candidates: f1 and f2, but that these are two otherwise complicated functions. Then, finding which of the k to use might require an O(n choose k) = O(n!/k!(n-k)!) computation and O(k log (n/k)) samples when k is much smaller than n. On the other hand, when we know which k are relevant, “learning” is trivial. This suggests that telling the algorithm which k features are “near” to other features can be very helpful in solving the problem, at least computationally.

There are several natural questions which arise:

  1. How do we specify locality? Using a neighborhood graph? A weighted neighborhood graph?
  2. What are natural algorithms subject to locality? It seems most practical existing learning algorithms using locality have some form of sweeping operation over local neighborhoods. Is that it, or is there more?
  3. Can a general purpose locality aware algorithm perform well on a broad variety of different tasks?

1/25/2006

1 year

Tags: Research jl@ 12:34 am

At the one year (+5 days) anniversary, the natural question is: “Was it helpful for research?”

Answer: Yes, and so it shall continue.

Some evidence is provided by noticing that I am about a factor of 2 more overloaded with paper ideas than I’ve ever previously been. It is always hard to estimate counterfactual worlds, but I expect that this is also a factor of 2 more than “What if I had not started the blog?”

As for “Why?”, there seem to be two primary effects.

  1. A blog is a mechanism for connecting with people who either think like you or are interested in the same problems. This allows for concentration of thinking which is very helpful in solving problems.
  2. The process of stating things you don’t understand publicly is very helpful in understanding them. Sometimes you are simply forced to express them in a way which aids understanding. Sometimes someone else says something which helps. And sometimes you discover that someone else has already solved the problem.

There are drawbacks which should not be minimized.

  1. A great deal of accumulated time and effort goes into writing posts.
  2. Stress. Telling coauthors “I’m sorry, but I don’t have much time to actually write.” is not at all fun. And the wrists are hurting.

One of the things that I thought would be a problem was running out of ideas for posts, but this just didn’t happen. I’m hoping to have more posts by others over the next year to help relieve (1). (Also, I often find the posts of occasional posters more interesting.) Problem (2) is just an ill of success that must be coped with.

Some statistics:

  1. posts 1 every 2 days, on average (150)
  2. comments 3/post on average (492)
  3. posters 10
  4. registered users 72
  5. visits per day About 2000 (some uncertainty due to sharing of webserver with less-used sites).
  6. unique IP addreses per month Perhaps 7000 (uncertainty due to same source).

I’ve been surprised by the growth of traffic to the site. It is odd to realize that a post here is seen by more people than a talk at even the largest machine learning conference. Radically more effort goes into any talk at nearly any conference.

Please comment on any particular thoughts, suggestions, changes, for the new year or the last.

1/23/2006

On Coding via Mutual Information & Bayes Nets

Say we have two random variables X,Y with mutual information I(X,Y). Let’s say we want to represent them with a bayes net of the form X< -M->Y, such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y). Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y), so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y) and coding Y with this mutual info (without X).

It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch).

What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be burdened with a harder modeling task. If all we care about is a information theoretic, compact hidden state, then constructing an accurate Bayes net might be harder, due to the fact that it takes more bits to specify the distribution of the hidden state. In fact, since we usually condition on the data, it seems odd that we should bother specifying a (potentially more complex) generative model. What are the alternatives? The information bottleneck seems interesting, though this has peculiarities of its own.

Alina’s counterexample:

Here is the joint distribution P(X,Y). Sample binary X from an unbiased coin. Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). So the joint is:

P(0,0)=1/4
P(0,1)=1/4
P(1,0)=0
P(1,1)=1/2

Note P(X=1)=1/2 and P(Y=1)=3/4. Here,

I(X,Y)= 3/4 log (4/3) ~= 0.31

The rest of the proof showing that this is not achievable in a ‘compact’ Bayes net is in a comment.

1/18/2006

Is Multitask Learning Black-Boxable?

Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1}n. Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana) using methods which share an internal representation. On other words, the the computation and learning of the ith prediction is shared with the computation and learning of the jth prediction. Another way to ask this question is: can we avoid sharing the internal representation?

For example, it might be feasible to solve multitask learning by some process feeding the ith prediction f(x)i into the jth predictor f(x,f(x)i)j,

If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to solve multitask learning problems.

Finding a satisfying answer to this question at a theoretical level appears tricky. If you consider the infinite data limit with IID samples for any finite X, the answer is “yes” because any function can be learned. However, this does not take into account two important effects:

  1. Using a shared representation alters the bias of the learning process. What this implies is that fewer examples may be required to learn all of the predictors. Of course, changing the input features also alters the bias of the learning process. Comparing these altered biases well enough to distinguish their power seems very tricky. For reference, Jonathon Baxter has done some related analysis (which still doesn’t answer the question).
  2. Using a shared representation may be computationally cheaper.

One thing which can be said about multitask learning (in either black-box or shared representation form), is that it can make learning radically easier. For example, predicting the first bit output by a cryptographic circuit is (by design) extraordinarily hard. However, predicting the bits of every gate in the circuit (including the first bit output) is easily done based upon a few examples.

1/13/2006

Benchmarks for RL

Tags: Machine Learning, Reinforcement jl@ 12:54 am

A couple years ago, Drew Bagnell and I started the RLBench project to setup a suite of reinforcement learning benchmark problems. We haven’t been able to touch it (due to lack of time) for a year so the project is on hold. Luckily, there are several other projects such as CLSquare and RL-Glue with a similar goal, and we strongly endorse their continued development.

I would like to explain why, especially in the context of criticism of other learning benchmarks. For example, sometimes the UCI Machine Learning Repository is criticized. There are two criticisms I know of:

  1. Learning algorithms have overfit to the problems in the repository. It is easy to imagine a mechanism for this happening unintentionally. Strong evidence of this would be provided by learning algorithms which perform great on the UCI machine learning repository but very badly (relative to other learning algorithms) on non-UCI learning problems. I have seen little evidence of this but it remains a point of concern. There is a natural order to how much we trust performance results:
    1. Competitions. A well run prediction competition provides some of the best evidence available about which learning algorithms work well. There are still some mechanisms for being misled (too many entries, not enough testing data, unlucky choice of learning problem that your algorithm just has the wrong bias for), but they are more controlled than anywhere else.
    2. Benchmarks. Benchmarks provide a “reasonable sanity check” that the algorithm will do something well in practice. Compared to competitions, there are additional ways to be misled and (unfortunately) to mislead. Overfitting becomes a serious issue.
    3. Simulated data Simulated data can be very misleading. Typically, there is little reason to believe that simulated data reflects the structure of real-world problems.

    This ordering should always be kept in mind when considering empirical results.

  2. The repository enforced a certain interface which many natural learning problems did not fit into. Some interface must be enforced in order for a benchmark/repository to be useful. This implies that problems fitting the interface are preferred both in submission of new problems and in choice of algorithms to solve. This is a reasonable complaint, but it’s basically a complaint that the UCI machine learning repository was too succesful. The good news here is that the internet has made it so anyone can setup a repository. Just put up a webserver and announce the dataset.

It is important to consider the benefits of a benchmark suite as well. The standard in reinforcement learning for experimental studies in reinforcement learning algorithms has been showing that some new algorithm does something reasonable on one or two reinforcement learning problems. Naturally, what this implies in practice is that the algorithms were hand tuned to work on these one-or-two problems, and there was relatively little emphasis on building a general purpose reinforcement learning algorithm.

This is strongly at odds with the end goal of the reinforcement learning community! The way to avoid this is by setting up a benchmark suite (the more problems, the better). With a large set of problems that people can easily test on, the natural emphasis in empirical algorithm development shifts towards general purpose algorithms.

Another drawback of the “one or two problems” approach is incomparability of results. When people reimplement simulators for reinforcement learning problems, it turns out these reimplementations typically differ in the details, and these details can radically alter the difficulty of the problem. For example, there are many ways to implement the “car on the hill” problem, and some of them are much more difficult to solve than others. Again, a large set of problems people can easily test on solves this problem because the precise definition of the problem is shared.

One sometimes reasonable view of the world is that if you define a problem publicly, it is as good as solved since there are plenty of smart people out there eager to prove themselves. According to this view of the world, setting up a benchmark is a vital task.

1/8/2006

Debugging Your Brain

Tags: Research jl@ 12:34 am

One part of doing research is debugging your understanding of reality. This is hard work: How do you even discover where you misunderstand? If you discover a misunderstanding, how do you go about removing it?

The process of debugging computer programs is quite analogous to debugging reality misunderstandings. This is natural—a bug in a computer program is a misunderstanding between you and the computer about what you said. Many of the familiar techniques from debugging have exact parallels.

  1. Details When programming, there are often signs that some bug exists like: “the graph my program output is shifted a little bit” = maybe you have an indexing error. In debugging yourself, we often have some impression that something is “not right”. These impressions should be addressed directly and immediately. (Some people have the habit of suppressing worries in favor of excess certainty. That’s not healthy for research.)
  2. Corner Cases A “corner case” is an input to a program which is extreme in some way. We can often concoct our own corner cases and solve them. If the solution doesn’t match our (mis)understanding, a bug has been found.
  3. Warnings On The compiler “gcc” has the flag “-Wall” which means “turn all warnings about odd program forms on”. You should always compile with “-Wall” as you immediately realize if you compare the time required to catch a bug that “-Wall” finds with the time required to debug the hard way.

    The equivalent for debugging yourself is listening to others carefully. In research, some people have the habit of wanting to solve everything before talking to others. This is usually unhealthy. Talking about the problem that you want to solve is much more likely to lead to either solving it or discovering the problem is uninteresting and moving on.

  4. Debugging by Design When programming, people often design the process of creating the program so that it is easy to debug. The analogy for us is stepwise mastery—first master your understanding of something basic. Then take the next step, the next, etc…
  5. Isolation When a bug is discovered, the canonical early trouble shooting step is isolating the bug. For a parse error, what is the smallest program exhibiting the error? For a compiled program: what are the simplest set of inputs which exhibit the bug? For research, what is the simplest example that you don’t understand?
  6. Representation Change When programming, sometimes a big program simply becomes too unwieldy to debug. In these cases, it is often a good idea to rethink the problem the program is trying to solve. How can you better structure the program to avoid this unwieldiness?

    The issue of how to represent the problem is perhaps even more important in research since human brains are not as adept as computers at shifting and using representations. Significant initial thought on how to represent a research problem is helpful. And when it’s not going well,
    changing representations can make a problem radically simpler.

Some aspects of debugging a reality misunderstanding don’t have a good analogue for programming because debugging yourself often involves social interactions. One basic principle is that your ego is unhelpful. Everyone (including me) dislikes having others point out when they are wrong so there is a temptation to avoid admitting it (to others, or more harmfully to yourself). This temptation should be actively fought . With respect to others, admitting you are wrong allows a conversation to move on to other things. With respect to yourself, admitting you are wrong allows you to move on to other things. A good environment can help greatly with this problem. There is an immense difference in how people behave under “you lose your job if wrong” and “great, let’s move on”.

What other debugging techniques exist?

1/6/2006

MLTV

As part of a PASCAL project, the Slovenians have been filming various machine learning events and placing them on the web here. This includes, for example, the Chicago 2005 Machine Learning Summer School as well as a number of other summer schools, workshops, and conferences.

There are some significant caveats here—for example, I can’t access it from Linux. Based upon the webserver logs, I expect that is a problem for most people—Computer scientists are particularly nonstandard in their choice of computing platform.

Nevertheless, the core idea here is excellent and details of compatibility can be fixed later. With modern technology toys, there is no fundamental reason why the process of announcing new work at a conference should happen only once and only for the people who could make it to that room in that conference. The problems solved include:

  1. The multitrack vs. single-track debate. (”Sometimes the single track doesn’t interest me” vs. “When it’s multitrack I miss good talks”
  2. “I couldn’t attend because I was giving birth/going to a funeral/a wedding”
  3. “What was that? I wish there was a rewind on reality.”

There are some fears here too. For example, maybe a shift towards recording and placing things on the web will result in lower attendance at a conference. Such a fear is confused in a few ways:

  1. People go to conferences for many more reasons than just announcing new work. Other goals include doing research, meeting old friends, worrying about job openings, skiing, and visiting new places. There also a subtle benefit of going to a conference: it represents a commitment of time to research. It is this commitment which makes two people from the same place start working together at a conference. Given all these benefits of going to a conference, there is plenty of reason for them to continue to exist.
  2. It is important to remember that a conference is a process in aid of research. Recording and making available for download the presentations at a conference makes research easier by solving all the problems listed above.
  3. This is just another new information technology. When the web came out, computer scientists and physicists quickly adopted a “place any paper on your webpage” style even when journals forced them to sign away the rights of the paper to publish. Doing this was simply healthy for the researcher because his papers were more easily readable. The same logic applies to making presentations at a conference available on the web.

Powered by WordPress