Machine Learning (Theory) – Page 81 – Machine learning and learning theory research

1/8/20061/8/2006

Debugging Your Brain

One part of doing research is debugging your understanding of reality. This is hard work: How do you even discover where you misunderstand? If you discover a misunderstanding, how do you go about removing it?

The process of debugging computer programs is quite analogous to debugging reality misunderstandings. This is natural—a bug in a computer program is a misunderstanding between you and the computer about what you said. Many of the familiar techniques from debugging have exact parallels.

Details When programming, there are often signs that some bug exists like: “the graph my program output is shifted a little bit” = maybe you have an indexing error. In debugging yourself, we often have some impression that something is “not right”. These impressions should be addressed directly and immediately. (Some people have the habit of suppressing worries in favor of excess certainty. That’s not healthy for research.)
Corner Cases A “corner case” is an input to a program which is extreme in some way. We can often concoct our own corner cases and solve them. If the solution doesn’t match our (mis)understanding, a bug has been found.
Warnings On The compiler “gcc” has the flag “-Wall” which means “turn all warnings about odd program forms on”. You should always compile with “-Wall” as you immediately realize if you compare the time required to catch a bug that “-Wall” finds with the time required to debug the hard way.
The equivalent for debugging yourself is listening to others carefully. In research, some people have the habit of wanting to solve everything before talking to others. This is usually unhealthy. Talking about the problem that you want to solve is much more likely to lead to either solving it or discovering the problem is uninteresting and moving on.
Debugging by Design When programming, people often design the process of creating the program so that it is easy to debug. The analogy for us is stepwise mastery—first master your understanding of something basic. Then take the next step, the next, etc…
Isolation When a bug is discovered, the canonical early trouble shooting step is isolating the bug. For a parse error, what is the smallest program exhibiting the error? For a compiled program: what are the simplest set of inputs which exhibit the bug? For research, what is the simplest example that you don’t understand?
Representation Change When programming, sometimes a big program simply becomes too unwieldy to debug. In these cases, it is often a good idea to rethink the problem the program is trying to solve. How can you better structure the program to avoid this unwieldiness?
The issue of how to represent the problem is perhaps even more important in research since human brains are not as adept as computers at shifting and using representations. Significant initial thought on how to represent a research problem is helpful. And when it’s not going well,
changing representations can make a problem radically simpler.

Some aspects of debugging a reality misunderstanding don’t have a good analogue for programming because debugging yourself often involves social interactions. One basic principle is that your ego is unhelpful. Everyone (including me) dislikes having others point out when they are wrong so there is a temptation to avoid admitting it (to others, or more harmfully to yourself). This temptation should be actively fought . With respect to others, admitting you are wrong allows a conversation to move on to other things. With respect to yourself, admitting you are wrong allows you to move on to other things. A good environment can help greatly with this problem. There is an immense difference in how people behave under “you lose your job if wrong” and “great, let’s move on”.

What other debugging techniques exist?

1/6/20061/6/2006

MLTV

As part of a PASCAL project, the Slovenians have been filming various machine learning events and placing them on the web here. This includes, for example, the Chicago 2005 Machine Learning Summer School as well as a number of other summer schools, workshops, and conferences.

There are some significant caveats here—for example, I can’t access it from Linux. Based upon the webserver logs, I expect that is a problem for most people—Computer scientists are particularly nonstandard in their choice of computing platform.

Nevertheless, the core idea here is excellent and details of compatibility can be fixed later. With modern technology toys, there is no fundamental reason why the process of announcing new work at a conference should happen only once and only for the people who could make it to that room in that conference. The problems solved include:

The multitrack vs. single-track debate. (“Sometimes the single track doesn’t interest me” vs. “When it’s multitrack I miss good talks”
“I couldn’t attend because I was giving birth/going to a funeral/a wedding”
“What was that? I wish there was a rewind on reality.”

There are some fears here too. For example, maybe a shift towards recording and placing things on the web will result in lower attendance at a conference. Such a fear is confused in a few ways:

People go to conferences for many more reasons than just announcing new work. Other goals include doing research, meeting old friends, worrying about job openings, skiing, and visiting new places. There also a subtle benefit of going to a conference: it represents a commitment of time to research. It is this commitment which makes two people from the same place start working together at a conference. Given all these benefits of going to a conference, there is plenty of reason for them to continue to exist.
It is important to remember that a conference is a process in aid of research. Recording and making available for download the presentations at a conference makes research easier by solving all the problems listed above.
This is just another new information technology. When the web came out, computer scientists and physicists quickly adopted a “place any paper on your webpage” style even when journals forced them to sign away the rights of the paper to publish. Doing this was simply healthy for the researcher because his papers were more easily readable. The same logic applies to making presentations at a conference available on the web.

12/29/200512/29/2005

Deadline Season

Many different paper deadlines are coming up soon so I made a little reference table. Out of curiosity, I also computed the interval between submission deadline and conference.

Conference	Location	Date	Deadline	interval
COLT	Pittsburgh	June 22-25	January 21	152
ICML	Pittsburgh	June 26-28	January 30/February 6	140
UAI	MIT	July 13-16	March 9/March 16	119
AAAI	Boston	July 16-20	February 16/21	145
KDD	Philadelphia	August 23-26	March 3/March 10	166

It looks like the northeastern US is the big winner as far as location this year.

12/28/200512/28/2005

Yet more nips thoughts

I only managed to make it out to the NIPS workshops this year so
I’ll give my comments on what I saw there.

The Learing and Robotics workshops lives again. I hope it
continues and gets more high quality papers in the future. The
most interesting talk for me was Larry Jackel’s on the LAGR
program (see John’s previous post on said program). I got some
ideas as to what progress has been made. Larry really explained
the types of benchmarks and the tradeoffs that had to be made to
make the goals achievable but challenging.

Hal Daume gave a very interesting talk about structured
prediction using RL techniques, something near and dear to my own
heart. He achieved rather impressive results using only a very
greedy search.

The non-parametric Bayes workshop was great. I enjoyed the entire
morning session I spent there, and particularly (the usually
desultory) discussion periods. One interesting topic was the
Gibbs/Variational inference divide. I won’t try to summarize
especially as no conclusion was reached. It was interesting to
note that samplers are competitive with the variational
approaches for many Dirichlet process problems. One open question
I left with was whether the fast variants of Gibbs sampling could
be made multi-processor as the naive variants can.

I also have a reading list of sorts from the main
conference. Most of the papers mentioned in previous posts on
NIPS are on that list as well as these: (in no particular order)

The Information-Form Data Association Filter
Sebastian Thrun, Brad Schumitsch, Gary Bradski, Kunle Olukotun
[ps.gz][pdf][bibtex]

Divergences, surrogate loss functions and experimental design
XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]

Generalization to Unseen Cases
Teemu Roos, Peter GrÃƒÂ¼nwald, Petri MyllymÃƒÂ¤ki, Henry Tirri [ps.gz][pdf][bibtex]

Gaussian Process Dynamical Models
David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]

Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,
Patrice Marcotte [ps.gz][pdf][bibtex]

Describing Visual Scenes using Transformed Dirichlet Processes
Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky
[ps.gz][pdf][bibtex]

Learning vehicular dynamics, with application to modeling helicopters
Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]

Tensor Subspace Analysis
Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]

12/27/200512/27/2005

Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

Identify some subset of observed values to predict from the others.
Build a predictor.
Use the output of the predictor to define a new prediction problem.
Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

Predict nearby things given touch sensor output.
Predict medium distance things given the nearby predictor.
Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

Reinforcement learning to classification reductions which convert rewards into labels.
Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.