John Langford – Page 82 – Machine Learning (Theory)

7/10/20057/10/2005

Thinking the Unthought

One thing common to much research is that the researcher must be the first person ever to have some thought. How do you think of something that has never been thought of? There seems to be no methodical manner of doing this, but there are some tricks.

The easiest method is to just have some connection come to you. There is a trick here however: you should write it down and fill out the idea immediately because it can just as easily go away.
A harder method is to set aside a block of time and simply think about an idea. Distraction elimination is essential here because thinking about the unthought is hard work which your mind will avoid.
Another common method is in conversation. Sometimes the process of verbalizing implies new ideas come up and sometimes whoever you are talking to replies just the right way. This method is dangerous though—you must speak to someone who helps you think rather than someone who occupies your thoughts.
Try to rephrase the problem so the answer is simple. This is one aspect of giving up. Failing fast is better than failing slow.

There are also general ‘context development’ techniques which are not specifically helpful for your problem, but which are generally helpful for related problems.

Understand the multiple motivations for working on some topic, when they exist.
Question the “rightness” of every related thing. This is fundamental to finding good judgement in what you work on.
Let a little bit of chaos into your life. Once in awhile, attend a random conference, talk to people who you would not otherwise talk to, etc…

7/7/20057/7/2005

The Limits of Learning Theory

Suppose we had an infinitely powerful mathematician sitting in a room and proving theorems about learning. Could he solve machine learning?

The answer is “no”. This answer is both obvious and sometimes underappreciated.

There are several ways to conclude that some bias is necessary in order to succesfully learn. For example, suppose we are trying to solve classification. At prediction time, we observe some features X and want to make a prediction of either 0 or 1. Bias is what makes us prefer one answer over the other based on past experience. In order to learn we must:

Have a bias. Always predicting 0 is as likely as 1 is useless.
Have the “right” bias. Predicting 1 when the answer is 0 is also not helpful.

The implication of “have a bias” is that we can not design effective learning algorithms with “a uniform prior over all possibilities”. The implication of “have the ‘right’ bias” is that our mathematician fails since “right” is defined with respect to the solutions to problems encountered in the real world. The same effect occurs in various sciences such as physics—a mathematician can not solve physics because the “right” answer is defined by the world.

A similar question is “Can an entirely empirical approach solve machine learning?”. The answer to this is “yes”, as long as we accept the evolution of humans and that a “solution” to machine learning is human-level learning ability.

A related question is then “Is mathematics useful in solving machine learning?” I believe the answer is “yes”. Although mathematics can not tell us what the “right” bias is, it can:

Give us computational shortcuts relevant to machine learning.
Abstract empirical observations of what an empirically good bias is allowing transference to new domains.

There is a reasonable hope that solving mathematics related to learning implies we can reach a good machine learning system in time shorter than the evolution of a human.

All of these observations imply that the process of solving machine learning must be partially empirical. (What works on real problems?) Anyone hoping to do so must either engage in real-world experiments or listen carefully to people who engage in real-world experiments. A reasonable model here is physics which has benefited from a combined mathematical and empirical study.

7/4/20057/4/2005

The Health of COLT

The health of COLT (Conference on Learning Theory or Computational Learning Theory depending on who you ask) has been questioned over the last few years. Low points for the conference occurred when EuroCOLT merged with COLT in 2001, and the attendance at the 2002 Sydney COLT fell to a new low. This occurred in the general context of machine learning conferences rising in both number and size over the last decade.

Any discussion of why COLT has had difficulties is inherently controversial as is any story about well-intentioned people making the wrong decisions. Nevertheless, this may be worth discussing in the hope of avoiding problems in the future and general understanding. In any such discussion there is a strong tendency to identify with a conference/community in a patriotic manner that is detrimental to thinking. Keep in mind that conferences exist to further research.

My understanding (I wasn’t around) is that COLT started as a subcommunity of the computer science theory community. This implies several things:

There was a basic tension facing authors: Do you submit to COLT or to FOCS or STOC which are the “big” theory conferences?
The research programs in COLT were motivated by theoretical concerns (rather than, say, practical experience). This includes motivations like understanding the combinatorics of some models of learning and the relationship with crypto.

This worked well in the beginning when new research programs were being defined and new learning models were under investigation. What went wrong from there is less clear.

Perhaps the community shifted focus from thinking about new learning models to simply trying to find solutions in older models, and this went stale.
Perhaps some critical motivations were left out. Many of the learning models under investigation at COLT strike empirically motivated people as implausibly useful.
Perhaps the conference/community was not inviting enough to new forms of learning theory. Many pieces of learning theory have not appeared at COLT over the last 20 years.

These concerns have been addressed since the low point of COLT, but the long term health is still questionable: ICML has been accepting learning theory with plausible empirical motivations and a mathematical learning theory conference has appeared so there are several choices of venue available to authors.

The good news is that this year’s COLT appeared healthy. The topics covered by the program were diverse and often interesting. Several of the papers seem quite relevant to the practice of machine learning. Perhaps an even better measure is that there were many younger people in attendance.

7/1/20057/1/2005

The Role of Impromptu Talks

COLT had an impromptu session which seemed as interesting or more interesting than any other single technical session (despite being only an hour long). There are several roles that an impromptu session can play including:

Announcing new work since the paper deadline. Letting this happen now rather than later helps aid the process of research.
Discussing a paper that was rejected. Reviewers err sometimes and an impromptu session provides a means to remedy that.
Entertainment. We all like to have a bit of fun.

For design, the following seem important:

Impromptu speakers should not have much time. At COLT, it was 8 minutes, but I have seen even 5 work well.
The entire impromptu session should not last too long because the format is dense and promotes restlessness. A half hour or hour can work well.

Impromptu talks are a mechanism to let a little bit of chaos into the schedule. They will be chaotic in content, presentation, and usefulness. The fundamental advantage of this chaos is that it provides a means for covering material that the planned program did not (or could not). This seems like a “bargain use of time” considering the short duration. One caveat is that it is unclear how well this mechanism can scale to large conferences.

6/29/20056/29/2005

Not EM for clustering at COLT

One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.

One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.

A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:

Project to low dimensional space.
Pick out gross structure.
Project gross structure into the high dimensional space.
Run EM (or some other local improvement algorithm) to find a final fit.

The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.