One of the confusing things about research is that progress is very hard to measure. One of the consequences of being in a hard-to-measure environment is that the wrong things are often measured.
- Lines of Code The classical example of this phenomenon is the old lines-of-code-produced metric for programming. It is easy to imagine systems for producing many lines of code with very little work that accomplish very little.
- Paper count In academia, a “paper count” is an analog of “lines of code”, and it suffers from the same failure modes. The obvious failure mode here is that we end up with a large number of uninteresting papers since people end up spending a lot of time optimizing this metric.
- Complexity Another metric, is “complexity” (in the eye of a reviewer) of a paper. There is a common temptation to make a method appear more complex than it is in order for reviewers to judge it worthy of publication. The failure mode here is unclean thinking. Simple effective methods are often overlooked in favor of complex relatively ineffective methods. This is simply wrong for any field. (Discussion at Lance’s blog.)
- Acceptance Rate “Acceptance rate” is the number of papers accepted/number of papers submitted. A low acceptance rate is often considered desirable for a conference. But:
- It’s easy to skew an acceptance rate by adding (or inviting) many weak or bogus papers.
- It’s very difficult to judge what, exactly, is good work in the long term. Consequently, a low acceptance rate can retard progress by simply raising the bar too high for what turns out to be a good idea when it is more fully developed. (Consider the limit where only one paper is accepted per year…)
- Accept/reject decisions can become more “political” and less about judging the merits of a paper/idea. With a low acceptance ratio, a strong objection by any one of several reviewers might torpedo a paper. The consequence of this is that papers become noncontroversial with a tendency towards incremental improvements.
- A low acceptance rate tends to spawn a multiplicity of conferences in one area. There is a strong multiplicity of learning-related conferences.
(see also How to increase the acceptance ratios at top conferences?)
- Citation count Counting citations is somewhat better than counting papers because it is some evidence that an idea is actually useful. This has been particularly aided by automated citation counting systems like scholar.google.com and http://citeseer.ist.psu.edu/. However, there are difficulties—citation counts can be optimized using self-citation and “societies of mutual admiration” (groups of people who agree implicitly or explicitly to cite each other). Citations are also sometimes negative of the form “here we fix bad idea X”.
- See also the Academic Mechanism Design post for other ideas.
These metrics do have some meaning. A programmer who writes no lines of code isn’t very good. An academic who produces no papers isn’t very good. A conference that doesn’t aid information filtration isn’t helpful. Hard problems often require complex solutions. Important papers are often cited.
Nevertheless, optimizing these metrics is not beneficial for a field of research. In thinking about this, we must clearly differentiate 1) what is good for a field of research (solving important problems) and 2) what is good for individual researchers (getting jobs). The essential point here is that there is a disparity.
Any individual in academia cannot avoid being judged by these metrics. Attempts by an individual or a small group of individuals to ignore these metrics is unlikely to change the system (and likely to result in the individual or small group being judged badly).
I don’t believe there is an easy fix to this problem. The best we can hope for is incremental progress which takes the form of the leadership in the academic community introducing new, saner metrics. This is a difficult thing, particularly because any academic leader must have succeeded in the old system. Nevertheless, it must happen if academic-style research is to flourish.
In the spirit of being constructive, I’ll make one proposal which may address the “complexity” problem: judge the importance of a piece of work independent of the method. For a conference paper this might be done by changing the review process to have one “technical reviewer” and several “importance reviewers” rather than 3 or 4 reviewers. The “importance reviewer” is easier than the current standard: they must simply understand the problem being solved and rate how important this problem is. The technical reviewers job is harder than the current standard: they must verify that all claims of solution to the problem are met. Overall, the amount of work by reviewers would stay constant, and perhaps we would avoid the preference for complex solutions.
A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience.
David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ’s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.
Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president”.) By doing so, you can use these other models to give additional useful information to your main task.
Noah A. Smith and Jason Eisner, Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.
Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend) where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.
This is the 6 month point in the “run a research blog” experiment, so it seems like a good point to take stock and assess.
One fundamental question is: “Is it worth it?” The idea of running a research blog will never become widely popular and useful unless it actually aids research. On the negative side, composing ideas for a post and maintaining a blog takes a significant amount of time. On the positive side, the process might yield better research because there is an opportunity for better, faster feedback implying better, faster thinking.
My answer at the moment is a provisional “yes”. Running the blog has been incidentally helpful in several ways:
- It is sometimes educational. example
- More often, the process of composing thoughts well enough to post simply aids thinking. This has resulted in a couple solutions to problems of interest (and perhaps more over time). If you really want to solve a problem, letting the world know is helpful. This isn’t necessarily because the world will help you solve it, but it’s helpful nevertheless.
- In addition, posts by others have helped frame thinking about “What are important problems people care about?”, and why. In the end, working on the right problem is invaluable.
I wanted to expand on this post and some of the previous problems/research directions about where learning theory might make large strides.
- Why theory? The essential reason for theory is “intuition extension”. A very good applied learning person can master some particular application domain yielding the best computer algorithms for solving that problem. A very good theory can take the intuitions discovered by this and other applied learning people and extend them to new domains in a relatively automatic fashion. To do this, we take these basic intuitions and try to find a mathematical model that:
- Explains the basic intuitions.
- Makes new testable predictions about how to learn.
- Succeeds in so learning.
This is “intuition extension”: taking what we have learned somewhere else and applying it in new domains. It is fundamentally useful to everyone because it increases the level of automation in solving problems.
- Where next for learning theory? I like the analogy with physics. Back before we-the-humans knew much, people would experiment occasionally and learn to design new things by slow evolution. At some point the physics model arose: you try to build mathematical models of what is happening and then make predictions based on the models. This was wildly succesful for physics. For machine learning, it has only been moderately succesful. We have some formalisms which are of some use in addressing novel learning problems, but the overall process of doing machine learning is not very close to “automatic”. The good news is that over the last 20 years a much richer set of positive examples of succesful applied machine learning has developed. Thus, there are many good intuitions from which we can hope to generalize. In the physics analogy, the year is (perhaps) 1900. Here are a few specific issues:
- What is the “right” mathematical model of learning? (in analogy, What is the “right” mathematical model of physical phenomena?”) The models we currently use have their compelling points but typically fail to capture all of the relevant details. This is a very hard question to address, but it should be actively considered and any progress may be very helpful. Examples of this include:
- What is the “right” model of active learning? We know almost nothing except there is great potential.
- What is the “right” model of Reinforcement learning? Again, we know very little in comparison to what we want to know—a fully automatic general RL solver.
The notion of “right” here is partially theoretical (can we get derive efficient algorithms?) and partially empirical (do they actually work?).
- How do we refine the empirical observations and intuitions of applied learning?
- How should we think about “prior”? The Bayesian answer seems unconvincing. At a minimum, information used to create a Bayesian prior often does not come in the form of a Bayesian prior, and so some translation system must be developed.
- How can we develop big learning systems that solve big problems? Some form of structure seems necessary, but the right form is still unclear. What theory governs the design of such systems?
- How do we take existing theoretical insights and translate them into practical algorithms?
- The method of linear projection into spaces has been studied theoretically. Is it useful empirically?
- The online learning setting seems theoretically compelling and, at least sometimes, empirically validated. What concerns remain to be addressed to make this a useful technology?
We should keep in mind that there is a real chance the limits of machine learning are lower bounded by human learning. Getting from here to there of course will require a bit of work, some of which might be greatly aided by mathematical consideration.
Rajat Raina presented a paper on the technique they used for the PASCAL Recognizing Textual Entailment challenge.
“Text entailment” is the problem of deciding if one sentence implies another. For example the previous sentence entails:
- Text entailment is a decision problem.
- One sentence can imply another.
The challenge was of the form: given an original sentence and another sentence predict whether there was an entailment. All current techniques for predicting correctness of an entailment are at the “flail” stage—accuracies of around 58% where humans could achieve near 100% accuracy, so there is much room to improve. Apparently, there may be another PASCAL challenge on this problem in the near future.
Some of the “sister conference” presentations at AAAI have been great. Roughly speaking, the conference organizers asked other conference organizers to come give a summary of their conference. Many different AI-related conferences accepted. The presenters typically discuss some of the background and goals of the conference then mention the results from a few papers they liked. This is great because it provides a mechanism to get a digested overview of the work of several thousand researchers—something which is simply available nowhere else.
Based on these presentations, it looks like there is a significant component of (and opportunity for) applied machine learning in AIIDE, IUI, and ACL.
There was also some discussion of having a super-colocation event similar to FCRC, but centered on AI & Learning. This seems like a fine idea. The field is fractured across so many different conferences that the mixing of a supercolocation seems likely helpful for research.
The AAAI conference is running a student blog which looks like a fun experiment.
One thing common to much research is that the researcher must be the first person ever to have some thought. How do you think of something that has never been thought of? There seems to be no methodical manner of doing this, but there are some tricks.
- The easiest method is to just have some connection come to you. There is a trick here however: you should write it down and fill out the idea immediately because it can just as easily go away.
- A harder method is to set aside a block of time and simply think about an idea. Distraction elimination is essential here because thinking about the unthought is hard work which your mind will avoid.
- Another common method is in conversation. Sometimes the process of verbalizing implies new ideas come up and sometimes whoever you are talking to replies just the right way. This method is dangerous though—you must speak to someone who helps you think rather than someone who occupies your thoughts.
- Try to rephrase the problem so the answer is simple. This is one aspect of giving up. Failing fast is better than failing slow.
There are also general ‘context development’ techniques which are not specifically helpful for your problem, but which are generally helpful for related problems.
- Understand the multiple motivations for working on some topic, when they exist.
- Question the “rightness” of every related thing. This is fundamental to finding good judgement in what you work on.
- Let a little bit of chaos into your life. Once in awhile, attend a random conference, talk to people who you would not otherwise talk to, etc…
Suppose we had an infinitely powerful mathematician sitting in a room and proving theorems about learning. Could he solve machine learning?
The answer is “no”. This answer is both obvious and sometimes underappreciated.
There are several ways to conclude that some bias is necessary in order to succesfully learn. For example, suppose we are trying to solve classification. At prediction time, we observe some features X and want to make a prediction of either 0 or 1. Bias is what makes us prefer one answer over the other based on past experience. In order to learn we must:
- Have a bias. Always predicting 0 is as likely as 1 is useless.
- Have the “right” bias. Predicting 1 when the answer is 0 is also not helpful.
The implication of “have a bias” is that we can not design effective learning algorithms with “a uniform prior over all possibilities”. The implication of “have the ‘right’ bias” is that our mathematician fails since “right” is defined with respect to the solutions to problems encountered in the real world. The same effect occurs in various sciences such as physics—a mathematician can not solve physics because the “right” answer is defined by the world.
A similar question is “Can an entirely empirical approach solve machine learning?”. The answer to this is “yes”, as long as we accept the evolution of humans and that a “solution” to machine learning is human-level learning ability.
A related question is then “Is mathematics useful in solving machine learning?” I believe the answer is “yes”. Although mathematics can not tell us what the “right” bias is, it can:
- Give us computational shortcuts relevant to machine learning.
- Abstract empirical observations of what an empirically good bias is allowing transference to new domains.
There is a reasonable hope that solving mathematics related to learning implies we can reach a good machine learning system in time shorter than the evolution of a human.
All of these observations imply that the process of solving machine learning must be partially empirical. (What works on real problems?) Anyone hoping to do so must either engage in real-world experiments or listen carefully to people who engage in real-world experiments. A reasonable model here is physics which has benefited from a combined mathematical and empirical study.
The health of COLT (Conference on Learning Theory or Computational Learning Theory depending on who you ask) has been questioned over the last few years. Low points for the conference occurred when EuroCOLT merged with COLT in 2001, and the attendance at the 2002 Sydney COLT fell to a new low. This occurred in the general context of machine learning conferences rising in both number and size over the last decade.
Any discussion of why COLT has had difficulties is inherently controversial as is any story about well-intentioned people making the wrong decisions. Nevertheless, this may be worth discussing in the hope of avoiding problems in the future and general understanding. In any such discussion there is a strong tendency to identify with a conference/community in a patriotic manner that is detrimental to thinking. Keep in mind that conferences exist to further research.
My understanding (I wasn’t around) is that COLT started as a subcommunity of the computer science theory community. This implies several things:
- There was a basic tension facing authors: Do you submit to COLT or to FOCS or STOC which are the “big” theory conferences?
- The research programs in COLT were motivated by theoretical concerns (rather than, say, practical experience). This includes motivations like understanding the combinatorics of some models of learning and the relationship with crypto.
This worked well in the beginning when new research programs were being defined and new learning models were under investigation. What went wrong from there is less clear.
- Perhaps the community shifted focus from thinking about new learning models to simply trying to find solutions in older models, and this went stale.
- Perhaps some critical motivations were left out. Many of the learning models under investigation at COLT strike empirically motivated people as implausibly useful.
- Perhaps the conference/community was not inviting enough to new forms of learning theory. Many pieces of learning theory have not appeared at COLT over the last 20 years.
These concerns have been addressed since the low point of COLT, but the long term health is still questionable: ICML has been accepting learning theory with plausible empirical motivations and a mathematical learning theory conference has appeared so there are several choices of venue available to authors.
The good news is that this year’s COLT appeared healthy. The topics covered by the program were diverse and often interesting. Several of the papers seem quite relevant to the practice of machine learning. Perhaps an even better measure is that there were many younger people in attendance.
COLT had an impromptu session which seemed as interesting or more interesting than any other single technical session (despite being only an hour long). There are several roles that an impromptu session can play including:
- Announcing new work since the paper deadline. Letting this happen now rather than later helps aid the process of research.
- Discussing a paper that was rejected. Reviewers err sometimes and an impromptu session provides a means to remedy that.
- Entertainment. We all like to have a bit of fun.
For design, the following seem important:
- Impromptu speakers should not have much time. At COLT, it was 8 minutes, but I have seen even 5 work well.
- The entire impromptu session should not last too long because the format is dense and promotes restlessness. A half hour or hour can work well.
Impromptu talks are a mechanism to let a little bit of chaos into the schedule. They will be chaotic in content, presentation, and usefulness. The fundamental advantage of this chaos is that it provides a means for covering material that the planned program did not (or could not). This seems like a “bargain use of time” considering the short duration. One caveat is that it is unclear how well this mechanism can scale to large conferences.