Machine Learning (Theory)

9/30/2005

Research in conferences

Tags: General jl@ 12:52 pm

Conferences exist as part of the process of doing research. They provide many roles including “announcing research”, “meeting people”, and “point of reference”. Not all conferences are alike so a basic question is: “to what extent do individual conferences attempt to aid research?” This question is very difficult to answer in any satisfying way. What we can do is compare details of the process across multiple conferences.

  1. Comments The average quality of comments across conferences can vary dramatically. At one extreme, the tradition in CS theory conferences is to provide essentially zero feedback. At the other extreme, some conferences have a strong tradition of providing detailed constructive feedback. Detailed feedback can give authors significant guidance about how to improve research. This is the most subjective entry.
  2. Blind Virtually all conferences offer single blind review where authors do not know reviewers. Some also provide double blind review where reviewers do not know authors. The intention with double blind reviewing is to make the conference more approachable to first-time authors.
  3. Author Feedback Author feedback is a mechanism where authors can provide feedback to reviewers (and, to some extent, complain). Providing an author feedback mechanism provides an opportunity for the worst reviewing errors to be corrected.
  4. Conditional Accepts A conditional accept is some form of “we will accept this paper if conditions X,Y, and Z are met”. A conditional accept allows reviewers to demand different experiments or other details they need in order to make a decision. This might speed up research significantly because otherwise good papers need not wait another year.
  5. Papers/PC member How many papers can one person actually review well? When there is an incredible load of papers to review, it becomes very tempting to make snap decisions without a thorough attempt at understanding. Snap decisions are often wrong. These numbers are based on the number of submissions with a computer science standard of 3 reviews per paper.

Each of these “options” make reviewing more difficult by requiring more reviewer work. There is a basic trade-off between the amount of time spent reviewing vs. working on new research and the speed of the review process itself. It is unclear where this optimal trade-off point lies, but the easy default is “not enough time spent reviewing” because reviewing is generally an unrewarding job.

It seems reasonable to cross reference these options with some measures of ‘conference impact’. For each of these, it’s important to realize these are not goal metrics and so their meaning is unclear. The best that can be said is that it is not bad to do well. Also keep in mind that measurements of “impact” are inherently “trailing indicators” which are not necessarily relevant to the way the conference is currently run.

  1. average citations Citeseer has been used to estimate the average impact of a conference’s papers here using the average number of citations per paper.
  2. max citations A number of people believe that the maximum number of citations given to any one paper is a strong indicator of the success of the conference. This can be measured by going to scholar.google.com and using ‘advanced search’ for the conference name.
Conference Comments blindness author feedback conditional accepts Reviews/PC member log(average citations per paper+1) max citations
ICML Sometimes Helpful Double Yes Yes 8 2.12 1079
AAAI Sometimes Helpful Double Yes No 8 1.87 650
COLT Sometimes Helpful Single No No 15? 1.49 710
NIPS Sometimes Helpful/Sometimes False Single Yes No 113(*) 1.06 891
CCC Sometimes Helpful Single No No 24 1.25 142
STOC Not Helpful Single No No 41 1.69 611
SODA Not Helpful Single No No 56 1.51 175

(*) To some extent this is a labeling problem. NIPS has an organized process of finding reviewers very similar to ICML. They are simply not called PC members.

Keep in mind that the above is a very incomplete list (it only includes the conferences that I interacted with) and feel free to add details in the comments.

9/26/2005

Prediction Bounds as the Mathematics of Science

Tags: Prediction Theory jl@ 3:12 pm

“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:

  1. Form a hypothesis about the world.
  2. Use the hypothesis to make predictions.
  3. Run experiments to confirm or disprove the predictions.

The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.

Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.

  1. Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme.
  2. Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.

Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)

A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:

  1. They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
  2. They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.

These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.

  1. The most basic test set bound corresponds to the scientific method above.
  2. The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
  3. The Sample Compression bound allows careful disposal of some datapoints.
  4. Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.

Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)

9/20/2005

Workshop Proposal: Atomic Learning

Tags: General jl@ 5:18 pm

This is a proposal for a workshop. It may or may not happen depending on the level of interest. If you are interested, feel free to indicate so (by email or comments).

Description:
Assume(*) that any system for solving large difficult learning problems must decompose into repeated use of basic elements (i.e. atoms). There are many basic questions which remain:

  1. What are the viable basic elements?
  2. What makes a basic element viable?
  3. What are the viable principles for the composition of these basic elements?
  4. What are the viable principles for learning in such systems?
  5. What problems can this approach handle?

Hal Daume adds:

  1. Can composition of atoms be (semi-) automatically constructed[?]
  2. When atoms are constructed through reductions, is there some notion of the “naturalness” of the created leaning problems?
  3. Other than Markov fields/graphical models/Bayes nets, is there a good language for representing atoms and their compositions?

The answer to these and related questions remain unclear to me. A workshop gives us a chance to pool what we have learned from some very different approaches to tackling this same basic goal.

(*) As a general principle, it’s very difficult to conceive of any system for solving any large problem which does not decompose.

Plan Sketch:

  1. A two day workshop with unhurried presentations and discussion seems appropriate, especially given the diversity of approaches.
  2. TTI-Chicago may be able to help with costs.

The above two points suggest having a workshop on a {Friday, Saturday} or {Saturday, Sunday} at TTI-Chicago.

9/19/2005

NIPS Workshops

Tags: General jl@ 3:46 pm

Attendance at the NIPS workshops is highly recommended for both research and learning. Unfortunately, there does not yet appear to be a public list of workshops. However, I found the following workshop webpages of interest:

  1. Machine Learning in Finance
  2. Learning to Rank
  3. Foundations of Active Learning
  4. Machine Learning Based Robotics in Unstructured Environments

There are many more workshops. In fact, there are so many that it is not plausible anyone can attend every workshop they are interested in. Maybe in future years the organizers can spread them out over more days to reduce overlap.

Many of these workshops are accepting presentation proposals (due mid-October).

9/14/2005

The Predictionist Viewpoint

Tags: General jl@ 12:54 pm

Virtually every discipline of significant human endeavor has a way explaining itself as fundamental and important. In all the cases I know of, they are both right (they are vital) and wrong (they are not solely vital).

  1. Politics. This is the one that everyone is familiar with at the moment. “What could be more important than the process of making decisions?”
  2. Science and Technology. This is the one that we-the-academics are familiar with. “The loss of modern science and technology would be catastrophic.”
  3. Military. “Without the military, a nation will be invaded and destroyed.”
  4. (insert your favorite here)

Within science and technology, the same thing happens again.

  1. Mathematics. “What could be more important than a precise language for establishing truths?”
  2. Physics. “Nothing is more fundamental than the laws which govern the universe. Understanding them is the key to understanding everything else.”
  3. Biology. “Without life, we wouldn’t be here, so clearly the study of life is fundamental.”
  4. Computer Science. “Everything is a computer. Controlling computation is fundamental to controlling the world.”

This post is a “me too” for machine learning. The basic claim is that all problems can be rephrased as prediction problems. In particular, for any agent (human or machine), there are things which are sensed and the goal is make good predictions about which actions to take. Here are some examples:

  1. Soccer. Playing soccer with Peter Stone is interesting because he sometimes reacts to a pass before it is made. The ability to predict what will happen in the future is a huge edge in games.
  2. Defensive Driving is misnamed. It’s really predictive driving. You, as a driver, attempt to predict how the other cars around you can mess up, and take that into account in your own driving style.
  3. Predicting well can make you very wealthy by playing the stock market. Some companies have been formed around the idea of automated stock picking, with partial success. More generally, the idea of prediction as the essential ingredient is very common when gambling with stocks.
  4. Information markets generalize the notion of stock picking to make predictions about arbitrary facts.

Prediction problems are prevalent throughout our lives so studying the problems and their solution, which is a core goal of machine learning, is essential. From the predictionist viewpoint, it is not about what you know, what you can prove or infer, who your friends are, or how much wealth you have. Instead, it’s about how well you can predict (and act on predictions of) the future.

9/12/2005

Fast Gradient Descent

Tags: Papers, Supervised jl@ 9:27 am

Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD).

Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.

  1. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
  2. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
  3. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.

Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.

SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.

9/10/2005

“Failure” is an option

Tags: General jl@ 1:14 pm

This is about the hard choices that graduate students must make.

The cultural definition of success in academic research is to:

  1. Produce good research which many other people appreciate.
  2. Produce many students who go on to do the same.

There are fundamental reasons why this is success in the local culture. Good research appreciated by others means access to jobs. Many students succesful in the same way implies that there are a number of people who think in a similar way and appreciate your work.

In order to graduate, a phd student must live in an academic culture for a period of several years. It is common to adopt the culture’s definition of success during this time. It’s also common for many phd students discover they are not suited to an academic research lifestyle. This collision of values and abilities naturally results in depression.

The most fundamental advice when this happens is: change something. Pick a new advisor. Pick a new research topic. Or leave the program (and do something else with your life).

The first two are relatively easy, but “Do something else with your life” is a hard choice for a phd student to make because they are immersed in and adopt a value system that does not value that choice. Remember here that the academic value system is not a universal value system. For example, many people want to do something that is immediately constructive and find this at odds with academic research (which is almost defined by “not immediate”). The world is big enough and diverse enough to support multiple value systems. Realizing this may be the key to making very good decisions in your life. A number of my friends made this decision and went to google or investment banking places where they are deliriously happier (and more productive) than in their former lives.

9/8/2005

Online Learning as the Mathematics of Accountability

Tags: Online jl@ 9:59 am

Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made.

Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model.

In the online learning model, there are a set of hypotheses or “experts”. On any instantance x, each expert makes a prediction y. A master algorithm A uses these predictions to form it’s own prediction yA and then learns the correct prediction y*. This process repeats.

The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove:

L(A) less than mine L(e) + log(number of experts)

over all sequences.

In particular, there is no assumption of independent samples and there is no assumption that the experts perform well (or can perform well). This is not a high probability statement: it simply always holds. These assumption-free qualities are very important for application to the accountability problem, because the experts really can be adversarial.

In any situation where we have a set of human experts giving advice on the same subject, we can hope to apply online learning algorithms to better distill collective advice into single prediction. Examples include:

  1. stock picking Many of the big stocks have ‘analysts’ who predict whether they will go up or down. For an example, look here.
  2. surveys It is common to see things like “A gain of 1.4 percent was expected, based on the median estimate in a Bloomberg survey of 53 economists” in news articles. Presumably, these economists are reused frequently implying they have a record to which an online algorithm could be applied.

This application of online learning isn’t trivial. Even for the above examples, it isn’t clear how to handle issues like:

  1. A new expert starts making predictions. There are some reasonable ad-hoc mechanisms for coping with this in the context of particular algorithms.
  2. An expert declines to make a prediction. The modified “sleeping experts” setting handles this, but the results are not quite as tight.
  3. The loss associated with individual predictions is highly variable rather than something simple like “0/1-loss” or “squared error loss”. One approach to this is to combine regret minimizing learning reductions with online learning algorithms (drawback: the reduced predictions may not be of intuitive things). Another approach is simply trying to make very flexible master algorithms (drawback: flexibility often comes with a weakening in the theoretical guarantees).
  4. In the real world, we may not have feedback about a prediction until after the next 10 predictions (or more) need to be made.
  5. In the real world, there may be uncertainty about the measured truth. Quantifying GDP growth requires a lot of work and has some fundamental uncertainty associated with it, especially when immediate feedback is required.

9/6/2005

A link

Tags: General jl@ 2:48 pm

I read through some of the essays of Michael Nielsen today, and recommend them. Principles of Effective Research and Extreme Thinking are both relevant to several discussions here.

9/5/2005

Site Update

Tags: General site admin@ 9:48 pm

I tweaked the site in a number of ways today, including:

  1. Updating to Wordpress 1.5.
  2. Installing and heavily tweaking the Geekniche theme. Update: I switched back to a tweaked version of the old theme.
  3. Adding the Customizable Post Listings plugin.
  4. Installing the StatTraq plugin.
  5. Updating some of the links. I particularly recommend looking at the computer research policy blog.
  6. Adding threaded comments. This doesn’t thread old comments obviously, but the extra structure may be helpful for new ones.

Overall, I think this is an improvement, and it addresses a few of my earlier problems. If you have any difficulties or anything seems “not quite right”, please speak up. A few other tweaks to the site may happen in the near future.

9/4/2005

Science in the Government

Tags: General jl@ 9:18 am

I found the article on “Political Science” at the New York Times interesting. Essentially the article is about allegations that the US government has been systematically distorting scientific views. With a petition by some 7000+ scientists alleging such behavior this is clearly a significant concern.

One thing not mentioned explicitly in this discussion is that there are fundamental cultural differences between academic research and the rest of the world. In academic research, careful, clear thought is valued. This value is achieved by both formal and informal mechanisms. One example of a formal mechanism is peer review.

In contrast, in the land of politics, the basic value is agreement. It is only with some amount of agreement that a new law can be passed or other actions can be taken. Since Science (with a capitol ‘S’) has accomplished many things, it can be a significant tool in persuading people. This makes it compelling for a politician to use science as a mechanism for pushing agreement on their viewpoint.

Most scientists would not mind if their research is used in a public debate. The difficulty arises when the use of science is not representative of the beliefs of scientists. This can happen in many ways. For example, agreement is uncommon in research which implies that it is almost always possible, by carefully picking and choosing, to find one scientist who supports almost any viewpoint.

Such misrepresentations of scientific beliefs about the world violate the fundamental value of “careful, clear thought”, so they are regarded as fundamentally dangerous to the process of research. Naturally, fundamentally dangerous things are sensitive issues which can easily lead to large petitions.

This combination of mismatched values is what appears to be happening. It is less clear what should be done about it.

One response has been (as the article title suggests) politicization of science and scientists. For example the Union of Concerned Scientists (which organized the petition) has a viewpoint and is pushing it. As another example, anecdotal evidence suggests a strong majority of scientists in the US voted against Bush in the last presidential election.

I would prefer a different approach, which is essentially a separation of responsibilities. Given a sufficient separation of powers, scientists should be the most reliable source for describing and predicting the outcomes of some courses of action and the impact of new technologies. What is done with such information is up to the rest of the world. This style of “sharply defined well-separated powers” has worked fairly well elsewhere. Supreme court judges (who specialize in interpretation of law) are, by design, relatively unaffectable by the rest of politics. A newer example is the federal reserve board who have been relatively unaffected by changes in politics, even though it is easy to imagine their powers could dramatically effect election outcomes. This last example is a matter of custom rather than constitutional law.

Neither of the above examples are perfect—the separation of powers has failed on multiple occasions. Nevertheless, it seems to be a useful ideal.

Powered by WordPress