New Models

How should we, as researchers in machine learning, organize ourselves?

The most immediate measurable objective of computer science research is publishing a paper. The most difficult aspect of publishing a paper is having reviewers accept and recommend it for publication. The simplest mechanism for doing this is to show theoretical progress on some standard, well-known easily understood problem.

In doing this, we often fall into a local minima of the research process. The basic problem in machine learning is that it is very unclear that the mathematical model is the right one for the (or some) real problem. A good mathematical model in machine learning should have one fundamental trait: it should aid the design of effective learning algorithms. To date, our ability to solve interesting learning problems (speech recognition, machine translation, object recognition, etc…) remains limited (although improving), so the “rightness” of our models is in doubt.

If our mathematical models are bad, the simple mechanism of research above can not yield the end goal. (This should be agreed on even by people who disagree about what the end goal of machine learning is!) Instead, research which proposes and investigates new mathematical models for machine learning might yield the end goal. Doing this is hard.

  1. Coming up with a new mathematical model is just plain not easy. Some sources of inspiration include:
    1. Watching carefully: what happens succesfully in practice, can often be abstracted into a mathematical model.
    2. Swapping fields: In other fields (for example crypto), other methods of analysis have been developed. Sometimes, these methods can be transferred.
    3. Model repair: Existing mathematical models often have easily comprehendable failure modes. By thinking about how to avoid such failure modes, we can sometimes produce a new mathematical model.
  2. Speaking about a new model is hard. The difficulty starts with you in explaining it. Often, when trying to converge on a new model, we think of it in terms of the difference with respect to an older model, leading to a tangled explanation. The difficulty continues with other people (in particular: reviewers) reading it. For a reviewer with limited time, it is very tempting to assume that any particular paper is operating in some familiar model and fail out. The best approach here seems to be super explicitness. You can’t be too blunt about saying “this isn’t the model you are thinking about”.
  3. Succeeding with new models is also hard. When people don’t have a reference frame to understand the new model, they are unlikely to follow up, as is necessary for success in academia.

The good news here is that a succesful new model can be a big win. I wish it was an easier win: the barriers to success are formidably high, and it seems we should do everything possible to lower the barriers to success for the sake of improving research.

The Stock Prediction Machine Learning Problem

…is discussed in this nytimes article. I generally expect such approaches to become more common since computers are getting faster, machine learning is getting better, and data is becoming more plentiful. This is another example where machine learning technology may have a huge economic impact. Some side notes:

  1. We-in-research know almost nothing about how these things are done (because it is typically a corporate secret).
  2. … but the limited discussion in the article seem naive from a machine learning viewpoint.
    1. The learning process used apparently often fails to take into account transaction costs.
    2. What little of the approaches is discussed appears modeling based. It seems plausible that more direct prediction methods can yield an edge.
  3. One difficulty with stock picking as a research topic is that it is inherently a zero sum game (for every winner, there is a loser). Much of the rest of research is positive sum (basically, everyone wins).

MaxEnt contradicts Bayes Rule?

A few weeks ago I read this. David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it):

In short I was thinking that “bayesian belief updating” and “maximum entropy” were two othogonal principles. But it appear that they are not, and that they can even be in conflict !
Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A=”X is an odd number”
Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5 expectation. P(X) and P(X|a) are different distributions.
Conclusion ? MaxEnt and bayesian updating are two different principle leading to different belief distributions. Am I right ?

I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. Here, straight MaxEnt make no sense. The implication of the problem is that the ensemble average 3.5 is no longer an active constraint. That is, we no longer believe the contraint E[X]=3.5 once we have the additional data that X is an odd number. The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. These two answers are simply (correct) answers to different questions.

Some recent papers

It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out:

Online Multitask Learning
Ofer Dekel, Phil Long, Yoram Singer
This is on my reading list. Definitely an area I’m interested in.

Maximum Entropy Distribution Estimation with Generalized Regularization
Miroslav Dudík, Robert E. Schapire

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
András Antos, Csaba Szepesvári, Rémi Munos
Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages?

Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization” of existing graph algorithms.

Cover Trees for Nearest Neighbor (Alina Beygelzimer, Sham Kakade, John Langford) finally seems to have gotten published. It’s an embarrassment to the community that it took this long– and a reminder of how diligent one has to be in ensuring good work gets published. This seems to happen on a regular basis. (See A New View of EM.)

Finally, I thought this one was very cool:
Constructing Informative Priors by Rajat Raina, Andrew Y. Ng, Daphne Koller.
Same interest as the first paper on the list.
Check them out!

Branch Prediction Competition

Alan Fern points out the second branch prediction challenge (due September 29) which is a follow up to the first branch prediction competition. Branch prediction is one of the fundamental learning problems of the computer age: without it our computers might run an order of magnitude slower. This is a tough problem since there are sharp constraints on time and space complexity in an online environment. For machine learning, the “idealistic track” may fit well. Essentially, they remove these constraints to gain a weak upper bound on what might be done.