Watchword: Probability

Probability is one of the most confusingly used words in machine learning. There are at least 3 distinct ways the word is used.

  1. Bayesian The Bayesian notion of probability is a ‘degree of belief’. The degree of belief that some event (i.e. “stock goes up” or “stock goes down”) occurs can be measured by asking a sequence of questions of the form “Would you bet the stock goes up or down at Y to 1 odds?” A consistent better will switch from ‘for’ to ‘against’ at some single value of Y. The probability is then Y/(Y+1). Bayesian probabilities express lack of knowledge rather than randomization. They are useful in learning because we often lack knowledge and expressing that lack flexibly makes the learning algorithms work better. Bayesian Learning uses ‘probability’ in this way exclusively.
  2. Frequentist The Frequentist notion of probability is a rate of occurence. A rate of occurrence can be measured by doing an experiment many times. If an event occurs k times in n experiments then it has probability about k/n. Frequentist probabilities can be used to measure how sure you are about something. They may be appropriate in a learning context for measuring confidence in various predictors. The frequentist notion of probability is common in physics, other sciences, and computer science theory.
  3. Estimated The estimated notion of probability is measured by running some learning algorithm which predicts the probability of events rather than events. I tend to dislike this use of the word because it confuses the world with the model of the world.

To avoid confusion, you should be careful to understand what other people mean for this word. It is helpful to always be explicit about which variables are randomized and which are constant whenever probability is used because Bayesian and Frequentist probabilities commonly switch this role.

Holy grails of machine learning?

Let me kick things off by posing this question to ML researchers:

What do you think are some important holy grails of machine learning?

For example:
– “A classifier with SVM-level performance but much more scalable”
– “Practical confidence bounds (or learning bounds) for classification”
– “A reinforcement learning algorithm that can handle the ___ problem”
– “Understanding theoretically why ___ works so well in practice”

I pose this question because I believe that when goals are stated explicitly and well (thus providing clarity as well as opening up the problems to more people), rather than left implicit, they are likely to be achieved much more quickly. I would also like to know more about the internal goals of the various machine learning sub-areas (theory, kernel methods, graphical models, reinforcement learning, etc) as stated by people in these respective areas. This could help people cross sub-areas.

The Humanloop Spectrum of Machine Learning

All branches of machine learning seem to be united in the idea of using data to make predictions. However, people disagree to some extent about what this means. One way to categorize these different goals is on an axis, where one extreme is “tools to aid a human in using data to do prediction” and the other extreme is “tools to do prediction with no human intervention”. Here is my estimate of where various elements of machine learning fall on this spectrum.

Human Necessary Human partially necessary Human unnecessary
Clustering, data visualization Bayesian Learning, Probabilistic Models, Graphical Models Kernel Learning (SVM’s, etc..) Decision Trees? Reinforcement Learning

The exact position of each element is of course debatable. My reasoning is that clustering and data visualization are nearly useless for prediction without a human in the loop. Bayesian/probabilistic models/graphical models generally require a human to sit and think about what is a good prior/structure. Kernel learning approaches have a few standard kernels which often work on simple problems, although sometimes significant kernel engineering is required. I’ve been impressed of late how ‘black box’ decision trees or boosted decision trees are. The goal of reinforcement learning (rather than perhaps the reality) is designing completely automated agents.

The position in this spectrum provides some idea of what the state of progress is. Things at the ‘human necessary’ end have been succesfully used by many people to solve many learning problems. At the ‘human unnecessary’ end, the systems are finicky and often just won’t work well.

I am most interested in the ‘human unnecessary’ end.

Why I decided to run a weblog.

I have decided to run a weblog on machine learning and learning theory research. Here are some reasons:

1) Weblogs enable new functionality:

  • Public comment on papers. No mechanism for this exists at conferences and most journals. I have encountered it once for a science paper. Some communities have mailing lists supporting this, but not machine learning or learning theory. I have often read papers and found myself wishing there was some method to consider other’s questions and read the replies.
  • Conference shortlists. One of the most common conversations at a conference is “what did you find interesting?” There is no explicit mechanism for sharing this information at conferences, and it’s easy to imagine that it would be handy to do so.
  • Evaluation and comment on research directions. Papers are almost exclusively about new research, rather than evaluation (and consideration) of research directions. This last role is satisfied by funding agencies to some extent, but that is a private debate of a subset of the community. It’s easy to imagine that a public debate would be more thorough and thoughtful, producing better decisions.
  • Public Collaboration. It may be feasible to use a weblog as a mechanism for public research on a scale less than a paper. Currently, most research is done in machine learning by one or a few closely working and privately communicating authors. Weblogs provide a natural generalization where anyone who is interested may be able to contribute.
  • The things not thought of. Weblogs provide new capabilities, and it is natural to miss the impact of these capabilities until a number of people have thought about and used them.

I intend to experiment with these capabilities.

2) Weblogs have the potential to be revolutionary. Here is a comparison of the different mechanisms of communication in a table.

mechanism speed scope permanency information filtration
journal papers 6 months to years. Anyone with interest and access. Very permanent reviewed
conference papers 4-6 months Attendees (and often any with interest). Permanent reviewed
workshops 1-6 months Attendees Typically Transient inspected
mailing lists a few days Anyone subscribed (or reading archives). Semipermanent (with archives) inspected
personal discussion thought speed Whoever is there then. Transient not reviewed
weblog thought speed Anyone with interest Semipermaent not reviewed

Weblogs achieve “best we can imagine” in every category except permanency and quality control. Furthermore, the weaknesses are not inherent to the medium, and are being actively addressed.

Permalinks are the equivalent of a citation, providing a semipermanent pointer to a piece of content. This is only ‘semi’ becuase the _author_ of the content can typically revise the content at any moment in the future and the pointer is only permanet up to the permanence of the website.
Trackback is an explicit method for creating the reverse lookup table of citations: who cites this?
In addition, there are several mechanisms for information filtration such as “post is reposted in another weblog” and experimental moderation schemes.

The same forces driving academia into desiring permanent indelible records and very careful information filtration exist for blogs. These forces may produce the ‘missing pieces’, making weblogs very compelling for academic purposes.

3) Lance Fortnow told me so.