Use of Notation

For most people, a mathematical notation is like a language: you learn it and stick with it. For people doing mathematical research, however, this is not enough: they must design new notations for new problems. The design of good notation is both hard and worthwhile since a bad initial notation can retard a line of research greatly.

Before we had mathematical notation, equations were all written out in language. Since words have multiple meanings and variable precedences, long equations written out in language can be extraordinarily difficult and sometimes fundamentally ambiguous. A good representative example of this is the legalese in the tax code. Since we want greater precision and clarity, we adopt mathematical notation.

One fundamental thing to understand about mathematical notation, is that humans as logic verifiers, are barely capable. This is the fundamental reason why one notation can be much better than another. This observation is easier to miss than you might expect because, for a problem that you are working on, you have already expended the effort to reach an understanding.

I don’t know of any systematic method for designing notation, but there are a set of heuristics learned over time which may be more widely helpful.

  1. Notation should be minimized. If there are two ways to express things, then choose the (objectively, by symbol count) simpler one. If notation is only used once, it should be removable (this often arises in presentations).
  2. Notation divergence should be minimized. If the people working on some problem have a standard notation, then sticking with it is easier. For example, in machine learning x is almost always a set of features from which predictions are made.
  3. A reasonable mechanism for notation design is to first name and define the quantities you are working with (for example, reward r and time t), and then make derived quantities by combination (for example rt is reward at time t).
  4. Variables should be alliterated. Time is t, reward is r, cost is c, hypothesis is h.
  5. Name collisions (or near collisions) should be avoided. E and p are terrible variable names in some contexts.
  6. Sub-sub-scripts should be avoided. It is often possible to change a sub-sub-script into a sub-script by redefinition.
  7. Superscripts are dangerous because of overloading with exponentiation.
  8. Inessential dependences should be suppressed in the definition. (For example, in reinforcement learning the MDP M you are working with is often suppressable because it never changes.)
  9. A dependence must be either uniformly suppressed or uniformly explicit.
  10. Short theorem statements are very nice. There seem to be two styles of theorem statements: long including all definitions and short with definitions made before the statement. As computer scientists, we have to prefer “short” because long is nonmodular. As humans, it’s easier to read.
  11. It is very easy to forget the quantification of a variable (“for all” or “there exists”) when you are working on a theorem, and it is essential for readers that you specify it explicitly.
  12. Avoid strange alphabets. It is hard for people to think with unfamiliar symbols. english lowercase > english upper case > greek lower case > greek upper case > hebrew > other strange things.
  13. The definitions section of a paper often should not contain all the definitions in a paper. Instead, it should cover the universally used definitions. Others can be introduced just before they are used.

These heuristics often come into conflict, which can be hard to resolve. When trying to resolve the conflict, it’s important to understand that it’s easy to fail to imagine what a notation would be like. Trying out different notations and comparing is reasonable.

Are there other useful heuristics for notation design? (Or disagreements with the above heuristics?)

“Structural” Learning

Fernando Pereira pointed out Ando and Zhang‘s paper on “structural” learning. Structural learning is multitask learning on subproblems created from unlabeled data.

The basic idea is to take a look at the unlabeled data and create many supervised problems. On text data, which they test on, these subproblems might be of the form “Given surrounding words predict the middle word”. The hope here is that successfully predicting on these subproblems is relevant to the prediction of your core problem.

In the long run, the precise mechanism used (essentially, linear predictors with parameters tied by a common matrix) and the precise problems formed may not be critical. What seems critical is that the hope is realized: the technique provides a significant edge in practice.

Some basic questions about this approach are:

  1. Are there effective automated mechanisms for creating the subproblems?
  2. Is it necessary to use a shared representation?

Why do people count for learning?

This post is about a confusion of mine with respect to many commonly used machine learning algorithms.

A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express:
P(A,B,C) = P(A | B,C)P(B)P(C)

What I don’t understand is the mechanism commonly used to estimate P(A | B, C). If we let N(A,B,C) be the number of instances of A,B,C then people sometimes form an estimate according to:

P'(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)]

… in other words, people just estimate P'(A | B,C) according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space A x B x C, but it (naturally) falls apart when this is not the case as typically happens with “big” learning problems such as machine translation, vision, etc…

To compensate, people often try to pick some prior (such as Dirichlet prior with one “virtual count” per joint parameter setting) to provide a reasonable default value for the count. Naturally, in the “big learning” situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts. It’s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict A given B and C.

Stepping back a bit, we can think of the estimation of P(A | B, C) as a simple self-contained prediction (sub)problem. Why don’t we use existing technology for doing this prediction? Viewed as a learning algorithm “counting with a Dirichlet prior” is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default. It’s hard to imagine a more primitive learning algorithm.

There seems to be little downside to trying this approach. In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well. In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above. In either case something reasonable happens.

Using a general purpose probabilistic prediction algorithm isn’t a new idea, (for example, see page 57), but it appears greatly underutilized. This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}. It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception.

The Peekaboom Dataset

Luis von Ahn‘s Peekaboom project has yielded data (830MB).

Peekaboom is the second attempt (after Espgame) to produce a dataset which is useful for learning to solve vision problems based on voluntary game play. As a second attempt, it is meant to address all of the shortcomings of the first attempt. In particular:

  1. The locations of specific objects are provided by the data.
  2. The data collection is far more complete and extensive.

The data consists of:

  1. The source images. (1 file per image, just short of 60K images.)
  2. The in-game events. (1 file per image, in a lispy syntax.)
  3. A description of the event language.

There is a great deal of very specific and relevant data here so the hope that this will help solve vision problems seems quite reasonable.

A Fundamentalist Organization of Machine Learning

There are several different flavors of Machine Learning classes. Many classes are of the ‘zoo’ sort: many different learning algorithms are presented. Others avoid the zoo by not covering the full scope of machine learning.

This is my view of what makes a good machine learning class, along with why. I’d like to specifically invite comment on whether things are missing, misemphasized, or misplaced.

Phase Subject Why?
Introduction What is a machine learning problem? A good understanding of the characteristics of machine learning problems seems essential. Characteristics include: a data source, some hope the data is predictive, and a need for generalization. This is probably best taught in a case study manner: lay out the specifics of some problem and then ask “Is this a machine learning problem?”
Introduction Machine Learning Problem Identification Identification and recognition of the type of learning problems is (obviously) a very important step in solving such problems. People need to be familiar witth the concept of ‘regression’, ‘classification’, ‘cost sensitive classification’, ‘reinforcement learning’, etc… A good organization of these things is possible, but not yet well done.
Introduction Example algorithm 1 To really understand machine learning, a couple learning algorithms must be understood in detail.
Introduction Example algorithm 2 Ditto. The reason why the number is “2” and not “1” or “3” is that 2 is the minimum number required to make people naturally aware of the degrees of freedom available in learning algorithm design.
Analysis Bias for Learning The need for a good bias is one of the defining characteristics of learning. This includes discussing the means to specify bias (via Bayesian priors, choice of features, graphical models, etc…). This statement is generic so it will always apply to one degree or another.
Analysis Learning can be boosted. This is the boosting observation: that it is possible to bootstrap predictive ability to create a better overall system. This statement is similarly generic.
Analysis Learning can be transformed This is the reductions observation: that the ability to solve one kind of learning problems implies the ability to solve other kinds of leanring problems. This statement is similarly generic.
Analysis Learning can be preserved This is the online learning with experts observation: that we can have a master algorithm which preserves the best learning performance of subalgorithms. This statement is again generic.
Analysis Overfitting Learning algorithms can easily overfit to existing training data. How to analyze this (with an IID assumption), and how to avoid it are very important for success.
Analysis Hardness of Learning It turns out that there are several different ways in which machine learning can be hard including computational and information theoretic hardness. Some of PAC learning is relevant here. An understanding of how and why learning algorithms can fail seems important to understand the process.
Applications Vision One example of how learning is applied to solve vision problems.
Applications Language Ditto for language problems.
Applications Robotics Ditto for robotics
Applications Speech Ditto for speech
Applications Businesses Ditto for businesses
Where is machine learning going? Insert predictions of the future here. It should be understood that the field of machine learning is changing rapidly.

The emphasis here is on fundamentals: generally applicable mathematical statements and understandings of the learning problem. Given that emphasis, the ‘applications’ section could be cut without harming the integrity of the purpose.