Fallback Analysis is a Secret to Useful Algorithms

The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is:

If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.

We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:

  1. Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
  2. The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.

The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.

Several significant difficulties occur for anyone working on fallback analysis.

  1. It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
  2. It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
  3. It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
  4. It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.

On-line learning of regular decision rules

Many decision problems can be represented in the form
FOR n=1,2,…:
— Reality chooses a datum xn.
— Decision Maker chooses his decision dn.
— Reality chooses an observation yn.
— Decision Maker suffers loss L(yn,dn).
END FOR.
The observation yn can be, for example, tomorrow’s stock price and the decision dn the number of shares Decision Maker chooses to buy. The datum xn ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data.

Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D(x). Can we perform as well, or almost as well, as D, without knowing it? This is essentially a special case of the problem of on-line learning.

This is a simple result of this kind. Suppose the data xn are taken from [0,1] and L(y,d)=|yd|. A norm ||h|| of a function h on [0,1] is defined by

||h||2 = (Integral01 h(t) dt)2 + Integral01 (h'(t))2 dt.
Decision Maker has a strategy that guarantees, for all N and all D with finite ||D||,
Sumn=1N L(yn,dn) is at most Sumn=1N L(yn,D(xn)) + (||2D–1|| + 1) N1/2.
Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral01 (D'(t))2dt)1/2 of D is significantly less than N1/2. This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions.

It is interesting that the only known way of proving this result uses a non-standard (although very old) notion of probability. The standard definition (“Kolmogorov’s axiomatic“) is that probability is a normalized measure. The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. We call this procedure defensive forecasting. Takemura’s version was about binary observations, but it has been greatly extended since.

Now let us see how we can achieve the goal Sumn=1 N L(yn,dn) < Sumn=1N L(yn,D(xn)) + (something small). This would be easy if we knew the true probabilities for the observations. At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. Alas, we do not know the true probabilities (if they exist at all). But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. The rest of the proof is technical details.

Online Learning as the Mathematics of Accountability

Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made.

Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model.

In the online learning model, there are a set of hypotheses or “experts”. On any instantance x, each expert makes a prediction y. A master algorithm A uses these predictions to form it’s own prediction yA and then learns the correct prediction y*. This process repeats.

The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove:

L(A) less than mine L(e) + log(number of experts)

over all sequences.

In particular, there is no assumption of independent samples and there is no assumption that the experts perform well (or can perform well). This is not a high probability statement: it simply always holds. These assumption-free qualities are very important for application to the accountability problem, because the experts really can be adversarial.

In any situation where we have a set of human experts giving advice on the same subject, we can hope to apply online learning algorithms to better distill collective advice into single prediction. Examples include:

  1. stock picking Many of the big stocks have ‘analysts’ who predict whether they will go up or down. For an example, look here.
  2. surveys It is common to see things like “A gain of 1.4 percent was expected, based on the median estimate in a Bloomberg survey of 53 economists” in news articles. Presumably, these economists are reused frequently implying they have a record to which an online algorithm could be applied.

This application of online learning isn’t trivial. Even for the above examples, it isn’t clear how to handle issues like:

  1. A new expert starts making predictions. There are some reasonable ad-hoc mechanisms for coping with this in the context of particular algorithms.
  2. An expert declines to make a prediction. The modified “sleeping experts” setting handles this, but the results are not quite as tight.
  3. The loss associated with individual predictions is highly variable rather than something simple like “0/1-loss” or “squared error loss”. One approach to this is to combine regret minimizing learning reductions with online learning algorithms (drawback: the reduced predictions may not be of intuitive things). Another approach is simply trying to make very flexible master algorithms (drawback: flexibility often comes with a weakening in the theoretical guarantees).
  4. In the real world, we may not have feedback about a prediction until after the next 10 predictions (or more) need to be made.
  5. In the real world, there may be uncertainty about the measured truth. Quantifying GDP growth requires a lot of work and has some fundamental uncertainty associated with it, especially when immediate feedback is required.

Structured Regret Minimization

Geoff Gordon made an interesting presentation at the snowbird learning workshop discussing the use of no-regret algorithms for the use of several robot-related learning problems. There seems to be a draft here. This seems interesting in two ways:

  1. Drawback Removal One of the significant problems with these online algorithms is that they can’t cope with structure very easily. This drawback is addressed for certain structures.
  2. Experiments One criticism of such algorithms is that they are too “worst case”. Several experiments suggest that protecting yourself against this worst case does not necessarily incur a great loss.

Binomial Weighting

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than minc f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.