On-line learning of regular decision rules

Many decision problems can be represented in the form
FOR n=1,2,…:
— Reality chooses a datum x_n.
— Decision Maker chooses his decision d_n.
— Reality chooses an observation y_n.
— Decision Maker suffers loss L(y_n,d_n).
END FOR.
The observation y_n can be, for example, tomorrow’s stock price and the decision d_n the number of shares Decision Maker chooses to buy. The datum x_n ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data.

Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D(x). Can we perform as well, or almost as well, as D, without knowing it? This is essentially a special case of the problem of on-line learning.

This is a simple result of this kind. Suppose the data x_n are taken from [0,1] and L(y,d)=|y–d|. A norm ||h|| of a function h on [0,1] is defined by

||h||² = (Integral₀¹ h(t) dt)² + Integral₀¹ (h'(t))² dt. Decision Maker has a strategy that guarantees, for all N and all D with finite ||D||, Sum_n=1^N L(y_n,d_n) is at most Sum_n=1^N L(y_n,D(x_n)) + (||2D–1|| + 1) N^1/2. Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral₀¹ (D'(t))²dt)^1/2 of D is significantly less than N^1/2. This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions.

It is interesting that the only known way of proving this result uses a non-standard (although very old) notion of probability. The standard definition (“Kolmogorov’s axiomatic“) is that probability is a normalized measure. The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. We call this procedure defensive forecasting. Takemura’s version was about binary observations, but it has been greatly extended since.

Now let us see how we can achieve the goal Sum_n=1 ^N L(y_n,d_n) < Sum_n=1^N L(y_n,D(x_n)) + (something small). This would be easy if we knew the true probabilities for the observations. At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. Alas, we do not know the true probabilities (if they exist at all). But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. The rest of the proof is technical details.

4 Replies to “On-line learning of regular decision rules”

Teemu Roos says:

10/11/2005 at 1:18 am

Some questions:
1) Forgive my ignorance, but what counts as a law of probability? Is there is a p such that 1/N Sum_{n=1}^N x_n -> p, a law, or do we get a different law for each p?

2) Using Takemura’s procedure, the forecaster can play without losing a penny. Does this hold only for a single gambling strategy (i.e., a single law of probability) at a time? What happens when we combine several laws? Can the forecaster still avoid losing anything?

3) For the sake of illustration, could you shortly describe (or give a pointer where I can read about) the Decision Maker’s competitive strategy when the set of decision rules D is the set of constant functions D(x)=p for 0 is binary; and the loss is measured by log-loss? Could we then replace ‘something small’ in the loss bound by a function that depends on N but not on the norm of D?
Teemu Roos says:

10/11/2005 at 1:23 am

Hmm… I must have used some HTML control sequences. In 3) I tried to say ‘…D is the set of constant functions D(x)=p with p in [0,1]; y is binary; and the loss is measured by log-loss?’.
Volodya says:

10/12/2005 at 8:57 pm

To avoid confusion, I will repeat your questions before answering them (correcting a typo in question 1, splitting question 2 in two, and merging the two pieces of question 3).

1) Forgive my ignorance, but what counts as a law of probability? Is there is a p such that 1/N Sum_n=1^N y_n -> p, a law, or do we get a different law for each p?

A short answer would be to say that for each p we get a different law. In fact, the law you suggest, (1/N) Sum_n=1^N y_n → p, would not be useful for our purpose, since the forecaster is so severely restricted (to giving the same prediction p on each round of the forecasting game). Even the more general law (1/N) Sum_n=1^N (y_n – p_n) → 0 is not useful: it can be (too) easily enforced by the forecaster by choosing p₁=1/2 and p_n = y_n-1 for n>1; it is clear that this forecasting strategy is not doing anything useful. A more useful law of probability would be (1/N) Sum_n=1^N (y_n – p_n) F(p_n,x_n) → 0 (with a known convergence rate), where F is a mapping into some feature space (a vector, typically Hilbert, space). The proof (which can be found here) of the proposition that I stated in the original posting (for future reference, let me call it Proposition O) uses similar but more specialized laws.

2a) Using Takemura’s procedure, the forecaster can play without losing a penny. Does this hold only for a single gambling strategy (i.e., a single law of
probability) at a time?

Yes, the procedure is applicable only to a single gambling strategy. If you want your forecasts to satisfy several laws, the corresponding gambling strategies have to be combined first.

2b) What happens when we combine several laws? Can the forecaster still avoid losing anything?

Suppose we have two gambling strategies, S1 and S2, that avoid bankruptcy when used with the initial capital of 1 dollar. The most natural combined strategy is (S1+S2)/2. If the forecaster avoids losing anything against the combined strategy (which he can do), either strategy can make at most 2 dollars out of 1 for the gambler. Therefore, the forecaster still does not lose much. Only when a large number of gambling strategies are combined there is a danger of a big loss when playing against one of these strategies. In fact, the proof referred to in my answer to question 1 combines two gambling strategies (which are called “strategies for Skeptic” in that paper).

I will answer question 3 in a day or two, when I come home after the ALT conference.
Volodya says:

10/14/2005 at 7:27 pm

You might prefer to read my original posting, first reply, and this second reply in pdf set in LATEX.

3) For the sake of illustration, could you shortly describe (or give a pointer where I can read about) the Decision Maker’s competitive strategy when the set of decision rules D is the set of constant functions D(x)=p with p in [0,1]; y is binary; and the loss is measured by log-loss?. Could we then replace “something small” in the loss bound by a function that depends on N but not on the norm of D?

Such a strategy and a performance guarantee for it are described in this paper (already mentioned earlier); see Corollary 3 and Section 7 (these numbers refer to the current version, 3, of the paper). This is done for arbitrary reproducing kernel Hilbert spaces (RKHSs), but the strategy and gurantee simplify very little in the special case that you describe (the function space of constants). In the case of Proposition O (which considers the absolute loss rather than the log loss function) the dependence on D can be avoided since ||2D-1|| is bounded above by 1. For the function space of constants, the log-loss analogue of Proposition O is not as good as what can be obtained using the “Aggregating Algorithm” (AA; in the log-loss case, the AA is just the Bayes rule). Yoav Freund in his paper “Predicting a binary sequence almost as well as the optimal biased coin” found that the AA (which he calls the “exponential weights” algorithm) gives “something small” equal to (1/2) ln(N+1) + 1. This is better than Proposition O in two respects: there is no dependence on the norm of D (which cannot be avoided in the log-loss version of Proposition O) and the dependence on N is logarithmic. In general, the logarithmic dependence on N can be expected for “regular” finite-dimensional (say, K-dimensional) benchmark classes of decision rules: cf. the ubiquitous term (K/2) ln N in Rissanen’s estimates of stochastic complexity or, in a context closer to Yoav’s paper, Section 4.5 of my “Competitive on-line statistics” (International Statistical Review, 2001).

I can see two ways to eliminate the existing mismatch between Proposition O and Yoav’s result. First, one can try to improve Proposition O by applying the AA to RKHSs of decision rules. Second, one can try to improve the use of defensive forecasting in the case of finite-dimensional benchmark classes (perhaps by finding laws of probability better suited to such classes). I do not plan to investigate this further in the near future (I have already started some other work, such as decision making using defensive forecasting when the loss functions depends on several future observations and/or is not convex, and it will take some time). I will be very glad if you or somebody else find this problem interesting and try to solve it.

Comments are closed.