The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it.
The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n), (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens:
- The world reveals a set of features x in {0,1}n. In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts).
- EG makes a prediction according to y’ = w . x (dot product).
- The world reveals the truth y in [0,1].
- EG updates the weights according to wi <- wie-2 c (y’ – y)xi. Here c is a small constant learning rate.
- The weights are renormalized to sum to 1.
The 4th line is what gives EG it’s name—exponent of the negative gradient (of squared loss in this case).
The accompanying theoretical statement (in english), is that for all sequences of observations, this algorithm does nearly as well in squared loss as the best convex combination of the features with a regret that is only logarithmic in the number of features. The mathematical theorem statement is: For all T for all T-length sequences of observations,
Here KL(q||p) = Sumi qi ln (qi / pi) is the KL-divergence between the distribution q that you compare with and the distribution p that you start with. The KL-divergence is at most log n if p is uniform.
The learning rate c plays a critical role in the theorem, and the best constant setting of c depends on how many total rounds T there are. Tong Zhang likes to think of this algorithm as the stochastic gradient descent with entropy regularization, which makes it clear that when used as an online optimization algorithm, c should be gradually decreased in value.
There are many things right here.
- Exponentiated Gradient competes with the best convex predictor with no caveats about how the world produces the data. This is pretty remarkable—it’s much stronger than competing with the best single feature, as many other online adversary algorithms do. The lack of assumptions about the world means this is a pretty universal algorithm.
- The notion of competition is logarithmic in the number of features so the algorithm can give good performance even when the number of features is extraordinarily large compared to the number of examples. In contrast gradient descent style algorithms might need a number of examples similar to the number of features. (The same paper also analyzes regret for gradient descent.)
- We know from a learning reductions point of view that the ability to optimize squared loss well implies that ability to optimize many other losses fairly well.
A few things aren’t right about EG. A convex combination is not as powerful a representation as a linear combination. There are natural relaxations covered in the EG paper which deal with more general combinations, as well as in some other papers. For more general combinations, the story with respect to gradient descent becomes more mixed. When the best predictor is a linear combination of many features, gradient descent style learning may be superior to EG.
EG-style algorithms are slowly coming out. For example, this paper shows that EG style updates can converge very quickly compared to other methods.
Great post John — this is a class of algorithm I’ve worked on as a practical approach to some online learning problems but I confess I didn’t know that it had such solid theoretical underpinnings. I will read both papers closely. And, there are some settings in which convex combinations are more natural — for example, in tactical asset allocation, we seek an allocation of our portfolio in N assets. In general, this is a nice approach where we have to specify a budget as the output.
The essence of the result is reminiscent of the well-known property that the Rademacher complexity of a set of functions (features) is unchanged when considering the convex hull of said set. For batch learning theory using a finite set of candidates and a Lispschitz loss we therefore get a log-cardinality complexity term for the initial set of functions as well as for its convex hull.
“Tong Zhang likes to think of this algorithm as the stochastic gradient descent with entropy regularization”
May I ask which paper is it?
I don’t know if it’s in a paper—email Tong.
EG could be considered as an aggressive version of Winnow, when EG adopt the hinge loss function for binary classification problem. The relation between EG and Winnowis similar with the relation between Online Gradient Descent and Perceptron.