Exponentiated Gradient

The Exponentiated Gradient algorithm by Manfred Warmuth and Jyrki Kivinen came out just as I was starting graduate school, so I missed it both at a conference and in class. It’s a fine algorithm which has a remarkable theoretical statement accompanying it.

The essential statement holds in the “online learning with an adversary” setting. Initially, there are of set of n weights, which might have values (1/n,…,1/n), (or any other values from a probability distribution). Everything happens in a round-by-round fashion. On each round, the following happens:

  1. The world reveals a set of features x in {0,1}n. In the online learning with an adversary literature, the features are called “experts” and thought of as subpredictors, but this interpretation isn’t necessary—you can just use feature values as experts (or maybe the feature value and the negation of the feature value as two experts).
  2. EG makes a prediction according to y’ = w . x (dot product).
  3. The world reveals the truth y in [0,1].
  4. EG updates the weights according to wi <- wie-2 c (y’ – y)xi. Here c is a small constant learning rate.
  5. The weights are renormalized to sum to 1.

The 4th line is what gives EG it’s name—exponent of the negative gradient (of squared loss in this case).

The accompanying theoretical statement (in english), is that for all sequences of observations, this algorithm does nearly as well in squared loss as the best convex combination of the features with a regret that is only logarithmic in the number of features. The mathematical theorem statement is: For all T for all T-length sequences of observations,

SumTt=1 (y’t – yt)2 <= minprobability distributions q (2/(2-c)) SumTt=1 (q.xt – y)2 + KL(q||p) / c

Here KL(q||p) = Sumi qi ln (qi / pi) is the KL-divergence between the distribution q that you compare with and the distribution p that you start with. The KL-divergence is at most log n if p is uniform.

The learning rate c plays a critical role in the theorem, and the best constant setting of c depends on how many total rounds T there are. Tong Zhang likes to think of this algorithm as the stochastic gradient descent with entropy regularization, which makes it clear that when used as an online optimization algorithm, c should be gradually decreased in value.

There are many things right here.

  1. Exponentiated Gradient competes with the best convex predictor with no caveats about how the world produces the data. This is pretty remarkable—it’s much stronger than competing with the best single feature, as many other online adversary algorithms do. The lack of assumptions about the world means this is a pretty universal algorithm.
  2. The notion of competition is logarithmic in the number of features so the algorithm can give good performance even when the number of features is extraordinarily large compared to the number of examples. In contrast gradient descent style algorithms might need a number of examples similar to the number of features. (The same paper also analyzes regret for gradient descent.)
  3. We know from a learning reductions point of view that the ability to optimize squared loss well implies that ability to optimize many other losses fairly well.

A few things aren’t right about EG. A convex combination is not as powerful a representation as a linear combination. There are natural relaxations covered in the EG paper which deal with more general combinations, as well as in some other papers. For more general combinations, the story with respect to gradient descent becomes more mixed. When the best predictor is a linear combination of many features, gradient descent style learning may be superior to EG.

EG-style algorithms are slowly coming out. For example, this paper shows that EG style updates can converge very quickly compared to other methods.

Motivation should be the Responsibility of the Reviewer

The prevailing wisdom in machine learning seems to be that motivating a paper is the responsibility of the author. I think this is a harmful view—instead, it’s healthier for the community to regard this as the responsibility of the reviewer.

There are lots of reasons to prefer a reviewer-responsibility approach.

  1. Authors are the most biased possible source of information about the motivation of the paper. Systems which rely upon very biased sources of information are inherently unreliable.
  2. Authors are highly variable in their ability and desire to express motivation for their work. This adds greatly to variance on acceptance of an idea, and it can systematically discriminate or accentuate careers. It’s great if you have a career accentuated by awesome wording choice, but wise decision making by reviewers is important for the field.
  3. The motivation section in a paper doesn’t do anything in some sense—it’s there to get the paper in. Reading the motivation of a paper is of little use in helping the reader solve new problems.
  4. Many motivation sections are a waste of time. The 30th paper on a subject should not require a motivation as if it’s the first paper on a subject, and requiring or expecting this of authors is an exercise in busy work by the research community.

Some caveats to make sure I’m understood:

  1. I’m not advocating the complete removal of a motivation section (motivectomy?), which would be absurd (and frankly harmful to your career). A paragraph describing common examples where the problem addressed comes up is desirable for readers who are not specialists. This paragraph should not be in the abstract, where it seems to often sneak in.
  2. I’m also not arguing against discussion of motivations. I regard discussion of motivations as quite important, and totally unsuited to the paper format. It’s hard to imagine any worse method for discussion than one with a year-size latency where quasi-anonymous people are quasi-randomly paired and each attempts to accomplish several different tasks one of which happens to be a one-sided discussion of motivation. A blog can work much better for this sort of thing, and I definitely invite discussion on motivational questions.

So, how do we change the prevailing wisdom? The answer is always “gradually”, but there are a number of steps we can take.

  1. As an author, one clever technique is to pass serious discussion of motivation by reference. “For a general discussion and motivation of this problem see [].” This would save space in the large number of papers which attempt to address an old problem better than previous approaches.
  2. Participate in public discussion of motivations. We need to encourage a real mechanism for discussion. Until these alternative (and far better) formats for discussion are developed the problem of “who motivates” will always exist.
  3. Have private discussions about motivation where you can. Random conversations at conferences are great for this, and the process often sharpens your appreciation.
  4. Learn to take responsibility for motivation as a reviewer. This might sound hard, but it’s actually somewhat easier than careful evaluation of technical content in my experience.
    1. The first step is to disbelieve all the motivational parts of a paper by default. As mentioned above, the authors are not a reliable source anyways. Skip it and move on.
    2. Make sure you understand the problem being addressed.
    3. Make sure you understand how well the problem is addressed, relative to previous work.
    4. Think about how important that increment is. This is not equivalent to asking “how many people will appreciate the increment?” which is a popularity question. Frankly, all of Machine Learning fails the popularity test in a wider sense, even though many people appreciate the fruits of machine learning on a daily basis. First, think about the problem.
      1. How many people might a solution to the problem help? 0 is fairly common amongst submitted papers.
      2. How much would it help them? If it’s “alot”, then that should add a bit to the importance of the paper.
      3. How familiar are you with the problem? If not very, then it’s appropriate to give the benefit of the doubt to the authors.

      Think about the solution.

      1. This solution might be useful to some other researchers who come up with something useful. This is a a warning sign.
      2. This solution might be useful to me in coming up with a useful algorithm for solving problems.
      3. This paper improves an algorithm. This is also fairly common. It should be improving an algorithm with a reasonable claim at being the best method for solving some problem.
      4. This paper can provide improvements to many algorithms. Theory papers often fall here, but they can also fall under (1) or (2) easily.

      Now, take these considerations into account in forming your own opinion about how motivated the paper is.

  5. Go multimodel. If you only know one model of what machine learning is, you don’t really know machine learning. Learn multiple ideas of what machine learning are, and actively consider their merits and downsides.

The View From China

I’m visiting Beijing for the Pao-Lu Hsu Statistics Conference on Machine Learning.

I had several discussions about the state of Chinese research. Given the large population and economy, you might expect substantial research—more than has been observed at international conferences. The fundamental problem seems to be the Cultural Revolution which lobotimized higher education, and the research associated with it. There has been a process of slow recovery since then, which has begun to be felt in the research world via increased participation in international conferences and (now) conferences in China.

The amount of effort going into construction in Beijing is very impressive—people are literally building a skyscraper at night outside the window of the hotel I’m staying at (and this is not unusual). If a small fraction of this effort is later focused onto supporting research, the effect could be very substantial. General growth in China’s research portfolio should be expected.

Idempotent-capable Predictors

One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons:

  1. Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O.
  2. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system.

When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors.

  1. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.
  2. Linear regressors can be idempotent—they just set the weight on one input feature to 1 and other features to 0.
  3. Regression trees are not idempotent, or (at least) not easily idempotent. In order to predict the same as an input feature, that input feature must be split many times.
  4. Bayesian approaches may or may not be easily idempotent, depending on the structure of the Bayesian Prior.

It isn’t clear how important the idempotent-capable property is. Successive approximation approaches such as boosting can approximate it in a fairly automatic maner. It may be of substantial importance for large modular systems where efficiency is important.