One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance.
It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down.
- Sometimes there are clever approximations of the gradient. (Simon Osindero)
- Sometimes we can compute constrained gradients via iterated gradient/project steps. (Ben Taskar)
- Sometimes we can compute gradients anyways over mildly nondifferentiable functions. (Drew Bagnell)
- Even given a gradient, the choice of update is unclear, and might be cleverly chosen (Nic Schraudolph)
Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees.