Fast Gradient Descent

Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD).

Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.

  1. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
  2. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
  3. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.

Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.

SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.

14 Replies to “Fast Gradient Descent”

  1. Not related at all. CCCP is a way to globally optimize some non-convex functions by turning them into a series (in fact, two nested loops, in the general case) of convex optimization problems. It uses a standard local optimizer such as CG under the hood to solve those convex problems.

    SMD, by contrast, is a highly scalable (O(n) cost per iteration) local optimizer, that is, an alternative to methods like CG. It shines when gradients are stochastically approximated (“online” learning, not “batch”), where CG methods break down – see these tutorial slides.

    To my knowledge nobody has yet looked at whether/how either or both of CCCP’s loops could be stochastically approximated without destroying its overall convergence properties. If that can be done, SMD would be a natural choice for putting under the hood of such an “online CCCP” algorithm.

    – nic

    PS: I’ve yet to attend NIPS this century – I stopped going in 1999 out of frustration over the bad quality of their refereeing for anything outside an inbred set of core areas.

  2. It’s amazing to me that people still use gradient descent (including “backpropagation,” by which I refer to gradient descent with the adjoint method for neural networks, not just the adjoint method) instead of quasi-Newton optimization. I have never heard of a situation where gradient descent is better than quasi-Newton, and it is often much, much worse. Implementation shouldn’t be a factor, since the Fortran L-BFGS library is excellent — I use it all the time, with a C++-wrapper. Using an optimization package means you get to have all the benefits of the line-search parameters etc. that have been carefully tweaked by numerical analysis wizards so that you don’t have to. There are some okay optimizers for MATLAB (like NETLIB’s SCG), but, if performance is an issue, then using MATLAB is not necessarily a good idea.

    Most numerical optimization algorithms have issues with local minima, and some with parameter tuning (including quasi-Newton methods, EM and variational learning, annealing, etc.). But holding up gradient descent as *the* search method seems crazy.

    I like the idea of SMD in general. However, a student of mine tried to use something like SMD for a shape-from-video project, but couldn’t get it to help. Each evaluation of the objective function and the gradient were very expensive. Using SMD with fixed step-sizes didn’t converge to a good result. Using line-search meant many redundant objective function evaluations, whereas L-BFGS with the full objective function reuses information between steps in clever ways. I think there was something specific to this case where sampling prevented him from using some preconditioning that had been helpful as well. (Actually, I can’t say with confidence that what we were using is the same as SMD, since it’s awhile since I read the paper; basically, he was randomly sampling a subset of the error terms, rather than using all of them).

    In general, it sounds like SMD requires careful choice of sampling in order to get sensible gradient estimates. (Incidentally, there seems to be some similarity with contrastive divergence— which does pretty well with only a single sample— and REINFORCE, which (as far as I know) does not.

  3. Also, I think that some of the objections above are questionable.

    (1) It is often possible to use a deterministic initialization procedure. I’ve done this in all of my recent papers that use numerical optimization (e.g., “initialize with PCA”.)

    (2) Is this referring to performance in the sense of speed or of finding a good solution? If the latter, gradient-based optimization is much more flexible than limiting yourself to problems that can be cleverly solved with a closed-form algorithm. As long as you know how to compute derivatives, then you’re set. (Heck, you can even avoid that by using automatic differentiation software).

    Local minima and speed are huge problems with iterative optimization algorithms. But limiting oneself to problems with closed-form optima seems closes many doors.

  4. Quasi-Newton is (a) O((number of weights)^2) (b) inherently batch rather than online. There are a number of applications which have many weights and/or require online updates. You may not have encountered these applications yet, but they certainly exist. One example is Yann LeCun’s work.

  5. It’s definitely true that objection (1) has more to do with historic use than necessity.

    For (2), I meant ‘the latter’. I have no argument about flexibility, but rather about whether flexibility without guidance is good.

  6. It’s been a while since I’ve read Nic’s papers myself but I thought that the whole point of SMD was that it was an online algorithm that converged to an approximate second-order method.

    SMD worked great for me but I was training an MLP on one of those problems with online-only updates and lots of weights. But SMD was not really what cracked the problem in the end, it was another one of Nic’s insights that made all the difference: fitting the hidden units to predict the residues after the short-cut weights had been fully trained.

  7. Except for batch vs online, there is another problem which I run into all the time, so please share your insights
    if you have experience with this (do I get a mail notification here?).
    Suppose you have code to compute f(x) and its gradient g(x), but it’s approximate so you can expect an ABSOLUTE
    error eps on BOTH f(x) and each comp. of g(x). eps can be made small by paying more, but it’s there and absolute.
    Standard line search methods do not cope with this at all, because now any finite differences [f(x+u)-f(x)]/u for
    small u are wildly wrong if u order of eps. And as I understand, standard code for L-BFGS (my favorite as well)
    does heavily rely on standard line searches. Also, things like bracketing are all difficult now.
    On the other hand, if eps is small, clearly g(x) contains valuable info, even if there is an error eps, and I
    would expect at least to find some x_* s.t. f(x_*) is about eps from a local minimum.

  8. Whatever your student has been using, it certainly wasn’t SMD if it used either fixed step sizes or line searches.

    The problem is indeed that many things that optimizers take for granted, such as line searches, conjugate search directions, etc. break down under stochastic approximation. In essence, you’re only allowed to use linear operations; naturally this makes it quite challenging to build an efficient stochastic gradient optimizer.

    On the other hand, stochastic gradient can give you a huge speed-up factor upfront by being sub-linear in the data set size. For large, redundant data sets, this often far outweighs any slower convergence. For instance, we find SMD beating the pants off (as in: converging 30 times faster than) L-BFGS for CRF training.

  9. Yes, it became clear that what we were doing wasn’t SMD, when we discussed it at NIPS. (I think I had gotten the wrong impression of it from talking to one of your students). Now I’d like to try doing it the right way…

  10. What are the advantages of using Levenberg – Marquardt over Stochastic Gradient Descent for Deep Networks? Are there any good methods for fast training of Deep Networks?

Comments are closed.