Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.
- Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
- The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
- Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.
Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.
SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.