Fast Gradient Descent

Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD).

Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.

  1. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
  2. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
  3. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.

Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.

SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.

“Failure” is an option

This is about the hard choices that graduate students must make.

The cultural definition of success in academic research is to:

  1. Produce good research which many other people appreciate.
  2. Produce many students who go on to do the same.

There are fundamental reasons why this is success in the local culture. Good research appreciated by others means access to jobs. Many students succesful in the same way implies that there are a number of people who think in a similar way and appreciate your work.

In order to graduate, a phd student must live in an academic culture for a period of several years. It is common to adopt the culture’s definition of success during this time. It’s also common for many phd students discover they are not suited to an academic research lifestyle. This collision of values and abilities naturally results in depression.

The most fundamental advice when this happens is: change something. Pick a new advisor. Pick a new research topic. Or leave the program (and do something else with your life).

The first two are relatively easy, but “Do something else with your life” is a hard choice for a phd student to make because they are immersed in and adopt a value system that does not value that choice. Remember here that the academic value system is not a universal value system. For example, many people want to do something that is immediately constructive and find this at odds with academic research (which is almost defined by “not immediate”). The world is big enough and diverse enough to support multiple value systems. Realizing this may be the key to making very good decisions in your life. A number of my friends made this decision and went to google or investment banking places where they are deliriously happier (and more productive) than in their former lives.

Online Learning as the Mathematics of Accountability

Accountability is a social problem. When someone screws up, do you fire them? Or do you accept the error and let them continue? This is a very difficult problem and we all know of stories where the wrong decision was made.

Online learning (as meant here), is a subfield of learning theory which analyzes the online learning model.

In the online learning model, there are a set of hypotheses or “experts”. On any instantance x, each expert makes a prediction y. A master algorithm A uses these predictions to form it’s own prediction yA and then learns the correct prediction y*. This process repeats.

The goal of online learning is to find a master algorithm A which uses the advice of the experts to make good predictions. In particular, we typically want to guarantee that the master algorithm performs almost as well as the best expert. If L(e) is the loss of expert e and L(A) is the loss of the master algorithm, it is often possible to prove:

L(A) less than mine L(e) + log(number of experts)

over all sequences.

In particular, there is no assumption of independent samples and there is no assumption that the experts perform well (or can perform well). This is not a high probability statement: it simply always holds. These assumption-free qualities are very important for application to the accountability problem, because the experts really can be adversarial.

In any situation where we have a set of human experts giving advice on the same subject, we can hope to apply online learning algorithms to better distill collective advice into single prediction. Examples include:

  1. stock picking Many of the big stocks have ‘analysts’ who predict whether they will go up or down. For an example, look here.
  2. surveys It is common to see things like “A gain of 1.4 percent was expected, based on the median estimate in a Bloomberg survey of 53 economists” in news articles. Presumably, these economists are reused frequently implying they have a record to which an online algorithm could be applied.

This application of online learning isn’t trivial. Even for the above examples, it isn’t clear how to handle issues like:

  1. A new expert starts making predictions. There are some reasonable ad-hoc mechanisms for coping with this in the context of particular algorithms.
  2. An expert declines to make a prediction. The modified “sleeping experts” setting handles this, but the results are not quite as tight.
  3. The loss associated with individual predictions is highly variable rather than something simple like “0/1-loss” or “squared error loss”. One approach to this is to combine regret minimizing learning reductions with online learning algorithms (drawback: the reduced predictions may not be of intuitive things). Another approach is simply trying to make very flexible master algorithms (drawback: flexibility often comes with a weakening in the theoretical guarantees).
  4. In the real world, we may not have feedback about a prediction until after the next 10 predictions (or more) need to be made.
  5. In the real world, there may be uncertainty about the measured truth. Quantifying GDP growth requires a lot of work and has some fundamental uncertainty associated with it, especially when immediate feedback is required.

Site Update

I tweaked the site in a number of ways today, including:

  1. Updating to WordPress 1.5.
  2. Installing and heavily tweaking the Geekniche theme. Update: I switched back to a tweaked version of the old theme.
  3. Adding the Customizable Post Listings plugin.
  4. Installing the StatTraq plugin.
  5. Updating some of the links. I particularly recommend looking at the computer research policy blog.
  6. Adding threaded comments. This doesn’t thread old comments obviously, but the extra structure may be helpful for new ones.

Overall, I think this is an improvement, and it addresses a few of my earlier problems. If you have any difficulties or anything seems “not quite right”, please speak up. A few other tweaks to the site may happen in the near future.