There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others:
- Neural Networks. Neural networks use randomization to assign initial weights.
- Boltzmann Machines/Deep Belief Networks. Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness.
- Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote.
- Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies.
- Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees.
A basic question is: “Should there be explicit randomization in learning algorithms?” It seems perverse to feed extra random bits into your prediction process since they don’t contain any information about the problem itself. Can we avoid using random numbers? This question is not just philosophy—we might hope that deterministic version of learning algorithms are both more accurate and faster.
There seem to be several distinct uses for randomization.
- Symmetry breaking. In the case of a neural network, if every weight started as 0, the gradient of the loss with respect to every weight would be the same, implying that after updating, all weights remain the same. Using random numbers to initialize weights breaks this symmetry. It is easy to believe that there are good deterministic methods for symmetry breaking.
- Overfit avoidance. A basic observation is that deterministic learning algorithms tend to overfit. Bagging avoids this by randomizing the input of these learning algorithms in the hope that directions of overfit for individual predictions cancel out. Similarly, using random bits internally as in a deep belief network avoids overfitting by forcing the algorithm to learn a robust-to-noise set of internal weights, which are then robust-to-overfit. Large margin learning algorithms and maximum entropy learning algorithms can be understood as deterministic operations attempting to achieve the same goal. A significant gap remains between randomized and deterministic learning algorithms: the deterministic versions just deal with linear predictions while the randomized techniques seem to yield improvements in general.
- Continuizing. In reinforcement learning, it’s hard to optimize a policy over multiple timesteps because the optimal decision at timestep 2 is dependent on the decision at timestep 1 and vice versa. Randomized interpolation of policies offers a method to remove this cyclic dependency. PSDP can be understood as a derandomization of CPI which trades off increased computation (learning a new predictor for each timestep individually). Whether or not we can avoid a tradeoff in general is unclear.
- Adversary defeating. Some algorithms, such as randomized weighted majority are designed to work against adversaries who know your algorithm, except for random bits. The randomization here is provably essential, but the setting is often far more adversarial than the real world.
The current state-of-the-art is that random bits provide performance (computational and predictive) which we don’t know (or at least can’t prove we know) how to achieve without randomization. Can randomization be removed or is it essential to good learning algorithms?
Some authors had already (some time ago) suggested using specific distributions of points (still probabilistic derived) for neural network initialization, specifically for the reason you mention: breaking the distribution of observations up. I have been experimenting with designed (deterministic) intializations, but have no results to report as yet.
Hi, I’d say randomization is almost unavoidable when you deal with the problem of estimating the parameters of a non-linear function: basically the function shape is unknown; so one has to explore locally the regions of parameter space to find out what the local minimum is in the current region. Randomization of the initial parameter guess allows us for easily explore different region without biasing our search towards any direction. One could perhaps deterministically define “unbiased” starting point which would allow to explore satisfactorily the parameter space but I am unaware of previous such attempts.
One of the most important benefits of randomization is getting out of rarely occuring artifacts – (bad, pessimistic cases, local minima, etc). So, if our deterministic algorithm can do well in majority of cases, bad sometimes behaves badly, then randomization decerases performance a little on most cases, but solves the problem on bad cases.
This is especially important, when algorithm is iterative. I.e. repeats some action over and over again.
In such case bad cases will occur often.
In RL, a stochastic policy often represents our belief on how good we think a particular action is in a particular state (e.g., a eps-greedy policy computed from a Q-function). Do you have an insight how derandomization would work in this context?
Even-Dar and Mansour have a nice NIPS paper about derandomizing epsilon-greedy exploration in Q-learning. They do it with a deterministic “exploration bonus”, which is adjusted appropriately.
I’m not too familiar with the ML field (which is one reason I enjoy this blog!), but there are cases in general signal processing where randomization is essential.
As you said, noise (a random signal) is often used for symmetry breaking, to “kickstart” a system from a degenerate solution, but noise is usually used here more because it’s handy than necessary.
A more vital application is, for example, dithering during quantization. Quantization introduces error, which is unavoidable, but naive systematic quantization introduces an error signal which is correlated with the main signal, and thus can become a noticeable artifact. Injecting noise before quantization decorrelates the error signal, essentially spreading its energy unobtrusively over the entire spectrum.
Another application is in spread-spectrum communication, where a data signal is multiplied by a (deterministic) noise signal in order to spread its energy over a wide frequency range. This leaves it much less vulnerable to interference at any specific frequency.
In general, noise is not thought of as “randomization” so much as just another signal with certain useful characteristics (in particular, a continuous spectrum that is flat or otherwise interestingly-shaped). And noise needn’t be indeterministic — just reuse the same random generator with the same seed if determinism is a concern.
Again, I’m not too familiar with ML, but I think noise is too useful a tool to reject. If you are reducing information and want to ensure that are not inadvertantly biasing the information kept, or if you need a wide frequency spectrum (whatever that may mean in your system), noise is there for you.
When we design learning algorithms, making a good decision is often hard. An easy choice for making decisions is to simply randomize.
To think about this, consider the case where there is only one bit of randomization with states 0 and 1. Assuming that the problem being solved is fixed, each choice of the random bit has some value v0 or v1 in the context of the problem under consideration. Randomization induces an expectation over these values, and it is only optimal when v0 = v1.
Many applications seem to fit this setting, which implies that randomization (1) has lower expected performance than the best deterministic algorithm and (2) is being used to cover up our lack of understanding the best deterministic algorithm. If this is your situation, it’s worth thinking about how a deterministic version of the algorithm would run.
For example, in RL, when we have a Markov Decision Process, E3 style algorithms can provide a deterministic derandomization of the epsilon greedy Q learning.
Sure, if we have a fixed learning problem, then some particular (deterministic) setting of the random bits will perform better than the random algorithm, but maybe this setting of the bits causes the algorithm to perform horribly on a different learning problem. Randomization allows us to make probabilistic guarantees without knowing what problem our algorithm might face — is it necessarily the case that there is a deterministic algorithm that can do better on all inputs?
I think the answer is “yes”. We can just let v0 and v1 be the expected value over the assortment of learning problems we encounter rather than for one problem.
An adaptive adversary generating the problems would make the answer “no”, but this is not the situation we typically encounter.
One very basic reason for randomization is missing: Monte Carlo methods. Sometimes exact inference (such as querying a Bayesian network) has no closed-form solution or is computationally intractable, and in those cases an approach based on random sampling is really the only option. Your average nonlinear function maximization/minimization – which some have already brought up – falls under this.
As far as overcoming overfit goes, I believe randomization effectively runs a low-pass filter, whether over model space (bagging), input space (jitter), or something else. A learning algorithm that incorporates this analytically might remove the randomness quite effectively.
Nearly every case that has been mentioned (overfitting, bagging, initialization) could be solved by being staunchly Bayesian about the problem. Set prior beliefs on all parameter settings, then run the algorithm for each combination, and take the weighted average of the results in accordance with the prior beliefs.
Since this is computationally intractable in most situations, we must sample from these distributions. “Deterministically sampling” from a probability distribution would not actually be pulling values from that distribution, so we must use randomness.
I don’t understand the leap from Bayesian (which is a deterministic algorithm) to “use monte carlo sampling” (which is a randomized algorithm). It seems possible to imagine that deterministic forms of integration may work better, using the vi logic above. Is there a reason to believe this cannot be so?
Bayesian inference is not an algorithm, it’s a definition. The statement P(Theta | Data) = P(Data | Theta)P(Theta)/p(Data) does not specify how these quantities are actually computed. Sometimes it can be computed in closed formula, sometimes it cannot, and that’s where randomized algorithms are useful. Deterministic integration is also a useful tool, but usually they are approximations that are not expected to work as well as Monte Carlo integration (although in several problems few people would be willing to pay the Monte Carlo price for getting supposedly better predictions).
Notice that, in statistics, modeling and algorithms are two dettached things, and I like to think that way. Bootstrapping, for instance, is an example of a non-Bayesian estimator that is defined without any reference to an algorithm. Although in a few cases one can get closed formula expressions for bootstrap estimators, for most problems one can only calculate such estimators through sampling.
I don’t know if I understood the v_i example… If we plug in the expected value of v_i (assuming the expected value is the actual estimator that we need, your function might otherwise be non-linear on v_i), it feels just like the usual “plug-in” estimators of statistical inference. But this doesn’t tell your anything about how to get such expectations to begin with.
Existing deterministic forms of integration exist but tend to scale badly with high input dimensions, which tends to be the privileged situation in Machine Learning.
The nice part about sampling is that the variance of the estimate does not depend on the dimension so it may be quite efficient.
John is right; randomness is not required for computing Bayesian integrals, even in high dimensions. Recall that the error of Monte Carlo scales as O(1/sqrt(n)) where n is the number of function evaluations. For integrating non-smooth functions in high dimensions, this convergence rate is optimal, but it is not uniquely optimal. Low-discrepancy number sequences and quadrature rules can do just as well for these functions, and even better in low dimensions or if the function is smooth. For more information, look up Quasi-Monte Carlo.
Here is a simple thought experiment which boils the problem down to its essence. Consider a guessing game with two players. On the table is a list of 10 functions, e.g. f1(x)=3x^2+4, f2(x)=cos(exp(x)), etc. Next to each function is a number, which gives the value of the function’s integral from x=0 to 1. Both players can see the list, including the integral values. Player 1 secretly picks one of the functions. Player 2’s job is to guess the integral of that function (which is always one of the 10 listed numbers). Before guessing, player 2 can ask questions of the form “what is the value of your f at value x?”. Player 1 must give the answer, rounded to the nearest integer. (Without rounding the problem would be trivial.) After n consecutive questions and answers, player 2 must guess the integral, and is charged e dollars where e is the absolute difference between his guess and the true integral.
Any integration algorithm based on n evaluations of a function can be interpreted as a strategy for player 2. For example, naive Monte Carlo would correspond to drawing n random numbers between 0 and 1, and asking the function value at each one. Then you estimate the integral as the average of those function values. Of course, this strategy completely ignores the list of functions on the table. Another approach is quadrature. Using the list of 10 functions, you precompute the n most informative places at which to measure the function. Then you take a weighted average of the function values. This is better, but not optimal since it selects the n points in advance. You should be able to do better by choosing the next point on the basis of what you have already learned from the previous questions.
Now I ask the following:
1. What is player 2’s optimal strategy? You can assume that player 2 receives the answer to each question as soon as it is asked.
2. Is there any advantage for player 2 to be random?
3. How should player 2’s strategy change if the list were extended to 1 million functions? The expected payoff would clearly change, but does the basic strategy change?
4. How should player 2’s strategy change if the functions are now multi-dimensional, and the integral is over the unit cube [0,1]^d?
5. Does it matter that we are estimating the integral of the function, and not some other property, like the location of the function’s maximum? Does the number associated with the function have to have any meaning at all?
Part of my dissertation will center around how much randomization is needed in information theoretic settings, and there, as a previous commenter said, randomization is key in getting around spurious “bad cases.” What is more interesting in the information theory world is that a small amount of randomization can mean the difference between 0 bits/second reliably and >> 0 bits/second reliably.
There are a couple of other situations that might require explicit randomization.
Some ML algorithms that train on-line, such as Adaptive Resonance Theory, are inherently dependent on training order. While this is helpful if you have a training set that has some order to its exemplars, you would otherwise want to randomize the training order and perhaps even create an ensemble of systems, each trained with a different order and each contributing votes to the final test predictions.
Also, systems which hope to model biological mechanisms of learning and memory may benefit from explicit randomization. The human brain is not a deterministic system. Each neuron and its connections are incredibly noisy, which can be both a benefit and a curse.
Quasi Monte Carlo methods are interesting but they haven’t become popular in Bayesian computation for two main reasons
* Even if the rate of convergence is faster for QMC than MC, the constant at the numerator can be much higher than for standard MC (especially if the function is not very smooth). Hence for fixed computational efforts, it might perform very badly.
* QMC and MCMC do not get along very well. As soon as you plug low-discrepancy sequences in an MCMC algorithm, you essentially don’t know what is going on. There is a recent paper of Art Owen in PNAS (see http://www-stat.stanford.edu/~owen/reports/qmcmcmc2.pdf for a long version of the paper)showing that it still provides asymptotically consistent estimates under regularity assumptions… but nobody knows what the resulting rate of convergence is.