This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.
The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = Ex ~ D f(x)
given samples from D and knowledge of f.
Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
Ex ~ Q f(x) D(x)/Q(x)
A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.
This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).
One thing I don’t understand: In many cases we use importance sampling because it is difficult to sample from D(x) directly, but it is easy to sample from Q(x). So in those cases, is it easier or even feasible to sample from f(x) D(x)?
Importance reweighting can be done iteratively: 1. you draw K samples from Q, 2. if that’s enough, then you stop and make an estimate of N, 3. if it’s not enough, then (from the samples) you compute a new distribution Q’ that is closer to D (in the cross-entropy measure). This algorithm (by Reuven Rubinstein) is the cross-entropy method, more about it here
To do this, of course, you have to know something about D(x). For example, Rubinstein assumes that there is some objective function g(x), and D is a uniform distribution over some level set of g.
The method can be used
1. for rare events, get an accurate Monte Carlo estimation of its probability
2. for global optimization (where it turns out to be an estimation-of-distribution type algorithm)
Probably it can be also used for MC integration somehow… I have only experience with the global-optimization CEM, where it is almost unbeatable 🙂
I’ve never had a case where you could literally sample from Q(x). It’s an ideal to strive towards, but it’s rarely if ever achievable.
I am also a big fan of importance sampling. In estimating the minimax regret (under log-loss) it turns out that the thing we need to find out is the uniform average of maximized likelihood over all possible data-sets (or the normalization term of Shtarkov’s NML distribution). In particular, for multinomial (or Markov) models, using mixture distributions with Dirichlet(1/2,…,1/2) priors as Q(x) turns out to be much better than the uniform distribution. In fact the Dirichlet mixture is very close to optimal, and the sample complexity is practically 1. (paper)
I have also played around with the (probably doomed) idea of sampling from Q(x) \propto D(x) f(x) using MCMC, which only requires that we can evaluate the ratio f(x)D(x) / f(y)D(y) for all x,y. However, since I cannot evaluate the proportionality constant, N, I have to use something else than the importance sampling estimator. The harmonic mean estimator works in principle (is consistent) but it is notoriously unstable (i.e., the variance is extremely large). Plus the details of the MCMC sampler have to be tuned. All in all, this is very likely a bad idea.