6.4 Importance Sampling

With rejection sampling, we ultimately obtain a sample from the target density \(f\). With that sample, we can create any number of summaries, statistics, or visualizations. However, what if we are interested in the more narrow problem of computing a mean, such as \(\mathbb{E}_f[h(X)]\) for some function \(h:\mathbb{R}^k\rightarrow\mathbb{R}\)? Clearly, this is a problem that can be solved with rejection sampling: First obtain a sample \(x_1,\dots,x_n\sim f\) and then compute \[ \hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n h(x_i). \] with the obtained sample. As \(n\rightarrow\infty\) we know by the Law of Large Numbers that \(\hat{\mu}_n\rightarrow\mathbb{E}_f[h(X)]\). Further, the Central Limit Theorem gives us \(\sqrt{n}(\hat{\mu}_n-\mathbb{E}_f[h(X)])\longrightarrow\mathcal{N}(0,\sigma^2)\). So far so good.

However, with rejection sampling, in order to obtain a sample of size \(n\), we must generate, on average, \(c\times n\) candidates from \(g\), the candidate density, and then reject about \((c-1)\times n\) of them. If \(c\approx 1\) then this will not be too inefficient. But in general, if \(c\) is much larger than \(1\) then we will be generating a lot of candidates from \(g\) and ultimately throwing most of them away.

It’s worth noting that in most cases, the candidates generated from \(g\) fall within the domain of \(f\), so that they are in fact values that could plausibly come from \(f\). They are simply over- or under-represented in the frequency with which they appear. For example, if \(g\) has heavier tails than \(f\), then there will be too many extreme values generated from \(g\). Rejection sampling simply thins out those extreme values to obtain the right proportion. But what if we could take those rejected values and, instead of discarding them, simply downweight or upweight them in a specific way?

Note that we can rewrite the target estimation as follows, \[ \mathbb{E}_f[h(X)] = \mathbb{E}_g\left[\frac{f(X)}{g(X)}h(X)\right]. \]

Hence, if \(x_1,\dots,x_n\sim g\), drawn from the candidate density, we can say \[ \tilde{\mu}_n = \frac{1}{n}\sum_{i=1}^n\frac{f(x_i)}{g(x_i)}h(x_i) = \frac{1}{n}\sum_{i=1}^n w_i h(x_i) \approx \mathbb{E}_f[h(X)] \]

In the equation above, the values \(w_i=f(x_i)/g(x_i)\) are referred to as the importance weights because they take each of the candidates \(x_i\) generated from \(g\) and reweight them when taking the average. Note that if \(f = g\), so that we are simply sampling from the target density, then this estimator is just the sample mean of the \(h(x_i)\)s. The estimator \(\tilde{\mu}_n\) is known as the importance sampling estimator.

When comparing rejection sampling with importance sampling, we can see that

  • Rejection sampling samples directly from \(f\) and then uses the samples to compute a simple mean

  • Importance sampling samples from \(g\) and then reweights those samples by \(f(x)/g(x)\)

For estimating expectations, one might reasonably believe that the importance sampling approach is more efficient than the rejection sampling approach because it does not discard any data.

In fact, we can see this by writing the rejection sampling estimator of the expectation in a different way. Let \(c=\sup_{x\in\mathcal{X}_f}f(x)/g(x)\). Given a sample \(x_1,\dots,x_n\sim g\) and \(u_1,\dots,u_n\sim\text{Unif}(0,1)\), then \[ \hat{\mu}_n = \frac{ \sum_i\mathbf{1}\left\{u_i\leq\frac{f(x_i)}{c\,g(x_i)}\right\}h(x_i) }{ \sum_i\mathbf{1}\left\{u_i\leq\frac{f(x_i)}{c\,g(x_i)}\right\} } \]

What importance sampling does, effectively, is replace the indicator functions in the above expression with their expectation. So instead of having a hard threshold, where observation \(x_i\) is either included (accepted) or not (rejected), importance sampling smooths out the acceptance/rejection process so that every observation plays some role.

If we take the expectation of the indicator functions above, we get (note that the \(c\)s cancel) \[ \tilde{\mu}_n = \frac{ \sum_i \frac{f(x_i)}{g(x_i)}h(x_i) }{ \sum_i \frac{f(x_i)}{g(x_i)} } = \frac{ \frac{1}{n}\sum_i \frac{f(x_i)}{g(x_i)}h(x_i) }{ \frac{1}{n}\sum_i \frac{f(x_i)}{g(x_i)} } \] which is roughly equivalent to the importance sampling estimate if we take into account that \[ \frac{1}{n}\sum_{i=1}^n\frac{f(x_i)}{g(x_i)}\approx 1 \] because \[ \mathbb{E}_g\left[\frac{f(X)}{g(X)}\right] = \int \frac{f(x)}{g(x)}g(x)\,dx = 1 \] The point of all this is to show that the importance sampling estimator of the mean can be seen as a “smoothed out” version of the rejection sampling estimator. The advantage of the importance sampling estimator is that it does not discard any data and thus is more efficient.

Note that we do not need to know the normalizing constants for the target density or the candidate density. If \(f^\star\) and \(g^\star\) are the unnormalized target and candidate densities, respectively, then we can use the modified importance sampling estimator, \[ \mu^\star_n = \frac{ \sum_i\frac{f^\star(x_i)}{g^\star(x_i)}h(x_i) }{ \sum_i\frac{f^\star(x_i)}{g^\star(x_i)} }. \] We can then use Slutsky’s Theorem to say that \(\mu^\star_n\rightarrow\mathbb{E}_f[h(X)]\).

6.4.1 Example: Bayesian Sensitivity Analysis

An interesting application of importance sampling is the examination of the sensitivity of posterior inferences with respect to prior specification. Suppose we observe data \(y\) with density \(f(y\mid\theta)\) and we specify a prior for \(\theta\) as \(\pi(\theta\mid\psi_0)\), where \(\psi_0\) is a hyperparameter. The posterior for \(\theta\) is thus

\[ p(\theta\mid y, \psi_0) \propto f(y\mid\theta)\pi(\theta\mid\psi_0) \] and we would like to compute the posterior mean of \(\theta\). If we can draw \(\theta_1,\dots,\theta_n\), a sample of size \(n\) from \(p(\theta\mid y,\psi_0)\), then we can estimate the posterior mean with \(\frac{1}{n}\sum_i\theta_i\). However, this posterior mean is estimated using a specific hyperparameter \(\psi_0\). What if we would like to see what the posterior mean would be for a different value of \(\psi\)? Do we need to draw a new sample of size \(n\)? Thankfully, the answer is no. We can simply take our existing sample \(\theta_1,\dots,\theta_n\) and reweight it to get our new posterior mean under a different value of \(\psi\).

Given a sample \(\theta_1,\dots,\theta_n\) drawn from \(p(\theta\mid y,\psi_0)\), we would like to know \(\mathbb{E}[\theta\mid y, \psi]\) for some \(\psi\ne\psi_0\). The idea is to treat our original \(p(\theta\mid y,\psi_0)\) as a “candidate density” from which we have already drawn a large sample \(\theta_1,\dots,\theta_n\). Then we want know the posterior mean of \(\theta\) under a “target density” \(p(\theta\mid y,\psi)\). We can then write our importance sampling estimator as

\[\begin{eqnarray*} \frac{ \sum_i\theta_i\frac{p(\theta_i\mid y, \psi)}{p(\theta_i\mid y,\psi_0)} }{ \sum_i\frac{p(\theta_i\mid y, \psi)}{p(\theta_i\mid y,\psi_0)} } & = & \frac{ \sum_i\theta_i\frac{f(y\mid\theta_i)\pi(\theta_i\mid\psi)}{f(y\mid\theta_i)\pi(\theta_i\mid\psi_0)} }{ \sum_i\frac{f(y\mid\theta_i)\pi(\theta_i\mid\psi)}{f(y\mid\theta_i)\pi(\theta_i\mid\psi_0)} }\\ & = & \frac{ \sum_i\theta_i\frac{\pi(\theta_i\mid\psi)}{\pi(\theta_i\mid\psi_0)} }{ \sum_i\frac{\pi(\theta_i\mid\psi)}{\pi(\theta_i\mid\psi_0)} }\\ & \approx & \mathbb{E}[\theta\mid y,\psi] \end{eqnarray*}\]

In this case, the importance sampling weights are simply the ratio of the prior under \(\psi\) to the prior under \(\psi_0\).

6.4.2 Properties of the Importance Sampling Estimator

So far we’ve talked about how to estimate an expectation with respect to an arbitrary target density \(f\) using importance sampling. However, we haven’t discussed yet what is the variance of that estimator. An analysis of the variance of the importance sampling estimator is assisted by the Delta method and by viewing the importance sampling estimator as a ratio estimator.

Recall that the Delta method states that if \(Y_n\) is a \(k\)-dimensional random variable with mean \(\mu\), \(g:\mathbb{R}^k\rightarrow\mathbb{R}\) and is differentiable, and further we have \[ \sqrt{n}(Y_n-\mu)\stackrel{D}{\longrightarrow}\mathcal{N}(0,\Sigma) \] as \(n\rightarrow\infty\), then \[ \sqrt{n}(g(Y_n)-g(\mu)) \stackrel{D}{\longrightarrow} \mathcal{N}(0, g^\prime(\mu)^\prime\Sigma g^\prime(\mu)) \] as \(n\rightarrow\infty\).

For the importance sampling estimator, we have \(f\) is the target density, \(g\) is the candidate density, and \(x_1,\dots,x_n\) are samples from \(g\). The estimator of \(\mathbb{E}_f[h(X)]\) is written as \[ \frac{\frac{1}{n}\sum_i h(x_i) w(x_i)}{\frac{1}{n}\sum_i w(x_i)} \] where \[ w(x_i) = \frac{f(x_i)}{g(x_i)} \] are the importance sampling weights.

If we let \(g((a, b)) = a/b\), then \(g^\prime((a,b)) = (1/b, -a/b^2)\). If we define the vector \(Y_n = \left(\frac{1}{n}\sum h(x_i) w_i,\,\frac{1}{n}\sum w_i\right)\) then the importance sampling estimator is simply \(g(Y_n)\). Furthremore, we have \[ \mathbb{E}_g[Y_n] = \mathbb{E}_g\left[\left(\frac{1}{n}\sum h(x_i) w(x_i),\,\frac{1}{n}\sum w(x_i)\right)\right] = (\mathbb{E}_f[h(X)], 1) = \mu \] and \[ \Sigma = n\,\text{Var}(Y_n) = \left( \begin{array}{cc} \text{Var}(h(X)w(X)) & \text{Cov}(h(X)w(X), w(X))\\ \text{Cov}(h(X)w(X), w(X)) & \text{Var}(w(X)) \end{array} \right) \] Note that the above quantity can be estimated consistently using the sample versions of each quantity in the matrix.

Therefore, the variance of the importance sampling estimator of \(\mathbb{E}_f[h(X)]\) is \(g^\prime(Y_n)^\prime\Sigma g^\prime(Y_n)\) which we can expand to \[ n\left( \frac{\sum h(x_i)w(x_i)}{\sum w(x_i)} \right)^2 \left( \frac{\sum h(x_i)^2w(x_i)^2}{\left(\sum h(x_i)w(x_i)\right)^2} - 2\frac{\sum h(x_i)w(x_i)^2}{\left(\sum h(x_i)w(x_i)\right)\left(\sum w(x_i)\right)} + \frac{\sum w(x_i)^2}{\left(\sum w(x_i)\right)^2} \right) \] Given this, for the importance sampling estimator, we need the following to be true, \[ \mathbb{E}_g\left[h(X)^2w(X)^2\right] = \mathbb{E}_g\left[h(X)\frac{f(X)}{g(X)}\right] < \infty, \] \[ \mathbb{E}_g[w(X)^2] = \mathbb{E}_g\left[\left(\frac{f(X)}{g(X)}\right)^2\right] < \infty, \] and \[ \mathbb{E}_g\left[h(X)w(X)^2\right] = \mathbb{E}_g\left[h(X)\left(\frac{f(X)}{g(X)}\right)^2\right] < \infty. \]

All of the above conditions are true if the conditions for rejection sampling are satisfied, that is, if \(\sup_{x\in\mathcal{X}_f}\frac{f(x)}{g(x)}<\infty\).