7.5 Simulated Annealing
Simulated annealing is a technique for minimizing functions that makes use of the ideas from Markov chain Monte Carlo samplers, which is why it is in this section of the book. It is a particularly useful method for functions that are very misbehaved and wiggly; the kinds of functions that Newton-style optimizers tend to be bad at minimizing.
Suppose we want to find the global minimum of a function \(h(\theta)\), where \(\theta\) is a vector of parameters in a space \(S\). Ideally, we would simulate a Markov chain whose target density \(\pi(\theta)\) was a point mass on the global minimum. This would be great if we could do it! But then we wouldn’t have a problem. The idea with simulated annealing is that we build successive approximations to \(\pi(\theta)\) until we have an approximation that is very close to the target density.
Let \(S^\star = \{\theta\in S: h(\theta) = \min_\theta h(\theta)\}\). Then define \(\pi(\theta)\propto 1\) for all \(\theta\in S^\star\) and \(\pi(\theta)=0\) for all \(\theta\notin S^\star\). In other words, \(\pi(\theta)\) is the uniform distribution over all the global minimizers. The ultimate goal is to find some way to sample from \(\pi(\theta)\).
We will begin by building an approximate density called \(\pi_T(\theta)\) where
\[ \pi_T(\theta) \propto \exp(-h(\theta) / T). \] and where \(T\) is called the “temperature”. This density as two key properties:
As \(T\rightarrow\infty\) it \(\pi_T(\theta)\) approaches the uniform density;
As \(T\downarrow 0\), \(\pi_T(\theta)\rightarrow\pi(\theta)\).
The aim is then to draw many samples from \(\pi_T(\theta)\), initially with a large value of \(T\), and to lower \(T\) towards \(0\) slowly. As we lower \(T\), the density \(\pi_T(\theta)\) will become more and more concentrated around the minima of \(h(\theta)\).
Given that \(\pi_T(\theta)\rightarrow\pi(\theta)\) as \(T\downarrow 0\), why not just start with \(T\) really small? The problem is that if \(T\) is small from the start, then the sampler will quickly get “stuck” in a whatever local model is near our initial value. Once there, it will not be able to jump out and go to an other mode. So the general strategy is to start with a large \(T\) so that the sample space can be adequately explored and so we don’t get stuck in a local minimum. Then, as we lower the temperature, we can be sure that we have explored as much of the space as feasible.
The sampling procedure is then to first choose a symmetric proposal density \(q(\cdot\mid\theta)\). Then, if we are at iteration \(n\) with state \(\theta_n\),
Sample \(\theta^\star\sim q(\theta\mid\theta_n)\).
Sample \(U\sim\text{Unif}(0,1)\).
Compute \[\begin{eqnarray*} \alpha(\theta^\star\mid\theta_n) & = & \min\left(\frac{ \exp(-h(\theta^\star) / T) }{ \exp(-h(\theta_n) / T) }, 1 \right)\\ & = & \min(\exp(-(h(\theta^\star)-h(\theta_n)) / T), 1) \end{eqnarray*}\]
Accept \(\theta^\star\) as the next state if \(U\leq\alpha(\theta^\star\mid\theta_n)\).
Decrease \(T\).
Note that if \(h(\theta^\star)\leq h(\theta_n)\), then we always accept the proposal, so we always go “downhill”. If \(h(\theta^\star) > h(\theta_n)\) then we accept \(\theta^\star\) with some probability \(< 1\). So there is a chance that we go “uphill”, which allows us to not get stuck in local minima. As \(T\) approaches \(0\), it becomes much less likely that we will accept an uphill proposal.
If we have that
The temperature is decreasing, so that \(T_{n+1} < T_n\) for each iteration \(n\);
\(T_n\downarrow 0\) as \(n\rightarrow\infty\);
\(T_n\) decreases according to an appropriate “cooling schedule”;
then \(\|\pi_{T_n} - \pi\|\longrightarrow 0\).
For the theory to work and for us to be guaranteed to find the global minimizer(s), we need to have a “cooling schedule” for \(T\) that is along the lines of
\[ T_n = \frac{a}{\log(n + b)} \] for some \(a, b > 0\). Because of the logarithm in the denominator, this cooling schedule is excruciatingly slow, making the simulated annealing algorithm a very slow algorithm to converge.