(PDF) Optimization assisted MCMC

Journal of Machine Learning Research 1–17 Optimization Assisted MCMC Ricky Fok

Department of Electrical Engineering & Computer Science and Engineering York University Toronto, Ontario, Canada, M3J 1P3 arXiv:1709.02888v1 [stat.CO] 9 Sep 2017 Aijun An

[email protected]

Department of Electrical Engineering & Computer Science and Engineering York University Toronto, Ontario, Canada, M3J 1P3 Xiaogang Wang

[email protected]

Department of Mathematics and Statistics York University Toronto, Ontario, Canada, M3J 1P3 Editor: ??? Abstract Markov Chain Monte Carlo (MCMC) sampling methods are widely used but often encounter either slow convergence or biased sampling when applied to multimodal high dimensional distributions. In this paper, we present a general framework of improving classical MCMC samplers by employing a global optimization method. The global optimization method first reduces a high dimensional search to an one dimensional geodesic to find a starting point close to a local mode. The search is accelerated and completed by using a local search method such as BFGS. We modify the target distribution by extracting a local Gaussian distribution aound the found mode. The process is repeated to find all the modes during sampling on the fly. We integrate the optimization algorithm into the Wormhole Hamiltonian Monte Carlo (WHMC) method. Experimental results show that, when applied to high dimensional, multimodal Gaussian mixture models and the network sensor localization problem, the proposed method achieves much faster convergence, with relative error from the mean improved by about an order of magnitude than WHMC in some cases. Keywords: High dimensions, MCMC 1. Introduction Many machine learning methods employ Bayesian analyses where the posterior probability densities are often analytically intractable. Markov Chain Monte Carlo (MCMC) is a technique admitting sampling from most posterior densities. In theory, MCMC samplers can provide unbiased sampling of posterior densities. In practice, however, Markov chains may require a rather long time to converge to the target distribution or even trapped in a local mode. This problem is especially acute for high dimensional multimodal target distributions which little knowledge of the target sampling landscape is available. MCMC samples could c R. Fok, A. An & X. Wang. Fok An Wang be drawn from just one of the modes. In high dimensional distributions that are multimodal, it is likely that no samples are drawn from major modes. This could lead to significant bias in the estimation of the mean and variance of the target distribution and results in false confidence on the quality of the inference. One may think that running multiple chains could alleviate this problem. For high dimensions, though, this may not be effective (Neal, 2012). The probability density function could be rather at in some subregions in a high dimensional space. This results in rather slow convergence and causes the strategy of running multiple chains almost computationally infeasible. Recent efforts have been made to sample from multimodal distributions, such as Regenerative Darting Monte Carlo (RDMC, Ahn et. al. (2013)) and Wormhole Hamiltonian Monte Carlo (WHMC, Lan et. al. (2013)). Both methods update the sampler to include new modes as they are found during regeneration times, when the Markov chains probabilistically restarts. Markov chains probabilistically jump to other modes when they enter predefined regions determined by known modes. During regeneration times, a local search method, such as BFGS, is used to search for new modes and redefine the jump regions. It has been shown in the literature that WHMC converges rather fast if detailed and accurate knowledge about modes are provided. This requirement, however could be unrealistic in high dimensions such as 100 dimensions or higher due to the cost in discovering new modes. In order to avoid rediscovering known modes in WHMC, Lan et. al. (2013) employed the so called residual potential in order to diminish the basins of attraction of known modes so that BFGS is more likely to find unknown modes when the residual potential is optimized. Even though the method has shown to be successful in specific cases, such as Gaussian mixture models, there are a few issues to this approach. First, a major drawback is that there is a possibility of fictitious modes as their method directly replaces the target distribution with the residual potential to be optimized. In fact, during our experiments we discovered that fictitious modes are indeed introduced, unstabilizing WHMC. Second, the starting points are still chosen randomly in the search space for the BFGS runs. This allows BFGS to rediscover already known modes and wasting computational resources. Another major drawback is that the independent sampler employed within WHMC requires the proposal distribution to be similar to the target distribution. Otherwise, the regeneration probability would be very small and the mode search practically never starts. This creates an awkward scenerio where one requires some knowledge of the target distribution for regeneration to be effective when no such knowledge is available apriori. Lastly, WHMC samples from the target distribution “on-the-fly”, where mode searching is conducted during sampling. This introduces an additional source of bias before major modes are found. At first glance, modifying the objective function to subtract out contribution from known modes seems trivial. However, such a subtraction scheme tends to create regions with vanishing gradient if done perfectly, slowing the convergence of any gradient based search algorithms. In differential geometry, a geodesic is a generalization of a straight line on Riemmanian manifolds. The vanishing gradient problem can be avoided if one finds a manifold on which geodesics correspond to straight lines in the at regions of the search space and where the geodesics would pass through the maxima of the objective function. Recently, Fok et. al. (2016) showed that geodesics on manifolds conformally related to Euclidean spaces possesses such properties and a single geodesic can travel through multiple maxima. Such geodesics appear to be an excellent candidate for finding major modes. 2 Optimization Assisted MCMC We propose a novel mode searching method that outperforms the residual potential method by addressing the drawbacks experienced by WHMC. Our method first constructs a conformal geodesic that is able to visit the basins of many local maxima1 . The trajectory travels on the residual log objective function φ(x) = log(f (x)) − log(fˆ(x)), where the modes have been subtracted out from the log objective function f (x). This method addresses the first two concerns mentioned above with WHMC. First, a starting point is chosen with a bias towards undiscovered modes so that undiscovered modes are more likely to be found. Second, because the original function is maximized, there will be no fictitious maxima introduced. Our experiments show that our approach requires fewer BFGS calls to for mode searching and was stable in the experiment where WHMC was not able to converge. Lastly, we show theoretically that WHMC can be made more effcient by having fewer samples during the early runs where not many major modes are found. Therefore we simply force our sampler to search for new modes after a number of samples have been acquired so as to discover new modes as fast as possible in order to reduce the additional source of bias introduced by on-the- y sampling. This article is organized as follows. Section 2 provides a brief review of MCMC literature. Our algorithm is proposed in Section 3. Numerical results are provided in Section 4. Section 5 concludes the paper with discussion and plan for future work. 2. Related Work The first MCMC sampler developed is the Metropolis-Hastings algorithm (MH) (Metropolis et. al., 1953; Hastings, 1970). The Gibbs sampler was proposed more recently (Gelfand and Smith, 1990). It relies on a user-specified proposal function, one that is easy to sample from, to generate candidate points. Whether the candidate points are accepted is based on the acceptance probability. The Metroplis-Hastings algorithm is rather inefficient in today’s standards, however. First, its computational time scales with dimensionality as d2 (Creutz, 1988) which makes it rather inefficient in high dimensions. As a consequence, the optimal acceptance rate is just 0.23 for random walk proposal functions2 . It also mixes poorly in colinear regions of the target distribution. A better alternative is the so called Hamiltonian Monte Carlo (HMC) (Neal, 2012). The candidate points are proposed by solving the Hamiltonian equations of motion under the potential energy U (x) = − log π(x), where π(x) is the target distribution and the kinetic energy K(x) = 12 pT M −1 p, where p is the momentum and M is a constant real and symmetric mass matrix, usually set to be the identity. First, the momentum p ∼ N (0, 1) is sampled. Then the Hamiltonian equations of motion is solved using the leapfrog integrator and a candidate point is chosen given the trajectory length and the leapfrog step size. HMC is much more efficient than MH with optimal acceptance rate at 0.65. Also, its complexity scales with dimensions as d5/4 which is less costly than MH in high dimensions. HMC has a few drawbacks. The leapfrog step size must be randomized to avoid periodic behaviors leading to poor mixing. Furthermore, the trajectory length and leapfrog step sizes need to be tuned from a prelimineary HMC run to obtain an optimal acceptance rate. 1. The same trajectory was used as a component of a global optimization algorithm in Fok et. al. (2016). The novel contribution is our mode finding method as opposed to that with a residual potential. 2. For a derivaton of the optimal acceptance rate, see Section 4 of (Neal, 2012) 3 Fok An Wang In recent years, MCMC samplers outperforming HMC has been proposed. One example is the Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) (Girolami and Calderhead, 2011). In this algorithm the trajectories follow the Hamiltonian dynamics on a manifold defined by the Fisher-Rao metric tensor (Rao, 1945). In general, the Hamiltonian contains terms depending on x and p in an inseparable way. As a result the equations of motion are implicit and the solution requires a numerical intergrator such as fixed point iteration, which may be numerically unstable and computationally costly in some cases. To alleviate this, Riemannian MCMC with Lagrangian dynamics (LMC) (Lan et. al., 2012) was proposed. LMC uses a semi-explicit integrator, in which one of the updates is explicit. Attempts have been made to tune HMC parameters adaptively. The No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) uses a recusrive algorithm to propose candidate points in a wide region of the target distribution and stops the trajectory as it makes a U-turn. The step sizes can also be adjusted adaptively in NUTS. The Adaptive Hamiltonian and Riemann Monte Carlo (AHRMC) sampler Wang et. al. (2013) uses a Bayesian optimization method to tune HMC parameters. There exists samplers that specializes in sampling from multimodal distributions. Darting Monte Carlo (DMC) (Andricioaiei et. al., 2001; Sminchisescu and Welling, 2007) defines regions around locations of high posterior density and attempts to jump to another mode when entered. RDMC (Ahn et. al., 2013), a variant of DMC, was developed so that new regions can be defined when the Markov chains reaches their regeneration times. Rather than using the Fisher Information metric, Wormhole Hamiltonian Monte Carlo (WHMC) (Lan et. al., 2013) designs wormhole metric tensors in which the distances between modes on the corresponding manifold are reduced. Mode searches are performed with BFGS at regeneration times and the wormhole metric tensors are updated. 3. Motivation We show formally the source of bias introduced by on-the-fly sampling. Suppose we are given a set of N samples X, the probability density can be estimated as N 1 X fˆ(x) = K(xi − x), N n=1 where xi ∈ X is a sample and K is the kernel. Now, suppose that the target distribution is multimodal with K modes and k ≤ K modes are known. Assume that X is obtained on-the-fly, we can partition the samples into two sets X = Xk<K ∪XK , where Xk<K denotes samples before all the modes are found and XK denotes samples obtained after the discovery of all modes. Similarly, fˆk<K and fˆK denotes the density estimate obtained from Xk<K and Xk , respectively and N = Nk<K + NK their corresponding sample sizes. Since WHMC converges to a target distribution of known modes k, the true mean and variances (and higher statistical moments) are obtained from XK as NK → ∞. Without the loss of generality, consider an estimation of the mean using X, µ̂ = 1 X ˆ xf (x). N x∈X 4 Optimization Assisted MCMC Partitioning X as above by splitting the sum in the function estimation into two corresponding terms, we get µ̂ = Nk<K P x∈Xk<K xfˆk<K (x) + NK Nk<K + NK P x∈XK xfˆK (x) . The above expression has two terms. The first one is the bias introduced by on-the-fly sampling. In the limit NK → ∞, the first term above is negligible compared to the second and the estimated mean converges to the true mean. Another way to reduce the on-the-fly bias is to find all the major modes as soon as possible so that fk<K ≃ fK . ThisP can beRseen in the asymptotic limit (N → ∞) where the sum can be replaced by an integral x → dx. Note that in the asymptotic limit x is in both XK and Xk<K so the distinction between each set can be dropped. Finally, set fk<K = fK and the denominator is canceled by the corresponding term in the numerator. One obtains and estimate of the true mean µ̂ = Z xfˆK (x)dx. The above steps only involves manipulation of the density estimate and therefore is valid for all higher statistical moments. From the above consideration one sees that the on-the-fly bias can be reduced by having an efficient mode finding algorithm such that major modes first so that fˆk<K ≃ fˆK , and that limiting the number of samples before all modes are found can speed up convergence. Even though it may be impractical to find all modes before sampling, we found that the proposed algorithm, when applied on-the-fly, still outperforms WHMC. 4. Experiments We perform comparisons between the tempered residual potential energy (TRPE) method of WHMC and SGEO-KDE using target distributions provided by Lan et. al. (2013). To justify the discussion in Section 3, the density around a mode is estimated by a Gaussian distribution with mean and covariance being the maximizer and inverse Hessian matrix at the mode in the first experiment involving Gaussian mixtures, so that the estimated density fk for mode k is very close to a Gaussian term in the target distribution. In the second experiment with the sensor network localization problem, KDEs were used. We found that SGEO-KDE converges much faster than TRPE (WHMC) in both experiments. Further, we found no ficititious modes using SGEO-KDE whereas the residual potential method ficititious modes are present which caused the WHMC sampler to be unstable. The following diagnostic quantities are used to assess convergence, the relative error of mean (REM), REM (t) = P (t) d |µ̂ Pd − µ∗d | ∗ d µd , where d = {1, . . . , D} denotes the components, µ̂(t) denotes the MCMC estimate of the mean at time t, and µ∗ is the mean of the target distribution. Similarly, the relative error 5 Fok An Wang of covariance (RECOV) is defined as RECOV (t) = P ij ∗| |σ̂ij (t) − σij P ∗ , ij σij where indices i and j donotes the components of the covariance matrix σ ∗ and its MCMC estimate σ̂. Since BFGS is the computational bottleneck, especially in high dimensions. We show the average number of BFGS attempts NQN as a measure of computation cost. 4.1. Mixture of Gaussians We test SGEO-KDE on Gaussian mixture models used by Lan et. al. (2013) in this section, the estimates are obtained from 4 MCMC chains run in parallel. The HMC acceptance rate is tuned to be near optimal, in the range of 0.6 to 0.7. The wormhole in uence factor is set to be F = 0.1. In addition to the convergence diagnostics, we also compute the average number of BFGS optimization needed to locate all the modes for the SGEO optimization algorithm and the method of tempered residual potential energy employed in WHMC. This quantity can be used as a measure for computational cost as BFGS optimization is the bottleneck for both algorithms. Results for SGEO-KDE and TRPE (WHMC) for Gaussian mixture models with equal weights are shown in Table 1. The corresponding results for Guassian mixtures with unequal weights wk ∝ k/K, where k = {1, . . . , K}, are shown in Table 2. The results are taken after running the samplers for 800 seconds, and 2000 seconds for the D = 100 case. Plots of convergence diagnostics over computational time are shown in Figure 1 and Figure 2 for the unequal weight and equal weight cases, respectively. From the results we see that SGEO-KDE outperforms TRPE (WHMC) in all diagnostic measures, with the exception of three cases in RECOV. In addition, the SGEO mode searching algorithm is shown to be superb in finding new modes, with only one failure out of a total 80 attempts in Tables 1 and 2. Furthermore, TRPE (WHMC) requires at least four times the number of BFGS attempts than SGEO-KDE. This shows that SGEO-KDE outperforms TRPE (WHMC) in terms of computational cost as the bottleneck is BFGS. This is confirmed by conisdering the REM measure from the most recent 1000 samples (third column of Figures 1 and 2). SGEO-KDE takes a much shorter time to converge to the true mean. The improvement is amplified as the number of dimensions increase. In the D = 100 case, TRPE (WHMC) was not able to locate all the modes in 2000 seconds whereas the proposed method where all the modes were discovered at about 500 seconds. Furthermore, in higher dimensions (D ≥ 40), the margin of improvement is larger when the Gaussian weights become unequal. The results from HMC are included in the appendix. The HMC sampler is not able to sample from all the modes and the convergence is significantly worse than TRPE (WHMC) and SGEO-KDE. 6 Optimization Assisted MCMC D = 10, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 20, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 40, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 100, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) REM 0.018371 0.083307 1.2962 RECOV 0.95372 0.94318 3.0498 NBF GS 10 10 61.1 0.030689 0.032558 0.66759 0.98234 0.98191 3.5667 10 10 44.8 0.021922 0.17626 0.47974 0.99505 1.8093 1.0246 10 10 47.9 0.024986 0.12135 0.82776 0.99685 2.9112 1.3427 10 10 62.4 Table 1: Comparisons between SGEO-KDE and WHMC using Gaussian mixture models with equal weights. D = 10, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 20, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 40, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) D = 100, K = 10 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) REM 0.17218 0.14069 0.9356 RECOV 0.95234 0.94555 0.96606 NBF GS 10.05 10.05 37.5 0.14588 0.10588 0.68653 0.98155 0.9638 0.96862 10 10 44.2 0.11964 0.1011 1.3478 0.99498 1.2841 15.659 10 10 43.5 0.07165 0.078958 0.73814 0.99667 2.5587 1.6344 10 10 61.1 Table 2: Same as Table 1 but using Gaussian mixture models with unequal weights pk ∝ k/K, where k = {1, . . . , K}. 7 Fok An Wang RECOV D=10,K=10 REM 2 6 4 1.5 4 2 1 2 0 0 500 TIME (s) 1000 D=20,K=10 4 D=40,K=10 0.5 0 500 TIME (s) 1000 3 0 0 500 TIME (s) 1000 0 500 TIME (s) 1000 0 500 TIME (s) 1000 0 1000 2000 TIME (s) 3000 4 2 2 2 1 0 0 500 TIME (s) 1000 0 0 500 TIME (s) 1000 0 4 4 4 2 2 2 0 0 500 TIME (s) 1000 6 D=100,K=10 REM (last n) 6 0 0 500 TIME (s) 1000 8 6 6 4 0 4 4 2 0 2 2 0 1000 TIME (s) 2000 0 0 1000 TIME (s) 2000 0 Figure 1: Convergence diagnostics of SGEO-KDE (blue) and TRPE (WHMC) (red) for Gaussian mixture models of equal weights. The columns represent the results for REM, RECOV and the REM using the most recent 1000 samples, respectively. The rows denotes the results for various mixture models. 4.2. Sensor Network Localization Given a set of Ns sensors scattered on a two dimensional plane, it is possible to infer the locations of each sensor given the distance measured between each sensors. The domain of 8 Optimization Assisted MCMC D=10,K=10 REM RECOV 8 4 3 6 3 2 4 2 1 2 1 0 0 500 TIME (s) 1000 D=20,K=10 4 D=40,K=10 0 0 500 TIME (s) 1000 2 3 0 0 500 TIME (s) 1000 0 500 TIME (s) 1000 0 500 TIME (s) 1000 0 1000 2000 TIME (s) 3000 4 3 1.5 2 2 1 1 0 0 500 TIME (s) 1000 0.5 3 15 2 10 1 5 1 0 500 TIME (s) 1000 0 4 3 2 0 D=100,K=10 REM (last n) 4 0 500 TIME (s) 1000 0 1 0 500 TIME (s) 1000 0 4 8 4 3 6 3 2 4 2 1 2 1 0 0 500 1000 1500 TIME (s) 2000 0 0 500 1000 1500 TIME (s) 2000 0 Figure 2: Same as Figure 1 but with unequal weights. SGEO-KDE (blue) and TRPE (WHMC) (red). the posterior distribution is a 2Ns dimensional configuration space comprises of the locations of each sensor {x1 , . . . , xNs }, where each xi is a 2-dimensional coordinate vector denoting the position of sensor i. The solution for the locations can be found by a maximum aposterior (MAP) configuration estimate, and its uncentainties estimated by MCMC samples of the posterior. Due to measurement error, the posterior distribution can be multimodal, which each mode denoting a possible set of locations for the Ns sensors that maximizes the 9 Fok An Wang posterior locally. Obviously, the modes are not known aprori and hence this problem is an excellent application for SGEO-KDE. We run two chains in parallel for this experiment. In general, each mode of the posterior is non-Gaussian. After finding a new mode, HMC samples are taken and used to obtain a Gaussian kernel estimate of the posterior around the new mode.Then the KDE estimate of each mode is subtracted out for SGEO. Obtaining the true mean and covariance matrix of the target distribution is intractible. Thus we perform a long SGEO-KDE run to obtain an estimate of the “true” mean and covariance of the target distribution. If the prior distribution is taken to be uniform, the target distribution has the form given by (Ihler et. al., 2005) π(x1 , . . . , xN ) = Y ψij (xi , xj ), i>j ψij (xi , xj ) = [1 − Po (xi , xj )](1−oij ) [Po (xi , xj )Pν (dij − ||xi − xj ||)]oij , ||xi − xj ||2 Po (xi , xj ) = exp(− ), and 2R2 Pν (·) = N (·|0, σ 2 ), (1) where oij = {0, 1} denotes whether sensors i and j detect each other. If so, dij denotes the distance measured between the sensors. The total number of unknown sensors is Ns . In accordance to the literature, we choose Ns = 8, R = 0.3 and σ = 0.02. The wormhole influence factor is F = 0.01, with the probability of jumping between modes being 0.68. The HMC parameters are set to be ǫ = 0.014, L = 0.14 and the mass matrix M = I. The HMC acceptance rate is 0.60. The results are shown in Table 3 and Figure 3. In Figure 3, it is shown that WHMC is not able to converge even after 1000 seconds in the plot on the right. For SGEO-KDE, convergence is achieved very quickly. The reason is that the tempered residual potential method finds fictitious modes that destabilize WHMC. To show this, we investigated whether another BFGS run starting from the recorded locations of the modes would lead to another point. Let x̂∗k be the k-th recorded mode (k = 1 being the first mode discovered and k = K denotes the last mode) from either the residual potential method or SGEO-KDE and x∗k be a local maximum of the target distribtion obtained by BFGS using x̂∗k as a starting point. Then we consider the displacement between x̂∗k and x∗k up to the k-th mode as a measure for fictitious modes, that is ∆xk = |x̂∗k − x∗k |. P Figure 4 shows the cumulative displacement, ki=1 ∆xk for the residual potential method and SGEO-KDE over five runs. The results show that significant error tends to develop for residual potential after some modes have been discovered. For SGEO-KDE this trend is not observed and the cumulative error remains small as new modes are being discovere. This is consistent with the fact that the residual potential introduces ficititious modes as genuine ones are being found. 10 Optimization Assisted MCMC REM RECOV 0.2 REM (last n) 35 0.25 30 0.2 0.15 25 0.15 20 0.1 15 0.1 10 0.05 0.05 5 0 0 500 TIME (s) 1000 0 0 500 TIME (s) 1000 0 0 500 TIME (s) 1000 Figure 3: Convergence diagnostics for the sensor network localization problem. SGEOWHMC (on the fly, blue, all modes, magenta) and WHMC with forced mode updates (green). REM (10−3 ) 7.01 17.8 66.4 SGEO-KDE (all modes) SGEO-KDE (on the fly) TRPE (WHMC) RECOV 0.14 4.45 13.93 Table 3: Convergence diagnostics for the sensor network localization problem averaged over 5 runs corresponding to Figure 3. Only the second half of the chains were used. 5. Discussion and Conclusions In principle, the mode searching algorithm described in this article can be applied to many MCMC methods without modification. We have also tried our algorithms on a Gaussian mixture model with 500 dimensions and obtained analogous results for mode searching. The next challenge is to extend our algorithm to much higher dimensions than described in this article. 11 Fok An Wang −4 Residual potential (WHMC) 2 3.5 x 10 SGEO−KDE 3 2.5 cumsum(Δ x) cumsum(Δ x) 1.5 1 2 1.5 1 0.5 0.5 0 0 5 10 k−th known modes 0 15 0 5 10 k−th known modes 15 P Figure 4: This figure shows the cumulative error in mode finding, k ∆xk , as each mode is found. The left shows the results for residual potential mode search. The corresponding results for SGEO-KDE is shown on the right. Note the scales of the two plots on the vertical axes. Appendix A. Review of MCMC algorithms A.1. HMC The HMC originates from Hamiltonian dynamics. The solution of the Hamiltonian equations of motion describes the trajectories of particles under a potential energy U (x) and the kinetic energy K(p), where x is the position and p is the momentum of the particle. In HMC, the potential is chosen to be the negated log-density, i.e. U (x) = − log π(x), where π is the target density. The kinetic energy is K(p) = 12 pT M −1 p, where M is a constant mass matrix, usually set to be the identity. Given a trajectory length L, each candidate point is chosen as the last position of its trajectory. Let H(x, p) = U (x) + K(p) be the Hamiltonian. The Hamiltonian equations of motion are dx dt dp dt = ∇p H, = −∇x H. The above can be solved numerically by using the leapfrog integrator with step size parameter ǫ 12 Optimization Assisted MCMC ǫ ǫ p(t + ) = p(t) − ∇x U (x(t)) 2 2 ǫ x(t + ǫ) = x(t) + ǫM −1 p(t + ) 2 ǫ ǫ p(t + ǫ) = p(t + ) − ∇x U (x(t + ǫ)). 2 2 Let a state be s = (x, p). Analogous to MH, the acceptance probability αM H (s → s′ ) = min(1, w(s′ )/w(s)), where w = p(s)/g(s → s′ ), w′ = p(s′ )/g(s′ → s), p(s) denotes the probability of being in state s and g(s → s′ ) denotes the transition kernel from state s to s′ . Since Hamiltonian dynamics is time reversible, we have g(s → s′ ) = g(s′ → s). Then the acceptance probability is determined by the probability of being in states s and s′ . From statistical mechanics, the probability of finding a state s is3 1 exp[−H(s)], Z where Z is a normalization constant. The acceptance probability for HMC is p(s) = αHM C = min{1, exp[H(s) − H(s′ )]}. The algorithm for HMC is shown in 1. The two input parameters are the leapfrog step size ǫ and the total number of leapfrog steps Nǫ . In practice, these parameters need to be tuned so that the acceptance probability is close to the optimal value of 0.65. Algorithm 1: HMC 1. Given input xt , generate pt ∼ N (0, 1) 2. Obtain (xt+1 , pt+1 ) with Leapfrog steps of size ǫ for Nǫ steps. 3. Accept xt+1 with probability αHM C , discard pt+1 . Despite the efficiency of HMC for target distribution with a single mode, the HMC trajectories can get trapped in only one of the modes when multiple are present. This causes biased sampling where only one of the modes are being sampled. Many methods exists to alleviate this problem. In the next subsection we review the state-of-the-art method for multimodal sampling using HMC, the Wormhole Monte Carlo (WHMC) sampler. Tables 4 and 5 shows the results with HMC on Gaussian mixture models. A.2. WHMC When multiple modes exist in the target distribution, the samples and HMC trajectories can be confined within a local maximum. Given the locations of the modes, the WHMC employs the HMC sampler extended onto a Riemannian manifold, endowed with a metric 3. We set the temperature parameter T = 1 here. The distribution is also called the Boltzmann distribution used in Boltzmann machines. 13 Fok An Wang D = 10, K = 10 D = 20, K = 10 D = 20, K = 20 D = 40, K = 10 D = 100, K = 10 REM 5.471 4.442 4.917 3.707 4.455 RECOV 7.72 17.09 3.835 27.55 56.74 Table 4: Same as Table 1 with results obtained from HMC. D = 10, K = 10 D = 20, K = 10 D = 20, K = 20 D = 40, K = 10 D = 100, K = 10 REM 5.582 5.068 4.767 4.742 4.956 RECOV 9.921 12.62 3.028 18.81 58.39 Table 5: Same as Table 2 with results obtained from HMC. that shortens the intermodal distances. Following Girolami and Calderhead (2011), the constant mass matrix M in the kinetic energy is replaced by the positional dependent metric, M = G(x). An immediate consequence of this is that, as we shall see, at least one of the leapfrog update equations become implicit and the updates require fixed point iterations. Consider we have two known modes at x∗1 and x∗2 . The metric in WHMC comprises of two terms, G = (1 − m(x))I + m(x)GW , where GW is the wormhole metric defined by GW = I − (1 − ǫW )vW (vW )T , with vW being a unit vector point from x∗1 to x∗2 and ǫW is a small and positive number ǫW ≪ 1. The mollifying function m(x) ∈ (0, 1) is defined by m(x) = exp{− 1 (||x − x∗1 || + ||x∗2 − x|| − ||x∗2 − x∗1 ||)}, F where F is the wormhole influence factor. Notice that, using the triangle inequality, ||x − x∗1 || + ||x∗2 − x|| ≥ ||x∗2 − x∗1 ||, m(x) = 1 if and only if x is on the line connecting the two modes and that the wormhole influence factor controls the extent the wormhole metric GW influences the metric G away from the intermodal line. Also, the wormhole metric GW shortens the distance between the two modes. Let v = kvW , where k = ||x∗1 − x∗2 || so that ||v|| = k in Euclidean space. Under GW , the 2-norm of v is v T GW v = ǫ W k 2 . 14 Optimization Assisted MCMC That is to say that the wormhole metric shortens the distance between two modes by a √ factor of ǫW . For more than two modes, a network of wormholes connecting each mode is built. To facilitate moving between modes, Lan et. al. (2013) introduced a vector field, f , along the intermodel line, with norm proportional to the mollifying function and the projection of the velocity4 v = G(x)−1 p onto the intermodal line f (x, v) = m(x)hv, vW ivW with the corresponding modification of the Hamilton equation dx = ∇p H + f (x, v). dt Lastly, Lan et. al. (2013) also introduced an auxiliary dimension to prevent interference between the wormholes and the HMC dynamics. The generalized leapfrog integrator for WHMC is 1 v(t + ) = v(t) − 2 ǫ ∇x U (x(t)) 2 1 1 1 1 x(t + 1) = x(t) + ǫ v(t + ) + f x(t), v(t + ) + f x(t + 1), v(t + ) 2 2 2 2 ǫ 1 v(t + 1) = v(t + ) − ∇x U (x(t + 1)). 2 2 We can see that the update equation for x(t + 1) is implicit and fixed point iteration is needed for this update. A.3. Updating Target Distribution During Regeneration Time Regeneration times are periods where a Markov chain restarts itself probabilistically. States after regeneration is independent of states prior to the regeneration time. In thoery, another set of sampler parameters can be used for the sampler while leaving the MCMC estimates unaffected. In addition, Lan et. al. (2013) and Ahn et. al. (2013) performs BFGS at regeneration times to search for new modes and updates the proposal distribution on-thefly. Regeneration is checked at each step. The regeneartion probability is r(xt , xt+1 ) = where S(xt )Q(xt+1 ) , T (xt+1 |xt ) c S(xt ) = min 1, π(xt )/q(xt ) π(xt+1 )/q(xt+1 ) Q(xt+1 ) = q(xt+1 ) min 1, c π(xt+1 )/q(xt+1 ) T (xt+1 |xt ) = q(xt+1 ) min 1, , π(xt )/q(xt ) 4. The definition of the velocity in classical mechanics is v = M−1 p and we are identifying the mass matrix with the metric here. 15 Fok An Wang q(·) is the proposal indepdence distribution. Note that, r = 1 if and only if q(·) = π(·). Regeneration probability is dependent on the choice of the independence proposal function, and the choice of the constant c. If the choice of the indenpendence sampler is poor, such as in the case where only a minority of the modes in a multimodal target distribution are known, then the regeneration probability can be too small that BFGS is never triggered to find new modes. We encountered such a scenario when testing the sensor network localization problem with WHMC. Another problem with updating the target distribution on-the-fly is that sample estimates are biased by the samples from the previous target distribution. In the following sections, we show that the mean and covariance estimates of the target distribution are significantly more accurate if all modes of the target distribution are found first. To this end we employ a global optimization algorithm SGEO modified to discover all the modes before sampling. Acknowledgments The research of second and third authors are supported by National Science and Engineering Research Grants. We thank Shiwei Lan for providing the code for WHMC. References S. Ahn, Y. Chen, and M. Welling. Distributed and adaptive darting Monte Carlo through regenerations. Journal of Machine Learning Research 31: W&CP, 31, 2013 I. Andricioaiei, J. Straub, and A. Voter. Smart darting Monte Carlo. 114(16), 2001. A.E. Brockwell and J.B. Kadane. Identification of regeneration times in mcmc simulation, with application to adaptive schemes. Journal of Computational and Graphical Statistics, 14(2):436–458, 2005. S. P. Brooks and A. Gelman. General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics 7(4):434–455, 1998. M. Creutz. Global Monte Carlo algorithms for many-fermion systems. Physical Review D, 38(4):1228–1238, 1988. R. Fok, A. An and X. Wang. Geodesic and Contour Optimization Using Conformal Mapping. Journal of Global Optimization, 2016. doi:10.1007/s10898-016-0467-8 A. E. Gelfand and A. F. M. Smith. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association 85 (410): 398–409. Walter R. Gilks; Gareth O. Roberts; Sujit K. Sahu. Adaptive Markov Chain Monte Carlo through Regeneration. Journal of the American Statistical Association 93:443:1045–1054, 1998. M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Statist. Soc. B 73(2),:1–37, 2011. 16 Optimization Assisted MCMC W. K. Hastings, Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1):97–109, 1970. M. D. Hoffman and A. Gelman. Journal of Machine Learning Research 15:1351-1381, 2014 A. T. Ihler, J. W. Fisher, R. L. Moses and A. S. Willsky. IEEE Journal on Selected Areas in Communication 73(2):809–819, 2005. M. Kristan, A. Leonardis, D. Skoaj. Multivariate online Kernel Density Estimation with Gaussian Kernels, Pattern Recognition, 2011. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equations of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21 (6): 1087–1092, 1953. P. Mykland, L. Tierney, and B. Yu. Regeneration in markov chain samplers. Journal of the American Statistical Association, 90(429):233–241, 1995. S. Lan, V. Stathopoulos, B. Shahbaba and M. Girolami. Lagrangian Dynamical Monte Carlo. arXiv preprint arXiv:1211.3759, 2012. S. Lan, J. Streets and B. Shahbaba. Wormhole Hamiltonian Monte Carlo. arXiv preprint arXiv:1306.0063, 2013. R.M. Neal, MCMC using Hamiltonian dynamics. arXiv preprint arXiv:1206.1901, 2012 C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calc. Math. Soc., 37, 81–91, 1945. C. Sminchisescu and M.Welling. Generalized dartingmonte carlo. In Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS2007), 2007. online proceedings. Z. Wang, S. Mohamed and N. de Freitas, Adaptive Hamiltonian and Riemann Monte Carlo Samplers. International Conference on Machine Learning (ICML), 2013. 17