Algorithms¶
Euler-Maruyama (SGLD)¶
The classic Euler-Maruyama scheme, also known as Stochastic Gradient Langevin Dynamics (SGLD) when a stochastic approximation is used as the gradient.
This scheme uses the gradient of the posterior to move through the configuration space, without Metropolis correction. It produces a biased trajectory (though convergent as the learning rate decreases), but is a good first-choice in high dimension for general distributions.
Usage¶
Set algo="sgld"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The (potentially approximate) gradient of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
This scheme takes no additional parameters.
References
This scheme is extensively studied, it should be featured heavily in any good statistics textbook.
Numerical Solution of Stochastic Differential Equations by Eckhard Platen and Peter Kloeden, Springer (1992)
Welling, Max, and Yee W. Teh. “Bayesian learning via stochastic gradient Langevin dynamics.” Proceedings of the 28th international conference on machine learning (ICML-11). 2011 https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf
Leimkuhler-Matthews (LM)¶
A biased MCMC sampling method that provides significant accuracy improvements over the Euler-Maruyama scheme, without extra gradient evaluations. Gives exact sampling for Gaussian target distributions.
Usage¶
Set algo="LM"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The gradient of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
This scheme takes no additional parameters.
References
Molecular Dynamics: With Deterministic and Stochastic Numerical Methods by Benedict Leimkuhler and Charles Matthews, Springer (2017)
Random Walk Metropolis¶
One of the simplest and widely used MCMC sampling algorithms, the Metropolis algorithm proposes a new point using an isotropic Gaussian distribution and accepts with a probability preserving the detailed balance condition.
The Metropolis condition ensures this method is unbiased, but can be extremely slow to converge (particularly in high dimension).
Note
The acceptance rate for the scheme can be checked by calling the acceptance_rate
function on the sampler. A good choice is to choose a learning rate so that acceptance is around 0.234.
Usage¶
Set algo="RWMetropolis"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
llh
:: The value of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
This scheme takes no additional parameters.
BAOAB¶
A biased Langevin sampling routine that incorporates momentum to expedite sampling over barriers. A damping parameter can be chosen to customize the mixing with the heat bath. In the limit of infinite damping this method becomes the LM scheme. Similarly, the BAOAB method is exact for Gaussian distributions.
Usage¶
Set algo="baoab"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The gradient of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
g
positive float(Default 1.0) The isotropic friction constant (gamma) used in the Langevin dynamics.
References
Molecular Dynamics: With Deterministic and Stochastic Numerical Methods by Benedict Leimkuhler and Charles Matthews, Springer (2017)
Hybrid Monte Carlo (HMC)¶
Hybrid (or Hamiltonian) Monte Carlo takes steps of constant-energy dynamics after redrawing the kinetic energy. The symplectic scheme used for the constant-energy integration allows for very favorable behavior with dimension, in spite of the Metropolization step.
Note
The acceptance rate for the scheme can be checked by calling the acceptance_rate
function on the sampler. A good choice is to choose a learning rate so that acceptance is between 0.7 and 0.8.
Usage¶
Set algo="hmc"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
llh
:: The log posterior value.grad
:: The gradient of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
T
positive int(Default 5) The number of constant-energy substeps to take between Metropolis tests.
References
Hybrid Monte Carlo; Simon Duane, A.D. Kennedy, Brian J. Pendleton, Duncan Roweth (1987) https://doi.org/10.1016/0370-2693(87)91197-X
BADODAB¶
A scheme that adaptively learns a constant damping term required to correct the distribution when using stochastic gradients.
Usage¶
Set algo="badodab"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The gradient of the log posterior.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
g
positive float(Default 1.0) The damping parameter used in the Langevin dynamics.
mu
positive float(Default 1.0) The precision (reciprocal variance) of the auxillary variables distribution.
References
Leimkuhler, Benedict, and Xiaocheng Shang. “Adaptive thermostats for noisy gradient systems.” SIAM Journal on Scientific Computing 38.2 (2016)
Stochastic Gradient HMC¶
A biased scheme for use with stochastic gradients. Damps the dynamics to compensate for additional noise coming from a stochastic gradient.
The covariance of the stochastic gradient term, denoted B, is calculated at every step using a user-supplied function that takes as input the current value of B and the grad_data terms, if supplied.
For reasons of ergodicity (i.e. to ensure proper mixing behavior in the method) a matrix C>=B*h/2 must be given, where h is the learning rate or step size. In this implementation we use scale the identity matrix scaled by a parameter g
as the C matrix.
The large damping parameter required can mean that this scheme can easily become unstable and slow to converge if the gradient noise is large.
Usage¶
Set algo="sghmc"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The gradient of the log posterior.grad_data
:: The gradient of the log likelihood with respect to the current position, itemized over the data batch. Expects a (N,D) array, where the position space is N dimensional and the batch size is D.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
g
positive float(Default 1.0) The damping parameter used in the Langevin dynamics. Must be large enough to dominate the gradient noise.
auto_friction
bool(Default True) Automatically increase the friction
g
if the target covariance matrix is not positive-definite.
B
numpy array(Default 0) The initial value to use as the damping matrix for the stochastic gradient.
grad_cov
functionA function estimating the covariance of the stochastic gradient, with signature grad_cov(B, grad_data). The current covariance estimate B and the grad_data term are supplied to the function, and it outputs an (N,N) numpy array as the new B matrix. If a function is not specified, the default behavior is to use the last estimate of
B
without any change.
References
Chen, Tianqi, Emily Fox, and Carlos Guestrin. “Stochastic gradient hamiltonian monte carlo.” International conference on machine learning. (2014)
Racecar¶
A general purpose sampling algorithm, specifically designed for efficient sampling in systems with a stochastic gradient.
Usage¶
Set algo="racecar"
in The sampler module.
Requires¶
The llh
function needs to output the following dictionary keys:
grad
:: The gradient of the log posterior.grad_data
:: (optional) The gradient of the log likelihood with respect to the current position, itemized over the data batch. Expects a (N,D) array, where the position space is N dimensional and the batch size is D.
Params¶
The behavior of the sampler can be customized by including the following arguments in the sampler’s params
dict.
g
positive float(Default 1.0) The damping parameter used in the Langevin dynamics. Must be large enough to dominate the force noise.
mu
positive float(Default 1.0) The precision (reciprocal variance) of the auxillary variables distribution.
estimate_basis
bool(Default True) If set to true and
grad_data
is given, then the method will approximate the dominant directions for the gradient noise.
estimate_time
float(Default 0.5) The time in seconds to spend initially estimating the basis vectors by generating new grad-data. By default it will spend 0.5s building this at setup. Set it to 0 to just use the initial grad-data for the guess.
basis_size
positive int(Default Ndim) The number of basis vectors to use. A smaller number can improve efficiency in high dimensions, but will remove potential accuracy.
basis
NxK numpy array(optional) The rank-K array of basis vectors to use for the damping of the noisy gradient.
Note
The Racecar scheme damps the system in directions specified by basis vectors.
If a basis
is not given, and estimate_basis
is True, then these vectors are estimated from the grad_data
value if it is outputted from the llh
function. If there is no grad_data
, then it will the standard (diagonal) basis instead.