Sampling from a different Gaussian

One way to evaluate integral ratios of the form fϱ=dxϱ(x)f(x)dxϱ(x) \langle f \rangle_\varrho = \frac{ \int\! \mathrm{d} x \, \varrho(x) f(x) }{ \int\! \mathrm{d} x \, \varrho(x) } is using Monte Carlo integration. If ϱ(x)\varrho(x) is proportional to the pdf of a Gaussian distribution, ϱ(x)exp((xμ)22σ2), \varrho(x) \propto \exp\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right), then sampling is very easy, with many languages and environments offering a function for rejection-free sampling (e.g. Julia's randn).

For example, we choose the function f(x)=x2 f(x) = x^2 and a Gaussian distribution ϱ(x)\varrho(x) with mean μ=2.5\mu = 2.5 and standard deviation σ=1.5\sigma = 1.5. From the definition of the variance (σ2=x2μ2\sigma^2 = \langle x^2 \rangle - \mu^2), we see that fϱ=μ2+σ2=8.5. \langle f \rangle_\varrho = \mu^2 + \sigma^2 = 8.5. If we perform the average f¯=1Ni=1Nf(xi) \bar{f} = \frac{1}{N} \sum_{i=1}^N f(x_i) for N=104N = 10^4 samples and bin the results over many trials, the distribution looks like Monte Carlo integration example with Gaussian sampling

Single Gaussian

Suppose that we want to evaluate the same integral ratio as before, but by sampling from a different Gaussian distribution ϱ(x)\varrho'(x') with mean μ=1.2\mu' = -1.2 and standard deviation σ=2.8\sigma' = 2.8. We introduce x(x)=σσ(xμ)+μ, x'(x) = \frac{\sigma'}{\sigma} (x - \mu) + \mu', so that ϱ(x)ϱ(x(x)). \varrho(x) \propto \varrho'(x'(x)). This gives us fϱ=dxϱ(x(x))f(x)dxϱ(x(x)), \langle f \rangle_\varrho = \frac{ \int\! \mathrm{d} x \, \varrho'(x'(x)) f(x) }{ \int\! \mathrm{d} x \, \varrho'(x'(x)) }, to which we apply a change of variables from xx to xx', yielding fϱ=dxϱ(x)f(x)dxϱ(x), \langle f \rangle_\varrho = \frac{ \int\! \mathrm{d} x' \, \varrho'(x') f'(x') }{ \int\! \mathrm{d} x' \, \varrho'(x') }, where f(x)=f(x(x))=f(σσ(xμ)+μ). f'(x') = f(x(x')) = f\left( \frac{\sigma}{\sigma'} (x' - \mu') + \mu \right). We don't need to worry about dxdx=σσ, \frac{\mathrm{d} x}{\mathrm{d} x'} = \frac{\sigma}{\sigma'}, because it would appear in both the numerator and denominator.

Hence, we see that fϱ=fϱ. \langle f \rangle_\varrho = \langle f' \rangle_{\varrho'}. We may verify that this is the case by sampling from ϱ\varrho' and evaluating f¯=1Ni=1Nf(σσ(xiμ)+μ). \bar{f}' = \frac{1}{N} \sum_{i=1}^N f\left( \frac{\sigma}{\sigma'} (x_i' - \mu') + \mu \right). Binning the results from several trials, we obtain Sampling from a single different Gaussian As expected, this distribution looks very similar to the previous one.

From an implementation perspective, this result is incredibly trivial. Since in practice we obtain random values from a standard Gaussian distribution (with zero mean and unit variance), sampling from a Gaussian with specific parameters amounts to a scaling and a shift:

1
xs1 = 1.5 * randn(SAMPLE_SIZE) + 2.5

It is then clear that sampling with any other parameters and performing the described transformation is functionally identical:

1
2
xs2 = 2.8 * randn(SAMPLE_SIZE) + (-1.2)
xs1 = (xs2 - (-1.2)) * 1.5 / 2.8 + 2.5

Multiple Gaussians

The case of several independent Gaussians is a straightforward generalization of the above. In principle, multiple correlated Gaussians may always be transformed into independent Gaussians, so we may stop here. However, it is interesting to see what the transformation for the sampled values is in terms of the original coordinates.

We now consider a distribution ϱ\varrho with mean vector μ\vec{\mu} and covariance matrix Σ\mathbf{\Sigma}, whose pdf is ϱ(x)exp(12(xμ)TΣ1(xμ)). \varrho(\vec{x}) \propto \exp\left( -\frac{1}{2} (\vec{x} - \vec{\mu})^\mathrm{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{\mu}) \right). For convenience, we introduce A=Σ1\mathbf{A} = \mathbf{\Sigma}^{-1} and diagonalize it as A=STΛS\mathbf{A} = \mathbf{S}^\mathrm{T} \mathbf{\Lambda} \mathbf{S}. If we have another distribution ϱ\varrho' with parameters μ\vec{\mu}' and Σ\mathbf{\Sigma}', then the choice x(x)=(S)T(Λ)1ΛS(xμ)+μ \vec{x}'(\vec{x}) = (\mathbf{S}')^\mathrm{T} \sqrt{(\mathbf{\Lambda}')^{-1} \mathbf{\Lambda}} \mathbf{S} (\vec{x} - \vec{\mu}) + \vec{\mu}' results in ϱ(x)ϱ(x(x)). \varrho(\vec{x}) \propto \varrho'(\vec{x}'(\vec{x})). This is essentially identical to the univariate x(x)x'(x), but with the introduction of the transformation matrices S\mathbf{S} and S\mathbf{S}' to decouple the coordinates.

As before, we have that fϱ=fϱ, \langle f \rangle_\varrho = \langle f' \rangle_{\varrho'}, where f(x)=f(x(x))=f(STΛ1ΛS(xμ)+μ). f'(\vec{x}') = f(\vec{x}(\vec{x}')) = f\left( \mathbf{S}^\mathrm{T} \sqrt{\mathbf{\Lambda}^{-1} \mathbf{\Lambda}'} \mathbf{S}' (\vec{x}' - \vec{\mu}') + \vec{\mu} \right). Again, this is not surprising if we consider what is done operationally when sampling from independent standard Gaussian distributions.

We may check this result with μ=(0.61.4), \vec{\mu} = \begin{pmatrix} -0.6 & 1.4 \end{pmatrix}, Σ=(2113), \mathbf{\Sigma} = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix}, μ=(0.61.4), \vec{\mu}' = \begin{pmatrix} 0.6 & -1.4 \end{pmatrix}, Σ=(3112), \mathbf{\Sigma}' = \begin{pmatrix} 3 & -1 \\ -1 & 2 \end{pmatrix}, and f(x)=(x1+x2)2. f(\vec{x}) = (x_1 + x_2)^2. The resulting distributions are: Sampling from a multiple different Gaussians The grey curve is from sampling ϱ\varrho, while the black curve is from sampling ϱ\varrho' and then transforming the sampled coordinates. The exact result (dotted) is given by (μ1+μ2)2+(s11+s12)2λ1+(s21+s22)2λ2=7.64. (\mu_1 + \mu_2)^2 + \frac{(s_{11} + s_{12})^2}{\lambda_1} + \frac{(s_{21} + s_{22})^2}{\lambda_2} = 7.64. So far, we've seen that sampling from any Gaussian is as good as any other.

Application: Finite difference

Consider the function F(p)=f(p)p=dxϱ(p,x)f(p,x)dxϱ(p,x), F(p) = \langle f(p) \rangle_p = \frac{ \int\! \mathrm{d} x \, \varrho(p, x) f(p, x) }{ \int\! \mathrm{d} x \, \varrho(p, x) }, where xx is a scalar or a vector. We may approximate its derivative using a finite difference formula: dFdp12Δp(F(p+Δp)F(pΔp)). \frac{\mathrm{d} F}{\mathrm{d} p} \approx \frac{1}{2 \Delta p} (F(p + \Delta p) - F(p - \Delta p)). More elaborately, its log derivative is given by dlnFdp12ΔpF(p+Δp)F(pΔp)F(p). \frac{\mathrm{d} \ln{F}}{\mathrm{d} p} \approx \frac{1}{2 \Delta p} \cdot \frac{ F(p + \Delta p) - F(p - \Delta p) }{ F(p) }. This is equivalently written as dlnFdp12Δpf(p+Δp)p+Δpf(pΔp)pΔpf(p)p. \frac{\mathrm{d} \ln{F}}{\mathrm{d} p} \approx \frac{1}{2 \Delta p} \cdot \frac{ \langle f(p + \Delta p) \rangle_{p + \Delta p} - \langle f(p - \Delta p) \rangle_{p - \Delta p} }{ \langle f(p) \rangle_p }.

Suppose that sampling is the most expensive part of the calculation and so we only wish to sample values from a single distribution. From the above discussion, we know that there are functions f+(x)f^+(x) and f(x)f^-(x) such that dlnFdp12Δpf+pfpf(p)p, \frac{\mathrm{d} \ln{F}}{\mathrm{d} p} \approx \frac{1}{2 \Delta p} \cdot \frac{ \langle f^+ \rangle_p - \langle f^- \rangle_p }{ \langle f(p) \rangle_p }, where the only approximation is due to the finite difference. Specifically, f±(x)=f(p±Δp,x±(x)), f^\pm(x) = f(p \pm \Delta p, x^\pm(x)), where x±(x)x^\pm(x) are the appropriate coordinate transformations. Hence, we may estimate this derivative by sampling from just one distribution rather than three.

For example, consider the multivariate Gaussian ϱ(p,x)\varrho(p, \vec{x}) with μ=(0.61.4), \vec{\mu} = \begin{pmatrix} -0.6 & 1.4 \end{pmatrix}, and Σ(p)=(2p113/p). \mathbf{\Sigma}(p) = \begin{pmatrix} 2 p & 1 \\ 1 & 3/p \end{pmatrix}. If we choose the function f(p,x)=p2(x1+x2)2, f(p, \vec{x}) = p^2 (x_1 + x_2)^2, then F(p=1)=7.64F(p=1) = 7.64 is the value we were estimating in the previous section. Now we will estimate dlnFdpp=1 \left. \frac{\mathrm{d} \ln{F}}{\mathrm{d} p} \right|_{p=1} with Δp=104\Delta p = 10^{-4}. The distributions that we get are: Sampling a finite difference by substituting Gaussians The black curve is generated by sampling only from ϱ(p,x)\varrho(p, \vec{x}) (and not ϱ(p+Δp,x)\varrho(p + \Delta p, \vec{x}) or ϱ(pΔp,x)\varrho(p - \Delta p, \vec{x})). The dotted line is found by applying the finite difference method to the exact result. The grey curve (which looks like a horizontal line in the above plot) shows what happens if we sample from all three distributions separately. To really appreciate the difference between the two, it helps to zoom out: Sampling a finite difference by substituting Gaussians (wide view)

Surprisingly, it seems that we not only save on sampling effort using this technique, but we also significantly reduce the standard error of the mean! This reduction might be due to a favourable cancellation of errors that is only possible when the sampled points are the same.

We may also try to evaluate the derivative of a more convoluted expression. Starting from F(p)=dxg(p,x)dxϱ(p,x), F(p) = \frac{ \int\! \mathrm{d} x \, g(p, x) }{ \int\! \mathrm{d} x \, \varrho(p, x) }, we may always obtain F(p)=f(p)p=dxϱ(p,x)f(p,x)dxϱ(p,x) F(p) = \langle f(p) \rangle_p = \frac{ \int\! \mathrm{d} x \, \varrho(p, x) f(p, x) }{ \int\! \mathrm{d} x \, \varrho(p, x) } by setting f(p,x)=g(p,x)ϱ(p,x). f(p, x) = \frac{g(p, x)}{\varrho(p, x)}.

For our example, we'll use g(p,x)=exp(p(x12+x22)) g(p, \vec{x}) = \exp\left( -p (x_1^2 + x_2^2) \right) and the same ϱ(p,x)\varrho(p, \vec{x}) as before. We obtain a simple form for the expression: F(p)λ1λ2p. F(p) \propto \frac{\sqrt{\lambda_1 \lambda_2}}{p}. Because λ1λ2\lambda_1 \lambda_2 is the determinant of A\mathbf{A}, which happens to be constant, we have that dlnFdp=1p. \frac{\mathrm{d} \ln{F}}{\mathrm{d} p} = -\frac{1}{p}. Sampling the finite difference at p=1p=1 yields Sampling a finite difference by substituting Gaussians and Sampling a finite difference by substituting Gaussians (wide view) Once again, we see an improvement if we only sample once.

It appears that although sampling from one Gaussian and then swapping it out for a different one is formally a useless procedure, there do exist situations where it provides some benefit.