Background

A GP is a distribution over functions \(f\parenth{x}\) such that any finite set of \(n\) values \(\vec{f}=f\parenth{\vec{x}}\) evaluated at \(\vec{x}=\braces{x_1,\ldots,x_n}\) follows a multivariate normal distribution. The distribution is thus fully specified by its mean \(\mu\parenth{x}=\E{f\parenth{x}}\) and covariance (kernel) \(k\parenth{x,x'}=\cov\parenth{f\parenth{x},f\parenth{x'}}\). Evaluating the likelihood requires inverting the covariance matrix \(\mat{K}\) obtained by evaluating the kernel \(k\) at all pairs of observed locations \(\vec{x}\). Unfortunately, the computational cost of inverting \(\mat{K}\) scales as \(\BigO\parenth{n^3}\), making GPs prohibitively expensive save for relatively small datasets.

Sparse Approximation

The joint distribution of observations \(\vec{f}\) may be expressed as the product of conditional distributions

\[\proba{\vec{f}}=\proba{f_1}\prod_{j=2}^n \proba{f_j\mid f_{j-1}, \ldots, f_1}.\]

The conditional structure can be encoded by a directed acyclic graph (DAG) whose nodes represent observations such that a directed edge exists from a node \(j\) to each of its predecessors \(\pred_j=\braces{j-1,\ldots,1}\), where the ordering is arbitrary. If two observations do not depend on one another, the corresponding edge can be removed from the DAG to reduce the computational cost. In particular, evaluating each factor above requires inverting a matrix with size equal to the number of predecessors of the corresponding node—a substantial saving if the graph is sparse. For example, nearest-neighbor methods, a special case, reduce the asymptotic runtime to \(\BigO\parenth{n q^3}\) by retaining only edges from each node to at most \(q\) of its nearest predecessors.

Note

gp_graph_exp_quad_cov_lpdf(vector, vector, array [] vector, real, real, array [,] int) evaluates the likelihood of a GP on a DAG. gp_graph_matern32_cov_lpdf and gp_graph_matern52_cov_lpdf evaluate the corresponding likelihoods for Matérn kernels. Transformations for non-centered parameterizations are implemented as gp_inv_graph_exp_quad_cov(vector, vector, array [] vector, real, real, array[,] int), gp_inv_graph_matern32_cov, and gp_inv_graph_matern52_cov.

Fourier Methods

If the observation points \(\vec{x}\) form a regular grid and the kernel is stationary, i.e., \(k\parenth{x,x'}=k\parenth{x-x'}\), we can use the fast Fourier transform (FFT) to evaluate the likelihood exactly in \(\BigO\parenth{n\log n}\) time. As the Fourier transform is a linear operator and \(f\) is (infinite-dimensional) multivariate normal, the Fourier coefficients

\[\tilde f\parenth{\xi}=\int_{-\infty}^\infty dx\,\exp\parenth{-2\pi\imag\xi x} f\parenth{x}\]

are also multivariate normal. Assuming \(\mu\parenth{x}=0\) for simplicity, the mean of Fourier coefficients is zero and their expected complex-conjugate product at two different frequencies \(\xi\) and \(\xi'\) is

\[\begin{split}\E{\tilde f\parenth{\xi}\overline{\tilde f\parenth{\xi'}}}&=\int_{-\infty}^\infty dx\,dx'\,\exp\parenth{-2\pi\imag\parenth{x\xi-x'\xi'}}k\parenth{x-x'}\\ % &=\int d\Delta\,dx'\,\exp\parenth{-2\pi\imag\parenth{\Delta\xi +x'\parenth{\xi-\xi'}}}k\parenth{\Delta}\\ &=\int_{-\infty}^\infty dx'\, \exp\parenth{-2\pi\imag x'\parenth{\xi-\xi'}} \int_{-\infty}^\infty d\Delta\,\exp\parenth{-2\pi\imag \Delta\xi} k\parenth{\Delta}\end{split}\]

where we changed variables to \(x=\Delta + x'\) and factorized the integrals in the second line. The first integral is a representation of the Dirac delta function \(\delta\parenth{\xi-\xi'}\), and the second integral is the Fourier transform of the kernel. Fourier coefficients are thus uncorrelated, and, subject to careful bookkeeping, we can evaluate the GP likelihood exactly. In other words, the Fourier transform diagonalizes stationary kernels.

Note

gp_rfft_lpdf and gp_rfft2_lpdf evaluate the likelihood of a GP on one- and two-dimensional grids, respectively. Transformations for non-centered parameterizations are implemented as gp_inv_rfft and gp_inv_rfft2.

Parameterizations

Consider a simple model with latent GP and normal observation noise with variance \(\kappa^2\)

\[\begin{split}\vec{f}&\dist\mathsf{MultivariateNormal}\parenth{0, \mat{K}}\\ \vec{y}&\dist\mathsf{Normal}\parenth{\vec{f}, \kappa^2}.\end{split}\]

The model employs the natural centered parameterization, i.e., each observation \(y_i\) is independent given the corresponding latent \(f_i\). This parameterization works well if the data are informative (small \(\kappa\)) because each observation \(y_i\) constrains the corresponding latent parameter \(f_i\). The elements of \(\vec{f}\) are thus relatively uncorrelated under the posterior, and the Hamiltonian sampler can explore the distribution efficiently.

However, if the data are weak (large \(\kappa\)), they cannot independently constrain each element of \(\vec{f}\) and the GP prior dominates the posterior. The resulting correlation among elements of \(\vec{f}\) frustrates the sampler, especially if the correlation length is large. We can overcome this challenge by employing a non-centered parameterization such that the parameters of the model are uncorrelated under the prior. Here, we reparameterize the model in terms of a white noise vector \(\vec{z}\) of the same size as \(\vec{f}\) and obtain realizations of the GP \(\vec{f}=\phi^{-1}\parenth{\vec{z}}\) using an inverse transform \(\phi^{-1}\) which must be selected carefully to ensure \(\vec{f}\) follows the desired distribution. The reparameterized model is

\[\begin{split}\vec{z}&\dist\mathsf{Normal}\parenth{0, 1}\\ \vec{f}&=\phi^{-1}\parenth{\vec{z}, 0, \mat{K}}\\ \vec{y}&\dist\mathsf{Normal}\parenth{\vec{f}, \kappa^2}.\end{split}\]