Logistic regression#

This notebook illustrates the use of the gptools package by fitting a univariate latent Gaussian process (GP) to synthetic binary outcomes. It comprises three steps:

  1. generating synthetic data from the model.

  2. defining the model in Stan using the Fourier methods provided by the package.

  3. fitting the model to the synthetic data and analyzing the results.

Generating synthetic data#

We consider a latent Gaussian process \(z(x)\) on a regular grid \(x\) which encodes the log odds of binary outcomes \(y\). More formally, the model is defined as

\[\begin{split}\begin{aligned} z&\sim\mathsf{MultivariateNormal}\left(0, K\right)\\ y&\sim\mathsf{Bernoulli}\left(\mathrm{expit} (z)\right), \end{aligned}\end{split}\]
where \(\mathrm{expit}(z) = 1/\left(1 + \exp(-z)\right)\) denotes the logistic sigmoid and \(K\) is the covariance matrix between elements of the Gaussian process evaluated on the grid. Specifically, we use a squared exponential kernel \(k\) such that
\[\begin{split}\begin{aligned} K_{ij}&=k\left(x_i,x_j\right) + \epsilon \delta_{ij}\\ &=\sigma^2\exp\left(-\frac{\left\vert x_i - x_j\right\vert}{2\ell^2}\right), \end{aligned}\end{split}\]
where \(\sigma\) is the marginal scale of the Gaussian process and \(\ell\) is the correlation length of the kernel. We add a so-called “nuggest variance” \(\epsilon={10}^{-6}\) to the diagonal of the covariance matrix to ensure it is numerically positive definite.

For the synthetic data, we consider \(n=100\) regularly spaced observations, a relatively large correlation length \(\ell=10\) to illustrate the smoothing effect of GPs, and marginal scale \(\sigma=2\).

from matplotlib import pyplot as plt
import numpy as np
from scipy import special


# Define hyperparameters.
n = 100
sigma = 2
length_scale = 10
epsilon = 1e-6
seed = 0

# Seed random number generator for reproducibility, create grid, and generate synthetic data.
np.random.seed(seed)
x = np.arange(n)
cov = sigma ** 2 * np.exp(- (x[:, None] - x[None, :]) ** 2 / (2 * length_scale ** 2)) \
    + epsilon * np.eye(n)
z = np.random.multivariate_normal(np.zeros_like(x), cov)
proba = special.expit(z)
y = np.random.binomial(1, proba)

The figure below shows the latent probability for a positive outcome as a curve and individual observation as a scatter plot. We seek to infer the former given the latter.

def plot_synthetic_data(x, y, proba, ax=None):
    ax = ax or plt.gca()
    ax.plot(x, proba, label="latent probability\n$p=\\mathrm{expit}\\left(z\\right)$")
    ax.scatter(x, y, label="binary outcome $y$", marker=".")
    ax.set_xlabel("covariate $x$")

fig, ax = plt.subplots()
plot_synthetic_data(x, y, proba)
ax.legend()
fig.tight_layout()
../../_images/fc6722f731e90e4318a12f95e13b2624d921c2d45b584edc27ae7cae464b4d96.png

Defining the model in Stan#

We employ gptools to approximate the GP using Fourier methods (see Fourier Methods for theoretical background). Fourier methods induce periodic boundary conditions, and we need to introduce padding \(p\) to attenuate their effect (see Effect and importance of padding for details). Here, we pad the domain with the number of observations, i.e., the padded domain has \(m=2n\) grid points, which is sufficient to overcome boundary effects.

Each binary outcome only contains a limited amount of information (think of trying to estimate whether a coin is fair by only flipping it once). The latent GP \(z\) is thus only weakly constrained by the data such that adjacent point are highly correlated under the posterior distribution. We use a non-centered parameterization of the GP which is more efficient for weakly constrained GPs (see Parameterizations for details and the “Reparameterization” section of the Stan user guide for a general discussion).

To complete the model, we need to specify priors for the marginal scale \(\sigma\) and correlation length \(\ell\). We use a half-normal distribution for the former. Specifying a prior for the latter can be challenging: The correlation length is not identifiable when it is smaller than the separation between adjacent points or larger than the domain (see the case study “Robust Gaussian Process Modeling” for details). Here, we use a log-uniform prior on the interval \((2, n / 2)\) to restrict the correlation length.

The complete model is shown below.

functions {
    #include gptools/util.stan
    #include gptools/fft1.stan
}

data {
    int n, p;
    array [n] int y;
    real<lower=0> epsilon;
}

transformed data {
    int m = n + p;
}

parameters {
    real<lower=0> sigma;
    real<lower=log(2), upper=log(n / 2.0)> log_length_scale;
    vector[m] raw;
}

transformed parameters {
    real length_scale = exp(log_length_scale);
    // Evaluate the covariance kernel in the Fourier domain. We add the nugget
    // variance because the Fourier transform of a delta function is a constant.
    vector[m %/% 2 + 1] cov_rfft = gp_periodic_exp_quad_cov_rfft(
        m, sigma, length_scale, m) + epsilon;
    // Transform the "raw" parameters to the latent log odds ratio.
    vector[m] z = gp_inv_rfft(raw, zeros_vector(m), cov_rfft);
}

model {
    // White noise implies that z ~ GaussianProcess(...).
    raw ~ normal(0, 1);
    y ~ bernoulli_logit(z[:n]);
    sigma ~ normal(0, 3);
    // Implicit log uniform prior on the length scale.
}

generated quantities {
    vector[n] proba = inv_logit(z[:n]);
}

Fitting the model and analyzing results#

We fit the model in two steps. First, we compile the model using the compile_model() wrapper for cmdstanpy. The function ensures include paths are configured appropriately such that gptools can be included using Stan’s #include syntax. If you intend to use a different interface, the library path needs to be added to the include_paths of the interface (see the cmdstanpy documentation for details). The library path can be obtained by running python -m gptools.stan from the command line. Second we call the sample method of the compiled model to obtain posterior samples.

from gptools.stan import compile_model


model = compile_model(stan_file="logistic_regression.stan")
data = {
    "n": n,
    "p": n,
    "y": y,
    "epsilon": epsilon,
}
fit = model.sample(data, seed=seed, show_progress=False)
print(fit.diagnose())
WARNING:root:cmdstan<2.36.0 had a bug in the evaluation of Matern 3/2 kernels with different length scales for different dimensions; see https://github.com/stan-dev/math/pull/3084 for details. Your model may yield unexpected results or crash if you use nearest-neighbor Gaussian processes with Matern 3/2 kernels. Your cmdstan version is 2.31; consider upgrading.
WARNING:cmdstanpy:CmdStanModel(compile=...) is deprecated and will be removed in the next major version. The constructor will always ensure a model has a compiled executable.
If you wish to force recompilation, use force_compile=True instead.
INFO:cmdstanpy:compiling stan file /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.stan to exe file /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression
DEBUG:cmdstanpy:cmd: make STANCFLAGS+=--filename-in-msg=logistic_regression.stan STANCFLAGS+=--include-paths=/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/python/stan/gptools/stan,/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression
cwd: /home/docs/.cmdstan/cmdstan-2.31.0
DEBUG:cmdstanpy:Console output:

--- Translating Stan model to C++ code ---
bin/stanc --filename-in-msg=logistic_regression.stan --include-paths=/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/python/stan/gptools/stan,/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression --o=/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.hpp /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.stan

--- Compiling, linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.9 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS          -c -Wno-ignored-attributes   -x c++ -o /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.o /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.hpp
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.9 -I stan/lib/stan_math/lib/boost_1.78.0 -I stan/lib/stan_math/lib/sundials_6.1.1/include -I stan/lib/stan_math/lib/sundials_6.1.1/src/sundials    -DBOOST_DISABLE_ASSERTS                -Wl,-L,"/home/docs/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/docs/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/tbb"      /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.o src/cmdstan/main.o        -Wl,-L,"/home/docs/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/docs/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/tbb"   stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_6.1.1/lib/libsundials_kinsol.a  stan/lib/stan_math/lib/tbb/libtbb.so.2 -o /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression
rm -f /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression.o
INFO:cmdstanpy:compiled model executable: /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression
DEBUG:cmdstanpy:cmd: /home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression info
cwd: None
DEBUG:cmdstanpy:input tempfile: /tmp/tmpgsuh14ej/0211dkxm.json
INFO:cmdstanpy:CmdStan start processing
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression', 'id=1', 'random', 'seed=0', 'data', 'file=/tmp/tmpgsuh14ej/0211dkxm.json', 'output', 'file=/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
DEBUG:cmdstanpy:idx 1
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression', 'id=2', 'random', 'seed=0', 'data', 'file=/tmp/tmpgsuh14ej/0211dkxm.json', 'output', 'file=/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_2.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:Chain [1] start processing
INFO:cmdstanpy:Chain [2] start processing
INFO:cmdstanpy:Chain [2] done processing
DEBUG:cmdstanpy:idx 2
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression', 'id=3', 'random', 'seed=0', 'data', 'file=/tmp/tmpgsuh14ej/0211dkxm.json', 'output', 'file=/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_3.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:Chain [3] start processing
INFO:cmdstanpy:Chain [1] done processing
DEBUG:cmdstanpy:idx 3
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression', 'id=4', 'random', 'seed=0', 'data', 'file=/tmp/tmpgsuh14ej/0211dkxm.json', 'output', 'file=/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_4.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
INFO:cmdstanpy:Chain [4] start processing
INFO:cmdstanpy:Chain [4] done processing
INFO:cmdstanpy:Chain [3] done processing
DEBUG:cmdstanpy:runset
RunSet: chains=4, chain_ids=[1, 2, 3, 4], num_processes=4
 cmd (chain 1):
	['/home/docs/checkouts/readthedocs.org/user_builds/gptools-stan/checkouts/latest/docs/logistic_regression/logistic_regression', 'id=1', 'random', 'seed=0', 'data', 'file=/tmp/tmpgsuh14ej/0211dkxm.json', 'output', 'file=/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv', 'method=sample', 'algorithm=hmc', 'adapt', 'engaged=1']
 retcodes=[0, 0, 0, 0]
 per-chain output files (showing chain 1 only):
 csv_file:
	/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv
 console_msgs (if any):
	/tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_0-stdout.txt
DEBUG:cmdstanpy:Chain 1 console:
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
    num_chains = 1 (Default)
id = 1 (Default)
data
  file = /tmp/tmpgsuh14ej/0211dkxm.json
init = 2 (Default)
random
  seed = 0
output
  file = /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
num_threads = 1 (Default)


Gradient evaluation took 0.000201 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.01 seconds.
Adjust your expectations accordingly!


Iteration:    1 / 2000 [  0%]  (Warmup)
Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 3.062 seconds (Warm-up)
               2.825 seconds (Sampling)
               5.887 seconds (Total)
DEBUG:cmdstanpy:cmd: /home/docs/.cmdstan/cmdstan-2.31.0/bin/diagnose /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_2.csv /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_3.csv /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_4.csv
cwd: None
Processing csv files: /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_1.csv, /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_2.csv, /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_3.csv, /tmp/tmpgsuh14ej/logistic_regressionxum9xe0b/logistic_regression-20250317164732_4.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

Finally, we can visualize the inferred probability curve (shown in orange in the left panel with a band representing the posterior interquartile range) and compare it with the synthetic GP we generated (in blue).

fig, (ax1, ax2) = plt.subplots(1, 2, width_ratios=[3, 2])

plot_synthetic_data(x, y, proba, ax1)
lower, center, upper = np.quantile(fit.proba, [0.25, 0.5, 0.75], axis=0)
line, = ax1.plot(x, center, label="inferred latent\nprobability")
ax1.fill_between(x, lower, upper, color=line.get_color(), alpha=0.2)
ax1.legend(fontsize="small", loc=(0.05, .7))

ax2.scatter(fit.length_scale, fit.sigma, marker=".", alpha=0.2)
ax2.scatter(length_scale, sigma, marker="X", color="k", edgecolor="w")
ax2.set_xlabel(r"length scale $\ell$")
ax2.set_ylabel(r"marginal scale $\sigma$")

fig.tight_layout()
../../_images/c7d7b58d5041bb98e9d6e71a34e7394b9edfabec41734c13010f09654883ed60.png

The panel on the right shows posterior samples of the correlation length \(\ell\) and marginal scale \(\sigma\) of the kernel. The parameter values used to generate the data are shown as a black cross and are consistent with posterior samples. However, the kernel parameters remain weakly identified because the binary observations are not very informative.