Using Stan to Estimate Bayesian Models

Just like BUGS and JAGS are tools to estimate Bayesian models, Stan is another piece of software that allows you to estimate these models. Stan uses an MCMC sampler known as Hamiltonian Monte Carlo to sample from the posterior distribution. Stan is being actively developed and has a number of advantages over BUGS and JAGS in certain situations.

Installing Stan

Instructions for installing Stan and the RStan frontend can be found at: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started. This webpage contains information about installing Stan on both Windows and Mac/Unix systems. Stan requires a C++ toolchain, so be sure to carefully read the instructions on the page to ensure that you have everything installed correctly.

Once you’ve installed Stan according to the directions on the Stan site, the final step is to install the RStan package.

install.packages('rstan')

Stan Models

The Stan language is very similar to BUGS/JAGS, but there are a few key differences. A Stan model consists of several blocks, each of which performs a specific function. Every model needs data, parameters, and model blocks, and they can also contain the transformed parameters and generated quantities blocks. Like in BUGS/JAGS, you can either store your model as a character object in R, or as a separate text file with the .stan extension. Although you can use a text file with any extension, using the .stan extension enables syntax highlighting in RStudio. Try opening the included file Stan Linear Regression.stan to see for yourself!

The Stan language has a few quirks that you need to follow. Each line needs to end on a semicolon, and the model file needs to end on a newline. Blocks are set off by their title, and the contents of blocks are enclosed in curly braces. Stan supports C style comments, with multiline comments opening and /* and ending with */, and single line comments beginning with //. The order of blocks is not flexible, and you need to include them in your model file in the order below.

The Data Block

The data block contains anything that is a fixed quantity. This includes your explanatory and response variables, as well as any indexing or dimensionality variables your model will use. Unlike BUGS and JAGS, Stan requires you to explicitly declare any object that you will be referencing in your model. This means that we have to assign specific data types to our objects. The two basic data types in Stan and int and real, which correspond to integer and numeric in R. While R doesn’t care if you have a dichotomous response variable stored in a numeric object, Stan does, and trying to estimate a logit on a binary dependent variable stored as a real object will cause an error.

data {
  
  /* dimensionality of data and model */
  int<lower=0> N; // number of observations
  int<lower=0> K; // number of predictors (p + 1)
  
  /* response and explanatory variables */
  vector[N] y; // response variable
  matrix[N,K] X; // design matrix
  
}

This model will carry out a linear regression of $K$ independent variables on the dependent variable $y$. The independent variables are stored in the $N \times K$ matrix X, and the dependent variable is the vector y of length $N$. You could also store your variables as individual objects like in JAGS.

The Parameters Block

The parameters block contains anything that is a random variable you wish to estimate. Due to the way that Stan’s sampler works, it cannot estimate discrete parameters. If you have a model with discrete parameters, you have to marginalize them out before writing the model in Stan. Practically, what this means is that your parameters will always be reals and never ints.

parameters {
  
  /* parameters to be estimated by model */
  vector[K] beta; // regression coefficeients
  real<lower=0> sigma; // standard deviation
  
}

This model will estimate $K$ $\beta$ coefficients and the error term $\sigma$. If you chose to store the variables individually in the data block, you will need to define individual $\beta$s in the parameters block.

The Model Block

The model block is very similar to a BUGS or JAGS model. Key differences are that you don’t have to put explicit priors on all paramters, and that normal distributions are parameterized by standard deviation (like rnorm() in R!) instead of precision. If you don’t specify a prior on a bounded parameter, then Stan uses an implicit uniform prior. If you don’t specify a prior on an unbounded parameter, Stan uses an improper prior. Notice that $\mathbf{X}\beta$ in the normal() sampling statement is a matrix multiplication operation because $\mathbf{X}$ is a matrix and $\beta$ is a vector. This means that you need to have the terms in the correct order to satisfy conformability.

The other major change from BUGS/JAGS is that Stan is vectorized! This means we don’t have to use loops to carry out operations on vectors of data. This speeds up the estimation of our models and simplifies our code.

model {
  
  /* priors and data */
  beta ~ normal(0, 5); // prior on coefficients
  sigma ~ rayleigh(7.5, 10); // prior on variance
  y ~ normal(X * beta, sigma); // response standard deviation
  
}

This model has a $\mathcal{N}(0, 5)$ prior on the coefficients and a $\text{Rayleigh}(1.5)$ prior on the error. Since Stan doesn’t use Gibbs sampling, and hence doesn’t gain increased efficiency from the use of conjugate priors, we’re free to use whatever distributions we want for our priors, which allows us to potentially pick priors whose form better reflects our susbstantive knowledge. We could easily expand this model by replacing the fixed parameters for the priors on $\beta$ and $\sigma$ with hyperpriors to be estimated from the data.

Other Blocks

We can also include a transformed parameters block between the parameters and model blocks, and a generated quantities block after the model block. Transformed parameters are often used to speed up sampling, or when we are fixing the values of specific elements within parameters in latent variable models as an identification restriction. Generated quantities can include values such as $\hat{y}$s and residuals, or samples from the posterior predictive distribution. Both of these blocks are executed every iteration, allowing us to monitor objects in them and obtain a full posterior distribution for quantities of interest. In this case, our model file has a generated quantities block that calculates $\hat{y}$ and the residual. Don’t forget the newline at the end of the file!

generated quantities {
  
  /* define predictions and residuals */
  vector[N] y_hat;
  vector[N] resid;
  
  /* calculate predictions and residuals */
  y_hat = X * beta;
  resid = y - y_hat;
  
}

Fitting Bayesian Models using Stan

Once we have our model written out, we use R to prepare our data, decide which paramters we’re interested in monitoring, and setup our sampler. This is done using the rstan package. One advantage of Stan over BUGS/JAGS is that Stan natively supports parallel computing and makes it exceptionally easy to run multiple chains in parallel. The following code loads the rstan package, and sets it to use all of your machine’s cores when running parallel pains. When you load rstan, R will give you a message about the version you’re using, as well as suggestions to speed up estimation. Notice that the two next lines are these suggestions.

library(car)
library(rstan)

## Loading required package: ggplot2

## Loading required package: StanHeaders

## rstan (Version 2.15.1, packaged: 2017-04-19 05:03:57 UTC, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## rstan_options(auto_write = TRUE)
## options(mc.cores = parallel::detectCores())

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Let’s use the Angell data to run the same linear regression you did for homework 2.

data(Angell)
moral <- Angell$moral
predictors <- model.matrix(~ scale(hetero) + scale(mobility) + as.factor(region), data = Angell)

Like JAGS, Stan requires variables in the form of a named list, where the names match the variable names declared in the data block. Since X is a matrix, we can use the nrow() and ncol() commands to get its dimensionality dynamically, ensuring that our code still works even if we decide to add another explanatory variable.

stan_data_list <- list(X = predictors, y = moral, N = nrow(predictors), K = ncol(predictors))

The syntax for the stan() function is closer to the one used by runjags than R2jags. The command automatically compiles the model and runs the sampler for a specified number of iterations; we don’t have to compile the model into an object and then use a separate function to obtain samples from it.

The stan() function has many arguments, but there are a few key ones. We need to point the function to our model, using either the file argument if we saved our model in a separate text file with the .stan extension, or model_code if our model is a character object in the R environment. We also need to tell the function where our data are, which we do by giving the data argument our named list of variables. If we have a complicated model with lots of parameters that aren’t substantively interesting, we can specify specific ones to monitor through the pars argument (leaving it blank will result in samples for all parameters). The chains argument tells the function how many different chains to run, and the iter argument tells the function how many iterations to run the sampler for.

Stan refers to the initial discarded draws at the beginning of a sampling run as the warmup. Although JAGS can adpat the properties of the sampler, it often does not need to, and the burnin draws are simply thrown away. Hamiltonian Monte Carlo only has advantages over Gibbs sampling when the auxiliary parameters are properly tuned, so the warmup phase in Stan always involves adjustments to the sampler. Unlike JAGS, the samples during the warmup phase are saved, so we can diagnose exploration of the parameter space and check to see if chains from different starting values converged to the same posterior mean. The default number of warmup iterations is 1/2 iter.

m1 <- stan(file = 'Linear Regression.stan', data = stan_data_list,
           pars = c('beta', 'sigma', 'resid', 'y_hat'), chains = , iter = 1e4,
           seed = 981280)

You might get a lot of messages about C++ code while the model compiles, but you can ignore them. You might also get a few warning messages about rejected proposals, but unless there are a large number of these errors, you can ignore them as rejected proposals are a normal part of Metropolis and Hamiltonian Monte Carlo sampling.

That was fast! Each chain ran 5000 warmup iterations in an average of 0.47 seconds and 5000 sampling iterations in an average of 0.52 seconds. And remember that we used a non-conjugate prior on $\sigma$. The speed of Stan’s sampler isn’t dependent on conjugacy like a Gibbs sampler.

Once our sampler has finished, the samples are stored in a stanfit object. Unlike with JAGS, we don’t have to convert our m1 object into another list object; stanfit objects can contain one or more chains by default.

str(m1)

Assessing Convergence

The RStan package also includes a number of useful functions for diagnosing convergence and ensuring that we’re sampling from the posterior distribution. We can use traceplot() to evaluate the mixing of the chains. If we set inc_warmup = T, then we can view all samples for each parameter, giving us increased confidence that we have fully explored the parameter space.

traceplot(m1, pars = paste('beta[', 2:4, ']', sep = ''), inc_warmup = T)

We can also evaluate the autocorrelation between samples for each parameter at different lags. Given the nature of Markov chains, there has to be some autocorrelation at the first lag, but we want to see a sharp drop off as the lags increase. Since we are performing this operation on multiple chains, these results are the average autocorrelation values from all four chains.

stan_ac(m1, pars = 'sigma', lags = 15)

What if we want a plot of the potential scale reduction factor $\hat{R}$ at each iteration of the sampler? Unfortunately, rstan doesn’t include a builtin function to produce this plot. Fortunately, it does provide an incredibly easy way to convert a stanfit object to a coda object using the As.mcmc.list() function.

coda::gelman.plot(As.mcmc.list(m1, pars = 'beta'), autoburnin = F)

Presenting Results

Now that we’re confident that our samples are actually drawn from the posterior distribution, we need to communicate the information contained in them to our readers. Typing m1 will print the mean, Monte Carlo standard errors, standard deviation, various percentiles of the samples, the effective sample size, and the $\hat{R}$ for each parameter, as well as some of the sampler parameters.

To save the parameter estimate summaries to an object, we use the summary() function. If we assign an object using just summary(m1), it will create a list, with the first entry summary containing the matrix of parameter estimates we just saw, and the second entry c_summary containing a list of matrices for each individual chain. Since we’re confident that our chains mixed well, we want to do inference on all of them merged together, so our call to summary() should be followed by the $summary index reference to access the posterior samples from all chains merged together.

# rename parameters in stan model with predictor names
names(m1)[1:6] <- colnames(predictors)

# extract regression parameters
m1_sum <- summary(m1, pars = c('beta', 'sigma'))$summary
kable(m1_sum)

	mean	se_mean	sd	2.5%	25%	50%	75%	98%	n_eff	Rhat
(Intercept)	13.0	0.01	1.01	11.0	12.3	13.0	13.7	14.96	5265	1
scale(hetero)	-1.6	0.01	0.56	-2.7	-2.0	-1.6	-1.3	-0.55	9057	1
scale(mobility)	-1.1	0.01	0.55	-2.2	-1.4	-1.1	-0.7	0.00	6408	1
as.factor(region)MW	-1.7	0.01	1.03	-3.7	-2.4	-1.7	-1.0	0.39	6823	1
as.factor(region)S	-3.0	0.02	1.69	-6.3	-4.2	-3.0	-1.9	0.34	5452	1
as.factor(region)W	-2.6	0.02	1.53	-5.5	-3.6	-2.6	-1.5	0.44	6465	1
sigma	2.2	0.00	0.26	1.7	2.0	2.2	2.3	2.76	11482	1

Because we’re not interested in stars, this table isn’t super informative for the reader. As Bayesians we want to know where the majority of a parameter’s posterior distribution falls relative to zero, which is much more easily assessed graphically. We can plot the results from our model using the plot() command on a stanfit, which object will produce a dotplot of point estimates for parameters with credible interval bars.

plot(m1, pars = c('beta', 'sigma')) + geom_vline(xintercept = 0, linetype = 2)

## ci_level: 0.8 (80% intervals)

## outer_level: 0.95 (95% intervals)

The plot() command is actually a wrapper to the stan_plot() command, which has arguments unique to stanfit objects. To see more information about the various Stan plots available, type ?`plot,stanfit-method`.

If we want to view the actual distribution of posterior samples for a given parameter, rather than just credible intervals, we can use the stan_dens() command to generate density plots. We can use the pars argument to limit the results to specific parameters. This is especially useful if we have lots of hierarchical parameters, or if we are estimating many latent variables and a plot of all of them would be unreadable.

stan_dens(m1, pars = 'beta') + geom_vline(xintercept = 0, linetype = 2)

Notice that we can still use beta as an argument to pars, even though we renamed our parameters above. This is because Stan stores the names of each parameter object separately from the names accessed by names(). You can see this for yourself by typing m1@model_pars. If you rename parameters in a stanfit object, you can refer to entire vectors of parameters by their original names, but not individually. What this means is you can use pars = 'beta' to access all of the coefficient estimates, but you can’t use pars = 'beta[3]' to access the coefficient estimate for mobility.

For many Stan plots, we can use the separate_chains = T argument to view the resulting figure for all individual chains, rather than the averaged output.

stan_dens(m1, pars = 'sigma', separate_chains = T)

We can also generate scatterplots of the correlation between parameters using the stan_scat() command. If we have a complex model that generates a lot of divergent transitions during sampling (this is common in hierarchical models), we might use the pairs() function to identify the responsible parameters which could be candidates for reparameterization.

stan_scat(m1, pars = c('scale(mobility)', 'sigma'))

Accessing Samples

If we want to directly work with the samples from our model like with JAGS, we can do that as well. To extract all of the samples for a given parameter, we use the extract() function, which returns a list of the samples for each parameter.

samps <- extract(m1, pars = c('beta', 'sigma'))
str(samps)

## List of 2
##  $ beta : num [1:20000, 1:6] 13.5 12.2 13.1 13.6 12.4 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ iterations: NULL
##   .. ..$           : NULL
##  $ sigma: num [1:20000(1d)] 2.11 2.15 1.96 2.13 2.12 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ iterations: NULL

beta_samps <- samps$beta

Once we’ve extracted our samples from the posterior distribution, we can plot them just like with samples from JAGS. This approach is also useful if you want more control over the graphical parameters of your figures than the builtin Stan plotting functions provide.

# get samples for region dummies
beta_samps <- as.data.frame(beta_samps[, 4:6])

# drop the 'as.factor(region)' from names for cleaner legend
names(beta_samps) <- gsub('as\\.factor\\(region\\)', '', colnames(predictors[, 4:6]))

# prepare data for ggplot
beta_samps <- tidyr::gather(beta_samps)

# plot, grouping by region dummy, add line at 0, and drop x axis label
ggplot(data = beta_samps, aes(x = value, fill = key)) +
  geom_density(alpha = .75, linetype = 0) +
  geom_vline(xintercept = 0, linetype = 2) +
  guides(fill = guide_legend(title = 'region')) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x  = element_blank(),
        legend.position = 'right',
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank())

The ggridges package provides a way to create ridge plots which put multiple density plots on the same graph with a shared axis, allowing the viewer to easily compare entire posterior distributions, instead of just point estimates and summaries of uncertainty. This approach also lets the viewer visually assess convergence between chains.

# ridge plots
library(ggridges)

# create ggs objet for ridge plot
ridge_ggs <- ggmcmc::ggs(m1, family = 'beta')

# keep only last 3 beta parameters, which are our region dummies
ridge_ggs <- ridge_ggs[grepl('[4-6]', ridge_ggs$Parameter), ]

# ridge plot; east as refernce category
ggplot(ridge_ggs, aes(x = value, y = Parameter, height = ..density.., color = as.factor(Chain))) +
  geom_density_ridges(fill = NA, rel_min_height = .01, scale = 1.25, show.legend = F) +
  geom_vline(xintercept = 0, linetype = 5, col = 'lavenderblush4') +
  labs(x = '', y = '') +
  theme_bw() +
  scale_y_discrete(labels = c('MW', 'S', 'W')) + 
  coord_cartesian(xlim = c(-8, 5)) +
  theme(plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank(),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_line(color = 'lavenderblush4'))

We might also want to extract samples if we are estimating a glm and want to generate predicted probabilities. One of the advantages of Bayes is that instead of having to assume that our coefficient estimates are drawn from a multivariate normal distribution, and then generating simulated coefficients from that distribution, we can instead simply take the last thousand draws for each parameter and use those to calculate predicted probabilities.

Extracting samples can also let us conduct model fit checks such as residual plots.

# create dataframe for ggplot
df_resid <- data.frame(y = moral,
                       y_hat = summary(m1, pars = 'y_hat')$summary[, 'mean'],
                       residual = summary(m1, pars = 'resid')$summary[, 'mean'])

# plot predicted values against observed values
pred_plot <- ggplot(df_resid, aes(x = y_hat, y = y, color = residual)) +
  geom_point(alpha = .75) +
  scale_color_continuous(low = 'darkred', high = 'lightsalmon') +
  geom_abline(slope = 1, intercept = 0) +
  theme_bw() +
  theme(legend.position = 'right',
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank())

# plot residuals against predicted values
resid_plot <- ggplot(df_resid , aes(x = y_hat, y = residual, color = residual)) +
  geom_point(alpha = .75) +
  scale_color_continuous(low = 'darkred', high = 'lightsalmon') +
  geom_hline(yintercept = 0) +
  theme_bw() +
  theme(legend.position = 'right',
        plot.background = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        panel.border = element_blank())

# arrange side by side as interactive plotly plots
plotly::subplot(pred_plot, resid_plot, nrows = 1)