Julian Faraway 2024-08-15
- Data
- Questions
- Linear model with fixed effects
- Likelihood inference
- INLA
- STAN
- BRMS
- MGCV
- Discussion
- Package version info
See the introduction for an overview.
This example is discussed in more detail in my book Extending the Linear Model with R
Libraries used:
library(faraway)
library(ggplot2)
library(lme4)
library(INLA)
library(knitr)
library(cmdstanr)
library(brms)
library(mgcv)
register_knitr_engine(override = FALSE)
Load up and look at the data, which concerns the brightness of paper which may vary between operators of the production machinery.
data(pulp, package="faraway")
summary(pulp)
bright operator
Min. :59.8 a:5
1st Qu.:60.0 b:5
Median :60.5 c:5
Mean :60.4 d:5
3rd Qu.:60.7
Max. :61.0
ggplot(pulp, aes(x=operator, y=bright))+geom_point(position = position_jitter(width=0.1, height=0.0))
You can read more about the data by typing help(pulp)
at the R prompt.
In this example, there are only five replicates per level. There is no strong reason to reject the normality assumption. We don’t care about the specific operators, who are named a, b, c and d, but we do want to know how they vary.
- Is there a difference between operators in general?
- How much is the difference between operators in general?
- How does the variation between operators compare to the variation within operators?
- What is the difference between these four operators?
We are mostly interested in the first three questions.
We start with the simplest analysis although it is not correct. It will be useful for comparisons. We treat the operator as a fixed effect meaning that the analysis refers to these four operators and not to other possible operators. Since we probably don’t care about these particular four operators, this would not be the best choice.
You can use the lm()
or aov()
functions:
amod = aov(bright ~ operator, pulp)
Now test for a difference between operators:
anova(amod)
Analysis of Variance Table
Response: bright
Df Sum Sq Mean Sq F value Pr(>F)
operator 3 1.34 0.447 4.2 0.023
Residuals 16 1.70 0.106
We find a statistically significant difference. We can estimate the coefficients:
coef(amod)
(Intercept) operatorb operatorc operatord
60.24 -0.18 0.38 0.44
The treatment coding sets operator a as the reference level. The intercept is the mean for operator a and the other estimates are differences in the mean from operator a. We can also test for a difference between pairs of operators:
TukeyHSD(amod)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = bright ~ operator, data = pulp)
$operator
diff lwr upr p adj
b-a -0.18 -0.769814 0.40981 0.81854
c-a 0.38 -0.209814 0.96981 0.29030
d-a 0.44 -0.149814 1.02981 0.18448
c-b 0.56 -0.029814 1.14981 0.06579
d-b 0.62 0.030186 1.20981 0.03767
d-c 0.06 -0.529814 0.64981 0.99108
Only the d to b difference is found significant.
We have answered the fourth question stated above. We could make some speculations on the first three questions (what can be said about operators in general) but our analysis was not designed to do this.
The aov()
function has been available in R and S before that i.e. at
least 30 years. I do not believe it has changed in a long time. It can
handle some simple models but it is has very limited functionality.
We use a model of the form: $$
y_{ij} = \mu + \alpha_i + \epsilon_{ij} \qquad i=1,\dots ,a
\qquad j=1,\dots ,n_i,
$$ where the
The default fit uses the REML estimation method:
mmod <- lmer(bright ~ 1+(1|operator), pulp)
faraway::sumary(mmod)
Fixed Effects:
coef.est coef.se
60.40 0.15
Random Effects:
Groups Name Std.Dev.
operator (Intercept) 0.26
Residual 0.33
---
number of obs: 20, groups: operator, 4
AIC = 24.6, DIC = 14.4
deviance = 16.5
We see slightly less variation between operators (
We can also use the ML method:
smod <- lmer(bright ~ 1+(1|operator), pulp, REML = FALSE)
faraway::sumary(smod)
Fixed Effects:
coef.est coef.se
60.40 0.13
Random Effects:
Groups Name Std.Dev.
operator (Intercept) 0.21
Residual 0.33
---
number of obs: 20, groups: operator, 4
AIC = 22.5, DIC = 16.5
deviance = 16.5
The REML method is preferred for estimation but we must use the ML method if we wish to make hypothesis tests comparing models.
If we want to test for variation between operators, we fit a null model containing no operator, compute the likelihood ratio statistic and corresponding p-value:
nullmod <- lm(bright ~ 1, pulp)
lrtstat <- as.numeric(2*(logLik(smod)-logLik(nullmod)))
pvalue <- pchisq(lrtstat,1,lower=FALSE)
data.frame(lrtstat, pvalue)
lrtstat pvalue
1 2.5684 0.10902
Superficially, the p-value greater than 0.05 suggests no strong evidence against that hypothesis that there is no variation among the operators. But there is good reason to doubt the accuracy of the standard approximation of the chi-squared null distribution when testing a parameter on the boundary of the space (as we do here at zero). A parametric bootstrap can be used where we generate samples from the null and compute the test statistic repeatedly:
lrstat <- numeric(1000)
set.seed(123)
for(i in 1:1000){
y <- unlist(simulate(nullmod))
bnull <- lm(y ~ 1)
balt <- lmer(y ~ 1 + (1|operator), pulp, REML=FALSE)
lrstat[i] <- as.numeric(2*(logLik(balt)-logLik(bnull)))
}
Check the proportion of simulated test statistics that are close to zero:
mean(lrstat < 0.00001)
[1] 0.703
Clearly, the test statistic does not have a chi-squared distribution under the null. We can compute the proportion that exceed the observed test statistic of 2.5684:
mean(lrstat > 2.5684)
[1] 0.019
This is a more reliable p-value for our hypothesis test which suggest there is good reason to reject the null hypothesis of no variation between operators.
More sophisticated methods of inference are discussed in Extending the Linear Model with R
We can use bootstrap again to compute confidence intervals for the parameters of interest:
confint(mmod, method="boot")
2.5 % 97.5 %
.sig01 0.00000 0.51451
.sigma 0.21084 0.43020
(Intercept) 60.11213 60.69244
We see that the lower end of the confidence interval for the operator SD extends to zero.
Even though we are most interested in the variation between operators, we can still estimate their individual effects:
ranef(mmod)$operator
(Intercept)
a -0.12194
b -0.25912
c 0.16767
d 0.21340
Approximate 95% confidence intervals can be displayed with:
dd = as.data.frame(ranef(mmod))
ggplot(dd, aes(y=grp,x=condval)) +
geom_point() +
geom_errorbarh(aes(xmin=condval -2*condsd,
xmax=condval +2*condsd), height=0)
Integrated nested Laplace approximation is a method of Bayesian computation which uses approximation rather than simulation. More can be found on this topic in Bayesian Regression Modeling with INLA and the chapter on GLMMs
At the time of writing, INLA has recently undergone some changes that have some impact on the fitting of mixed effect models. Previously, the model fitting proceeded in two stages - the first step computed the posteriors for most of the parameters while a second step makes a refinement to compute posteriors for the hyperparameters (the variances of the random terms in this case). This was termed classic mode. In the new method, the idea is to compute all the posteriors in one stage. This was called experimental mode and is now called compact mode. This compact mode is now the default.
You can see an older analyis of the same data using the previous methodology. We use the most recent computational methodology and also opt for a shorter output summary:
inla.setOption(inla.mode="compact")
inla.setOption("short.summary",TRUE)
Run the INLA model with default priors:
imod <- inla(bright ~ f(operator, model="iid"),
family="gaussian",
data=pulp)
The summary of the posterior distribution for the fixed effects (which is only the intercept in this example):
imod$summary.fixed |> kable()
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld | |
---|---|---|---|---|---|---|---|
(Intercept) | 60.4 | 0.08809 | 60.226 | 60.4 | 60.574 | 60.4 | 0 |
The posterior mean is the same as the (RE)ML estimate. The posterior distribution of the hyperparameters (precision of the error and operator terms)
imod$summary.hyperpar |> kable()
mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | |
---|---|---|---|---|---|---|
Precision for the Gaussian observations | 6.8903 | 2.1299 | 3.4852 | 6.6368 | 11.791 | 6.1554 |
Precision for operator | 17504.1469 | 17051.8380 | 1101.7099 | 12305.8966 | 62746.492 | 2981.3404 |
Precision for the operator term is unreasonably high. This implies a strong belief that there is no variation between the operators which we would find hard to believe. This is due to the default diffuse gamma prior on the precisions which put almost all the weight on the error variation and not nearly enough on the operator variation. We need to change the prior.
We try a halfnormal prior with low precision instead. A precision of 0.01 corresponds to an SD of 10. (It is possible to vary the mean but we have set this to zero to achieve a halfnormal distribution). This is substantially larger than the SD of the response so the information supplied is very weak.
tnprior <- list(prec = list(prior="logtnormal", param = c(0,0.01)))
imod <- inla(bright ~ f(operator, model="iid", hyper = tnprior),
family="gaussian",
data=pulp)
summary(imod)
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 60.4 0.306 59.774 60.4 61.026 60.4 0.059
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for the Gaussian observations 10.65 3.52 5.106 10.20 18.81 9.35
Precision for operator 12.74 20.63 0.481 6.52 63.35 1.15
is computed
The results appear more plausible. Transform to the SD scale
sigmaalpha <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[2]])
sigmaepsilon <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[1]])
and output the summary statistics (note that transforming the summary statistics on the precision scale only works for the quantiles)
sigalpha = c(inla.zmarginal(sigmaalpha, silent = TRUE),
mode=inla.mmarginal(sigmaalpha))
sigepsilon = c(inla.zmarginal(sigmaepsilon, silent = TRUE),
mode=inla.mmarginal(sigmaepsilon))
rbind(sigalpha, sigepsilon)
mean sd quant0.025 quant0.25 quant0.5 quant0.75 quant0.975 mode
sigalpha 0.48835 0.34812 0.12583 0.26007 0.38996 0.60265 1.4264 0.26504
sigepsilon 0.3191 0.053609 0.23116 0.28071 0.31272 0.35101 0.44094 0.29913
The posterior mode is most comparable with the (RE)ML estimates computed above. In this respect, the results are similar.
We can also get summary statistics on the random effects:
imod$summary.random$operator |> kable()
ID | mean | sd | 0.025quant | 0.5quant | 0.975quant | mode | kld |
---|---|---|---|---|---|---|---|
a | -0.13063 | 0.32019 | -0.79984 | -0.11960 | 0.49535 | -0.08799 | 0.04964 |
b | -0.27746 | 0.32512 | -0.97235 | -0.26118 | 0.32551 | -0.23350 | 0.04693 |
c | 0.17958 | 0.32151 | -0.43875 | 0.16625 | 0.85736 | 0.13448 | 0.04885 |
d | 0.22847 | 0.32302 | -0.38191 | 0.21329 | 0.91461 | 0.18576 | 0.04812 |
Plot the posterior densities for the two SD terms:
ddf <- data.frame(rbind(sigmaalpha,sigmaepsilon),
errterm=gl(2,dim(sigmaalpha)[1],labels = c("alpha","epsilon")))
ggplot(ddf, aes(x,y, linetype=errterm))+
geom_line()+xlab("bright")+ylab("density")+xlim(0,2)
We see that the operator SD less precisely known than the error SD. Although the mode for the operator is smaller, there is a substantial chance it could be much higher than the error SD.
Is there any variation between operators? We framed this question as an hypothesis test previously but that is not sensible in this framework. We might ask the probability that the operator SD is zero. Since we have posited a continuous prior that places no discrete mass on zero, the posterior probability will be zero, regardless of the data. Instead we might ask the probability that the operator SD is small. Given the response is measured to one decimal place, 0.1 is a reasonable representation of small if we take this to mean the smallest amount we care about.
We can compute the probability that the operator SD is smaller than 0.1:
inla.pmarginal(0.1, sigmaalpha)
[1] 0.0085291
The probability is small but not entirely negligible.
Now try more informative gamma priors for the precisions. Define it so
the mean value of gamma prior is set to the inverse of the variance of
the response. We expect the two error variances to be lower than the
response variance so this is an overestimate. The variance of the gamma
prior (for the precision) is controlled by the apar
shape parameter in
the code. apar=1
is the exponential distribution. Shape values less
than one result in densities that have a mode at zero and decrease
monotonely. These have greater variance and hence less informative.
apar <- 0.5
bpar <- var(pulp$bright)*apar
lgprior <- list(prec = list(prior="loggamma", param = c(apar,bpar)))
imod <- inla(bright ~ f(operator, model="iid", hyper = lgprior),
family="gaussian",
data=pulp)
summary(imod)
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 60.4 0.207 59.984 60.4 60.816 60.4 0.002
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for the Gaussian observations 10.63 3.51 5.09 10.18 18.77 9.33
Precision for operator 10.95 8.98 1.52 8.54 34.58 4.30
is computed
Compute the summaries as before:
sigmaalpha <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[2]])
sigmaepsilon <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[1]])
sigalpha = c(inla.zmarginal(sigmaalpha, silent = TRUE),
mode=inla.mmarginal(sigmaalpha))
sigepsilon = c(inla.zmarginal(sigmaepsilon, silent = TRUE),
mode=inla.mmarginal(sigmaepsilon))
rbind(sigalpha, sigepsilon)
mean sd quant0.025 quant0.25 quant0.5 quant0.75 quant0.975 mode
sigalpha 0.37932 0.16436 0.17076 0.26434 0.34096 0.45282 0.80274 0.2794
sigepsilon 0.31946 0.053727 0.2314 0.28098 0.31305 0.35143 0.44163 0.29936
Slightly different outcome.
Make the plots:
ddf <- data.frame(rbind(sigmaalpha,sigmaepsilon),
errterm=gl(2,dim(sigmaalpha)[1],labels = c("alpha","epsilon")))
ggplot(ddf, aes(x,y, linetype=errterm))+
geom_line()+xlab("bright")+ylab("density")+xlim(0,2)
The posterior for the error SD is quite similar to that seen previously but the operator SD is larger and bounded away from zero and less dispersed.
We can compute the probability that the operator SD is smaller than 0.1:
inla.pmarginal(0.1, sigmaalpha)
[1] 3.3572e-05
The probability is very small. The choice of prior may be unsuitable in that no density is placed on an SD=0 (or infinite precision). We also have very little prior weight on low SD/high precision values. This leads to a posterior for the operator with very little density assigned to small values of the SD. But we can see from looking at the data or from prior analyses of the data that there is some possibility that the operator SD is very small.
In Simpson (2017), penalized complexity priors are proposed. This requires that we specify a scaling for the SDs of the random effects. We use the SD of the residuals of the fixed effects only model (what might be called the base model in the paper) to provide this scaling.
sdres <- sd(pulp$bright)
pcprior <- list(prec = list(prior="pc.prec", param = c(3*sdres,0.01)))
imod <- inla(bright ~ f(operator, model="iid", hyper = pcprior),
family="gaussian",
data=pulp)
summary(imod)
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 60.4 0.172 60.05 60.4 60.75 60.4 0
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for the Gaussian observations 10.56 3.51 5.05 10.12 18.70 9.26
Precision for operator 25.10 34.61 2.21 14.81 111.99 5.70
is computed
Compute the summaries as before:
sigmaalpha <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[2]])
sigmaepsilon <- inla.tmarginal(function(x)1/sqrt(exp(x)),
imod$internal.marginals.hyperpar[[1]])
sigalpha = c(inla.zmarginal(sigmaalpha, silent = TRUE),
mode=inla.mmarginal(sigmaalpha))
sigepsilon = c(inla.zmarginal(sigmaepsilon, silent = TRUE),
mode=inla.mmarginal(sigmaepsilon))
rbind(sigalpha, sigepsilon)
mean sd quant0.025 quant0.25 quant0.5 quant0.75 quant0.975 mode
sigalpha 0.29077 0.14866 0.095096 0.18461 0.25985 0.36156 0.66617 0.20377
sigepsilon 0.32048 0.054035 0.23185 0.28178 0.31405 0.35264 0.4433 0.30036
We get a similar result to the truncated normal prior used earlier although the operator SD is generally smaller.
Make the plots:
ddf <- data.frame(rbind(sigmaalpha,sigmaepsilon),
errterm=gl(2,dim(sigmaalpha)[1],labels = c("alpha","epsilon")))
ggplot(ddf, aes(x,y, linetype=errterm))+
geom_line()+xlab("bright")+ylab("density")+xlim(0,2)
We can compute the probability that the operator SD is smaller than 0.1:
inla.pmarginal(0.1, sigmaalpha)
[1] 0.03128
The probability is small but not insubstantial.
We can plot the posterior density of
mu <- data.frame(imod$marginals.fixed[[1]])
cbound = inla.qmarginal(c(0.025,0.975),mu)
ggplot(mu, aes(x,y)) + geom_line() +
geom_vline(xintercept = cbound) +
xlab("brightness")+ylab("density")
We can plot the posterior marginals of the random effects:
nlevels = length(unique(pulp$operator))
rdf = data.frame(do.call(rbind,imod$marginals.random$operator))
rdf$operator = gl(nlevels,nrow(rdf)/nlevels,labels=1:nlevels)
ggplot(rdf,aes(x=x,y=y,group=operator, color=operator)) +
geom_line() +
xlab("") + ylab("Density")
We see that operators 1 and 2 tend to be lower than 3 and 4. There is substantial overlap so we would hesitate to declare any difference between a pair of operators.
STAN performs Bayesian inference using MCMC.
There are two ways to access Stan from within R - rstan
and
cmdstanr
. Previously I have used rstan
but there have been some
difficulties in updating the interface with Stan such that only an older
version of Stan can be used and I cannot get rstan
to work with R
version 4.3. Although updates to fix this have been promised, I am
switching to cmdstanr
for this section because it works and works with
the most recent version of Stan.
You see below the Stan code to fit our model. Rmarkdown allows the use of Stan chunks (elsewhere I have R chunks). The chunk header looks like this.
STAN chunk will be compiled to ‘mod’. Chunk header is:
cmdstan, output.var="mod", override = FALSE
An alternative approach would be to have the Stan code in a separate file. This might be more convenient in practice but we use the chunk approach since we can easily see the code. Compilation of the Stan code will now occur.
data {
int<lower=0> N;
int<lower=0> J;
array[N] int<lower=1,upper=J> predictor;
array[N] real response;
}
parameters {
vector[J] eta;
real mu;
real<lower=0> sigmaalpha;
real<lower=0> sigmaepsilon;
}
transformed parameters {
vector[J] a;
vector[N] yhat;
a = mu + sigmaalpha * eta;
for (i in 1:N)
yhat[i] = a[predictor[i]];
}
model {
eta ~ normal(0, 1);
response ~ normal(yhat, sigmaepsilon);
}
We have used uninformative priors for the overall mean and the two
variances. Prepare data in a format consistent with the command file.
Needs to be a list. Can’t use the word operator
since this is reserved
for system use in STAN.
pulpdat <- list(N=nrow(pulp),
J=length(unique(pulp$operator)),
response=pulp$bright,
predictor=as.numeric(pulp$operator))
Do the MCMC sampling:
fit <- mod$sample(
data = pulpdat,
seed = 123,
chains = 4,
parallel_chains = 4,
refresh = 500 # print update every 500 iters
)
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 3 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 1 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1 finished in 0.5 seconds.
Chain 2 finished in 0.5 seconds.
Chain 3 finished in 0.5 seconds.
Chain 4 finished in 0.4 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 0.7 seconds.
By default, we use 2000 iterations but repeated with independent starts 4 times giving 4 chains. We can thin but do not by default. The warmup period is half the number of observations.
We get warning messages about the fit. Since the default number of 2000 iterations runs in seconds, we can simply run a lot more iterations. This is rather lazy and would not be viable for more expensive computations, but sometimes CPU effort is preferred to mental effort. We stick with the default here.
Diagnostics to check the convergence are worthwhile.
Extract the draws into a convenient dataframe format:
draws_df <- fit$draws(format = "df")
Check the diagnostics on the most problematic parameter:
ggplot(draws_df,
aes(x=.iteration,y=sigmaalpha,color=factor(.chain))) + geom_line() +
labs(color = 'Chain', x="Iteration")
We see the traces of the four chains overlaid in different colors. The chains appear roughly stationary although there are some occasional larger excursions (which is why we prefer more iterations).
The similar plots can be produced for other parameters although these are less likely to give us trouble.
We consider only the parameters of immediate interest:
fit$summary(c("mu","sigmaalpha","sigmaepsilon","a"))
# A tibble: 7 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mu 60.4 60.4 0.268 0.191 60.0 60.9 1.01 622. 279.
2 sigmaalpha 0.472 0.374 0.371 0.242 0.0897 1.19 1.01 617. 496.
3 sigmaepsilon 0.360 0.347 0.0725 0.0630 0.264 0.492 1.00 1427. 1459.
4 a[1] 60.3 60.3 0.151 0.148 60.0 60.5 1.00 3017. 3228.
5 a[2] 60.1 60.1 0.169 0.166 59.9 60.4 1.00 2134. 2237.
6 a[3] 60.6 60.6 0.157 0.150 60.3 60.8 1.00 2325. 2321.
7 a[4] 60.6 60.6 0.159 0.157 60.4 60.9 1.00 2421. 2606.
We see the posterior mean, median, SD and MAD of the samples. We see
some quantiles from which we could construct a 95% credible interval
(for example). The ess
are rough measures of the sample size taking
into account the correlation in the samples. The effective sample sizes
for the mean and operator SD primary parameters is not large
(considering the number of iterations) although adequate enough for most
purposes. The Rhat statistic is known as the potential scale reduction
factor. Values much greater than one indicate that additional samples
would significantly improve the inference. In this case, the factors are
all close to one so we feel no inclination to draw more samples.
Show posterior densities on the two variance terms:
sdf = stack(draws_df[,startsWith(colnames(draws_df),"sigma")])
colnames(sdf) = c("bright","sigma")
levels(sdf$sigma) = c("alpha","epsilon")
ggplot(sdf, aes(x=bright,color=sigma)) + geom_density() + xlim(0,2)
We see that the error SD can be localized much more than the operator SD. We can also look at the operator random effects:
adf = stack(draws_df[,startsWith(colnames(draws_df),"a")])
colnames(adf) = c("bright","operator")
levels(adf$operator) = letters[1:4]
ggplot(adf, aes(x=bright,color=operator)) + geom_density()
We see that the four operator distributions overlap.
Previously, we took an interest in whether there is any variation between operators and answered this question with a computation of the probability that the operator SD is less than 0.1. We computed the proportion of sampled values less than 0.1.
Compute the lower tail probability.
fit$summary("sigmaalpha", tailprob = ~ mean(. <= 0.1))
# A tibble: 1 × 2
variable tailprob
<chr> <dbl>
1 sigmaalpha 0.0572
This is a somewhat larger probability than seen previously. The value obtained is sensitive to the choice of prior on the error SD. This can be changed within STAN but it is easier to experiment with this using BRMS.
BRMS stands for Bayesian Regression Models with STAN. It provides a convenient wrapper to STAN functionality.
Fitting the model is very similar to lmer
as seen above:
suppressMessages(bmod <- brm(bright ~ 1+(1|operator), pulp, backend = "cmdstanr"))
Notice that we have specified the cmdstanr
to be used internally due
to the problems with the default choice of rstan
. We get some warnings
but not as severe as seen with our STAN fit above. We can obtain some
posterior densities and diagnostics with:
plot(bmod)
We can look at the STAN code that brms
used with:
stancode(bmod)
// generated with brms 2.19.0
functions {
}
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
// data for group-level effects of ID 1
int<lower=1> N_1; // number of grouping levels
int<lower=1> M_1; // number of coefficients per level
array[N] int<lower=1> J_1; // grouping indicator per observation
// group-level predictor values
vector[N] Z_1_1;
int prior_only; // should the likelihood be ignored?
}
transformed data {
}
parameters {
real Intercept; // temporary intercept for centered predictors
real<lower=0> sigma; // dispersion parameter
vector<lower=0>[M_1] sd_1; // group-level standard deviations
array[M_1] vector[N_1] z_1; // standardized group-level effects
}
transformed parameters {
vector[N_1] r_1_1; // actual group-level effects
real lprior = 0; // prior contributions to the log posterior
r_1_1 = sd_1[1] * z_1[1];
lprior += student_t_lpdf(Intercept | 3, 60.5, 2.5);
lprior += student_t_lpdf(sigma | 3, 0, 2.5)
- 1 * student_t_lccdf(0 | 3, 0, 2.5);
lprior += student_t_lpdf(sd_1 | 3, 0, 2.5)
- 1 * student_t_lccdf(0 | 3, 0, 2.5);
}
model {
// likelihood including constants
if (!prior_only) {
// initialize linear predictor term
vector[N] mu = rep_vector(0.0, N);
mu += Intercept;
for (n in 1 : N) {
// add more terms to the linear predictor
mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
}
target += normal_lpdf(Y | mu, sigma);
}
// priors including constants
target += lprior;
target += std_normal_lpdf(z_1[1]);
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept;
}
We see that brms
is using student t distributions with 3 degrees of
freedom for the priors. For the two error SDs, this will be truncated at
zero to form half-t distributions. You can get a more explicit
description of the priors with prior_summary(bmod)
. These are
qualitatively similar to the half-normal and the PC prior used in the
INLA fit. This explains why we encountered fewer problems in the fit
because we are supplying more informative priors. Nevertheless, we do
need to increase the number of iterations for more accurate estimation
of tail probabilities.
bmod <- brm(bright ~ 1+(1|operator), pulp, iter=10000, cores = 4, backend = "cmdstanr")
Because the STAN programme was compiled earlier, this takes much less time overall even though we are doing 5 times as many iterations as the default number of 2000. We examine the fit:
summary(bmod)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: bright ~ 1 + (1 | operator)
Data: pulp (Number of observations: 20)
Draws: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
total post-warmup draws = 20000
Multilevel Hyperparameters:
~operator (Number of levels: 4)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.46 0.36 0.05 1.45 1.00 1142 671
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 60.40 0.28 59.77 60.97 1.00 1015 488
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.36 0.07 0.25 0.53 1.00 6286 8555
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We now have better effective sample sizes. We can estimate the tail probability as before
bps = as_draws_df(bmod)
mean(bps$sd_operator__Intercept < 0.1)
[1] 0.05515
A similar value to that seen previously. The priors used here put greater weight on smaller values of the SD.
It is possible to fit some GLMMs within the GAM framework of the mgcv
package. An explanation of this can be found in this
blog
and also in this
blog
The operator
term must be a factor for this to work:
gmod = gam(bright ~ 1 + s(operator, bs = 're'), data=pulp, method="REML")
and look at the summary output:
summary(gmod)
Family: gaussian
Link function: identity
Formula:
bright ~ 1 + s(operator, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.400 0.149 404 <2e-16
Approximate significance of smooth terms:
edf Ref.df F p-value
s(operator) 2.29 3 3.2 0.021
R-sq.(adj) = 0.336 Deviance explained = 41.6%
-REML = 9.3131 Scale est. = 0.10625 n = 20
We get the estimate and SE for the fixed effect (intercept in this example). We also get a test on the random effect (as described in this article. The hypothesis of no variation between the operators is rejected.
We can get an estimate of the operator and error SD:
gam.vcomp(gmod)
Standard deviations and 0.95 confidence intervals:
std.dev lower upper
s(operator) 0.26093 0.090812 0.74971
scale 0.32596 0.230511 0.46093
Rank: 2/2
which is the same as the REML estimate from lmer
earlier.
The random effect estimates for the four operators can be found with:
coef(gmod)
(Intercept) s(operator).1 s(operator).2 s(operator).3 s(operator).4
60.40000 -0.12194 -0.25912 0.16767 0.21340
which is again the same as before.
In Wood (2019), a simplified
version of INLA is proposed. The first construct the GAM model without
fitting and then use the ginla()
function to perform the computation.
gmod = gam(bright ~ 1 + s(operator, bs = 're'), data=pulp, fit = FALSE)
gimod = ginla(gmod)
We get the posterior density for the intercept as:
plot(gimod$beta[1,],gimod$density[1,],type="l",xlab="brightness",ylab="density")
and for the fixed effects as:
plot(gimod$beta[2,],gimod$density[2,],type="l",xlim=c(-0.8,0.6),
xlab="brightness",ylab="density")
for(i in 3:5){
lines(gimod$beta[i,],gimod$density[i,],lty=i-1)
}
It is not straightforward to obtain the posterior densities of the hyperparameters.
It is interesting to compare these methods. We cannot make numerical comparisons across the whole set because the models, assumptions and even statistical philosophies are not the same. We cannot hope to claim one method is objectively better than another. Even so, the researcher, who wants to know the nature of the difference between the operators, must decide between them and so a qualitative comparison is worthwhile.
Let us assume the perspective of researcher who is not statistician. Although such researchers are usually tolerant of complexity and ambiguity in research questions, they are far less willing to wade into theoretical and computational marshes of statistics. They would like to submit a manuscript for publication where questions won’t be raised over their choice of statistical methodology. They want to software that is easy to use and does not require opaque, complex and finicky computations.
All the software discussed here has been provided free. Researchers have gone beyond the call of duty to provide this. Academic research articles require only text and perhaps some demonstration of software. The production of functional software requires immensely more effort and is usually not rewarded in a career or financial sense. The efforts of these producers has advanced science immeasurably beyond what would have been achieved by purely commercial statistical software. Any criticism below should be measured against these huge achievements.
-
The fixed effect linear model solution is by far the easiest to implement. The methodology has been established for decades and the manner in which such results should be presented is well understood and accepted. It requires only the installation of base R and runs almost instantly. The only problem is that it does not properly answer the main questions of interest.
-
The likelihood-based analysis using
lme4
is relatively straightforward. We need to install only one well-established R package:lme4
. This package has been used and installed so many times that we are unlikely to encounter any unexpected problems. We do need to understand the difference between ML and REML estimation. Testing for the operator effect is more complicated than we might like. -
Many years ago,
lme4
would produce a likelihood ratio test that was not correct for this class of models (more precisely, the null distribution was wrong). Some statisticians might have had reservations about the use of such tests but for a long time, they were routinely used in these circumstances. After all, there are many tests used now which we know to be approximations but feel comfortable with accepting. No one wants to open that can of worms. But eventually the evidence built up against these particular tests and the functionality was abruptly withdrawn fromlme4
. Having used these tests in the first edition of Extending the Linear Models with R, I was perturbed. I had recommended a procedure that was wrong. Many others who had published results were also discomfited. But getting it right is more important than any embarassment. The episode illustrated the uncomfortable fact that not all statistical procedures can be regarded as incontrovertibly correct for all time. -
I used a parametric bootstrap procedure here which requires some undesirable complication to use from the perspective of the non-statistician. There are R packages which provide other good tests but also require more advanced understanding. All this is a nuisance for the non-statistician writing up results for publication in a non-statistics journal. The referees may well not be familiar with these tests and trouble may result. The fixed effects analysis would not be questioned in this way.
-
The Bayesian approaches introduce additional aspects of complexity into the analysis. Our non-statistician researcher will have to decide whether to brave the possible complications of submitting a Bayesian analysis to an academic journal. It’s fine for me to extol the virtues of a Bayesian analysis but no good if the article gets rejected as a result. Putting that aside, we must specify priors. In some cases, the priors do not make much difference to the conclusions. In this example, the priors make a large difference to the chance that the operator variation is negligible. This reflects the small size of the data and that we have information on only four operators. Although this is the reality, it does contrast with the definitive answers provided by the previous analyses.
-
The Bayesian approaches also entail practical and computational complications. Installing INLA or STAN involves various difficulties that may or may not arise depending on the operating system and other software on your computer. If I were to ask a class of 100 undergraduates to get INLA or STAN up and running, I would consign myself to many hours of dealing with frustrated emails involving computer problems that I have only the vaguest ideas about solving. It would be more practical to ask them to use a central server where everything is installed but this results in other problems regarding access and usage. In this example, STAN outputs a slew of obscure compiler warnings and even some errors. Some googling reveals that some other people receive much the same set of warnings as me. But many more do not because otherwise the developers would have fixed it. The errors and warnings are probably due to the specific configuration of my computer. I never did discover the source of the problem - suggestions on the internet required uninstalling and reinstalling large pieces of software. Fortunately, I could see that the errors and warnings could be ignored. Update: Now that I run the same code again a few months later, the errors and warnings have gone away. I don’t know why. They might come back. Update no.2: After updating R to version 4.3, Stan and BRMS stopped working. After much research and experimentation, I found that the solution was to switch from
rstan
tocmdstanr
. Update no.3: Now on R version 4.4 in August 2004, installation and execution of INLA and STAN ran without (obvious) problems After several decades of fiddling with computers to get software to work, I (mostly) possess the patience and experience to deal with these sorts of problems. That’s far from true for many potential users. -
Both STAN and INLA are being actively developed. Of course, it’s good to know that functionality and performance are being improved. It’s not unusual to complete an analysis, come back to it a year later and find that the results are different in some important way or that the code does not run at all. Again I can ride along with this but less experienced users find this problematic. At some point, they may ask “How do I know if this is the right answer?” and the reassurance is not entirely convincing. In contrast, it is not unusual to feed S vintage code into base R and have it give the same results as it did 30 years ago. Update August 2024: INLA gave moderately different results from the previous year, particularly for the operator SD, presumably due to a change in the algorithm. The STAN results were also slightly different but the MCMC sampling is subject to the RNG. A larger MCMC iteration size than the default would help with this. BRMS gave very similar results but it is using a larger iteration size by default. Nothing obvious changed with the other computations based on other packages.
-
Specifying and fitting Bayesian models is inherently more complicated than standard linear models or those fit by
lme4
. We expect to work harder but surely there is some scope for a notational advance in model specification. The extended Wilkinson-Rogers model notation and the grammar of graphics ideas seen inggplot2
are examples where notation has helped advance understanding beyond simple convenience. STAN requires us to learn a new language merely to specify the model. Fortunately,brms
(andrstanarm
) allow less advanced users to skip these complications. -
Fitting Bayesian models is more likely to go wrong than GLMM models. For our STAN model, there were perhaps insufficient iterations. We did get a warning that suggested the solution of more iterations. Some warnings remained but we knew that is was safe to ignore them. With the INLA model, the default prior led to some unbelievable results. It took some knowledge to know the solution was to use a more informative prior (and this would have been another solution to the STAN fitting problem.) There were less problems in using
lme4
although the bootstrapping does throw large numbers of warnings when the parameter estimate falls on the boundary. Again we had to know it was safe to let this pass. All this requires some expertise and may confuse the non-statistician.
These analyses required a shockingly large number of packages. One worries that a small change in just one of these packages might cause the analysis above to fail or change in some unexpected manner. Not all of the attached packages are actually used but it is hard to know what we could do without.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mgcv_1.9-1 nlme_3.1-165 brms_2.21.0 Rcpp_1.0.13 cmdstanr_0.8.1 knitr_1.48 INLA_24.06.27
[8] sp_2.1-4 lme4_1.1-35.5 Matrix_1.7-0 ggplot2_3.5.1 faraway_1.0.8
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4 loo_2.8.0 fastmap_1.2.0
[6] tensorA_0.36.2.1 digest_0.6.36 estimability_1.5.1 lifecycle_1.0.4 Deriv_4.1.3
[11] sf_1.0-16 StanHeaders_2.32.10 processx_3.8.4 magrittr_2.0.3 posterior_1.6.0
[16] compiler_4.4.1 rlang_1.1.4 tools_4.4.1 utf8_1.2.4 yaml_2.3.10
[21] data.table_1.15.4 labeling_0.4.3 bridgesampling_1.1-2 pkgbuild_1.4.4 classInt_0.4-10
[26] plyr_1.8.9 abind_1.4-5 KernSmooth_2.23-24 withr_3.0.1 grid_4.4.1
[31] stats4_4.4.1 fansi_1.0.6 xtable_1.8-4 e1071_1.7-14 colorspace_2.1-1
[36] inline_0.3.19 emmeans_1.10.3 scales_1.3.0 MASS_7.3-61 cli_3.6.3
[41] mvtnorm_1.2-5 rmarkdown_2.27 generics_0.1.3 RcppParallel_5.1.8 rstudioapi_0.16.0
[46] reshape2_1.4.4 minqa_1.2.7 DBI_1.2.3 proxy_0.4-27 rstan_2.32.6
[51] stringr_1.5.1 splines_4.4.1 bayesplot_1.11.1 parallel_4.4.1 matrixStats_1.3.0
[56] vctrs_0.6.5 boot_1.3-30 jsonlite_1.8.8 systemfonts_1.1.0 units_0.8-5
[61] glue_1.7.0 nloptr_2.1.1 codetools_0.2-20 ps_1.7.7 distributional_0.4.0
[66] stringi_1.8.4 gtable_0.3.5 QuickJSR_1.3.1 munsell_0.5.1 tibble_3.2.1
[71] pillar_1.9.0 htmltools_0.5.8.1 Brobdingnag_1.2-9 R6_2.5.1 fmesher_0.1.7
[76] evaluate_0.24.0 lattice_0.22-6 backports_1.5.0 rstantools_2.4.0 class_7.3-22
[81] svglite_2.1.3 coda_0.19-4.1 gridExtra_2.3 checkmate_2.3.2 xfun_0.46
[86] pkgconfig_2.0.3