Hand rolling empirical Bayes estimation of a hierarchical model to learn how it works
This is a quick experiment to teach myself if I can just use full likelihood optimization to get well-behaved empirical Bayes estimate of both the trial-level random effects and the metastudy group-level hyperparameters in a mixed model.
Why? Because proper statisticians say you aren’t supposed to use a full likelihood to estimate random effects and hyperparameters because it’s biased for the group-level variance components, but how much does it matter for model fitting metastudy applications?
I care because I like to use simple optimizations when initially calibrating models on the kind of strange data types I get from papers when I don’t have the individual-level data. I want to better understand when I can take shortcuts and still get decent statistical properties during model development. Then, if doing the stats better proves to be important, relative to other limitations of the model, we can do better later.
Also, I use packages to do mixed-effects modeling all the time, but I’ve never worked out a non-Gaussian example. That is not a morally-sustainable position.
What? Compare full maximum likelihood (ML) estimate where I jointly
optimize over the subject-level random effects and the group level
hyperparameters, vs the restricted maximum likelihood (REML) estimate
where I first optimize only over the hyperparamaters and then find an
empirical Bayes estimate of the random effects given the
hyperparameters.
For the best tight description of the two algorithms that I’ve found,
see this documentation for Estimating Parameters in Linear
Mixed-Effects Models from The
Mathworks.
What did I learn? WHELP, it looks full likelihood optimization is
just fine, at least for this example, which is on a relevant scale for a
vaccine model building metastudy. (Of order 10 trial arms, with of order
100 subjects per arm, and binary outcomes.)
I also learned that it was easy to implement the REML approach in this
example. So, if it’s also easy on the real calibration problem to come,
I’ll probably just do that.
To learn more, and very quickly, about what I’m trying to do, here’s a useful chat log with ChatGPT4.5 to go along with this.
What am I looking for from you? I’ve learned a lot of stats over the last 15 years, but I’m not trained as a statistician and I’m always learning more. Please comment if you see anything you’d like to correct or expand upon. Also, let me know if this was useful to you. Thanks!
Note about how this post was generated. This post was generated
directly from the commented R script using the knitr::spin
functionality of rmarkdown::render
. This gives all the advantages of
scripts and r-markdown, without the disadvantages. Check out this blog
post by Dean
Attali
to learn more.
Setting up the study
First, we set up a study with 10 trial arms and 100 people measured per arm. The group level is the ensemble of trial arms, and the individuals in this exmaple are the trial arms. The observations for each trial arm are zero or one for the outcome, and each arm has a true probability of the outcome drawn from a beta distribution for the ensemble of studies. Note that this isn’t set up as real vaccine efficacy measurement. I’m just interested in looking at cohort parameter estimation for now. VE would be a transform on that, if the arms were labeled by treatment vs control.
# set up the environment
library(tidyverse)
library(rmutil)
library(stats4)
library(knitr)
# set seed for reproducibility. Comment out to run many examples to see that this example is representative.
set.seed(100)
##simulate binomial draws from a beta random-effects model
# typical example of a metastudy, 10 trials, ~100 subjects per trial
n_trials = 10 # don't change! I hard-coded 10 later for dumb reasons w/r/t variable names in stats4::mle that I didn't feel like debugging.
n_subjects=100 # you can change this if you want.
# set up the measurement-level data frame
subjects = expand_grid(ID=1:n_trials,rep = 1:n_subjects)
## draw probability of positive for each ID from a beta distribution
# rmutil::betabinomial parameterization
m=0.3
s=8
# rbeta parameterization
alpha=m*s
beta=s-alpha
# add true trial-level outcome probality to the subject data
subjects = subjects |> left_join(
data.frame(ID=1:n_trials,p=rbeta(n_trials,shape1=alpha,shape2=beta)))
# draw binomial samples for each subject
subjects = subjects |> cbind(
data.frame(response=rbinom(n=nrow(subjects),size=1,prob=subjects$p)))
# collapse the measurement-level data into subject level data for fitting later
observed = subjects |> group_by(ID,p) |>
summarize(n_pos=sum(response),
n_trials=n(),
p_hat = n_pos/n_trials)
observed |> kable()
ID | p | n_pos | n_trials | p_hat |
---|---|---|---|---|
1 | 0.2134503 | 22 | 100 | 0.22 |
2 | 0.3254404 | 34 | 100 | 0.34 |
3 | 0.2853022 | 27 | 100 | 0.27 |
4 | 0.4947419 | 48 | 100 | 0.48 |
5 | 0.3225656 | 29 | 100 | 0.29 |
6 | 0.3637010 | 49 | 100 | 0.49 |
7 | 0.2012123 | 23 | 100 | 0.23 |
8 | 0.4527466 | 45 | 100 | 0.45 |
9 | 0.1662158 | 17 | 100 | 0.17 |
10 | 0.2363248 | 34 | 100 | 0.34 |
Parameter inference with the full likelihood
The first approach I’m considering is maximum likelihood estimation of
both the population-level hyperparameters and the trial-level outcome
probabilities. We can estimate both the MLE and the Wald confidence
intervals from the variance-covariance matrix using
stats4::mle
.
In the code chunk below, I define the likelihood function, run the
optimization (using method = 'L-BFGS-B
with appropriate bounds to
respect parameter domains), and tidy up the results for display and
plotting later.
# define the negative log likelihood for the fully parameterized model: beta-distributed population + individial binomial probabilities
# Note that I'm hardcoding data for this demo so I don't have to worry about passing data to mle later. This is for convenience.
minus_log_lik = function(m=alpha/(alpha+beta),s=(alpha+beta),
p1=observed$p[1],p2=observed$p[2],p3=observed$p[3],p4=observed$p[4], # there's surely a nicer way to do this...
p5=observed$p[5],p6=observed$p[6],p7=observed$p[7],p8=observed$p[8],
p9=observed$p[9],p10=observed$p[10]){
a = m*s
b = s-a
p =c(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10)
-sum(dbeta(x=p,shape1=a,shape2=b,log=TRUE) + dbinom(x=observed$n_pos,size=observed$n_trials,prob=p,log=TRUE))
}
minus_log_lik()
# find the maximum likelihood estimate simulateously for both the trial-level random effects and the hyperparameters.
model_lik = stats4::mle(minus_log_lik, start = list(m=0.5,s=10,p1=0.5,p2=0.5,p3=0.5,p4=0.5,
p5=0.5,p6=0.5,p7=0.5,p8=0.5,p9=0.5,p10=0.5),
method='L-BFGS-B',
upper=c(0.99999,Inf,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999),
lower=c(0.00001,0,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001))
summary(model_lik)
# tidy up the results for display and later plotting
full_lik_params = data.frame(true=c(m=m,s=s,observed$p), estimate =coef(model_lik), se = sqrt(diag(vcov(model_lik )))) |>
mutate( z = (true-estimate)^2/se^2,
lower = estimate - 1.96*se, # wald confidence interval
upper = estimate + 1.96*se) |>
rownames_to_column(var='param') |>
select(param, everything()) |>
mutate(param=factor(param,levels=c('m','s','p1','p2','p3','p4','p5','p6','p7','p8','p9','p10'))) |>
mutate(level=factor(c('group_mean','group_dispersion',rep('trial',10)))) |>
mutate(param_index = 1:12)
full_lik_params
## param true estimate se z lower upper level param_index
## 1 m 0.3000000 0.3246235 0.02967484 0.68853213 0.2664609 0.3827862 group_mean 1
## 2 s 8.0000000 32.0202634 19.93898104 1.45127465 -7.0601394 71.1006663 group_dispersion 2
## 3 p1 0.2134503 0.2414595 0.04012803 0.48719936 0.1628086 0.3201105 trial 3
## 4 p2 0.3254404 0.3337523 0.04203733 0.03909562 0.2513592 0.4161455 trial 4
## 5 p3 0.2853022 0.2799146 0.04053223 0.01766756 0.2004715 0.3593578 trial 5
## 6 p4 0.4947419 0.4414270 0.04786335 1.24076659 0.3476149 0.5352392 trial 6
## 7 p5 0.3225656 0.2952969 0.04085350 0.44552396 0.2152241 0.3753698 trial 7
## 8 p6 0.3637010 0.4491181 0.04839787 3.11484948 0.3542583 0.5439780 trial 8
## 9 p7 0.2012123 0.2491503 0.04016259 1.42467515 0.1704316 0.3278690 trial 9
## 10 p8 0.4527466 0.4183538 0.04634406 0.55074260 0.3275194 0.5091881 trial 10
## 11 p9 0.1662158 0.2030039 0.04030446 0.83312302 0.1240072 0.2820006 trial 11
## 12 p10 0.2363248 0.3337523 0.04203733 5.37146452 0.2513592 0.4161455 trial 12
Parameter inference with REML and empirical Bayes
Before looking to closely at the results, let’s look at the other (statistician-preferred) approach of estimating the hyperparameters first with the trial-level random effects integrated out, and then estimating the trial-level random effects given the hyperparameter MLEs.
For the beta-binomial model, the restricted likelihood is just the analytic betabinomial distribution, which I wrap into a negative log likelhood for fitting.
# hyperparameter likelihood
minus_log_REML = function(m=alpha/(alpha+beta),s=(alpha+beta)){
-sum(rmutil::dbetabinom(y=observed$n_pos,size=observed$n_trials,m=m,s=s,log=TRUE))
}
minus_log_REML()
# hyperparameter MLE
model_REML = stats4::mle(minus_log_REML, start = list(m=0.5,s=1),
method='L-BFGS-B',
upper=c(0.99999,Inf),
lower=c(0.00001,0))
summary(model_REML)
Next, given the MLEs for the hyperparameters, I estimate the trial-level outcome probabilities and tidy up all the results.
# empirical Bayes likelihood for the individual-level random effects. Plug in MLE of the hyperparameters.
minus_log_REML_p = function(p1=observed$p[1],p2=observed$p[2],p3=observed$p[3],p4=observed$p[4], # there's probably a nicer way to do this...
p5=observed$p[5],p6=observed$p[6],p7=observed$p[7],p8=observed$p[8],
p9=observed$p[9],p10=observed$p[10]){
a = coef(model_REML)[1]*coef(model_REML)[2]
b = coef(model_REML)[2]-a
p =c(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10)
-sum(dbeta(x=p,shape1=a,shape2=b,log=TRUE) + dbinom(x=observed$n_pos,size=observed$n_trials,prob=p,log=TRUE))
}
minus_log_REML_p()
# estimate the trial-level probabilities
model_REML_p = stats4::mle(minus_log_REML_p, start = list(p1=0.5,p2=0.5,p3=0.5,p4=0.5,
p5=0.5,p6=0.5,p7=0.5,p8=0.5,p9=0.5,p10=0.5),
method='L-BFGS-B',
upper=c(0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999,0.99999),
lower=c(0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001,0.00001))
summary(model_REML_p)
# and tidy up the results
REML_params=data.frame(true=c(m=m,s=s,observed$p), estimate =c(coef(model_REML),coef(model_REML_p)), se = c(sqrt(diag(vcov(model_REML))),sqrt(diag(vcov(model_REML_p))))) |>
mutate( z = (true-estimate)^2/se^2,
lower = estimate - 1.96*se,
upper = estimate + 1.96*se) |>
rownames_to_column(var='param') |>
select(param, everything()) |>
mutate(param=factor(param,levels=c('m','s','p1','p2','p3','p4','p5','p6','p7','p8','p9','p10'))) |>
mutate(level=factor(c('group_mean','group_dispersion',rep('trial',10)))) |>
mutate(param_index = 1:12)
REML_params
## param true estimate se z lower upper level param_index
## 1 m 0.3000000 0.3280964 0.03351673 0.70271450 0.2624036 0.3937892 group_mean 1
## 2 s 8.0000000 23.1436114 12.56215846 1.45321669 -1.4782192 47.7654420 group_dispersion 2
## 3 p1 0.2134503 0.2360292 0.03858028 0.34251204 0.1604119 0.3116466 trial 3
## 4 p2 0.3254404 0.3350849 0.04288521 0.05057560 0.2510299 0.4191399 trial 4
## 5 p3 0.2853022 0.2773025 0.04067250 0.03868527 0.1975843 0.3570206 trial 5
## 6 p4 0.4947419 0.4506498 0.04520559 0.95134048 0.3620469 0.5392528 trial 6
## 7 p5 0.3225656 0.2938117 0.04138481 0.48273865 0.2126975 0.3749259 trial 7
## 8 p6 0.3637010 0.4589045 0.04527371 4.42194639 0.3701680 0.5476410 trial 8
## 9 p7 0.2012123 0.2442839 0.03903654 1.21741661 0.1677723 0.3207956 trial 9
## 10 p8 0.4527466 0.4258858 0.04492556 0.35748004 0.3378317 0.5139399 trial 10
## 11 p9 0.1662158 0.1947565 0.03597917 0.62925532 0.1242373 0.2652756 trial 11
## 12 p10 0.2363248 0.3350849 0.04288521 5.30331730 0.2510299 0.4191399 trial 12
Results
Now that we’ve run both estimation processes, we can compare the results to each other and the underlying simulated truth.
ggplot() +
geom_point(data=full_lik_params,aes(x=param_index,y=estimate),color='red') +
geom_segment(data=full_lik_params,aes(x=param_index,y=lower,yend=upper,group=param),color='red') +
geom_point(data=REML_params,aes(x=param_index+0.1,y=estimate),color='blue') +
geom_segment(data=REML_params,aes(x=param_index+0.1,y=lower,yend=upper,group=param),color='blue') +
geom_point(data=full_lik_params,aes(x=param_index-0.1,y=true)) +
theme_bw()+
facet_wrap('level',scales='free') +
scale_x_continuous(breaks=seq(1:12),labels=levels(full_lik_params$param),minor_breaks = NULL) +
xlab('') +
geom_text(data=data.frame(param_index=rep(1.945,3),y=full_lik_params$upper[2]*c(0.95,0.85,0.75),label=c('truth','full likelihood','REML'),level=rep('group_dispersion',3)),
aes(x=param_index,y=y,label=label),color=c('black','red','blue'))
In this example, the results are qualitatively similar.
-
Nominal 95% confidence interval coverage is acceptable for my purposes for both,
- especially as Wald intervals are known to be too narrow and don’t enforce that the dispersion hyperparameter must be positive because I worked on the outcome scale and not the link scale.
-
The trial-level effect estimates and the group-level mean estimates are nearly identical.
-
The group-level dispersion parameter is biased high with the full likelihood vs the REML estimate, as expected from the stats theory, and in this case a bit wider.
My take-away
The most interesting parameters for the analogous problem in the immunity modeling building to come are the group-level mean and trial-level individual effects, both of which are estimated with similar quality regardless of the method. The group-level dispersion is the least relevant parameter for that activity, and so I take away that I’m probably fine to do whatever is easiest to implement.
And regardless, this will be a great place to engage with some peer review after the model is prototyped, so that we can do better if needed.
Thanks for reading!