I want to fit a random-intercept complementary log-log regression in R, in order to check for unobserved user heterogeneity. I have searched through the internet and books and h
This works fine for me (more or less: see notes below)
## added data.frame()
df <- data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
activity = c(1,1,1,2,2,3,4,5,5,6,6),
completion = c(0,0,1,0,1,1,1,0,1,0,1),
sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
)
model_re1 <- completion ~ (1|people) + sunshine
clog_re1 <- glmer(model_re1, data=df,
family = binomial(link = cloglog))
update: I'm sorry to tell you that mixed models (GLMMs) are significantly more computationally intensive than standard GLMs: 500K observations with 68 predictor variables is definitely a large problem, and you should expect the fit to take hours. I have a few suggestions:
glmmTMB
package is as far as I know the fastest option within R for this problem (lme4
will scale badly with large numbers of predictor variables), but might run into memory constraints. The MixedModels.jl Julia package might be faster.Here's an example with 10,000 observations and a single predictor variable.
n <- 10000
set.seed(101)
dd <- data.frame(people=factor(rep(1:10000,each=3)),
sunshine=sample(1:8,size=n, replace=TRUE))
dd$completion <- simulate(~(1|people)+sunshine,
newdata=dd,
family=binomial(link="cloglog"),
newparams=list(beta=c(0,1),
theta=1))[[1]]
glmer
runs for 80 seconds and then fails:
system.time(update(clog_re1, data=dd, verbose=100))
On the other hand, glmmTMB
does this problem in about 20 seconds (I have 8 cores on my computer, and glmmTMB
uses all of them, so the CPU allocation to this job goes up to 750%; if you have fewer cores the elapsed computational time will increase accordingly).
library(glmmTMB)
system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
verbose=TRUE))
If I add four more predictor variables (for a total of 5), the computation time goes up to 46 seconds (again using 8 cores: the total computation time across all cores is 320 seconds). With 13 times as many predictor variables and 50 times as many observations, you should definitely expect this to be a challenging computation.
A very crude assessment of heterogeneity would be to fit the homogeneous model and compare the residual deviance (or sum of squared Pearson residuals) to the residual degrees of freedom of the model; if the former is much larger, that's evidence of some form of mis-fit (heterogeneity or something else).