Random Intercept GLM | 易学教程

问题

I want to fit a random-intercept complementary log-log regression in R, in order to check for unobserved user heterogeneity. I have searched through the internet and books and have only found one solution in Stata, maybe someone can adapt that to R. In Stata there are 2 commands available:

xtcloglog for two-level random intercept
gllamm for random-coefficient and and higher-levels models

My data relates if activities from people are completed or not and affected by sunshine - completion is the outcome variable and sunshine and the others mentioned below would be the explanatory variable; this is a simplified version.

    581755 obs. of 68 variables:
     $ activity          : int  37033 37033 37033 37033 37033 37033 37033 37033 37033 37033 ...
     $ people         : int  5272 5272 5272 5272 5272 5272 5272 5272 5272 5272 ...
     $ completion: num 0 0 0 0 0 0 0 0 0 0 ...
     $ active            : int  0 2 2 2 2 2 2 2 2 2 ...
     $ overdue           : int  0 0 0 0 0 0 0 0 0 0 ...
     $ wdsp              : num  5.7 5.7 7.7 6.4 3.9 5.8 3.5 6.3 4.8 9.4 ...
     $ rain              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ UserCompletionRate: num [1:581755, 1] NaN -1.55 -1.55 -1.55 -1.55 ...
      ..- attr(*, "scaled:center")= num 0.462
      ..- attr(*, "scaled:scale")= num 0.298
     $ DayofWeekSu       : num  0 0 0 0 0 1 0 0 0 0 ...
     $ DayofWeekMo       : num  0 0 0 0 0 0 1 0 0 0 ...
     $ DayofWeekTu       : num  1 0 0 0 0 0 0 1 0 0 ...
     $ DayofWeekWe       : num  0 1 0 0 0 0 0 0 1 0 ...
     $ DayofWeekTh       : num  0 0 1 0 0 0 0 0 0 1 ...
     $ DayofWeekFr       : num  0 0 0 1 0 0 0 0 0 0 ...

     $ MonthofYearJan    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearFeb    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMar    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearApr    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMay    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearJun    : num  1 1 1 1 1 1 1 1 1 1 ...
     $ MonthofYearJul    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearAug    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearSep    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearOct    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearNov    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ cold              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ hot               : num  0 0 0 0 0 0 0 0 0 0 ...
     $ overduetask       : num  0 0 0 0 0 0 0 0 0 0 ...

Original (simplified) data:

 df <-  people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)

So far I've used this code for the cloglog:

model <- as.formula("completion ~  sunshine")
clog_full = glm(model,data=df,family = binomial(link = cloglog))
summary(clog_full)

Using package glmmML:

model_re <- as.formula("completion ~  sunshine")
clog_re = glmmML(model_re,cluster = people, data= df,
    family = binomial(link = cloglog)) 
summary(clog_re)

Using package lme4:

model_re1<- as.formula("completion ~  (1|people) + sunshine") 
clog_re1 <- glmer(model_re1, data=df,
   family = binomial(link = cloglog)) 
summary(clog_re1)

However, R does not get any result out of this, just runs them but never comes to an result. Would i have to use people or activities as a cluster?

If anyone, also has an idea on how to run this model with a fixed intercept, I am happy to know.

回答1:

This works fine for me (more or less: see notes below)

## added data.frame()
df <-  data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
        )

model_re1 <- completion ~  (1|people) + sunshine
clog_re1 <- glmer(model_re1, data=df,
                  family = binomial(link = cloglog))

This finishes very quickly (less than a second): maybe you forgot to close a quote or a parenthesis or something ... ?
However, it does produce a message "boundary (singular) fit: see ?isSingular", which occurs because your data set is so small/noisy that the best estimate of among-person variation is zero (because it can't be negative).

update: I'm sorry to tell you that mixed models (GLMMs) are significantly more computationally intensive than standard GLMs: 500K observations with 68 predictor variables is definitely a large problem, and you should expect the fit to take hours. I have a few suggestions:

you should definitely try out subsets of your data (both observations and predictor variables) to get a sense of how the computation time will scale
the glmmTMB package is as far as I know the fastest option within R for this problem (lme4 will scale badly with large numbers of predictor variables), but might run into memory constraints. The MixedModels.jl Julia package might be faster.
you can usually turn on a "verbose" or "tracing" option to let you know that the model is at least working on the problem, rather than completely stuck (it's not really feasible to know how long it will take to complete, but at least you know something is happening ...)
If Stata is much faster (I doubt it, but it's possible) you could use it

Here's an example with 10,000 observations and a single predictor variable.

n <- 10000
set.seed(101)
dd <- data.frame(people=factor(rep(1:10000,each=3)),
                 sunshine=sample(1:8,size=n, replace=TRUE))
dd$completion <- simulate(~(1|people)+sunshine,
                          newdata=dd,
                          family=binomial(link="cloglog"),
                          newparams=list(beta=c(0,1),
                                         theta=1))[[1]]

glmer runs for 80 seconds and then fails:

system.time(update(clog_re1, data=dd, verbose=100))

On the other hand, glmmTMB does this problem in about 20 seconds (I have 8 cores on my computer, and glmmTMB uses all of them, so the CPU allocation to this job goes up to 750%; if you have fewer cores the elapsed computational time will increase accordingly).

library(glmmTMB)
system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
                    verbose=TRUE))

If I add four more predictor variables (for a total of 5), the computation time goes up to 46 seconds (again using 8 cores: the total computation time across all cores is 320 seconds). With 13 times as many predictor variables and 50 times as many observations, you should definitely expect this to be a challenging computation.

A very crude assessment of heterogeneity would be to fit the homogeneous model and compare the residual deviance (or sum of squared Pearson residuals) to the residual degrees of freedom of the model; if the former is much larger, that's evidence of some form of mis-fit (heterogeneity or something else).

来源：https://stackoverflow.com/questions/62234957/random-intercept-glm

标签

random

fixed

lme4

intercept