Random Intercept GLM

前端 未结 1 1293
别那么骄傲
别那么骄傲 2021-01-26 17:57

I want to fit a random-intercept complementary log-log regression in R, in order to check for unobserved user heterogeneity. I have searched through the internet and books and h

相关标签:
1条回答
  • 2021-01-26 18:43

    This works fine for me (more or less: see notes below)

    ## added data.frame()
    df <-  data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
            activity = c(1,1,1,2,2,3,4,5,5,6,6),
            completion = c(0,0,1,0,1,1,1,0,1,0,1),
            sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
            )
    
    model_re1 <- completion ~  (1|people) + sunshine
    clog_re1 <- glmer(model_re1, data=df,
                      family = binomial(link = cloglog))
    
    • This finishes very quickly (less than a second): maybe you forgot to close a quote or a parenthesis or something ... ?
    • However, it does produce a message "boundary (singular) fit: see ?isSingular", which occurs because your data set is so small/noisy that the best estimate of among-person variation is zero (because it can't be negative).

    update: I'm sorry to tell you that mixed models (GLMMs) are significantly more computationally intensive than standard GLMs: 500K observations with 68 predictor variables is definitely a large problem, and you should expect the fit to take hours. I have a few suggestions:

    • you should definitely try out subsets of your data (both observations and predictor variables) to get a sense of how the computation time will scale
    • the glmmTMB package is as far as I know the fastest option within R for this problem (lme4 will scale badly with large numbers of predictor variables), but might run into memory constraints. The MixedModels.jl Julia package might be faster.
    • you can usually turn on a "verbose" or "tracing" option to let you know that the model is at least working on the problem, rather than completely stuck (it's not really feasible to know how long it will take to complete, but at least you know something is happening ...)
    • If Stata is much faster (I doubt it, but it's possible) you could use it

    Here's an example with 10,000 observations and a single predictor variable.

    n <- 10000
    set.seed(101)
    dd <- data.frame(people=factor(rep(1:10000,each=3)),
                     sunshine=sample(1:8,size=n, replace=TRUE))
    dd$completion <- simulate(~(1|people)+sunshine,
                              newdata=dd,
                              family=binomial(link="cloglog"),
                              newparams=list(beta=c(0,1),
                                             theta=1))[[1]]
    

    glmer runs for 80 seconds and then fails:

    system.time(update(clog_re1, data=dd, verbose=100))
    

    On the other hand, glmmTMB does this problem in about 20 seconds (I have 8 cores on my computer, and glmmTMB uses all of them, so the CPU allocation to this job goes up to 750%; if you have fewer cores the elapsed computational time will increase accordingly).

    library(glmmTMB)
    system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
                        verbose=TRUE))
    

    If I add four more predictor variables (for a total of 5), the computation time goes up to 46 seconds (again using 8 cores: the total computation time across all cores is 320 seconds). With 13 times as many predictor variables and 50 times as many observations, you should definitely expect this to be a challenging computation.

    A very crude assessment of heterogeneity would be to fit the homogeneous model and compare the residual deviance (or sum of squared Pearson residuals) to the residual degrees of freedom of the model; if the former is much larger, that's evidence of some form of mis-fit (heterogeneity or something else).

    0 讨论(0)
提交回复
热议问题