Large fixed effects binomial regression in R

后端 未结 3 1683
别跟我提以往
别跟我提以往 2021-02-01 07:56

I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 level

相关标签:
3条回答
  • 2021-02-01 08:41

    Check out

    glmmboot{glmmML}
    

    http://cran.r-project.org/web/packages/glmmML/glmmML.pdf

    There is also a nice document by Brostrom and Holmberg (http://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf)

    Here is the example from their document:

    dat <- data.frame(y = rbinom(5000, size = 1, prob = 0.5),
                   x = rnorm(5000), group = rep(1:1000, each = 5))
    fit1 <- glm(y ~ factor(group) + x, data = dat, family = binomial)
    
    require(glmmML)
    fit2 <- glmmboot(y ~ x, cluster = group,data = dat)
    

    The computing time difference is "huge"!

    0 讨论(0)
  • 2021-02-01 08:50

    For posterity, I'd also like to recommend the package speedglm, which I have found useful when trying to perform logistic regression on large data sets. It seems to use about half as much memory and finishes a lot quicker than glm.

    0 讨论(0)
  • 2021-02-01 08:51

    I agree with whoever (@Ben Bolker I guess?) suggested to you to use the glm4 function from the MatrixModels. Firstly, it solves you memory problem if you use the sparse argument. A dense design matrix with 480.000 entries and 6370 fixed effects is going to require 6371 * 480.000 * 8 = 24.464.640.000 bytes. However, your design matrix will be very sparse (many zeros) so you can do with a way smaller (in memory) design matrix if you use a sparse one. Secondly, you can exploit the sparsity to make way faster estimation.

    As for options, a quick search show that the speedglm also has sparse argument although I have not tried it. I key thing with whatever method you end with is that it should use that your design matrix is sparse both to reduce computation time and to reduce the memory requirements.

    The error you get (Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed" error) is likely because your design matrix is singular. In that case, your problem does not have unique solution and some option are to merge some of the group levels, use a penalization or random effect model.

    You are right that it does not seem like there is a summary method for the glpModel class. Though, the slots seems to have obvious named and it should not take you long to get e.g., standard errors on your estimator, compute a variance estimate etc.

    0 讨论(0)
提交回复
热议问题