How to minimize size of object of class “lm” without compromising it being passed to predict()

前端 未结 3 1248
有刺的猬
有刺的猬 2020-12-06 12:14

I want to run lm() on a large dataset with 50M+ observations with 2 predictors. The analysis is run on a remote server with only 10GB for storing the data. I ha

相关标签:
3条回答
  • 2020-12-06 12:44

    I'm trying to deal with same issue as well. What I use is not perfect for other things but works for predict, you can basically take out the qr slot of the qr slot in lm :

    lmFull <- lm(Volume~Girth+Height,data=trees)
    lmSlim <- lmFull
    lmSlim$fitted.values <- lmSlim$qr$qr <- lmSlim$residuals <- lmSlim$model <- lmSlim$effects <- NULL
    pred1 <- predict(lmFull,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
    pred2 <- predict(lmSlim,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
    identical(pred1,pred2)
    [1] TRUE
    
    as.numeric((object.size(lmFull) - object.size(lmSlim)) / object.size(lmFull))
    [1] 0.6550523
    
    0 讨论(0)
  • 2020-12-06 12:53

    The link here provides a relevant answer (for glm object, which is very similar to lm output object).

    http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/

    Basically, predict only use the coefficient part which is very small portion of the glm output. the function below (copied from the link) trim information that will not be used by predict.

    It does have a caveat though. After trimming, it can't be used by summary(fit) or other summary functions since those functions need more that what predict requires.

    cleanModel1 = function(cm) {
      # just in case we forgot to set
      # y=FALSE and model=FALSE
      cm$y = c()
      cm$model = c()
    
      cm$residuals = c()
      cm$fitted.values = c()
      cm$effects = c()
      cm$qr$qr = c()
      cm$linear.predictors = c()
      cm$weights = c()
      cm$prior.weights = c()
      cm$data = c()
      cm
    }
    
    0 讨论(0)
  • 2020-12-06 12:55

    The answer of xappp is nice but not the whole story. There is also a huge environment variable you can do something about (see: https://blogs.oracle.com/R/entry/is_the_size_of_your)

    Either add this to xappp's function

         e <- attr(cm$terms, ".Environment")
         parent.env(e) <- emptyenv()
         rm(list=ls(envir=e), envir=e)
    

    Or use this version which reduces less data but allows you to still use summary()

          cleanModel1 = function(cm) {
          # just in case we forgot to set
          # y=FALSE and model=FALSE
          cm$y = c()
          cm$model = c()
    
          e <- attr(cm$terms, ".Environment")
          parent.env(e) <- emptyenv()
          rm(list=ls(envir=e), envir=e)
          cm
          }
    
    0 讨论(0)
提交回复
热议问题