How to minimize size of object of class “lm” without compromising it being passed to predict()

主宰稳场 提交于 2019-11-27 06:09:22

问题


I want to run lm() on a large dataset with 50M+ observations with 2 predictors. The analysis is run on a remote server with only 10GB for storing the data. I have tested ´lm()´ on 10K observations sampled from the data and the resulting object had size 2GB+.

I need the object of class "lm" returned from lm() ONLY to produce the summary statistics of the model (summary(lm_object)) and to make predictions (predict(lm_object)).

I have done some experiment with the options model, x, y, qr of lm. If I set them all to FALSE I reduce the size by 38%

library(MASS)
fit1=lm(medv~lstat,data=Boston)
size1 <- object.size(fit1)
print(size1, units = "Kb")
# 127.4 Kb bytes
fit2=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=F)
size2 <- object.size(fit2)
print(size2, units = "Kb")
# 78.5 Kb Kb bytes
- ((as.integer(size1) - as.integer(size2)) / as.integer(size1)) * 100
# -38.37994

but

summary(fit2)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
#  Rank zero or should not have used lm(.., qr=FALSE).
predict(fit2,data=Boston)
# Error in qr.lm(object) : lm object does not have a proper 'qr' component.
#  Rank zero or should not have used lm(.., qr=FALSE).

Apparently I need to keep qr=TRUE which reduce the object size by only 9% if compared with the default object

fit3=lm(medv~lstat,data=Boston,model=F,x=F,y=F,qr=T)
size3 <- object.size(fit3)
print(size3, units = "Kb")
# 115.8 Kb
- ((as.integer(size1) - as.integer(size3)) / as.integer(size1)) * 100
# -9.142752

How do I bring the size of the "lm" object to a minimum without dumping a lot of unneeded information in memory and storage?


回答1:


The link here provides a relevant answer (for glm object, which is very similar to lm output object).

http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/

Basically, predict only use the coefficient part which is very small portion of the glm output. the function below (copied from the link) trim information that will not be used by predict.

It does have a caveat though. After trimming, it can't be used by summary(fit) or other summary functions since those functions need more that what predict requires.

cleanModel1 = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()
  cm
}



回答2:


The answer of xappp is nice but not the whole story. There is also a huge environment variable you can do something about (see: https://blogs.oracle.com/R/entry/is_the_size_of_your)

Either add this to xappp's function

     e <- attr(cm$terms, ".Environment")
     parent.env(e) <- emptyenv()
     rm(list=ls(envir=e), envir=e)

Or use this version which reduces less data but allows you to still use summary()

      cleanModel1 = function(cm) {
      # just in case we forgot to set
      # y=FALSE and model=FALSE
      cm$y = c()
      cm$model = c()

      e <- attr(cm$terms, ".Environment")
      parent.env(e) <- emptyenv()
      rm(list=ls(envir=e), envir=e)
      cm
      }



回答3:


I'm trying to deal with same issue as well. What I use is not perfect for other things but works for predict, you can basically take out the qr slot of the qr slot in lm :

lmFull <- lm(Volume~Girth+Height,data=trees)
lmSlim <- lmFull
lmSlim$fitted.values <- lmSlim$qr$qr <- lmSlim$residuals <- lmSlim$model <- lmSlim$effects <- NULL
pred1 <- predict(lmFull,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
pred2 <- predict(lmSlim,newdata=data.frame(Girth=c(1,2,3),Height=c(2,3,4)))
identical(pred1,pred2)
[1] TRUE

as.numeric((object.size(lmFull) - object.size(lmSlim)) / object.size(lmFull))
[1] 0.6550523


来源:https://stackoverflow.com/questions/21896265/how-to-minimize-size-of-object-of-class-lm-without-compromising-it-being-passe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!