linear regression in R without copying data in memory?

雨燕双飞 提交于 2019-12-01 20:10:25

You can use biglm:

m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)

Since biglm does not store the data in the output object you need to provide your data when making predictions:

p <- predict(m, newdata=iris)

The amount of data biglm uses is proportional to the number of parameters:

> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes

biglm also allows you to update the model with a new chunk of data using the update method. Using this you can also estimate models when the complete dataset does not fit in memory.

The only components of the lm object that you need to calculate predicted values are terms and coefficients. However, you'll need to roll your own prediction function as predict.lm complains if you delete the qr component (which is needed to compute term-by-term effects and standard errors). Something like this should do.

m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
     m$rank <- m$assign <- NULL

predict0 <- function(object, newdata)
{
    mm <- model.matrix(terms(object), newdata)
    mm %*% object$coefficients
}

predict0(m, iris[1:10,])

I think there are two approaches to deal with this:

  • Use lm and trim the fat afterwards. For quite nice and instructive discussions, see e.g. here and here. This will not solve the "computation time" issue.
  • Do not use lm.

If you go for the second option, you could easily write up the matrix operations yourself so that you only get the predicted values. If you prefer to use a canned routine, you could try other packages that implement least squares, e.g. fastLm from the RcppArmadillo-package (or the Eigen version of it, or as others pointed out biglm), which stores much less information. Using this approach has some benefits, e.g. providing a formula-interface and such things. fastLm is also quite fast, if computation time is a concern for you.

For comparison, here a small benchmark:

l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
library(biglm)
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
library(RcppArmadillo)
a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris)

object.size(l)
# 52704 bytes
object.size(m)
# 6664 bytes
object.size(a)
# 6344 bytes
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!