linear regression in R without copying data in memory?

问题

The standard way of doing a linear regression is something like this:

l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)

and then use predict(l, new_data) to make predictions, where new_data is a dataframe with columns matching the formula. But lm() returns an lm object, which is a list that contains crap-loads of stuff that is mostly irrelevant in most situations. This includes a copy of the original data, and a bunch of named vectors and arrays the length/size of the data:

R> str(l)
List of 12
 $ coefficients : Named num [1:3] 3.587 -0.257 0.364
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
 $ residuals    : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ effects      : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
  ..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
 $ rank         : int 3
 $ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
  ..$ qr   : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
  .. ..- attr(*, "assign")= int [1:3] 0 1 2
  ..$ qraux: num [1:3] 1.08 1.1 1.01
  ..$ pivot: int [1:3] 1 2 3
  ..$ tol  : num 1e-07
  ..$ rank : int 3
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 147
 $ xlevels      : Named list()
 $ call         : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
 $ terms        :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
  .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
  .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
  .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
  .. ..- attr(*, "order")= int [1:2] 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
 $ model        :'data.frame':  150 obs. of  3 variables:
  ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
  ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
  ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
  .. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
  .. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
  .. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
  .. .. ..- attr(*, "order")= int [1:2] 1 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
 - attr(*, "class")= chr "lm"

That stuff takes up a lot of space, and the lm object ends up being almost an order of magnitude larger than the original dataset:

R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes

This isn't a problem with a dataset as small as that, but it can be really problematic with a 170Mb dataset that produces a 450mb lm object. Even with all the return options set to false, the lm object is still 5 times the original dataset:

R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes

Is there any way of fitting a model in R, and then being able to predict output values on new input data, without storing crap tonnes of extra unnecessary data? In other words, is there a way to just store the model coefficients, but still be able to use those coefficients to predict on new data?

Edit: I guess, as well as not storing all that excess data, I'm also really interested in a way of using lm so that it doesn't even calculate that data - it's just wasted CPU time...

回答1:

You can use biglm:

m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)

Since biglm does not store the data in the output object you need to provide your data when making predictions:

p <- predict(m, newdata=iris)

The amount of data biglm uses is proportional to the number of parameters:

> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes

biglm also allows you to update the model with a new chunk of data using the update method. Using this you can also estimate models when the complete dataset does not fit in memory.

回答2:

The only components of the lm object that you need to calculate predicted values are terms and coefficients. However, you'll need to roll your own prediction function as predict.lm complains if you delete the qr component (which is needed to compute term-by-term effects and standard errors). Something like this should do.

m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
     m$rank <- m$assign <- NULL

predict0 <- function(object, newdata)
{
    mm <- model.matrix(terms(object), newdata)
    mm %*% object$coefficients
}

predict0(m, iris[1:10,])

回答3:

I think there are two approaches to deal with this:

Use lm and trim the fat afterwards. For quite nice and instructive discussions, see e.g. here and here. This will not solve the "computation time" issue.
Do not use lm.

If you go for the second option, you could easily write up the matrix operations yourself so that you only get the predicted values. If you prefer to use a canned routine, you could try other packages that implement least squares, e.g. fastLm from the RcppArmadillo-package (or the Eigen version of it, or as others pointed out biglm), which stores much less information. Using this approach has some benefits, e.g. providing a formula-interface and such things. fastLm is also quite fast, if computation time is a concern for you.

For comparison, here a small benchmark:

l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
library(biglm)
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
library(RcppArmadillo)
a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris)

object.size(l)
# 52704 bytes
object.size(m)
# 6664 bytes
object.size(a)
# 6344 bytes

来源：https://stackoverflow.com/questions/25679050/linear-regression-in-r-without-copying-data-in-memory

标签

memory-management

linear-regression