The standard way of doing a linear regression is something like this:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
and then use predict(l, new_data)
to make predictions, where new_data is a dataframe with columns matching the formula. But lm()
returns an lm
object, which is a list that contains crap-loads of stuff that is mostly irrelevant in most situations. This includes a copy of the original data, and a bunch of named vectors and arrays the length/size of the data:
R> str(l)
List of 12
$ coefficients : Named num [1:3] 3.587 -0.257 0.364
..- attr(*, "names")= chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
$ residuals : Named num [1:150] 0.2 -0.3 -0.126 -0.174 0.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -37.445 -2.279 -0.914 -0.164 0.313 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Petal.Length" "Petal.Width" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 3.3 3.3 3.33 3.27 3.3 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Petal.Length" "Petal.Width"
.. ..- attr(*, "assign")= int [1:3] 0 1 2
..$ qraux: num [1:3] 1.08 1.1 1.01
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ xlevels : Named list()
$ call : language lm(formula = Sepal.Width ~ Petal.Length + Petal.Width, data = iris)
$ terms :Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. ..- attr(*, "order")= int [1:2] 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
$ model :'data.frame': 150 obs. of 3 variables:
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 Sepal.Width ~ Petal.Length + Petal.Width
.. .. ..- attr(*, "variables")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
.. .. .. .. ..$ : chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "term.labels")= chr [1:2] "Petal.Length" "Petal.Width"
.. .. ..- attr(*, "order")= int [1:2] 1 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(Sepal.Width, Petal.Length, Petal.Width)
.. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:3] "Sepal.Width" "Petal.Length" "Petal.Width"
- attr(*, "class")= chr "lm"
That stuff takes up a lot of space, and the lm
object ends up being almost an order of magnitude larger than the original dataset:
R> object.size(iris)
7088 bytes
R> object.size(l)
52704 bytes
This isn't a problem with a dataset as small as that, but it can be really problematic with a 170Mb dataset that produces a 450mb lm
object. Even with all the return options set to false, the lm object is still 5 times the original dataset:
R> ls <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris, model=FALSE, x=FALSE, y=FALSE, qr=FALSE)
R> object.size(ls)
30568 bytes
Is there any way of fitting a model in R, and then being able to predict output values on new input data, without storing crap tonnes of extra unnecessary data? In other words, is there a way to just store the model coefficients, but still be able to use those coefficients to predict on new data?
Edit: I guess, as well as not storing all that excess data, I'm also really interested in a way of using lm so that it doesn't even calculate that data - it's just wasted CPU time...
You can use biglm
:
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
Since biglm
does not store the data in the output object you need to provide your data when making predictions:
p <- predict(m, newdata=iris)
The amount of data biglm
uses is proportional to the number of parameters:
> object.size(m)
6720 bytes
> d <- rbind(iris, iris)
> m <- biglm(Sepal.Width ~ Petal.Length + Petal.Width, data=d)
> object.size(m)
6720 bytes
biglm
also allows you to update the model with a new chunk of data using the update
method. Using this you can also estimate models when the complete dataset does not fit in memory.
The only components of the lm
object that you need to calculate predicted values are terms
and coefficients
. However, you'll need to roll your own prediction function as predict.lm
complains if you delete the qr
component (which is needed to compute term-by-term effects and standard errors). Something like this should do.
m <- lm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
m$effects <- m$fitted.values <- m$residuals <- m$model <- m$qr <-
m$rank <- m$assign <- NULL
predict0 <- function(object, newdata)
{
mm <- model.matrix(terms(object), newdata)
mm %*% object$coefficients
}
predict0(m, iris[1:10,])
I think there are two approaches to deal with this:
- Use
lm
and trim the fat afterwards. For quite nice and instructive discussions, see e.g. here and here. This will not solve the "computation time" issue. - Do not use
lm
.
If you go for the second option, you could easily write up the matrix operations yourself so that you only get the predicted values. If you prefer to use a canned routine, you could try other packages that implement least squares, e.g. fastLm
from the RcppArmadillo
-package (or the Eigen
version of it, or as others pointed out biglm
), which stores much less information. Using this approach has some benefits, e.g. providing a formula-interface and such things. fastLm
is also quite fast, if computation time is a concern for you.
For comparison, here a small benchmark:
l <- lm(Sepal.Width ~ Petal.Length + Petal.Width, data=iris)
library(biglm)
m <- biglm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
library(RcppArmadillo)
a <- fastLm(Sepal.Length ~ Petal.Length + Petal.Width, iris)
object.size(l)
# 52704 bytes
object.size(m)
# 6664 bytes
object.size(a)
# 6344 bytes
来源:https://stackoverflow.com/questions/25679050/linear-regression-in-r-without-copying-data-in-memory